|
Data Quality, Part 1: The Discovery
|
|
9/28/2008 11:42:00 AM
|
|
I wrote a post a few months back about a healthcare data conversion project that I’ve been working on for the better part of 2 years. My task on this project is to convert data from an old UNIX-based Universe database to a SQL Server-based application; the database we are extracting from is quite old, both in terms of technology as well as the length of time this application has been in use at this facility. Universe, and IBM product, is a multivalue database, which in my understanding is like the anti-RDBMS, allowing for multiple values within each row/column intersection. Getting data stored in this manner extracted and imported to SQL Server has been the quite challenging, but it pales in comparison with the difficulty we have had with the quality of the data itself.
We discovered early in the conversion spec process that we would need to make some significant changes to the data to allow it to fit the data model expected by our new software vendor. For example, the current application allows registration staff to enter a patient’s employer information by either direct text entry or through a lookup of existing employers in a lookup table. This was not a problem in itself, as we could reference the lookup table during the conversion to retrieve the employer information. However, a “feature” of the application is that a person can delete records from the lookup table, and since the patient’s visit information stored only the employer code, that information became orphaned; Bob’s Lumber Yard might be BLMYD or similar jibberish, and unless a tenured employee knew the code by heart, that data was lost to the bit bucket.
Another oddity we found was that the application allowed for the deletion of general ledger accounts, even if those accounts had transactions attached to them. Now I’m not an accounting guy (at least I wasn’t before this project – I’ve since gotten a crash course) but even I can tell you that orphaning those transactions can throw off your bottom line, and the effects could be felt for many quarters to come. Even though the transactions in question were several years old and I was able to address them relatively easily in the conversion, the overall quality of the data briefly came into question simply due to this little oddity.
Not all of the data quality issues we found were related to shortcomings of the system. The most dangerous threat to data quality – user input – remains the most unpredictable wildcard in the entire process. Staff turnover, training, and administrative policy changes through the years all contributed to inconsistencies in the data, causing us to create complex and often case-specific algorithms to correct.
Through the course of this conversion, we found a number of other similar issues that caused us grief. However, this old application has been a workhorse and has served its purpose well despite being very difficult to learn and expensive to maintain. For years, users have been able to use the system effectively on a day-to-day basis, and there is very little downtime. Reporting, on the other hand, has been the key shortcoming of this dinosaur. These data quality issues, along with the lack of an intuitive reporting tool, have made reporting from this system all but impossible. This organization has had to spend a painful amount of staff time creating reports because of the manual work required to do so. More often than not, the reporting process goes like this:
The user generates a text-based report using one of two integrated reporting tools (neither of which is user-friendly or intuitive). Said user runs a copy/paste operation from the text report into Excel. Extraneous text such as page headers, etc., must be manually removed from the Excel file. The user creates extra rows or columns and copies/pastes the data from one cell to another to properly align the data as required for reporting. The user then performs all of the summarizing/sorting on the Excel document.
This is done for most of the critical reports each month. Some reports must be generated weekly, which of course further increases staff workload. The most discouraging thing, though, was that many users simply accepted this and didn’t know how much easier reporting could – and should – be.
In this first of what will hopefully be many posts about data quality, I’ve painted sort of a bleak picture – I generally start out on a more positive note, so please bear with me for now. In future installments I’ll break down some of the particulars I ran into and some common (and a few uncommon) methods that can be used to improve the quality of data.

|
|
[Off Topic] The Big Apple
|
|
8/20/2008 10:47:00 PM
|
|
I got my first opportunity to visit New York City last week. I was actually sitting in on a training class in Whippany NJ, which is about an hour away (by train) from the city. I went over to NYC for both evenings that I was in town to see the sights and generally act like a tourist.
On a SQL Server-related note, I did feel that I was cheating just a bit on the SQL BI stack. The training class in which I was participating was for a Cognos setup...
The picture above is uptown looking at the NYPD substation (which looks more like a 1950s diner than a cop shop). The picture below is from the 86th floor of the Empire State Building; if you look very closely at the middle of the picture, you can see Ellis Island and the Statue of Liberty, and to the left of there is where the World Trade Center buildings once stood.


|
|
Delayed gratification
|
|
7/20/2008 11:37:00 PM
|
|
Call me odd, but I enjoy taking certification tests. It may be that I've always tested well, or perhaps it's that I enjoy a challenge, but for whatever reason I have always looked forward to sitting for exams. So when the opportunity arose to take a beta exam for the SQL Server 2008 business intelligence components, I signed up as soon as I could.
I admit that I took the exam cold, only having worked with SQL Server 2008 beta a dozen times or so. I've been studying for the SQL Server 2005 BI exams, and my thought was that a free pass at the 2008 beta test would be a good study tool for me. I spent almost 2 hours in front of the terminal, using what information I knew from experience (mostly on SQL Server 2005) and making educated guesses for the rest. All went well until I reached the end of the exam - you know, the point at which your pulse races just a bit as you wait for the "You Passed" or "You Failed" message. Imagine my surprise when neither of these was shown, but rather I received a "Thanks For Playing, We'll Get Back To You" message. Specifically, the message indicated that the results of the beta exam would be delivered to me via mail in 90 days or so. Yes, that's mail as in paper and envelopes and stamps, right here in the midst of the digital age [Suggestion to Microsoft: Save some cash - post the results on my MCP record and send me an e-mail when it's ready].
I've never taken a Microsoft beta exam before, so I don't know if this is status quo for these types of tests. Even though certifications have gotten a bad rap lately, I still believe that, when properly administered, certification programs can still offer some measure of a person's willingness and ability to succeed on a particular software component or career track. And as one who intends to continue to pursue certifications pertinent to my skillset, I have a vested interest in helping to make the SQL Server certification track as relevant as possible, so I am glad that I participated in this trial. However, I have to admit - and yes, it's probably a little selfish - that I'm not sure I would have been so eager to participate if I'd known that I would have to wait months for any kind of hint as to how I performed on the exam.
So here are my questions. First of all, is the delayed response common to all Microsoft exams, and second, is this a deterrent to other potential beta exam test-takers? 
|
|
Slacking and data quality (in that order)
|
|
6/30/2008 11:10:00 PM
|
|
Well, I almost missed blogging for the entire month of June. I'm sure that this fact didn't go unnoticed by both of the people who read my blog... I'm working on a major data conversion and am in a mad dash to finish converting and validating years of healthcare and financial data, and unfortunately my free time (including the time allocated for blogging) has been scarce. The good news is that the project - at least the data conversion piece - will be over in late September and perhaps life will return to some semblance of normalcy.
The aforementioned project has been an interesting exercise in data quality. The system from which I am extracting data is quite old, in technology years anyway, and the application design lacks some of the keystones of modern systems - not the least of which is relational integrity. The de facto standard for data entry was free text, which made for many (in some cases, tens of thousands) of duplicates. Fortunately, the system to which I am converting has a well designed SQL Server backend, and in spite of a few disagreements, the vendor has been open to modifying the system to suite or needs. As to the quality of our data, I've had lots of opportunities to expand my SSIS skills to gently (most of the time) massage the data into the target system. I've even been able to write some code, which I don't do that much any more, for some advanced text parsing and manipulation.
Once this project is complete, I'll write a more comprehensive - and coherent - post to discuss in more detail my travels through this conversion and some of the data quality lessons I've learned. 
|
|
Using IN with possible NULL values
|
|
5/30/2008 12:01:00 AM
|
|
There is a little quirk with NULLs when using the NOT IN qualifier. I use the term "quirk" loosely here because the behavior is exactly as intended, though it may not be obvious. The following query shows a trivial example:
At first glance, one might think that we'll see returned the row for the Chevy MINIVAN. However, when you run the above query (with the default option of ANSI nulls set to ON), you'll always receive zero rows returned. Why? Because the NOT IN is still a set-based comparison, and with ANSI_NULLS ON, comparing any value to NULL yields False (technically it yields NULL, but you see what I mean).
A quick solution can be found, of course, by simply adding a NOT NULL to the lookup column in your subquery. This isn't rocket science, and really isn't an advanced T-SQL topic, but it's one of those thing that can sneak up on you if you're not expecting it, particularly if you have one of these deeply nested in a complex query. 
|
|
A little fun with SSIS Expression Language
|
|
5/8/2008 5:21:00 PM
|
|
The SSIS expression language is a powerful yet enigmatic entity. Once you get used to its syntax - which is part C#, part T-SQL and part *WTH?!?* - it's actually somewhat fun to use. However, one thing it is lacking (as far as I can tell - correct me if I've missed something) is the ability to use an IF...ELSE IF...ELSE statement.
So I ran into a situation earlier in which I needed to do an advanced conditional expression. Here's the situation: My input has a column named provider. This column comes to me as an integer but has to be padded with zeros where necessary to make it a total length of 3 ("2" becomes "002", "13" becomes "013", etc.). A special case exception is the value "1" which is to be left as is.
I knew that the *proper* way to do this was to either create a script component and drop down into VB.NET to map the value appropriately, or use a Conditional Split transform and then merge the values together. However, a colleague had just asked an unrelated question about the ternary operator in the expression language (officially called the Conditional Operator in Redmond), and I decided to push it a bit and force this processing into the Derived Column transform using the Conditional Operator.
So just for fun, I've posted the expression below. It's actually three ternary Conditional Operators working together to provide an advance IF...ELSE IF...ELSE statement. [For the record and in case it's not clear from the code below, the syntax for the Conditional Operator is (value_to_test) ? (value_to_return_if_true) : (value_to_return_if_false).]
(LEN(provider) < 3 && LEN(provider) > 0 && provider != "1") ? (LEN(provider) == 1 ? "00" + provider: "0" + provider) : (LEN(provider) == 0 ? "000" : provider) 
|
|
SQL Backup 5 (Red Gate)
|
|
5/6/2008 4:37:00 PM
|
|
So after many months of trudging through native SQL Server backups for new mission-critical application I'm deploying, I have decided to do some evaluations and cost/benefit analyses on aftermarket SQL Server backup products. I've started off with SQL Backup 5 from Red Gate. No disclaimers here - I have no affiliation with Red Gate except that they are kind enough to allow me to blog on SqlServerCentral.com (which is owned by Red Gate).
The download and installation of this product was very straightforward - the download is just under 50mb and includes the entire suite of applications on the SQL Toolbelt. I installed only the Backup tool, which was ready for use in under 5 minutes. The interface is very clean in intuitive, and allowed me to register my test server and view its backup history.
To actually backup and restore on my test server (to which I am connecting remotely), I have to install a server-side component. This process happened quickly (less than a minute), and resulted in the addition of a new service (SQL Backup Agent) on this test machine. This service is required to schedule backup jobs from within this application, as it appears that SQL Backup 5 does not rely on the SQL Server Agent service to run scheduled backups. Note that the server-side piece must be installed on the server even if you are not running scheduled jobs.
I ran a baseline backup using native SQL backup tools on my database. The database, somewhere north of 26gb in size, generated a 17gb backup file in just under 1/2 hour. Restoring this backup file to a new database took about 10 minutes. By comparison, the same backup using Red Gate SQL Backup 5 took just over 16 minutes and created a backup file a little less than 3gb in size; restoring this file took about 6 minutes. These operations were done using the default settings in SQL Backup 5, including a compression level of 2 (the middle-of-the-road) setting, and bypassing the Multiple Threads option. The total time for a restore + backup operation (which will be required for our datawarehousing app) would be cut down to just over half the original time required, which was not as good as what I had hoped. However, the size of the resulting file was less than 20% of the original size, which is right in line with my expectations.
I was interested to find that SQL Server reports the last backup date as the date and time at which I ran the backup from SQL Backup 5, which surprised me since this is a third party app and not truly integrated with SQL Server.
Licensing seems to be straightforward - the SQL Toolbelt, an all-inclusive suite of SQL Server administration tools, includes one server license for SQL Backup 5. Additional server licenses are available as well, and licenses for test/demo installations are available at 30% of list price. One important item of note is that the gentleman I spoke to by telephone at Red Gate indicated that we will need a license for both nodes of our SQL cluster, though the second node qualifies for the same price break as a test/demo machine.
My eval version of this product expires in a few weeks, and in the meantime I plan on pushing it further by testing some of the advanced features (changing compression levels, using multiple threads, etc.) to see how it responds. I'm also interested in hearing from others who have used this product - please ping me or just comment here if you've had experiences, good or bad, with Red Gate SQL Backup 5. I plan to eval at least two more products in this category as well in the coming weeks and months, and I'd appreciate any feedback. 
|
|
|
SQL Saturday Jacksonville
|
|
5/4/2008 12:01:00 PM
|
I just wrapped up a high-octane day of SQL Server community education at SQL Saturday in Jacksonville, Florida. This event, the third thus far in a series of training opportunities for SQL Server professionals, is loosely based on the Code Camp model and is billed as a community event. I enjoy these kind of events, having attended several Code Camps and other similar training opportunities. They are community-based, meaning that most of the trainers are peer professionals and not professional technical educators. Events are held on the weekends, and by the way, they're free to attendees.
I stepped in and did something I had never done before: I volunteered to be a presenter at this event. I've been working toward expanding my capabilities by writing and speaking more, and this was a great opportunity to do the latter (and hopefully will open some doors for the former). The session I presented addressed the SMO (SQL Server Management Objects) framework and how DBAs and developers can use these objects to automate the administration of SQL Server instances throughout an organization. As I shared with my attendees, SMO is the coolest technology that nobody is using. I've posted my notes from this session on my website, so if you're interested in learning more about SMO feel free to download the slide deck and samples and use them as you will (these will likely be posted on the SQLSaturday website as well).
I got to spend some time visiting with Andy Warren and Brian Knight about the framework of the SQL Saturday event, and I am convinced that something like this would be a big hit back home (the Dallas/Ft. Worth area). There is a large user base, and there are enough people who have demonstrated a willingness to volunteer their time to make such an event a success in north Texas. I plan to seriously pursue this so if you are interested in seeing a SQL Saturday event in the Dallas area, please let me know.
If you are a north Floridian or will be in that area next month, there is actually another SQL Saturday event coming up next month at the Orange County Convention Center. If you can make it, I encourage you to stop by.
|
|
Gotta love the Information Age
|
|
4/27/2008 2:07:00 PM
|
|
First it was CNN 24x7. Next the Internet brought us live information from around the globe.
Now this... real-time pizza tracking...


|
|
Free SQL Server 2008 online courseware from Microsoft
|
|
4/14/2008 1:53:00 PM
|
|
Just received word that Microsoft has posted 3 new courseware titles under the "What's New in SQL Server 2008" umbrella. Available titles are Business Intelligence, Database Development, and Enterprise Data Platform:
https://www.microsoftelearning.com/eLearning/offerDetail.aspx?offerPriceId=139087
On a related note, I have downloaded and installed SQL Server 2008. I'll post more information later once I've had a chance to completely review it, but two things immediately stand out as positive improvements: Intellisense in the query window, and C# scripting in the SSIS script task/component. 
|
|
SQL Saturday
|
|
3/30/2008 9:34:00 PM
|
|
I've just received notice that I will be a presenter at the upcoming SQL Saturday event in Jacksonville, Florida. I'll be speaking about the SQL SMO object namespace and how this powerful set of classes can be used to administer SQL Server instances programmatically.
I'm looking forward to this opportunity. I've not had much experience delivering presentations to those outside my own places of employment, so this should be a learning experience for me. I'm also the last presentation of the day, so I'll get to attend many other sessions as well.
So if you can get to Jacksonville the first weekend in May, I encourage you to come out to SQL Saturday. Hope to see you there! 
|
|
Give us SP3!
|
|
3/12/2008 9:03:00 PM
|
|
There has been a great deal of wailing and gnashing of teeth regarding the scuttlebutt that SP2 will be the last service pack for SQL Server 2005. I, too, am excited about the release of the next version of SQL Server later this year, but SQL Server 2005 is likely to be around for a long while. C'mon, it's 2008 and I still have SQL Server 2000 boxes running in production (not by choice, but I digress...). I believe it was Steve Jones who suggested that there be a new SP every six months as long as the product is fully supported; I don't necessarily think we need service packs that often, but I concur that there should be a commitment to continue with full support - which includes regular service pack releases - as long as the product is mainstream.
There is a feedback poll on Microsoft's website to allow us to weigh in on the release (or nonrelease) of Service Pack 3. I've never voted on one of these polls, so I don't know how seriously the Blue Badges take these things, but at least it's worth voicing my opinion.... 
|
|
We Are Microsoft Charity Challenge Weekend
|
|
1/21/2008 12:20:00 PM
|
|
This weekend, I had the unique opportunity to donate some time to a worthwhile charity organization. Through the efforts of the Dallas-area .NET user group community (and specifically, Toi Wright), Microsoft, BravoTech and a number of other vendors, the first annual We Are Microsoft Charity Challenge was born in a flurry of activity over the last three days. This event was held in Dallas and matched up 100 or so developers with 18 charities in need of development services.
I was placed on a team assisting SER Child Development Center with developing a new website . This organization aids low-income families by providing low-cost child care and education, as well as adult education and career development services. From the outset, the organization staff were prepared and readily available; they arrived at the kickoff with design ideas, a packet of information as well as a USB drive full of electronic content and photos. Juan Torres, the CEO and President, and Dr. Carol Johnson-Gerendas, director of the center, were very enthusiastic about this project, volunteering to stay with us for as long as necessary to bring us up to speed on their needs. These two were gracious and appreciative, and made us feel like our efforts really will make a difference.
My team initially consisted of four people, but we lost one member to the flu on Friday. The other remaining members were Ryan Magnusson, a developer working for Wal-Mart in Arkansas (yes, he actually drove in for the weekend) and Raymond Sanchez, a web developer local to the Dallas area. After reviewing the charity's business model and website requirements, we opted to use a SubSonic starter kit for our project. This allowed us to quickly roll out the base application (essentially a CMS) and gave us a framework on which we could develop a couple of requested custom components. I took on the role of project lead as well as writing the custom components, the latter of which was very gratifying since I don't get to write as much code as I used to.
Like all software projects, we had a few glitches. The most frustrating issue was the web space provided for the event had a number of issues which were not resolved until the last day of the event. This left us with the unfortunate choice to leave their existing site in place (having run the demo from my laptop, where the code resides) until we resolve the issues with their web host. We also had a phantom error in SubSonic that slowed us down for a few hours. Since we only had 48 hours in which to work, sleep was simply an afterthought (in fact, one of the guys actually pitched a tent in the break room and slept on site). But the food was good and plentiful, they kept us filled up with caffiene, and the facilities were spacious.
We also had the opportunity to be interviewed by some of the guys from GeeksWithBlogs.net, which was published as a podcast [listen here] . This was my first - and hopefully not the last - podcast interview.
All things considered, the project was a success; when we met with the charity staff at the wrap-up meeting on Sunday, they were highly impressed with the product. Though we didn't win any awards, I'm confident that we have created a solid application on which they can promote their charity for many years to come.
The We Are Microsoft event was billed as a "first annual", suggesting that this will be an ongoing gig. It was suggested that perhaps other user groups will follow suit and host their own WAM event. As for me, I'll be first in line to participate again next year. And maybe I'll bring my tent.... 
|
|
|