Datamartist gives you data profiling and data transformation in one easy to use visual tool.

« | »

Setting the stage: managing data issues

Anyone who has done any data analysis with more than a few lines of data knows that some of the biggest time wasters are data quality issues.

What is bad data?

 Well, some of it is easy to see, some is downright impossible to find. Lets look at an easy example; a row of data where the country is “US” and the state/province is “Ontario”.

 You just know both those values can’t be right.  So why did the source system let it happen-  good question- when the programmers of the application tell you let me know…

So for this easy one  should we assume that the country is right, and change the state to Michigan since that’s close to Ontario?  Or maybe Ohio because they both start with “O”?

The right answer is of course to go back to the source and fix the problem-  but if you’ve got hundreds or thousands of users, an application that can’t be modified to stop this type of error at the source, and and IT department that is overworked then that is probably not an option.

Say your company does $500 million dollars of business in a year, and the Ontario, US data represents $1500 of sales- for practical analysis purposes it just doesn’t really matter where it goes-  just so you don’t see Ontario as a State in the report you give the CEO.

Often the solution in Excel is to just “fix it”- but if you reload the data each month, you have to go back and fix it again and again.

Or maybe you only load in the new months data, so you don’t overwrite last months.  This works until some definitions change, and now all the historical data is out of sync with the new months data.  A good example of this is if you have sales regions.  If the regions are changed (new ones added, existing ones split up or merged) then the historical data you have on your machine will have to be dumped to get the sales region codes corrected for the past.  But then all your fixes have to be redone-  could be a real nightmare if you’ve been using it for a while. 

Another issue is that eventually the data in the source system might get fixed- and it turns out the IT departments fix wasn’t the same as yours-  Ohio rather than Michigan- so you have more discrepancies to chase down.

On top of that, even though the amounts are small, its disconcerting to see different spreadsheets show different totals because you “fixed it” in one of them, and didn’t in the other even just within your spreadsheets- not to mention the spreadsheets of your colleagues.  I hate it when my boss sees three different numbers.

The key is to create a single data-set from all your sources, fix the problem once, and do it in a way that the entire data set can be refreshed automatically, the fixes (those that still apply) can be “re-run” and then all your spreadsheets link to THAT single version of your truth.  The more you share this “master” sheet with your co-workers the better.

Of course, that’s the whole point of having data warehouses and data marts and Enterprise business intelligence systems.  But what if the analysis you need to do hasn’t been covered, or isn’t scheduled to go live for another eight months?

If its just you and your best friend Excel, then here are some pointers;

  • Stage your data.  Don’t make 10 spreadsheets that all take data from the raw source, rather make one spreadsheet that is the “Fixed data”, and have all other spreadsheets link to this. This will mean you will have a single “staging” spreadsheet, and then a number of “reporting” spreadsheets.
  • Record all your “fixes” on a sheet in this staging spreadsheet called “Known Issues”.  The ideal would be to automate it so that the fixes get applied each time you reload the entire data-set, but by at least having a clear record you wil be able to quickly get the data where you want it each time you reload.
  • Don’t think about report layout or formatting in your staging spreadsheet–  keep the data in simple tables that is more aligned with how you get the raw data- every column has the same values in it all the way down.  If you have different data-sets, use different sheets.  Do the reporting in your reporting spreadsheets where you can have tables with different mixes of data from different sources.

The more macros or scripts you can use the better- but even if you have a set order of cut and paste, and follow it to the letter (write it down on a sheet called “Refresh Steps” in the staging spreadsheet maybe) it will reduce the amount of time it takes to update each time new data is available, and you only do the fixes once, and if you’ve linked it cleanly to the reporting sheets then the rework will be reduced.

 Of course, its still a lot of work.  In a nutshell that’s why I’m working on the Datamartist tool that will automate this and much more, allowing you to easily and without programming be able to manage your spreadsheets much more effectively.  Stay tuned, and in the meantime happy staging.

Tagged as: ,

Twitter

« | »

Leave a Response