Datamartist is a new visual tool that gives you control of your data.

« | »

Data Profiling and Data Completeness

There are various steps in data analysis- for me the very first one is always "what have we got?".  You have a data set, and some broad requests or ideas about what you want to get out of it, but the first question is how good is the data?  In the end, the first thing I always do with a data set is to profile it.  Explore it a bit, try to figure out if I'm good to go, or if I have some serious data quality issues that have to get fixed up first.

As I am developing the Datamartist Tool, I know that I need to have some serious data profiling functionality baked in.  To date, I've included only a simple row count report- but it illustrates where I want to go.  Simple, but to the point, with an intuitive interface that lets you get right at the data.  Click on the bar, and you see the rows involved.

I've built the user interface to be able to handle many, many reports and visualisations- the data profiler tab is designed to be able to show hundreds of them.  Its an area I feel will be of great use to analysts.

The simple row count I've started with gives you a clear, vertical bar chart that shows the row count for each unique value for each column.  Nothing fancy, just the facts.  But its extremely useful.  It immediately let's you see if you have issues with Null values, it shows you if the distribution of values has a long tail, or is more evenly, or randomly distributed.  And it even serves as a basic duplicate detection, since often sorted views with counts will reveal issues in categories.

But its only the beginning. Data profiling can give amazing views into even very large data sets by using visualisation techniques.  The next on my list are bitmaps that illustrate completeness using colour coding and techniques such as time lines, column maps and heuristic scale compression.   And these ideas are just the two-dimensional concepts- one day I'll play with even more... but one step at a time.

My goal is to bring forward a visibility into data that allows analysts to quickly know where the issues are, where the solutions might lie, and in the extreme cases, to know that a data set is "junk" and wave off before wasting time on analysis.

Send me your ideas for data profiling reports/visualisations- the first step to great data analysis is accessing the data.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • email
  • Reddit
  • Technorati
  • Twitter
Tagged as: , ,

Twitter

« | »

Leave a Response