« | »

Data quality from a four year old

I think my four year old would make a good data quality dude. He explained to me recently, why its better to use stickers than crayons, “for the things people use a lot”.

“Dad, if you use crayons, you might draw it different, but stickers- they are all the same.” he then pointed to the sheet of identical, machine generated stickers- “All the same- so everyone who gets one of these, knows what it is.”

“Using the crayon takes too long and sometimes I make mistakes.” Then he paused for a second. “But if it’s something different- then I have to draw it. No stickers for that.”

And off he went, blending hand drawn custom crayon work with high speed sticker application.

It strikes me that what my son has figured out as a basic rule of thumb in arts and crafts for the use of stickers, is a pretty good analogy for design of data entry systems.

Whenever you can, use something that restricts the users choices to a fixed, understood set of responses. Use pre-made data stickers.

The enemy of data quality everywhere is the gaping, un-validated free form text entry field. Only linguists and unstructured text analysts can get excited about the “endless possibilities” of what your users and customers can enter into those fields.

We’ve all seen the horrors of names and addresses run amok- “John A Smith”, “Jon A. Smith”, “John Smith Jr.”, “Smith, John A” or the even more amazing “John Smith (new customer)”.

If you’re in data, you don’t want endless possibilities. You want ordered sets of data that conform strictly to well defined rules. Eliminating duplicates is a complex and time consuming effort. Stopping as many of them before they are created is the first, best thing you can do to get a handle on the problem.

So think stickers. For every field ask yourself- can I make this a combo box? radio buttons? Can I do auto search in the existing records to suggest close matches? Anything to stop users or customers from making things up- and to have the data points they enter conform to a defined domain.

The more constrained a field is, the better the chances are that the data stored in it will be useful… unless of course you make it so constrained that you force data quality to suffer.

There is such a thing as too much…

Every good rule has its exceptions, and the evil side of overly constraining your data entry folks is that because they are smarter than computers, they’ll find ways to invent entirely new encoding methods.

If you tighten the entry on the postal code too much, so that international postal codes won’t fit, you can be sure that data entry clerks will discover that by entering their own postal code, and putting the customers postal code in the comment field, they can get the system to accept the record (and at least feel as if they had tried their best to get the data needed in there).

This is where which stickers you have in your collection starts to matter.

Have you ever noticed that at well run events, they always have some blank name tags, as well of the pre-printed ones? That and a magic marker makes sure the process can go on.

In the end, you’ll need to balance between the two extremes. Tighten up your data entry and interfaces as much as you can, but realize that there is a point of diminishing returns, and in fact probably even a point where your data totalitarianism will be hurting your data quality, not helping it.

Now of course, there are some pretty high end tools that let you create all sorts of rules, and others that let you comb through the data and cleanse it, checking those postal codes to states and cities, and doing all sorts of fancy matching and analysis. There is definitely an important role in many organisations and systems for approaches and tools such as these.

Using data profiling tools like Datamartist will help you understand what issues are making it through your defenses.

But if you are not doing it already, focusing on the point of entry with practical, balanced techniques will make a step change improvement to your data quality.

Tagged as: ,


« | »

1 Comment

  1. Great post, James!

    That place–the gaping, un-validated free form text entry field–is strong with the dark side of the Quality. A domain of evil it is. Within you must sometimes enter data. However, you must be cautious, for once you start down the dark path, forever will it dominate your destiny, consume your data management it will.

    May the Quality be with your Data–and always the Good Quality.

    Best Regards,


    P.S. Is your son accepting CIO applications, I know a few companies who need him 🙂