Duplicate Data – Datamartist.com

Data quality from a four year old

James Standen — Tue, 08 Jun 2010 14:10:22 +0000

I think my four year old would make a good data quality dude. He explained to me recently, why its better to use stickers than crayons, “for the things people use a lot”.

“Dad, if you use crayons, you might draw it different, but stickers- they are all the same.” he then pointed to the sheet of identical, machine generated stickers- “All the same- so everyone who gets one of these, knows what it is.”

“Using the crayon takes too long and sometimes I make mistakes.” Then he paused for a second. “But if it’s something different- then I have to draw it. No stickers for that.”

And off he went, blending hand drawn custom crayon work with high speed sticker application.

It strikes me that what my son has figured out as a basic rule of thumb in arts and crafts for the use of stickers, is a pretty good analogy for design of data entry systems.

Whenever you can, use something that restricts the users choices to a fixed, understood set of responses. Use pre-made data stickers.

The enemy of data quality everywhere is the gaping, un-validated free form text entry field. Only linguists and unstructured text analysts can get excited about the “endless possibilities” of what your users and customers can enter into those fields.

We’ve all seen the horrors of names and addresses run amok- “John A Smith”, “Jon A. Smith”, “John Smith Jr.”, “Smith, John A” or the even more amazing “John Smith (new customer)”.

If you’re in data, you don’t want endless possibilities. You want ordered sets of data that conform strictly to well defined rules. Eliminating duplicates is a complex and time consuming effort. Stopping as many of them before they are created is the first, best thing you can do to get a handle on the problem.

So think stickers. For every field ask yourself- can I make this a combo box? radio buttons? Can I do auto search in the existing records to suggest close matches? Anything to stop users or customers from making things up- and to have the data points they enter conform to a defined domain.

The more constrained a field is, the better the chances are that the data stored in it will be useful… unless of course you make it so constrained that you force data quality to suffer.

There is such a thing as too much…

Every good rule has its exceptions, and the evil side of overly constraining your data entry folks is that because they are smarter than computers, they’ll find ways to invent entirely new encoding methods.

If you tighten the entry on the postal code too much, so that international postal codes won’t fit, you can be sure that data entry clerks will discover that by entering their own postal code, and putting the customers postal code in the comment field, they can get the system to accept the record (and at least feel as if they had tried their best to get the data needed in there).

This is where which stickers you have in your collection starts to matter.

Have you ever noticed that at well run events, they always have some blank name tags, as well of the pre-printed ones? That and a magic marker makes sure the process can go on.

In the end, you’ll need to balance between the two extremes. Tighten up your data entry and interfaces as much as you can, but realize that there is a point of diminishing returns, and in fact probably even a point where your data totalitarianism will be hurting your data quality, not helping it.

Now of course, there are some pretty high end tools that let you create all sorts of rules, and others that let you comb through the data and cleanse it, checking those postal codes to states and cities, and doing all sorts of fancy matching and analysis. There is definitely an important role in many organisations and systems for approaches and tools such as these.

Using data profiling tools like Datamartist will help you understand what issues are making it through your defenses.

But if you are not doing it already, focusing on the point of entry with practical, balanced techniques will make a step change improvement to your data quality.

Connecting the dimension table to the fact table- Vendor Example (Part 3)

James Standen — Mon, 09 Feb 2009 20:47:55 +0000

In parts one and two of this series we introduced our challenge (to make a data mart to analyze the Acme Company’s spending) and showed how the Datamartist tool could import millions of rows of data and then turn it into a fact table we can use in Excel.

Now we need to create a Vendor dimension table and join it to this fact table to determine who our big vendors are.

In Datamartist it is a simple task to create this vendor dimension. As always we use blocks and connect them together. We define a dimension by using a reference definition block. All we have to do to configure the reference block is to specify which columns uniquely define the dimension (or almost uniquely, Datamartist will resolve duplicate keys using a majority/first rule set for you if you have some data glitches).

We start with an import block that brings in the Vendor master text file, then we define the reference by specifying “Vendor_ID” as the key. These first two blocks look like this:

Then we join it to the fact table we created in part two of this series with a join block. This means that now instead of just the vendor ID number that was in the fact table, we have the name, and address for the vendor in our mini star schema.

And finally we put a summarize block after that to total up all the monthly values for each vendor, and we export to excel. This is what the canvas looks like:

After we do this, we grab the excel file Datamartist just created for us, do a quick sort, and come up with a list of Acme’s top ten suppliers. Feeling pretty good about ourselves, we do a review with the head of purchasing.

“Where’s Mega brothers?” she says with a frown “I think your data is screwy- no way that Mega brothers didn’t make the top ten- we spend a fortune on railways, and a lot of our freight goes with the Mega Brothers Rail company. Of course it is probably entered under different vendors, each location works with the office local to them… But we’ve got to view them as a single vendor in the data mart- you can do that right?”

Fixing Duplicate Rows

Having to deal with duplicate data is a very common issue in any type of data analysis. So, back to the canvas. By simply adding a de-duplicate block to our Vendor dimension table (after the Reference block, and before the join) we can find and resolve the Mega Brothers duplicates.
We just use the filter to find the records- (Easy to do, looking for “Mega” “rail” “brothers” etc. and we map them to a single instance.) This is the filter control that lets us find and tag the duplicates:

As we tag them, they show up in the mapper, which lets us see which duplicate records we have eliminated for the dimension. We run the canvas again, and this time, sure enough, Mega Brothers Rail is in our top ten. But even though the head of purchasing knew it was a lot, this is actually the first time she’s seen the number. “Wow. I’ve got to give them a call- can you give me that in an Excel spreadsheet?”

Stay tuned, more to come as we go further into Datamartist’s ability to segment, filter and organize large data sets.

If you want to see the interface in action watch our first Tutorial Video. Or just get right to it with your own data- download the free trial now– there is no registration required, and it installs in minutes.

This is part of a 5 part series- here are the links to the various parts: 1,2 , 3 , 4 and 5

Duplicate Data and removing duplicate records

James Standen — Wed, 15 Oct 2008 02:07:19 +0000

Duplicate records, doubles, redundant data, duplicate rows; it doesn’t matter what you call them, they are one of the biggest problems in any data analyst’s life.

There are lots of different types of data quality problems, but in this post I’ll focus on Duplicates.

I’ll share some hints on how to find duplicate records and remove duplicate records, at least from your sight, if not from the source system.

Duplicate Records

A lot of the duplicate records that you’re apt to meet belong to two distinct types.

Non-unique Keys

This is where two records in the same table have the same code or key, but may or may not have different values and meanings- this can happen when you’re mixing data, or data is coming from non-database sources like text files, (csv files from a csv import say), or excel files. Databases usually have some sort of unique key so don’t tend to have this problem- but if you merge data from two different databases the uniqueness might be lost- example: say you have an oracle database (System 1) and a mysql database (System 2), both of which use a “unique” integer to track products. When you merge the two, you are going to have two of everything:

Notice I’ve added a column that specifies the source system- where the record came from- this is the first step in solving this problem- you need to Concatenate or combine the keys- although “Product Key” is not unique by itself, “Source System” + “Product Key” is unique, because each source system is internally unique. Now there is a trick to concatenation- add a string of unusual characters when combining. This ensures that by random luck the two keys don’t combine to be another duplicate key- here’s a different example that illustrates the point:

I like to use one or more of the pipe “|” character because its often not present, or even not allowed in source data and codes. Of course, you need a tool that is willing to accept that character as part of a string key for this to work. If you are doing this in excel, use the “&” to concatenate fields together and add in other characters as needed. The above example used the following syntax in the formula =”||” & A1 & “||||” & B1 & “||”

Its a trick you can use to be able to use VLOOKUP more effectively- another example- say you have a list that has first name, last name, and some address information. First Name + Last Name might not be unique on its own- throw in the street address though, and chances are you can get a more accurate list of unique people from a key point of view. Of course this doesn’t solve the John Smith, J. Smith, Johnny Smith, Johnathan Smith problem, or addresses like 123 Any Street vs 123 Any St. vs 123 Any Avenue (often all the same, with errors in data entry)… which leads us to;

Duplicate Meaning

This is more common, and sometimes harder to deal with. In example above, even though you can fix the duplicate key problem by concatonating a code for the source system to the key (along with some unused characters to ensure no “gotchas”)- its pretty clear that “Television” and “TV” are probably the same thing, and you don’t really want to see two products. These types of duplicates are often the most damaging to good analysis. Everything works, but your reports are difficult to read, or worse, you make decisions based on your “top 20 products” when in fact, 15 of them are not in the top twenty at all, because the REAL best sellers got split between “TV” and “Television” and “TV Screen” etc. Some automated duplicate detection tools exist (particularly in the area of peoples names and addresses), but in the end for many types of data its the old human eyeball that has to do the work- and you need some sort of system to keep a map of all the duplicates you’ve identified.

And obviously, you know by now where all this is going- the tool you need is the tool I’m creating; Datamartist.

Here are some teaser screen shots of the work in progress, and examples of the functionality that deals with the two problems we’ve discussed above:

To resolve duplicate keys, Datamartist scans the data, and allows you to select keys and experiment- as many as you like (doing the concatenation trick that I described above automatically) and informs you which keys are duplicate, and shows you the duplicates for the various key combinations. If there is no way around it, you can keep a non-unique key- Datamartist will fix the reference by taking the value that is most common within a given data set and mapping attributes from that record, giving you a clean, unique reference set to work with, and eliminating that handfull of bad records that are messing things up.

In the case of the second type of duplicates, Datamartist provides a filter/search capability to let you find all the duplicate rows (with “Smith” as the last name, for example). Then it allows you to identify which records are the “Master” and which are to be treated as duplicates. From that point on, the duplicates are mapped to the master, and the reference set shows a single, consistent set of data.

In both these cases, and as a general rule of how Datamartist works, the mapping and configuration you do is not lost if you change input files, or update with new data. As long as the minimum data structure consistency is there, the mapping you did stays with you, so you only have to do it once. You might need to remap some field names, but Datamartist lets you do that easily too, so the same mapping can be used to analyze many different data sets that use the same underlying keys (and would have had the same underlying data quality issues).

Download Datamartist now– see the de-duplication functionality in action.