Data Quality – Datamartist.com

Small data in the age of Big Data

James Standen — Fri, 28 Mar 2014 15:09:34 +0000

Just a quick thought on the subject of small data.

Don’t hear much about that in the press these days- everything is BIG BIG BIG.

But small- (and remember, small is still hundreds of thousands or millions of rows) is still critically important. I hope I’m stating the obvious here- but with some of the “irrational exuberance” around Big data, we need to keep our perspective.

Your entire customer list that (hopefully) you’ve worked hard to keep clean and accurate is probably small data- unless you’re Facebook and it has around a Billion rows.

Your entire product list is likely smallish data.

Certainly the categories that you want to look at, the geographical groupings and regions, your chart of accounts, pretty much all of your master data is small data.

What is clear is it is not just BIG data that matters- its how we mix our small data to our big data, and how we turn our big data into the small (aggregated, summarized, cleansed, ready for consumption by other than a data scientist) data sets we use day to day.

While the abilities of Big data are fantastic, and the capabilities of todays systems are truely awesome, they do not eliminate the need for good old data quality and master data management.

What they do though is open up whole new architectures and capabilities. So embrace big Data- take advantage of the amazing platforms that are available- but don’t get so excited about the big to let your small data program slip. After all, small is still beautiful too.

Exact isn’t everything- Surf your data!

James Standen — Mon, 08 Apr 2013 20:38:04 +0000

Sometimes an analyst needs to take off the accountants hat, forget the urge to chase down every last penny, and instead put on their surfing gear, grab the data surf board (i.e. their set of prefered data tools), and just surf some data.

There are some cases were “Exact” is the only acceptable level of data quality. When we’re sending invoices to our customers, not only does the amount need to be right to the penny, the invoice needs to make it to the right person, on time.

But sending accurate invoices is not exactly the cutting edge in terms of data.

The challenges for todays data driven organizations are to be able to make sense out of the ocean of data available, and to do it faster than the competition.

And the ocean does have the highest quality water all the time. In fact, some of the data is downright dirty.

But there is a lot of it, and it can tell us things.

Just like ocean surfing, when you are data surfing you can sometimes ride the wrong wave and end up underwater, you can spend hours on your board in flat water, hoping the surf will come up but instead getting no where.

But when the wave comes, and you ride it, letting yourself go with the flow- while it won’t give you any exact answers, it will give you a “feel” and sense of where the wave is going, and if you are fast to grab it, and your board is good, you might just find yourself getting some awesome, gnarly insight that will score you some competitive advantage- and you just can’t surf that wave if you are worried about only looking at Exact, perfect, “we’ve checked it three times” data.

Data Surf is up- and by the look of the ocean it’s going to be a rocking rolling ride- good luck out there!

Data quality monitoring and reporting

James Standen — Thu, 17 Jan 2013 15:43:00 +0000

In the vast majority of cases, useful data sets are not static, but are being updated, added to and purged constantly.

Data quality monitoring aims to provide data quality information that is also being constantly updated, and can be used to detect issues quickly, before the bad data piles up.

Don’t let those bad records pile up.

Imagine a company that does a mailing to its customers every 3 months. Imagine that a new call center training program is put in place, and unknown to all, customer information starts being incorrectly entered and updated due to an error in the program. When do you want to know about the growing number of invalid customer records? When its time to do the mailing, after 90 days of bad data generation, or after just a few days of problem entries?

ALERT! ALERT! Bad data Alert!

The trick to data quality monitoring is having a set of data quality rules, and profiling tests that will automatically give indications of issues. Some rules are easier than others, but the idea is that each record or group of records in a the table(s) in question is checked against a series of tests, and the number of data quality rule infringements are tracked. As the size of the table grows, if the overall percentage of bad to good is increasing, you know you have an issue. If it spikes up, you sound the alarm.

Actually having a large fog horn wired up in the CEOs office is optional, due to the challenges of false positives (detecting a “Bad” record when in fact the record is ok).

I would also be very cautious in having automatic data modification processes going on, but setting alerts that will notify those responsible for potential data quality issues is a relatively straight forward exercise and will at the very least improve your visibility of data quality trends. What you do to deal with them is up to you, but at least some of the battle is being aware of the problem.

It monitors a number of things, but the key here is to make a decision for ever record if it is “Good” or “Bad”, and give some analysis of which data quality rules are broken. You can see in the upper right, there are row counts of infractions by a number of data quality rules. In the middle is the overall Bad vs Good split on the records.

The datamartist Pro version can quite easily create automated data quality monitoring. Using visual blocks, and arranging them on a Canvas, Datamartist lets you create data quality rules, define profiling on selected columns, and then automaticaly, as a scheduled process, place the results either in Excel reports/dashboards or into a database for use by your reporting tool of choice.

By understanding how your data quality evolves, and catching problems early, you can reduce the amount of cleansing you need to do, and communicate clearly and regularly to your organisation the progress your data quality programs are making.

A new years resolution to data profile

James Standen — Tue, 10 Jan 2012 15:54:05 +0000

Well, it’s the time of making and breaking resolutions, a time when setting realistic goals is sometimes hard to do with all the optimism of the new year.

Sometimes, we make decisions NOT to set a goal, because we don’t want to break it.

You might be thinking you really should step up your data quality monitoring- get some data profiling underway to help identify the data domains and areas you most want to tackle in 2012. But you might be also thinking that with all the pressures and cutbacks that many companies are facing, you don’t have the resources to implement a full scale profiling and monitoring effort, and so might decide to delay.

Don’t wait. Just do it. The perfect is the enemy of the good.

Rather than worrying about how much of your data you are going to be able to cover, or that you can’t devote enough resources to tackle all of your reference areas at once, work at the problem from another direction.

First, start with master data.

Master data is the data that all your other data is made from. It’s the data everyone uses to view the massive piles of transactional data, so one bad row in a master data table, and the impact is felt across perhaps hundreds of reports, and multiple time periods. If you have a product in the wrong category, then every transaction, across perhaps hundreds of customers, and all time, will be mis-catagorized, and every total, sub-total and calculated metric using it will suffer.

While bad transactions are bad, bad reference data is deadly. Bad reference data takes a good transaction and messes it up.

Worst first!

Make a list of your reference tables/area. Customer, Product, Chart of account, etc. etc. What are the most important for your business? This isn’t something I can tell you- you have to think about what is most critical.

If you are a company that purchases large amounts of materials from many vendors, and purchasing decisions are fast paced and critical, then maybe it’s your vendor master, and your accounts payable.

On the other hand, if you have lots of interaction with your customers, and errors in the customer master cost you business, then start with that.

The key is to first make the list, and then think to yourself “if I have bad quality data, where am I most afraid it will be?” Start profiling there. You want to find the worst first, and fixing that will have the greatest positive impact.

Get to know your data

Don’t worry about setting complex or work intensive goals right away. Data profiling is about data discovery sometimes. You need to wade into your reference data, play with it, tease out patterns and relationships. As you get to know your data, you will be able to better identify where there are issues to tackle, and where root causes might lie for data quality issues.

One approach might be to simply resolve to spend an hour a week, every week, profiling some data. If you aren’t do that now, you will find that even just a bit of time set aside will give huge insight- sometimes we get too busy to do the basics, and we miss opportunities to make significant improvements with relatively little effort in our data.

Data Quality Rules

James Standen — Thu, 16 Jun 2011 17:00:07 +0000

What’s the difference between good data and bad data? It is much like the difference between good children and bad children- the bad data doesn’t follow the rules.

But what are the rules? Unlike the rules for kids, which have been fixed in stone for decades (or at least, parents wish it were so), the rules for data are slippery things that depend very much on the context and the database.

While it’s a complex subject, some basic rules of thumb can avoid the deeper rabbit holes.

The first thing to understand about Data Quality rules is they aren’t as easy as they may look. Data is in theory something in the ordered world of computers, but in reality is in the “flexible” world of humans. A huge amount of data is entered by members of the group “Homo sapiens” (or mutilated by software written by members of that group) and as a result is not as ordered as we would all like.

The challenge for data quality practitioners is to remove the chaos injected by those highly involved primates (us) and make the data the sterile, ordered, never any question about anything type that we all imagine in our fantasies.

But how?

In the end, it is amazing how powerful and complex the various solutions to this problem are.

But I suggest that there are some basic principles that can help guide us.

First- do no harm.

One of the risks of any data quality initiative is that it actually screws up the data more. Don’t define rules that are so complex, and so sure of themselves that they actually make the data worse. Be humble. Don’t change data unless you are pretty sure it’s a good idea. Err on the side of not screwing up the original. And keep a copy of the original- so if things do go off the rails you can undo- or at least try to understand what when wrong.

Go out and talk to the people

Don’t sit in your ivory tower and speculate as to what the data means. Go out there and watch people enter it in. See what real world type things are happening that never make it into bits and bytes.

Attack the basics first

Focus your first efforts on dealing with the basics- they will resolve the vast majority of the issues- don’t chase after the outliers until you have the “easy” cases taken care of- the tough stuff is a case of diminishing returns- look first at how to fix processes and train your people to make the majority of typical data entry cases more accurate before you start looking into artificial intelligence based hyper-multi-semantic-algorithmic-learning-matching-holistic-flux-capacitor data quality systems.

Less is more- the fewer rules the better.

So whats the rule about making rules? Try to make less rules, and test them in a pragmatic way. It is possible to have so many rules that the rules themselves have data quality issues- don’t go there.

Sometimes the simplest things will bring the greatest benefit.

In the coming weeks, I’ll be posting about how to design, implement and monitor Data quality rules using the Datamartist tool.

Data quality sizzle

James Standen — Tue, 22 Mar 2011 18:08:56 +0000

I’m an engineer. Being an engineer, I’m pretty product focused, pretty technology focused, and pretty “does it work or not” focused.

Having technical things like tools work is useful, and good. But just because you build it, does not mean they will come.

The challenge often in Data Quality is that often what has to change even more than the technology or tools is the behaviours and perspectives of the people in the organisation with data quality issues. At the very least, the users have to use the tools. Very few data quality solutions are of the “full autopilot” bad-data-goes-in-here-good-comes-out-here type.

As much as we engineers would like to solve everything with software, people are involved in Data Quality.

While a fantastic bit of data profiling analysis or an elegant and powerful data transform would seem to be enough, the truth is sometimes how and when you present these things is key to getting the non-engineer people to buy in.

Sometimes preparing people over time, and introducing things in a step by step way helps them understand, and makes the technology and the change required less daunting.

Because I’m looking out my window at a tentative (very tentative it’s only March after all) spring day here in Toronto, I’m going to use a summer barbecue analogy.

The tools and technology are the steak. The steak is key to the party. In the end (at least for me in this analogy) the steak delivers most of the value in your summer BBQ party value proposition, but you’ll have more guests and be more successful over all if you package the whole.

Sometimes, part of selling the steak is the sizzle, the preparation, the things around the steak.

It’s the smell of the BBQ getting ready, it’s the sound of the steak hitting the grill- its the cold drink, the conversation, the games on the lawn for the kids.

In the end, even if you know that 90% of the deal was that steak, if you just put a steak on a plate and give it to each guest the moment they arrive, its just not going to get the same response.

In my usual round about way the point I’m trying to get to is that you can’t solve technical problems, then drop them on people desks and say “do it”. You need to invite them to the party. Prepare them for the menu, ask preferences, give them some time to hear the sizzle, smell the charcoal, enjoy the sunshine in expectation of that steak.

Steak is good. Remember to plan some sizzle too.

Data quality challenges: behavioral inertia and its evil opposite

James Standen — Tue, 05 Oct 2010 16:39:04 +0000

Often, I hear someone say something like “this would be much easier if users would just…” or “If only we could convince the sales people that…”. Technology folks often are frustrated by the people component of the complex systems they are trying to install.

People are not a problem solved by technology

Some try to ignore the issue, or solve it with technology alone- “If we write complex enough validation into the data entry form people HAVE to enter good data” or “Our matching algorithms will resolve the issues in real time.”

Others try to use sophisticated training, documentation, bonus plans or punishment plans to get the behavior they want.

Obviously, components of both approaches are going to be used to some extent- but don’t lose sight of the fact that people ARE the process- and the heart of your business. It’s the sales guys that drive revenue, and its the sales order people, or help desk operators, or engineers in your manufacturing facilities that you are building the new system for that are creating all the value. You are a person too- think about their motivations, and how to take advantage of their abilities and enthusiasm- not how to remove them from the equation.

I often think that there are two powerful forces at work in the minds of all of us- oddly, they are opposites, and yet can co-exist even in the same person at the same moment. Some people are strongly to one side or the other.

Behavioral Inertia: Change is bad

We’ve all see this resistance to change, and in many cases people have this tendency for good reasons (that last disastrous ERP implementation where the new processes were not properly checked, and everyone worked 15 hour days for weeks while customers were screaming into their phones about how screwed up everything was, for example.)

Remember, resistance to bad change that is going to screw everything up is a good thing.

In other cases, however, it is unfounded, and it is a real problem- things have to change to move forward. Sometimes risks have to be taken, and there will be bumpy periods before a much better steady state is achieved.

People have a natural resistance to this because change is the unknown.

Hyper Active change syndrome: We can’t wait to do it right- we have to act NOW

This is the evil opposite twin of behavioral inertia. (It’s like that episode when Captain Kirk got split in a transporter incident- you know.)

You can identify people with this force at work by phrases like “We’re a dynamic organisation, we’re being proactive not reactive, our processes are fluid- its the way it is with business in the fast lane” or my personal favorite- “We don’t have time to get the data, we’ll have to go with our gut.”

Hyperactive changers will often try to get their way by always creating a sense of urgency: “The technology isn’t moving fast enough for us, we can’t wait for those changes to be approved, all the process is slowing us down, our customers are demanding speed”

Hyperactive changers are dangerous because they often ignore or circumvent processes in the name of expediency, generating risk and forcing others to waste effort compensating, and generally causing chaos. They want to change things so often, that efficiencies of new processes are never realized- everyone is on a constant learning curve and never gets in the groove.

Balance the forces, find your high-speed tortoise

Think of the story of the Tortoise and the Hare. The Hare, with all its speed, could not figure out that the process was start, run, finish, and completely wasted his speed advantage by having a nap.

On the other hand, while the Tortoise’s complete dedication to his goal and process is admirable, you can’t count on the incompetence of your competition. (And now that all Hares are no doubt told this story throughout their childhood, its unlikely many tortoise get away with the same trick.)

They key lies in between- we need to work with our organization to foster an environment where we value process, and consistency, but understand that a steady, relentless change to optimize is needed, and valuable. When one or the other of our behavioral urges overcomes us, we’ll find that people are the problem in our initiatives. If we balance them, and communicate with everybody, we can find ways to make things work, even without perfect cooperation at all times from everyone.

Not too slow, not too fast, always value process without letting it be your slave master. And for goodness sake, forget about going with your gut- go out and get some DATA!

An introduction to using regular expressions for data quality validation

James Standen — Thu, 23 Sep 2010 17:43:31 +0000

Regular expressions (sometimes referred to as regex or regexp) are a powerful formal language that can be used to match text strings to patterns.

They way regular expressions work is like this:

A pattern is defined. This is a string of symbols that act as a set of rules.
A text string to test, and the pattern are given to a regular expression engine, and compared.
The engine returns a true/false value meaning the string follows the rules or not (“PASSED” or “REJECTED by the pattern)

This is obviously very useful for someone interested in Data Quality- If you had a pattern that said “Is this a valid email address?”, and got PASS or REJECT back, it would give you a good idea as to the quality of that field in your contact database.

One advantage of regular expressions is that because they are widely used, lots and lots of them have been created, detecting all sorts of patterns- meaning that while you can write your own, you can also look up useful ones you need in libraries.

Regular expressions aren’t magic, of course- the result is only as good as the program. (As always.) Depending on how well the regex is written (or not) there may be false positives or negatives.

A regex example- Canadian Postal Code

Lets look at a simple example, and see how they work. Being from Canada, I’m going to use the example of validating a Canadian postal code.

Canadian Postal codes take the format ANA NAN, where “A” is a letter and “N” is a number. So what we want is a regular expression that will return TRUE for valid postal codes, and FALSE for postal codes that just can’t be right. “K9J 2K2” could be a valid postal code, but we know that “38X AB2” just can’t be.

In a regular expression, we use anchors to say where to start matching. In this case, we want just the Canadian postal code, so we’ll use the anchor character “^” to specify the beginning of the string.

To specify that a character has to be within a given set or range of characters, we use square brackets. So to match when ever the first character of a string is a letter, the regex would be:

^[a-zA-Z]

This regex will return TRUE for all strings that start with a letter. Thats fine, but not yet specific enough for a Canadian postal code (we Canadians are very very picky).

So we can add a number, then another letter constraint to our pattern.

^[a-zA-Z][0-9][a-zA-Z]

So far, now any string that starts with ANA will result in true- we’re almost half way there! Next, we want to specify that the space is optional- that is, its acceptable to have the space or not.
To do this, we use the “?” to specify that the space is optional. And then to finish up, we add the part of the expression that detects the NAN, and end with a dollar sign which specifies that that needs to be the end of the string (otherwise all strings that started with a valid postal code would pass);

^[a-zA-Z][0-9][a-zA-Z][ ]?[0-9][a-zA-Z][0-9]$

So there it is- or is it?

But is the REGEX fussy enough?

While this pattern that we’ve created does detect the ANA NAN pattern, and even allows the space to be optional, if you know Canadian postal codes ,you’ll know that in fact ANA NAN is not enough by itself. There are only certain letters that actually exist in certain locations. So a better REGEX pattern for Canadian postal code validity would be the following:

^[abceghjklmnprstvxyABCEGHJKLMNPRSTVXY][0-9][abceghjklmnprstvwxyzABCEGHJKLMNPRSTVWXYZ][ ]?[0-9][[abceghjklmnprstvwxyzABCEGHJKLMNPRSTVWXYZ][0-9]$

This pattern explicitly lists valid letters. Canadian postal codes do not use the letters D,F,O,Q or U anywhere and they do not use W or Z in the first position. Of course, this brings up another issue with any data quality method- remember Canada post could decide to change the rules- then your data quality test would need to be updated.

Ok, so that means the Postal code is ok right?… uh, No.

So this regular expression will detect if a text string of length 6 or 7 is a valid Canadian postal code- but remember that this alone is probably not enough. Chances are that this postal code is stored as part of an address, which will also include the city and province. In Canada, postal codes are of course unique to a given province- the first letter defines the area, and each area exists within a particular province (large provinces have more than one letter assigned to them).

This means that a properly formed postal code could be invalid- for example, an address in Quebec that has a postal code that starts with the letter “V” which is for British Columbia has clearly got something amiss.

So while learning a bit about regular expressions, we’ve also learned that probably if you had a big mailing list to clean you would probably want to use a dedicated tool- postal addresses are an area of data quality where lots of attention has been paid over the years, and writing a lot of custom logic and regex patterns is probably not a good use of your time. But for application specific codes and strings it might be very useful. In my next post, we’ll look at some more tricks with regular expressions that can be used to analyze data quality.

I’ve posted a small collection of useful regular expressions to the datamartist website here.

Datamartist V1.3.0 PRO and Regular expressions

The professional edition of the Datamartist tool provides a function REGEX(text,regex expression) that returns TRUE or FALSE depending on if the text “matches” with the regular expression specified. This function can be used anywhere in Datamartist where expressions are available, making it a powerful way to test if a string matches one or more patterns.

When should you data profile? Morning, Noon and Night!

James Standen — Wed, 22 Sep 2010 13:16:26 +0000

Data profiling is an important part of any data related project. The question often arises when the best time to data profile is. As you would expect from a software company that sells a really cool visual data profiling tool, our view is “all the time”.

Using data profiling tools before the project

Data profiling is useful even before the project is defined. By doing a first higher level data profiling on key data sets, you will get;

better project scope definition
a more accurate budget estimate
a clear baseline from which improvement can be measured
the ability to correctly manage expectations

The last two are important ones. Making wild promises based on what the data model says should be in the tables and then failing to deliver due to data quality issues is not nearly as career enhancing as defining an ambitious but doable scope and delivering based on the actual data, clearly communicating the progress made based on facts.

Data quality issues can seriously (double digit percentage seriously) affect the final cost of a project. Knowing about issues will let you set a realistic budget.

Data profiling at the beginning of the project

After the initial higher level data profiling done before the project, budgeting for a more detailed data profiling of the source data will:

allow clear design guidance to ETL developers
Clearly identify the subject matter experts needed to understand the data, and let you engage them early- rather than in a rush when ETL development hits the underlying data issues, and the project is already running late and over budget.

Data profiling during the project

By setting up automated data profiling tasks for the output of each ETL process, ETL developers and architects can track the progress for migration, cleansing or conversion tasks using concrete information.
Objective criteria can be set for each profile task to determine what level if data quality is considered “acceptable” for the final data load or fact set deliverables.

Data profiling at the end of the project

Doing a final data profiling run, and comparing it to the baseline established before the project will provide a clear Before/After view that will both clearly communicate the progress made, but also assist in justifying and promoting the next data quality initiative.

Too much data storage hurts data quality- the toothpaste effect

James Standen — Thu, 09 Sep 2010 15:36:34 +0000

When I brush my teeth there is a wide range in terms of amount of toothpaste that is acceptable to me. This is not a profound statement- bear with me.

Only as the tube of toothpaste starts getting near to its end do I start conserving toothpaste because I know I need to make it last.

Another example is the all you can eat buffet- we eat because it’s there and we can. Unlike wasting toothpaste, this has more immediate negative consequences.

When there is lots of something, we tend to use more of it than we should.

When the tube of enterprise storage capacity seems to be always full, and when massive databases make an all-you-can-store buffet the standard mode of operation, very often the tendency is to store everything.

Rather than try to determine what information is of a useful level of quality, or focusing on the key information (and ensuring it IS of useful data quality), we stuff our systems full of every type of field and attribute, with massive bloated forms that are too long for anyone to really fill out properly.

Sadly, this doesn’t matter because there are too many fields to check anyways (who can define so many business and data quality rules?), so no one is checking.

If we were forced to make a choice between data A and data B, we might think a bit more about which is more useful for answering key business questions (and by connection, actually think about what the key business questions are).

Instead, how many times have I heard an overworked, rushed subject matter expert say – “Just collect it all, we might need it.”

By collecting more, we end up with less.