<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Datamartist.com &#187; Data Quality</title>
	<atom:link href="http://www.datamartist.com/tag/data-quality/feed" rel="self" type="application/rss+xml" />
	<link>http://www.datamartist.com</link>
	<description>Reduce cost with self serve data transformation</description>
	<lastBuildDate>Wed, 25 Jan 2012 15:47:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>A new years resolution to data profile</title>
		<link>http://www.datamartist.com/a-new-years-resolution-to-data-profile</link>
		<comments>http://www.datamartist.com/a-new-years-resolution-to-data-profile#comments</comments>
		<pubDate>Tue, 10 Jan 2012 15:54:05 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Reality Check]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=6165</guid>
		<description><![CDATA[Well, it's the time of making and breaking resolutions, a time when setting realistic goals is sometimes hard to do with all the optimism of the new year. Sometimes, we make decisions NOT to set a goal, because we don't want to break it. You might be thinking you really should step up your data [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.datamartist.com/wp-content/uploads/2012/01/data-profiling-some-data.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2012/01/data-profiling-some-data-300x225.jpg" alt="" title="data-profiling-some-data" width="300" height="225" class="alignright size-medium wp-image-6171" /></a>Well, it's the time of making and breaking resolutions, a time when setting realistic goals is sometimes hard to do with all the optimism of the new year.  </p>
<p>Sometimes, we make decisions NOT to set a goal, because we don't want to break it.  </p>
<p>You might be thinking you really should step up your data quality monitoring- get some data profiling underway to help identify the data domains and areas you most want to tackle in 2012.  But you might be also thinking that with all the pressures and cutbacks that many companies are facing, you don't have the resources to implement a full scale profiling and monitoring effort, and so might decide to delay. </p>
<p>Don't wait. Just do it.  The perfect is the enemy of the good.</p>
<p>Rather than worrying about how much of your data you are going to be able to cover, or that you can't devote enough resources to tackle all of your reference areas at once, work at the problem from another direction.  </p>
<h1>First, start with master data.</h1>
<p>Master data is the data that all your other data is made from.  It's the data everyone uses to view the massive piles of transactional data, so one bad row in a master data table, and the impact is felt across perhaps hundreds of reports, and multiple time periods.  If you have a product in the wrong category, then every transaction, across perhaps hundreds of customers, and all time, will be mis-catagorized, and every total, sub-total and calculated metric using it will suffer.</p>
<p>While bad transactions are bad, bad reference data is deadly.  Bad reference data takes a good transaction and messes it up.</p>
<h1>Worst first!</h1>
<p>Make a list of your reference tables/area.  Customer, Product, Chart of account, etc. etc.  What are the most important for your business?  This isn't something I can tell you- you have to think about what is most critical.</p>
<p>If you are a company that purchases large amounts of materials from many vendors, and purchasing decisions are fast paced and critical, then maybe it's your vendor master, and your accounts payable.</p>
<p>On the other hand, if you have lots of interaction with your customers, and errors in the customer master cost you business, then start with that.</p>
<p>The key is to first make the list, and then think to yourself "if I have bad quality data, where am I most afraid it will be?"  Start profiling there.  You want to find the worst first, and fixing that will have the greatest positive impact.</p>
<h1>Get to know your data</h1>
<p>Don't worry about setting complex or work intensive goals right away.  Data profiling is about data discovery sometimes.  You need to wade into your reference data, play with it, tease out patterns and relationships.  As you get to know your data, you will be able to better identify where there are issues to tackle, and where root causes might lie for data quality issues.</p>
<p>One approach might be to simply resolve to spend an hour a week, every week, profiling some data.  If you aren't do that now, you will find that even just a bit of time set aside will give huge insight- sometimes we get too busy to do the basics, and we miss opportunities to make significant improvements with relatively little effort in our data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/a-new-years-resolution-to-data-profile/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data Quality Rules</title>
		<link>http://www.datamartist.com/data-quality-rules</link>
		<comments>http://www.datamartist.com/data-quality-rules#comments</comments>
		<pubDate>Thu, 16 Jun 2011 17:00:07 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[data culture]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Reality Check]]></category>
		<category><![CDATA[Data Quality rules]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=5995</guid>
		<description><![CDATA[What's the difference between good data and bad data? It is much like the difference between good children and bad children- the bad data doesn't follow the rules. But what are the rules? Unlike the rules for kids, which have been fixed in stone for decades (or at least, parents wish it were so), the [...]]]></description>
			<content:encoded><![CDATA[<p>What's the difference between good data and bad data?  It is much like the difference between good children and bad children- the bad data doesn't follow the rules.<br />
<a href="http://www.datamartist.com/wp-content/uploads/2011/04/data-quality-rules-data-freedom-or-death.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2011/04/data-quality-rules-data-freedom-or-death-300x269.jpg" alt="" title="data-quality-rules-data-freedom-or-death" width="300" height="269" class="alignright size-medium wp-image-6011" /></a><br />
But what are the rules?  Unlike the rules for kids, which have been fixed in stone for decades  (or at least, parents wish it were so), the rules for data are slippery things that depend very much on the context and the database.</p>
<p>While it's a complex subject, some basic rules of thumb can avoid the deeper rabbit holes.</p>
<p>The first thing to understand about Data Quality rules is they aren't as easy as they may look.  Data is in theory something in the ordered world of computers, but in reality is in the "flexible" world of humans.  A huge amount of data is entered by members of the group "Homo sapiens" (or mutilated by software written by members of that group) and as a result is not as ordered as we would all like.</p>
<p>The challenge for data quality practitioners is to remove the chaos injected by those highly involved primates (us) and make the data the sterile, ordered, never any question about anything type that we all imagine in our fantasies.</p>
<p>But how?</p>
<p>In the end, it is amazing how powerful and complex the various solutions to this problem are.</p>
<p>But I suggest that there are some basic principles that can help guide us.</p>
<h2>First- do no harm.</h2>
<p>One of the risks of any data quality initiative is that it actually screws up the data more.  Don't define rules that are so complex, and so sure of themselves that they actually make the data worse.  Be humble. Don't change data unless you are pretty sure it's a good idea.  Err on the side of not screwing up the original.  And keep a copy of the original- so if things do go off the rails you can undo- or at least try to understand what when wrong.</p>
<h2>Go out and talk to the people</h2>
<p>Don't sit in your ivory tower and speculate as to what the data means.  Go out there and watch people enter it in.  See what real world type things are happening that never make it into bits and bytes.</p>
<h2>Attack the basics first</h2>
<p>Focus your first efforts on dealing with the basics- they will resolve the vast majority of the issues- don't chase after the outliers until you have the "easy" cases taken care of- the tough stuff is a case of diminishing returns- look first at how to fix processes and train your people to make the majority of typical data entry cases more accurate before you start looking into artificial intelligence based hyper-multi-semantic-algorithmic-learning-matching-holistic-flux-capacitor data quality systems.</p>
<h2>Less is more- the fewer rules the better.</h2>
<p>So whats the rule about making rules?  Try to make less rules, and test them in a pragmatic way.  It is possible to have so many rules that the rules themselves have data quality issues- don't go there.</p>
<p>Sometimes the simplest things will bring the greatest benefit.</p>
<p>In the coming weeks, I'll be posting about how to design, implement and monitor Data quality rules using the <a href="/">Datamartist tool</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-quality-rules/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data quality sizzle</title>
		<link>http://www.datamartist.com/data-quality-sizzle</link>
		<comments>http://www.datamartist.com/data-quality-sizzle#comments</comments>
		<pubDate>Tue, 22 Mar 2011 18:08:56 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Project Management]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=5985</guid>
		<description><![CDATA[I'm an engineer. Being an engineer, I'm pretty product focused, pretty technology focused, and pretty "does it work or not" focused. Having technical things like tools work is useful, and good. But just because you build it, does not mean they will come. The challenge often in Data Quality is that often what has to [...]]]></description>
			<content:encoded><![CDATA[<p>I'm an engineer. Being an engineer, I'm pretty product focused, pretty technology focused, and pretty "does it work or not" focused.  </p>
<p>Having technical things like tools work is useful, and good.  But just because you build it, does not mean they will come.</p>
<p>The challenge often in Data Quality is that often what has to change even more than the technology or tools is the behaviours and perspectives of the people in the organisation with data quality issues.  At the very least, the users have to use the tools.  Very few data quality solutions are of the "full autopilot" bad-data-goes-in-here-good-comes-out-here type.</p>
<p>As much as we engineers would like to solve everything with software, people are involved in Data Quality.  </p>
<p>While a fantastic bit of data profiling analysis or an elegant and powerful data transform would seem to be enough, the truth is sometimes how and when you present these things is key to getting the non-engineer people to buy in.  </p>
<p>Sometimes preparing people over time, and introducing things in a step by step way helps them understand, and makes the technology and the change required less daunting.</p>
<p><a href="http://www.datamartist.com/wp-content/uploads/2011/03/red-bbq.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2011/03/red-bbq-300x199.jpg" alt="" title="red-bbq" width="300" height="199" class="alignright size-medium wp-image-5987" /></a>Because I'm looking out my window at a tentative (very tentative it's only March after all) spring day here in Toronto, I'm going to use a summer barbecue analogy.</p>
<p>The tools and technology are the steak.  The steak is key to the party.   In the end (at least for me in this analogy) the steak delivers most of the value in your summer BBQ party value proposition, but you'll have more guests and be more successful over all if you package the whole. </p>
<p>Sometimes, part of selling the steak is the sizzle, the preparation, the things around the steak.</p>
<p>It's the smell of the BBQ getting ready, it's the sound of the steak hitting the grill- its the cold drink, the conversation, the games on the lawn for the kids.</p>
<p>In the end, even if you know that 90% of the deal was that steak, if you just put a steak on a plate and give it to each guest the moment they arrive, its just not going to get the same response.</p>
<p>In my usual round about way the point I'm trying to get to is that you can't solve technical problems, then drop them on people desks and say "do it".  You need to invite them to the party.  Prepare them for the menu, ask preferences, give them some time to hear the sizzle, smell the charcoal, enjoy the sunshine in expectation of that steak.</p>
<p>Steak is good.  Remember to plan some sizzle too.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-quality-sizzle/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data quality challenges: behavioral inertia and its evil opposite</title>
		<link>http://www.datamartist.com/data-quality-challenges-behavioral-inertia-and-its-evil-opposite</link>
		<comments>http://www.datamartist.com/data-quality-challenges-behavioral-inertia-and-its-evil-opposite#comments</comments>
		<pubDate>Tue, 05 Oct 2010 16:39:04 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[data culture]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=5468</guid>
		<description><![CDATA[Often, I hear someone say something like "this would be much easier if users would just..." or "If only we could convince the sales people that...". Technology folks often are frustrated by the people component of the complex systems they are trying to install. People are not a problem solved by technology Some try to [...]]]></description>
			<content:encoded><![CDATA[<p>Often, I hear someone say something like "this would be much easier if users would just..." or "If only we could convince the sales people that...".   Technology folks often are frustrated by the people component of the complex systems they are trying to install.</p>
<h2>People are not a problem solved by technology</h2>
<p>Some try to ignore the issue, or solve it with technology alone-  "If we write complex enough validation into the data entry form people HAVE to enter good data" or "Our matching algorithms will resolve the issues in real time."</p>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/10/users-will-lose-chair-if-data-quality-suffers1.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/10/users-will-lose-chair-if-data-quality-suffers1.jpg" alt="" title="users-will-lose-chair-if-data-quality-suffers" width="311" height="229" class="alignright size-full wp-image-5473" /></a>Others try to use sophisticated training, documentation, bonus plans or punishment plans to get the behavior they want.</p>
<p>Obviously, components of both approaches are going to be used to some extent- but don't lose sight of the fact that people ARE the process- and the heart of your business.  It's the sales guys that drive revenue, and its the sales order people, or help desk operators, or engineers in your manufacturing facilities that you are building the new system for that are creating all the value.   You are a person too- think about their motivations, and how to take advantage of their abilities and enthusiasm- not how to remove them from the equation.</p>
<p>I often think that there are two powerful forces at work in the minds of all of us- oddly, they are opposites, and yet can co-exist even in the same person at the same moment.  Some people are strongly to one side or the other.  </p>
<h2>Behavioral Inertia:  Change is bad</h2>
<p>We've all see this resistance to change, and in many cases people have this tendency for good reasons (that last disastrous ERP implementation where the new processes were not properly checked, and everyone worked 15 hour days for weeks while customers were screaming into their phones about how screwed up everything was, for example.)</p>
<p>Remember, resistance to bad change that is going to screw everything up is a good thing.</p>
<p>In other cases, however, it is unfounded, and it is a real problem- things have to change to move forward.  Sometimes risks have to be taken, and there will be bumpy periods before a much better steady state is achieved.</p>
<p>People have a natural resistance to this because change is the unknown.</p>
<h2>Hyper Active change syndrome: We can't wait to do it right- we have to act NOW</h2>
<p>This is the evil opposite twin of behavioral inertia. (It's like that episode when Captain Kirk got split in a transporter incident- you know.) </p>
<p>You can identify people with this force at work by phrases like "We're a dynamic organisation, we're being proactive not reactive, our processes are fluid- its the way it is with business in the fast lane" or my personal favorite- "We don't have time to get the data, we'll have to go with our gut."</p>
<p>Hyperactive changers will often try to get their way by always creating a sense of urgency: "The technology isn't moving fast enough for us, we can't wait for those changes to be approved, all the process is slowing us down, our customers are demanding speed"</p>
<p>Hyperactive changers are dangerous because they often ignore or circumvent processes in the name of expediency, generating risk and forcing others to waste effort compensating, and generally causing chaos.  They want to change things so often, that efficiencies of new processes are never realized- everyone is on a constant learning curve and never gets in the groove.</p>
<h2>Balance the forces, find your high-speed tortoise </h2>
<p>Think of the story of the Tortoise and the Hare.  The Hare, with all its speed, could not figure out that the process was start, run, finish, and completely wasted his speed advantage by having a nap.</p>
<p>On the other hand, while the Tortoise's complete dedication to his goal and process is admirable, you can't count on the incompetence of your competition.  (And now that all Hares are no doubt told this story throughout their childhood, its unlikely many tortoise get away with the same trick.)</p>
<p>They key lies in between- we need to work with our organization to foster an environment where we value process, and consistency, but understand that a steady, relentless change to optimize is needed, and valuable.  When one or the other of our behavioral urges overcomes us, we'll find that people are the problem in our initiatives.  If we balance them, and communicate with everybody, we can find ways to make things work, even without perfect cooperation at all times from everyone. </p>
<p>Not too slow, not too fast, always value process without letting it be your slave master.  And for goodness sake, forget about going with your gut-  go out and get some DATA!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-quality-challenges-behavioral-inertia-and-its-evil-opposite/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>An introduction to using regular expressions for data quality validation</title>
		<link>http://www.datamartist.com/an-introduction-to-using-regular-expressions-for-data-quality-validation</link>
		<comments>http://www.datamartist.com/an-introduction-to-using-regular-expressions-for-data-quality-validation#comments</comments>
		<pubDate>Thu, 23 Sep 2010 17:43:31 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4694</guid>
		<description><![CDATA[Regular expressions (sometimes referred to as regex or regexp) are a powerful formal language that can be used to match text strings to patterns. They way regular expressions work is like this: A pattern is defined. This is a string of symbols that act as a set of rules. A text string to test, and [...]]]></description>
			<content:encoded><![CDATA[<p>Regular expressions (sometimes referred to as regex or regexp) are a powerful formal language that can be used to match text strings to patterns.</p>
<p>They way regular expressions work is like this:<br />
<a href="http://www.datamartist.com/wp-content/uploads/2010/07/regular-expressions-data-profiling-and-data-quality-overview.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/07/regular-expressions-data-profiling-and-data-quality-overview.jpg" alt="" title="regular-expressions-data-profiling-and-data-quality-overview" width="333" height="221" class="alignright size-full wp-image-4734" /></a></p>
<ol>
<li>A <strong>pattern</strong> is defined.  This is a string of symbols that act as a set of rules.</li>
<li>A <strong>text string to test</strong>, and the pattern are given to a regular expression engine, and compared.</li>
<li>The engine returns a <strong>true/false</strong> value meaning the string follows the rules or not ("PASSED" or "REJECTED by the pattern)</li>
</ol>
<p>This is obviously very useful for someone interested in Data Quality-  If you had a pattern that said "Is this a valid email address?", and got PASS or REJECT back, it would give you a good idea as to the quality of that field in your contact database.</p>
<p>One advantage of regular expressions is that because they are widely used, lots and lots of them have been created, detecting all sorts of patterns- meaning that while you can write your own, you can also look up useful ones you need in libraries.</p>
<p>Regular expressions aren't magic, of course- the result is only as good as the program. (As always.)  Depending on how well the regex is written (or not) there may be false positives or negatives.</p>
<h2>A regex example- Canadian Postal Code</h2>
<p>Lets look at a simple example, and see how they work.  Being from Canada, I'm going to use the example of validating a Canadian postal code.</p>
<p>Canadian Postal codes take the format ANA NAN, where "A" is a letter and "N" is a number.  So what we want is a regular expression that will return TRUE for valid postal codes, and FALSE for postal codes that just can't be right.  "K9J 2K2" could be a valid postal code, but we know that "38X AB2" just can't be.</p>
<p>In a regular expression, we use anchors to say where to start matching. In this case, we want just the Canadian postal code, so we'll use the anchor character "^" to specify the beginning of the string.</p>
<p>To specify that a character has to be within a given set or range of characters, we use square brackets.  So to match when ever the first character of a string is a letter, the regex would be:</p>
<p><code>^[a-zA-Z]</code></p>
<p>This regex will return TRUE for all strings that start with a letter. Thats fine, but not yet specific enough for a Canadian postal code (we Canadians are very very picky).</p>
<p>So we can add a number, then another letter constraint to our pattern.</p>
<p><code>^[a-zA-Z][1-9][a-zA-Z]</code></p>
<p>So far, now any string that starts with ANA will result in true- we're almost half way there! Next, we want to specify that the space is optional- that is, its acceptable to have the space or not.<br />
To do this, we use the "?" to specify that the space is optional.  And then to finish up, we add the part of the expression that detects the NAN, and end with a dollar sign which specifies that that needs to be the end of the string (otherwise all strings that started with a valid postal code would pass);</p>
<p><code>^[a-zA-Z][1-9][a-zA-Z][ ]?[1-9][a-zA-Z][1-9]$</code></p>
<p>So there it is- or is it?</p>
<h2>But is the REGEX fussy enough?</h2>
<p>While this pattern that we've created does detect the ANA NAN pattern, and even allows the space to be optional, if you know Canadian postal codes ,you'll know that in fact ANA NAN is not enough by itself.  There are only certain letters that actually exist in certain locations.  So a better REGEX pattern for Canadian postal code validity would be the following:</p>
<p><code>^[abceghjklmnprstvxyABCEGHJKLMNPRSTVXY][1-9][abceghjklmnprstvwxyzABCEGHJKLMNPRSTVWXYZ][ ]?[1-9][[abceghjklmnprstvwxyzABCEGHJKLMNPRSTVWXYZ][1-9]$</code></p>
<p>This pattern explicitly lists valid letters.  Canadian postal codes do not use the letters D,F,O,Q or U anywhere and they do not use W or Z in the first position.  Of course, this brings up another issue with any data quality method-  remember Canada post could decide to change the rules- then your data quality test would need to be updated.</p>
<h2>Ok, so that means the Postal code is ok right?... uh, No.</h2>
<p>So this regular expression will detect if a text string of length 6 or 7 is a valid Canadian postal code- but remember that this alone is probably not enough.  Chances are that this postal code is stored as part of an address, which will also include the city and province.   In Canada, postal codes are of course unique to a given province- the first letter defines the area, and each area exists within a particular province (large provinces have more than one letter assigned to them).<br />
<a href="http://www.datamartist.com/wp-content/uploads/2010/07/canadian-postal-code-first-letter-regions-map.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/07/canadian-postal-code-first-letter-regions-map.jpg" alt="" title="canadian-postal-code-first-letter-regions-map" width="402" height="343" class="alignright size-full wp-image-4709" /></a></p>
<p>This means that a properly formed postal code could be invalid-  for example, an address in Quebec that has a postal code that starts with the letter "V" which is for British Columbia has clearly got something amiss.</p>
<p>So while learning a bit about regular expressions, we've also learned that probably if you had a big mailing list to clean you would probably want to use a dedicated tool-  postal addresses are an area of data quality where lots of attention has been paid over the years, and writing a lot of custom logic and regex patterns is probably not a good use of your time.  But for application specific codes and strings it might be very useful.  In my next post, we'll look at some more tricks with regular expressions that can be used to analyze data quality.</p>
<p>I've posted a small collection of useful regular expressions to the datamartist website <a href="/useful-regular-expressions-for-data-quality">here</a>. </p>
<h1>Datamartist V1.3.0 PRO and Regular expressions</h1>
<p>The professional edition of the <a href="/">Datamartist tool</a> provides a function REGEX(text,regex expression) that returns TRUE or FALSE depending on if the text "matches" with the regular expression specified.   This function can be used anywhere in Datamartist where expressions are available, making it a powerful way to test if a string matches one or more patterns.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/an-introduction-to-using-regular-expressions-for-data-quality-validation/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>When should you data profile? Morning, Noon and Night!</title>
		<link>http://www.datamartist.com/when-should-you-data-profile-morning-noon-and-night</link>
		<comments>http://www.datamartist.com/when-should-you-data-profile-morning-noon-and-night#comments</comments>
		<pubDate>Wed, 22 Sep 2010 13:16:26 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4955</guid>
		<description><![CDATA[Data profiling is an important part of any data related project. The question often arises when the best time to data profile is. As you would expect from a software company that sells a really cool visual data profiling tool, our view is "all the time". Using data profiling tools before the project Data profiling [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.datamartist.com/wp-content/uploads/2010/09/sate-category-graph-250px.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/09/sate-category-graph-250px.jpg" alt="" title="sate-category-graph-250px" width="250" height="151" class="alignright size-full wp-image-5128" /></a>Data profiling is an important part of any data related project.  The question often arises when the best time to data profile is. As you would expect from a software company that sells <a href="/data-profiling-enhanced-datamartist-v1-3-released">a really cool visual</a> data profiling tool, our view is "all the time".</p>
<h2>Using data profiling tools before the project</h2>
<p>Data profiling is useful even before the project is defined.  By doing a first higher level data profiling on key data sets, you will get;</p>
<ul>
<li>better project scope definition</li>
<li>a more accurate budget estimate</li>
<li>a clear baseline from which improvement can be measured</li>
<li>the ability to correctly manage expectations</li>
</ul>
<p>The last two are important ones.  Making wild promises based on what the data model says should be in the tables and then failing to deliver due to data quality issues is not nearly as career enhancing as defining an ambitious but doable scope and delivering based on the actual data, clearly communicating the progress made based on facts.</p>
<p>Data quality issues can seriously (double digit percentage seriously) affect the final cost of a project.  Knowing about issues will let you set a realistic budget.</p>
<h2>Data profiling at the beginning of the project</h2>
<p>After the initial higher level data profiling done before the project,  budgeting for a more detailed data profiling of the source data will:</p>
<ul>
<li>allow clear design guidance to ETL developers</li>
<li>Clearly identify the subject matter experts needed to understand the data, and let you engage them early- rather than in a rush when ETL development hits the underlying data issues, and the project is already running late and over budget.</li>
</ul>
<h2>Data profiling during the project</h2>
<ul>
<li>By setting up automated data profiling tasks for the output of each ETL process, ETL developers and architects can track the progress for migration, cleansing or conversion tasks using concrete information.</li>
<li>Objective criteria can be set for each profile task to determine what level if data quality is considered "acceptable" for the final data load or fact set deliverables.</li>
</ul>
<h2>Data profiling at the end of the project</h2>
<ul>
<li>Doing a final data profiling run, and comparing it to the baseline established before the project will provide a clear Before/After view that will both clearly communicate the progress made, but also assist in justifying and promoting the next data quality initiative.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/when-should-you-data-profile-morning-noon-and-night/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Too much data storage hurts data quality- the toothpaste effect</title>
		<link>http://www.datamartist.com/too-much-data-storage-hurts-data-quality-the-toothpaste-effect</link>
		<comments>http://www.datamartist.com/too-much-data-storage-hurts-data-quality-the-toothpaste-effect#comments</comments>
		<pubDate>Thu, 09 Sep 2010 15:36:34 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[data culture]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Reality Check]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4960</guid>
		<description><![CDATA[When I brush my teeth there is a wide range in terms of amount of toothpaste that is acceptable to me. This is not a profound statement- bear with me. Only as the tube of toothpaste starts getting near to its end do I start conserving toothpaste because I know I need to make it [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.datamartist.com/wp-content/uploads/2010/09/data-quality-and-toothpaste-labour-issues.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/09/data-quality-and-toothpaste-labour-issues.jpg" alt="" title="data-quality-and-toothpaste-labour-issues" width="320" height="234" class="alignright size-full wp-image-4962" /></a><br />
When I brush my teeth there is a wide range in terms of amount of toothpaste that is acceptable to me.  This is not a profound statement- bear with me.</p>
<p>Only as the tube of toothpaste starts getting near to its end do I start conserving toothpaste because I know I need to make it last.</p>
<p>Another example is the all you can eat buffet- we eat because it's there and we can.  Unlike wasting toothpaste, this has  more immediate negative consequences.</p>
<p><strong>When there is lots of something, we tend to use more of it than we should.</strong></p>
<p>When the tube of enterprise storage capacity seems to be always full, and when massive databases make an all-you-can-store buffet the standard mode of operation, very often the tendency is to store everything.  </p>
<p>Rather than try to determine what information is of a useful level of quality, or focusing on the key information (and ensuring it IS of useful data quality), we stuff our systems full of every type of field and attribute, with massive bloated forms that are too long for anyone to really fill out properly.  </p>
<p>Sadly, this doesn't matter because there are too many fields to check anyways (who can define so many business and data quality rules?), so no one is checking.</p>
<p>If we were forced to make a choice between data A and data B, we might think a bit more about which is more useful for answering key business questions (and by connection, actually think about what the key business questions are).</p>
<p>Instead, how many times have I heard an overworked, rushed subject matter expert say - "Just collect it all, we might need it."</p>
<p>By collecting more, we end up with less.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/too-much-data-storage-hurts-data-quality-the-toothpaste-effect/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Data quality from a four year old</title>
		<link>http://www.datamartist.com/data-quality-templates-from-a-four-year-old</link>
		<comments>http://www.datamartist.com/data-quality-templates-from-a-four-year-old#comments</comments>
		<pubDate>Tue, 08 Jun 2010 14:10:22 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Duplicate Data]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4473</guid>
		<description><![CDATA[I think my four year old would make a good data quality dude. He explained to me recently, why its better to use stickers than crayons, "for the things people use a lot". "Dad, if you use crayons, you might draw it different, but stickers- they are all the same." he then pointed to the [...]]]></description>
			<content:encoded><![CDATA[<p>I think my four year old would make a good data quality dude. He explained to me recently, why its better to use stickers than crayons, "for the things people use a lot".<br />
<a href="http://www.datamartist.com/wp-content/uploads/2010/06/data-entry-problems-just-enter-anything.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/06/data-entry-problems-just-enter-anything-300x186.jpg" alt="" title="data-entry-problems-just-enter-anything" width="300" height="186" class="alignright size-medium wp-image-4539" /></a><br />
"Dad, if you use crayons, you might draw it different, but stickers- they are all the same."  he then pointed to the sheet of identical, machine generated stickers-  "All the same- so everyone who gets one of these, knows what it is."</p>
<p>"Using the crayon takes too long and sometimes I make mistakes."  Then he paused for a second. "But if it's something different- then I have to draw it. No stickers for that."</p>
<p>And off he went, blending hand drawn custom crayon work with high speed sticker application.</p>
<p>It strikes me that what my son has figured out as a basic rule of thumb in arts and crafts for the use of stickers, is a pretty good analogy for design of data entry systems.</p>
<p>Whenever you can, use something that restricts the users choices to a fixed, understood set of responses.  Use pre-made data stickers.</p>
<p>The enemy of data quality everywhere is the gaping, un-validated free form text entry field.  Only linguists and unstructured text analysts can get excited about the "endless possibilities" of what your users and customers can enter into those fields.</p>
<p>We've all seen the horrors of names and addresses run amok-  "John A Smith", "Jon A. Smith", "John Smith Jr.", "Smith, John A" or the even more amazing "John Smith (new customer)".</p>
<p>If you're in data, you don't want endless possibilities.  You want ordered sets of data that conform strictly to well defined rules.  Eliminating duplicates is a complex and time consuming effort.  Stopping as many of them before they are created is the first, best thing you can do to get a handle on the problem.</p>
<p>So think stickers.   For every field ask yourself- can I make this a combo box? radio buttons?  Can I do auto search in the existing records to suggest close matches?  Anything to stop users or customers from making things up- and to have the data points they enter conform to a defined domain.</p>
<p>The more constrained a field is, the better the chances are that the data stored in it will be useful... unless of course you make it so constrained that you force data quality to suffer.</p>
<h2>There is such a thing as too much...</h2>
<p>Every good rule has its exceptions, and the evil side of overly constraining your data entry folks is that because they are smarter than computers, they'll find ways to invent entirely new encoding methods.</p>
<p>If you tighten the entry on the postal code too much, so that international postal codes won't fit, you can be sure that data entry clerks will discover that by entering their own postal code, and putting the customers postal code in the comment field, they can get the system to accept the record (and at least feel as if they had tried their best to get the data needed in there).</p>
<p>This is where which stickers you have in your collection starts to matter.  </p>
<p>Have you ever noticed that at well run events, they always have some blank name tags, as well of the pre-printed ones?  That and a magic marker makes sure the process can go on.</p>
<p>In the end, you'll need to balance between the two extremes.  Tighten up your data entry and interfaces as much as you can, but realize that there is a point of diminishing returns, and in fact probably even a point where your data totalitarianism will be hurting your data quality, not helping it.</p>
<p>Now of course, there are some pretty high end tools that let you create all sorts of rules, and others that let you comb through the data and cleanse it, checking those postal codes to states and cities, and doing all sorts of fancy matching and analysis.  There is definitely an important role in many organisations and systems for approaches and tools such as these.</p>
<p>Using data profiling tools like <a href="/">Datamartist</a> will help you understand what issues are making it through your defenses.</p>
<p>But if you are not doing it already, focusing on the point of entry with practical, balanced techniques will make a step change improvement to your data quality.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-quality-templates-from-a-four-year-old/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Data integration is like a pizza</title>
		<link>http://www.datamartist.com/data-integration-is-like-a-pizza</link>
		<comments>http://www.datamartist.com/data-integration-is-like-a-pizza#comments</comments>
		<pubDate>Tue, 18 May 2010 12:52:12 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Integration]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Data Transformation]]></category>
		<category><![CDATA[Business Intelligence]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4520</guid>
		<description><![CDATA[I enjoy a slice of pizza as much as the next person (perhaps a bit more). The key to a good pizza is the raw materials- use the right stuff, and you'll be happy every time. What's great about pizza is that it has all sorts of great stuff on it, and presents them all [...]]]></description>
			<content:encoded><![CDATA[<p>I enjoy a slice of pizza as much as the next person (perhaps a bit more).  The key to a good pizza is the raw materials- use the right stuff, and you'll be happy every time.  What's great about pizza is that it has all sorts of great stuff on it, and presents them all in a single, easy to hold and eat meal. </p>
<p>Data integration can be like a really well put together pizza- lots of good cross-referencing cheese-data to keep everything in its place, great crust that supports it all, and a universal appeal that might even get people to try something they wouldn't normally consume (data wise).</p>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/05/data-integration-if-the-data-was-any-good.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/05/data-integration-if-the-data-was-any-good.jpg" alt="" title="data-integration-if-the-data-was-any-good" width="357" height="297" class="alignleft size-full wp-image-4525" /></a>But without data quality, data integration can make pizza that nobody really wants to eat, and rather than enhancing the value of your data, your data integration efforts can make your bad data even less consumable than it was on its own.</p>
<p>While combining data from multiple systems can generate huge insights, it is important to understand that moving it and combining it with data from other systems will not <em>always</em> increase its value.  </p>
<p>With good quality data you can have fantastic results, but bad quality data requires so much effort and transformation that often your payback on doing the integration will be non-existent.</p>
<h2>Data integration enthusiasm </h2>
<p>So what happens when an enterprise hears its stomach rumble, and starts thinking data pizza?</p>
<p>Enthusiastic analysts spring into action, building various mockups of all the fantastic dashboards that they will be able to produce, once the data integration is done.  Terms like "near-real time, balanced, cross-functional score cards" start to get bounced around, and pretty soon, budget proposals and appropriation requests are flying from color printers everywhere.</p>
<p>Whats unfortunate in many cases is that cooler heads don't stop to ask the question-  "So... all this data we are going to put together, is it any good?"</p>
<p>When you are making your pizza, you have to know if the cheese has been left out a bit too long or the green pepper is soggy.</p>
<p>What can be worse, is that if heroic measures are taken to try to get the data to fit together, the integration jobs themselves might actually degrade the data quality further- or eliminate levels of detail that are not compatible, actually hiding important trends and structures.  A risk of integrated dashboards is that they pander to the lowest common denominator.</p>
<p>So if you are planning to do some data integration, to build a data pizza, think twice about putting that moldy pepperoni from the CRM system on it- sometimes less is more.  </p>
<p>In fact, it might be that data integration is not your first concern- improving the quality of the data in all those data silos will actually improve day to day operations immediately- and make any future data integration project cheaper, and more successful. </p>
<p>Any great chef will tell you- no matter how complex the recipe, and how impressive your kitchen and equipment, the raw ingredients matter.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-integration-is-like-a-pizza/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Why you should data profile.</title>
		<link>http://www.datamartist.com/data-profiling-do-it-do-it-now</link>
		<comments>http://www.datamartist.com/data-profiling-do-it-do-it-now#comments</comments>
		<pubDate>Fri, 07 May 2010 02:11:58 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data migration]]></category>
		<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4496</guid>
		<description><![CDATA[Imagine that you have bought a new home, and you've decided to do some landscaping. So you pick three landscapers, draw a rough sketch of what you want, and ask them to bid on the job. But you don`t allow them to come see your property, and your sketch doesn't specify anything about the existing [...]]]></description>
			<content:encoded><![CDATA[<p>Imagine that you have bought a new home, and you've decided to do some landscaping.  So you pick three landscapers, draw a rough sketch of what you want, and ask them to bid on the job.</p>
<p>But you don`t allow them to come see your property, and your sketch doesn't specify anything about the existing landscaping- just the final configuration.  Do you think the landscapers would be willing to offer a reasonable price ? </p>
<p>Unlikely.   What if there are existing patio stones to remove- or an in-ground swimming pool that`s got to go? </p>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/05/did-the-consultants-data-profile-first.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/05/did-the-consultants-data-profile-first.jpg" alt="" title="did-the-consultants-data-profile-first" width="347" height="235" class="alignright size-full wp-image-4513" /></a>No landscaper would take on a job without understanding the lay of the land, and the existing conditions.  It would be impossible to estimate the job. Anyone who did would give you a huge price to cover themselves, or demand extras upon discovering the extra work.</p>
<p>Yet when companies hire consultants to build them business intelligence solutions, or do data migration,  it often happens with only the roughest outline of the existing data sets.  Certainly, often a data model is included- but knowing what the table SHOULD contain rather than what it does is just not the same thing.  It never ceases to amaze me that the simple, cost effective practice of data profiling is just often not part of the initial phases of so many business intelligence and data migration projects.</p>
<p>With the right data profiling tool, and just a few days work, its possible to gain a huge amount of insight into the data quality in your systems, and as a result, be able to make radically more accurate estimates of the cost to go from the "as is" to the "to be".</p>
<p>Phil Simon talked about this in a great post on the Data flux blog called <a href="http://www.dataflux.com/dfblog/?p=2590" target="_blank">"What Consultants Don't tell you"</a>, and raises an important and somewhat ugly truth- many times, service providers don't WANT to do data profiling because it reveals the true extent of the work to be done, increasing the budget requirement, and makes the project less likely to be approved.</p>
<p>Now certainly, we can't use a broad brush to paint all consultants, but it does lead to a reduction in the number of times valuable tools such as data profiling are recommended even though in my opinion they are a low cost, no-brainer, do it unless you are crazy first step to any major project.  </p>
<p>You are going to spend potentially millions of dollars on a business intelligence or data migration project- spend a few weeks to look at the data with the right tools first for goodness sake!</p>
<p>If you want to get a reasonable cost estimate, and you want to go into your business intelligence or data migration project with open eyes, don't imagine you can know what it will cost to get from here to there if you don't take a good look at where here really is.</p>
<p><a href="/resources/screenshots/Data-Profiler-on-States.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/05/Data-Profiler-on-States-Thumb.jpg" alt="" title="Data-Profiler-on-States-Thumb" width="220" height="165" class="alignright size-full wp-image-4506" /></a><strong>Full disclosure</strong>-  of course, you are reading the <a href="/">Datamartist</a> blog, and Datamartist has lots of data profiling functionality- so you have to understand that we are incredibly biased on this topic.  If you are able to overlook our inherent bias, <a href="/downloads">give the tool a try</a>- you`ll discover things about your data you might not have wanted to know, but its better to face the truth prepared, than to rely on wishful thinking, and then discover the bad news when you're well into the project, and your budget is almost gone.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-profiling-do-it-do-it-now/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

