<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Datamartist.com &#187; Data Quality</title>
	<atom:link href="http://www.datamartist.com/category/data-quality/feed" rel="self" type="application/rss+xml" />
	<link>http://www.datamartist.com</link>
	<description>Reduce cost with self serve data transformation</description>
	<lastBuildDate>Thu, 09 Feb 2012 20:00:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>A new years resolution to data profile</title>
		<link>http://www.datamartist.com/a-new-years-resolution-to-data-profile</link>
		<comments>http://www.datamartist.com/a-new-years-resolution-to-data-profile#comments</comments>
		<pubDate>Tue, 10 Jan 2012 15:54:05 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Reality Check]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=6165</guid>
		<description><![CDATA[Well, it's the time of making and breaking resolutions, a time when setting realistic goals is sometimes hard to do with all the optimism of the new year. Sometimes, we make decisions NOT to set a goal, because we don't want to break it. You might be thinking you really should step up your data [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.datamartist.com/wp-content/uploads/2012/01/data-profiling-some-data.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2012/01/data-profiling-some-data-300x225.jpg" alt="" title="data-profiling-some-data" width="300" height="225" class="alignright size-medium wp-image-6171" /></a>Well, it's the time of making and breaking resolutions, a time when setting realistic goals is sometimes hard to do with all the optimism of the new year.  </p>
<p>Sometimes, we make decisions NOT to set a goal, because we don't want to break it.  </p>
<p>You might be thinking you really should step up your data quality monitoring- get some data profiling underway to help identify the data domains and areas you most want to tackle in 2012.  But you might be also thinking that with all the pressures and cutbacks that many companies are facing, you don't have the resources to implement a full scale profiling and monitoring effort, and so might decide to delay. </p>
<p>Don't wait. Just do it.  The perfect is the enemy of the good.</p>
<p>Rather than worrying about how much of your data you are going to be able to cover, or that you can't devote enough resources to tackle all of your reference areas at once, work at the problem from another direction.  </p>
<h1>First, start with master data.</h1>
<p>Master data is the data that all your other data is made from.  It's the data everyone uses to view the massive piles of transactional data, so one bad row in a master data table, and the impact is felt across perhaps hundreds of reports, and multiple time periods.  If you have a product in the wrong category, then every transaction, across perhaps hundreds of customers, and all time, will be mis-catagorized, and every total, sub-total and calculated metric using it will suffer.</p>
<p>While bad transactions are bad, bad reference data is deadly.  Bad reference data takes a good transaction and messes it up.</p>
<h1>Worst first!</h1>
<p>Make a list of your reference tables/area.  Customer, Product, Chart of account, etc. etc.  What are the most important for your business?  This isn't something I can tell you- you have to think about what is most critical.</p>
<p>If you are a company that purchases large amounts of materials from many vendors, and purchasing decisions are fast paced and critical, then maybe it's your vendor master, and your accounts payable.</p>
<p>On the other hand, if you have lots of interaction with your customers, and errors in the customer master cost you business, then start with that.</p>
<p>The key is to first make the list, and then think to yourself "if I have bad quality data, where am I most afraid it will be?"  Start profiling there.  You want to find the worst first, and fixing that will have the greatest positive impact.</p>
<h1>Get to know your data</h1>
<p>Don't worry about setting complex or work intensive goals right away.  Data profiling is about data discovery sometimes.  You need to wade into your reference data, play with it, tease out patterns and relationships.  As you get to know your data, you will be able to better identify where there are issues to tackle, and where root causes might lie for data quality issues.</p>
<p>One approach might be to simply resolve to spend an hour a week, every week, profiling some data.  If you aren't do that now, you will find that even just a bit of time set aside will give huge insight- sometimes we get too busy to do the basics, and we miss opportunities to make significant improvements with relatively little effort in our data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/a-new-years-resolution-to-data-profile/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data Quality Rules</title>
		<link>http://www.datamartist.com/data-quality-rules</link>
		<comments>http://www.datamartist.com/data-quality-rules#comments</comments>
		<pubDate>Thu, 16 Jun 2011 17:00:07 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[data culture]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Reality Check]]></category>
		<category><![CDATA[Data Quality rules]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=5995</guid>
		<description><![CDATA[What's the difference between good data and bad data? It is much like the difference between good children and bad children- the bad data doesn't follow the rules. But what are the rules? Unlike the rules for kids, which have been fixed in stone for decades (or at least, parents wish it were so), the [...]]]></description>
			<content:encoded><![CDATA[<p>What's the difference between good data and bad data?  It is much like the difference between good children and bad children- the bad data doesn't follow the rules.<br />
<a href="http://www.datamartist.com/wp-content/uploads/2011/04/data-quality-rules-data-freedom-or-death.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2011/04/data-quality-rules-data-freedom-or-death-300x269.jpg" alt="" title="data-quality-rules-data-freedom-or-death" width="300" height="269" class="alignright size-medium wp-image-6011" /></a><br />
But what are the rules?  Unlike the rules for kids, which have been fixed in stone for decades  (or at least, parents wish it were so), the rules for data are slippery things that depend very much on the context and the database.</p>
<p>While it's a complex subject, some basic rules of thumb can avoid the deeper rabbit holes.</p>
<p>The first thing to understand about Data Quality rules is they aren't as easy as they may look.  Data is in theory something in the ordered world of computers, but in reality is in the "flexible" world of humans.  A huge amount of data is entered by members of the group "Homo sapiens" (or mutilated by software written by members of that group) and as a result is not as ordered as we would all like.</p>
<p>The challenge for data quality practitioners is to remove the chaos injected by those highly involved primates (us) and make the data the sterile, ordered, never any question about anything type that we all imagine in our fantasies.</p>
<p>But how?</p>
<p>In the end, it is amazing how powerful and complex the various solutions to this problem are.</p>
<p>But I suggest that there are some basic principles that can help guide us.</p>
<h2>First- do no harm.</h2>
<p>One of the risks of any data quality initiative is that it actually screws up the data more.  Don't define rules that are so complex, and so sure of themselves that they actually make the data worse.  Be humble. Don't change data unless you are pretty sure it's a good idea.  Err on the side of not screwing up the original.  And keep a copy of the original- so if things do go off the rails you can undo- or at least try to understand what when wrong.</p>
<h2>Go out and talk to the people</h2>
<p>Don't sit in your ivory tower and speculate as to what the data means.  Go out there and watch people enter it in.  See what real world type things are happening that never make it into bits and bytes.</p>
<h2>Attack the basics first</h2>
<p>Focus your first efforts on dealing with the basics- they will resolve the vast majority of the issues- don't chase after the outliers until you have the "easy" cases taken care of- the tough stuff is a case of diminishing returns- look first at how to fix processes and train your people to make the majority of typical data entry cases more accurate before you start looking into artificial intelligence based hyper-multi-semantic-algorithmic-learning-matching-holistic-flux-capacitor data quality systems.</p>
<h2>Less is more- the fewer rules the better.</h2>
<p>So whats the rule about making rules?  Try to make less rules, and test them in a pragmatic way.  It is possible to have so many rules that the rules themselves have data quality issues- don't go there.</p>
<p>Sometimes the simplest things will bring the greatest benefit.</p>
<p>In the coming weeks, I'll be posting about how to design, implement and monitor Data quality rules using the <a href="/">Datamartist tool</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-quality-rules/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data quality sizzle</title>
		<link>http://www.datamartist.com/data-quality-sizzle</link>
		<comments>http://www.datamartist.com/data-quality-sizzle#comments</comments>
		<pubDate>Tue, 22 Mar 2011 18:08:56 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Project Management]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=5985</guid>
		<description><![CDATA[I'm an engineer. Being an engineer, I'm pretty product focused, pretty technology focused, and pretty "does it work or not" focused. Having technical things like tools work is useful, and good. But just because you build it, does not mean they will come. The challenge often in Data Quality is that often what has to [...]]]></description>
			<content:encoded><![CDATA[<p>I'm an engineer. Being an engineer, I'm pretty product focused, pretty technology focused, and pretty "does it work or not" focused.  </p>
<p>Having technical things like tools work is useful, and good.  But just because you build it, does not mean they will come.</p>
<p>The challenge often in Data Quality is that often what has to change even more than the technology or tools is the behaviours and perspectives of the people in the organisation with data quality issues.  At the very least, the users have to use the tools.  Very few data quality solutions are of the "full autopilot" bad-data-goes-in-here-good-comes-out-here type.</p>
<p>As much as we engineers would like to solve everything with software, people are involved in Data Quality.  </p>
<p>While a fantastic bit of data profiling analysis or an elegant and powerful data transform would seem to be enough, the truth is sometimes how and when you present these things is key to getting the non-engineer people to buy in.  </p>
<p>Sometimes preparing people over time, and introducing things in a step by step way helps them understand, and makes the technology and the change required less daunting.</p>
<p><a href="http://www.datamartist.com/wp-content/uploads/2011/03/red-bbq.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2011/03/red-bbq-300x199.jpg" alt="" title="red-bbq" width="300" height="199" class="alignright size-medium wp-image-5987" /></a>Because I'm looking out my window at a tentative (very tentative it's only March after all) spring day here in Toronto, I'm going to use a summer barbecue analogy.</p>
<p>The tools and technology are the steak.  The steak is key to the party.   In the end (at least for me in this analogy) the steak delivers most of the value in your summer BBQ party value proposition, but you'll have more guests and be more successful over all if you package the whole. </p>
<p>Sometimes, part of selling the steak is the sizzle, the preparation, the things around the steak.</p>
<p>It's the smell of the BBQ getting ready, it's the sound of the steak hitting the grill- its the cold drink, the conversation, the games on the lawn for the kids.</p>
<p>In the end, even if you know that 90% of the deal was that steak, if you just put a steak on a plate and give it to each guest the moment they arrive, its just not going to get the same response.</p>
<p>In my usual round about way the point I'm trying to get to is that you can't solve technical problems, then drop them on people desks and say "do it".  You need to invite them to the party.  Prepare them for the menu, ask preferences, give them some time to hear the sizzle, smell the charcoal, enjoy the sunshine in expectation of that steak.</p>
<p>Steak is good.  Remember to plan some sizzle too.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-quality-sizzle/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data profiling- a search or a code to crack?</title>
		<link>http://www.datamartist.com/data-profiling-a-search-or-a-code-to-crac</link>
		<comments>http://www.datamartist.com/data-profiling-a-search-or-a-code-to-crac#comments</comments>
		<pubDate>Wed, 03 Nov 2010 17:50:08 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=5848</guid>
		<description><![CDATA[Often, tracking down data quality issues is presented as a search for bad data- but sometimes the data isn't so much bad, as not understood. In legacy systems, you might be more trying to first find the meaning of data- in effect, decoding it as if it had been encrypted (which in a way, time [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.datamartist.com/wp-content/uploads/2010/11/300px-Enigma-rotor-stack.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/300px-Enigma-rotor-stack.jpg" alt="" title="Photo by Bob Lord" width="300" height="225" class="alignright size-full wp-image-5850" /></a>Often, tracking down data quality issues is presented as a search for bad data- but sometimes the data isn't so much bad, as not understood.  In legacy systems, you might be more trying to first find the meaning of data- in effect, decoding it as if it had been encrypted (which in a way, time and lack of documentation might very well have done).</p>
<p>You know that all that data means something- but what?</p>
<p>One of my favorite code-busting stories is the epic victory over the Enigma code during the second world war.  One of the reasons its of interest is that it was one of the early applications of computing- but the key lesson I think is from not the brute force computation done, but the strategies used to crack the code.</p>
<p>When you are trying to crack a code, one of the key things you need are "Cribs"- some way have samples of coded message and clear text.  These cribs can radically reduce the number of possible ways a code can be decoded.</p>
<p>In the case of enigma, the allies would listen for German U-boat radio transmissions, while also using direction finding equipment to estimate their location.  Standard procedure was for a U-Boat to first radio a weather report.</p>
<p>By painstakingly back tracking known weather conditions and locations of U-Boats when they transmitted it was possible to take advantage of that first weather report- there were only so many ways to say "Sunny and calm".  Having this crib gave them a way to break into the code.</p>
<p>What is the point in terms of Data profiling?  While it's critical to have the right tools to analyse the data (a data profiler like <a href="/">Datamartist</a>, for example), its also important to get out there and talk to people, understand whats going on- collect some Cribs that will help it all make sense. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-profiling-a-search-or-a-code-to-crac/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Good Data is a force for good.</title>
		<link>http://www.datamartist.com/good-data-is-a-force-for-good</link>
		<comments>http://www.datamartist.com/good-data-is-a-force-for-good#comments</comments>
		<pubDate>Wed, 20 Oct 2010 14:59:16 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[data culture]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Public Data]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=5596</guid>
		<description><![CDATA[The United Nations has declared that today is the first world statistics day, "celebrating the many contributions and achievements of official statistics". It's the kind of holiday that those of us in the data wrangling profession can really get behind. Data about people in general, and their well being, their needs and challenges is a [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.datamartist.com/wp-content/uploads/2010/10/UN-World-Statisics-Day-Logo.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/10/UN-World-Statisics-Day-Logo.jpg" alt="" title="UN-World-Statisics-Day-Logo" width="290" height="228" class="alignright size-full wp-image-5599" /></a>The United Nations has declared that today is the first world statistics day, "celebrating the many contributions and achievements of official statistics".</p>
<p>It's the kind of holiday that those of us in the data wrangling profession can really get behind.</p>
<p>Data about people in general, and their well being, their needs and challenges is a critical component of any plan for progress- and the UN focusing on "official statistics" highlights the huge good that this data does in our world.</p>
<p>Governments, educators, charities, and communities can use official statistics to best direct aid, tailor programs to be as efficient as possible, and dramatically improve the lives of billions of people.  </p>
<p>Citizens can use data to demand change from their governments, and businesses.  They can use data to make informed decisions about which products to buy, understanding their health, environmental and economic impact.</p>
<h2>Don't take all that data for granted.</h2>
<p>I am fortunate to be living in Canada, a wealthy country that provides a broad range of services to its citizens, and I know that my family and I benefit every day from decisions and policies that have been put in place thanks to decisions informed by a broad range of statistical information.  One of the key sources is the census.</p>
<p>Unfortunately, this summer, the Canadian government decided to eliminate the mandatory long form census in Canada (there is still a shorter one), and there has been a strong outcry of disagreement. The chief statistician of statistics Canada resigned in August, but the government seems determined to eliminate this important source of data.</p>
<p>Our little drama in Canada is of course a tiny issue compared to the tragic state of affairs in many countries. Obviously, in many countries the lack of data is a symptom for much more fundamental issues.  But collecting and acting on statistical data to help your populace is an indicator of good governance, and encouraging statistics collection is a positive way to support change.</p>
<p>So on this world statistics day, I encourage everyone that loves data, facts, and decisions made using them, to consider that the anti-data forces of evil are still alive and well.  Fight those who want to "go with their gut", or worse those who know that data will expose their actions as contrary to the common good.</p>
<p>Good decisions are made based on good data. Good data does good.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/good-data-is-a-force-for-good/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Data quality challenges: behavioral inertia and its evil opposite</title>
		<link>http://www.datamartist.com/data-quality-challenges-behavioral-inertia-and-its-evil-opposite</link>
		<comments>http://www.datamartist.com/data-quality-challenges-behavioral-inertia-and-its-evil-opposite#comments</comments>
		<pubDate>Tue, 05 Oct 2010 16:39:04 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[data culture]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=5468</guid>
		<description><![CDATA[Often, I hear someone say something like "this would be much easier if users would just..." or "If only we could convince the sales people that...". Technology folks often are frustrated by the people component of the complex systems they are trying to install. People are not a problem solved by technology Some try to [...]]]></description>
			<content:encoded><![CDATA[<p>Often, I hear someone say something like "this would be much easier if users would just..." or "If only we could convince the sales people that...".   Technology folks often are frustrated by the people component of the complex systems they are trying to install.</p>
<h2>People are not a problem solved by technology</h2>
<p>Some try to ignore the issue, or solve it with technology alone-  "If we write complex enough validation into the data entry form people HAVE to enter good data" or "Our matching algorithms will resolve the issues in real time."</p>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/10/users-will-lose-chair-if-data-quality-suffers1.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/10/users-will-lose-chair-if-data-quality-suffers1.jpg" alt="" title="users-will-lose-chair-if-data-quality-suffers" width="311" height="229" class="alignright size-full wp-image-5473" /></a>Others try to use sophisticated training, documentation, bonus plans or punishment plans to get the behavior they want.</p>
<p>Obviously, components of both approaches are going to be used to some extent- but don't lose sight of the fact that people ARE the process- and the heart of your business.  It's the sales guys that drive revenue, and its the sales order people, or help desk operators, or engineers in your manufacturing facilities that you are building the new system for that are creating all the value.   You are a person too- think about their motivations, and how to take advantage of their abilities and enthusiasm- not how to remove them from the equation.</p>
<p>I often think that there are two powerful forces at work in the minds of all of us- oddly, they are opposites, and yet can co-exist even in the same person at the same moment.  Some people are strongly to one side or the other.  </p>
<h2>Behavioral Inertia:  Change is bad</h2>
<p>We've all see this resistance to change, and in many cases people have this tendency for good reasons (that last disastrous ERP implementation where the new processes were not properly checked, and everyone worked 15 hour days for weeks while customers were screaming into their phones about how screwed up everything was, for example.)</p>
<p>Remember, resistance to bad change that is going to screw everything up is a good thing.</p>
<p>In other cases, however, it is unfounded, and it is a real problem- things have to change to move forward.  Sometimes risks have to be taken, and there will be bumpy periods before a much better steady state is achieved.</p>
<p>People have a natural resistance to this because change is the unknown.</p>
<h2>Hyper Active change syndrome: We can't wait to do it right- we have to act NOW</h2>
<p>This is the evil opposite twin of behavioral inertia. (It's like that episode when Captain Kirk got split in a transporter incident- you know.) </p>
<p>You can identify people with this force at work by phrases like "We're a dynamic organisation, we're being proactive not reactive, our processes are fluid- its the way it is with business in the fast lane" or my personal favorite- "We don't have time to get the data, we'll have to go with our gut."</p>
<p>Hyperactive changers will often try to get their way by always creating a sense of urgency: "The technology isn't moving fast enough for us, we can't wait for those changes to be approved, all the process is slowing us down, our customers are demanding speed"</p>
<p>Hyperactive changers are dangerous because they often ignore or circumvent processes in the name of expediency, generating risk and forcing others to waste effort compensating, and generally causing chaos.  They want to change things so often, that efficiencies of new processes are never realized- everyone is on a constant learning curve and never gets in the groove.</p>
<h2>Balance the forces, find your high-speed tortoise </h2>
<p>Think of the story of the Tortoise and the Hare.  The Hare, with all its speed, could not figure out that the process was start, run, finish, and completely wasted his speed advantage by having a nap.</p>
<p>On the other hand, while the Tortoise's complete dedication to his goal and process is admirable, you can't count on the incompetence of your competition.  (And now that all Hares are no doubt told this story throughout their childhood, its unlikely many tortoise get away with the same trick.)</p>
<p>They key lies in between- we need to work with our organization to foster an environment where we value process, and consistency, but understand that a steady, relentless change to optimize is needed, and valuable.  When one or the other of our behavioral urges overcomes us, we'll find that people are the problem in our initiatives.  If we balance them, and communicate with everybody, we can find ways to make things work, even without perfect cooperation at all times from everyone. </p>
<p>Not too slow, not too fast, always value process without letting it be your slave master.  And for goodness sake, forget about going with your gut-  go out and get some DATA!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-quality-challenges-behavioral-inertia-and-its-evil-opposite/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Using regular expressions to check data quality Part 2</title>
		<link>http://www.datamartist.com/how-to-use-regular-expressions-to-check-data-quality-part-2</link>
		<comments>http://www.datamartist.com/how-to-use-regular-expressions-to-check-data-quality-part-2#comments</comments>
		<pubDate>Mon, 27 Sep 2010 13:34:21 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Regular expressions]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4740</guid>
		<description><![CDATA[Regular expressions are a powerful way to test if strings match a given pattern or rule set. They can be used to validate the structure of a string field in data, highlighting any obviously incorrect string values. Note: In a previous post, I introduced regular expressions and went through a simple example using Canadian Postal [...]]]></description>
			<content:encoded><![CDATA[<p>Regular expressions are a powerful way to test if strings match a given pattern or rule set.  They can be used to validate the structure of a string field in data, highlighting any obviously incorrect string values.</p>
<p>Note: In a <a href="/an-introduction-to-using-regular-expressions-for-data-quality-validation">previous</a> post, I introduced regular expressions and went through a simple example using Canadian Postal code validation.</p>
<p>In this post, we'll learn some more regular expression syntax, and explore some examples.</p>
<h1>Using regular expressions to validate codes</h1>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/07/we-are-changing-all-the-product-codes-again-problem1.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/07/we-are-changing-all-the-product-codes-again-problem1.jpg" alt="" title="we-are-changing-all-the-product-codes-again-problem" width="373" height="212" class="alignright size-full wp-image-4750" /></a>Often, within ERP systems, various entities such as organisational units, products, etc. are assigned structured alphanumeric codes.  These codes have a defined structure- all valid codes must have this structure.  For example, a product code might have the pattern:</p>
<p><strong>aaa-nnnnn</strong></p>
<p>Where the first three characters are a product group code, a mandatory dash, and then a five digit product number.</p>
<p>You will remember from last time, a set of acceptable characters is defined using square brackets.  So digits only can be specified by  [0123456789] and letters by [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ].</p>
<p>To simplify things, dashes can be used to include all digits or letters in a range, and these to formats can of course be combined.  The string [a-zA-Z0-9] specifies that any letter or digit is acceptable.</p>
<p>In addition, you can make a rule as to how many consecutive characters are required that follow the rule by adding curly brackets.  So the following pattern would test for a valid product code in this format:</p>
<p><code>^[a-zA-Z]{3}[-][0-9]{5}$</code></p>
<p>The {3} and {5} apply a length limit to the immediately preceeding test- so there must be exactly 3 letters, a dash and then five numbers.</p>
<p>To specify a range of lengths, you can put two numbers in the curly brackets, separated by a comma.  So to allow product codes like ABC-1,   DFG-12 or HGF-34564 we can use {1,5} instead for the second test- this allows the number after the dash to be made up of between one and five digits.</p>
<p><code>^[a-zA-Z]{3}[-][0-9]{1,5}$</code></p>
<p>From this example, you can see how its possible to fairly easily make patterns to test for a wide range of structured codes.</p>
<h1>Defining more than one rule set at a time</h1>
<p>Sometimes, you might have two different formats possible for a given value.  As painful as it is, companies often change their coding rules, but due to legacy constraints, keep legacy coded entities in their data sets as well.</p>
<p>In regular expressions, it is easy to combine two completely different patterns in one test by using the pipe character ("|").  In fact, this operator can also be used within a single pattern to check for one OR the other of two different things.</p>
<p>For example, lets pretend that we've added a whole new product line to our offering, and the codes for these new products will have the structure XY-NNNNN-AA  and XY is either the letter "A" followed by a single non-zero digit (ie A1,A2,A3...) OR if X is any letter other than A then Y should be a letter too. (I'm just making this up, but I can tell you some of the rules I've seen in the real world are a lot more complex and seemly arbitrary than this.)</p>
<p>A regular expression that would validate this new product code would be as follows:</p>
<p><code>^([aA][1-9]|[b-zB-Z]{2})[-][0-9]{5}[-][a-zA-Z]{2}$</code></p>
<p>Note that we put the first expressions all in some curved brackets, and separated two different rules by the "|" to make an or.  So if [aA][1-0] OR [b-zB-Z]{2} is true at the start of the string, then that part of the test is OK.  By starting at "b" we exclude the bad codes of "AB" or "AG" etc. because the rule says if it starts with A, the second character needs to be a number.</p>
<p>But, of course all of our old product codes are still going to be there, and they won't pass this new test-  so to combine the two, we just take each of them, and put a pipe between them:</p>
<p><code>^[a-zA-Z]{3}[-][0-9]{1,5}$|^([aA][1-9]|[b-zB-Z]{2})[-][0-9]{5}[-][a-zA-Z]{2}$</code></p>
<p>It looks like crazy gibberish- but when you build it a bit at a time, it all makes sense.  And now, you can easily detect if there are any product codes that don't conform to either valid structure.</p>
<h1>Using this code in Datamartist</h1>
<p>If you were using the <a href="/">Datamartist tool</a>, you could find the bad product codes by adding a new column using a calculation block, called, say "ProductCodeValid", that was defined as the following Datamartist expression:</p>
<p>REGEX([PRODUCT],"^[a-zA-Z]{3}[-][0-9]{1,5}$|^([aA][1-9]|[b-zB-Z]{2})[-][0-9]{5}[-][a-zA-Z]{2}$")</p>
<p>This column will now have TRUE for the records where the product code is well formed, and FALSE where there are issues.</p>
<p>Regular expressions are a very useful way to check many types of data quality and they can help you avoid all sorts of crazy tests with LEFT, RIGHT and MID string functions. </p>
<p>Using REGEX will let you create much more powerful data quality code.</p>
<p>If you want to try building some REGEX yourself, you can <a href="/downloads">download the Datamartist free trial</a>, and give it a go with your own data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/how-to-use-regular-expressions-to-check-data-quality-part-2/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>An introduction to using regular expressions for data quality validation</title>
		<link>http://www.datamartist.com/an-introduction-to-using-regular-expressions-for-data-quality-validation</link>
		<comments>http://www.datamartist.com/an-introduction-to-using-regular-expressions-for-data-quality-validation#comments</comments>
		<pubDate>Thu, 23 Sep 2010 17:43:31 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4694</guid>
		<description><![CDATA[Regular expressions (sometimes referred to as regex or regexp) are a powerful formal language that can be used to match text strings to patterns. They way regular expressions work is like this: A pattern is defined. This is a string of symbols that act as a set of rules. A text string to test, and [...]]]></description>
			<content:encoded><![CDATA[<p>Regular expressions (sometimes referred to as regex or regexp) are a powerful formal language that can be used to match text strings to patterns.</p>
<p>They way regular expressions work is like this:<br />
<a href="http://www.datamartist.com/wp-content/uploads/2010/07/regular-expressions-data-profiling-and-data-quality-overview.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/07/regular-expressions-data-profiling-and-data-quality-overview.jpg" alt="" title="regular-expressions-data-profiling-and-data-quality-overview" width="333" height="221" class="alignright size-full wp-image-4734" /></a></p>
<ol>
<li>A <strong>pattern</strong> is defined.  This is a string of symbols that act as a set of rules.</li>
<li>A <strong>text string to test</strong>, and the pattern are given to a regular expression engine, and compared.</li>
<li>The engine returns a <strong>true/false</strong> value meaning the string follows the rules or not ("PASSED" or "REJECTED by the pattern)</li>
</ol>
<p>This is obviously very useful for someone interested in Data Quality-  If you had a pattern that said "Is this a valid email address?", and got PASS or REJECT back, it would give you a good idea as to the quality of that field in your contact database.</p>
<p>One advantage of regular expressions is that because they are widely used, lots and lots of them have been created, detecting all sorts of patterns- meaning that while you can write your own, you can also look up useful ones you need in libraries.</p>
<p>Regular expressions aren't magic, of course- the result is only as good as the program. (As always.)  Depending on how well the regex is written (or not) there may be false positives or negatives.</p>
<h2>A regex example- Canadian Postal Code</h2>
<p>Lets look at a simple example, and see how they work.  Being from Canada, I'm going to use the example of validating a Canadian postal code.</p>
<p>Canadian Postal codes take the format ANA NAN, where "A" is a letter and "N" is a number.  So what we want is a regular expression that will return TRUE for valid postal codes, and FALSE for postal codes that just can't be right.  "K9J 2K2" could be a valid postal code, but we know that "38X AB2" just can't be.</p>
<p>In a regular expression, we use anchors to say where to start matching. In this case, we want just the Canadian postal code, so we'll use the anchor character "^" to specify the beginning of the string.</p>
<p>To specify that a character has to be within a given set or range of characters, we use square brackets.  So to match when ever the first character of a string is a letter, the regex would be:</p>
<p><code>^[a-zA-Z]</code></p>
<p>This regex will return TRUE for all strings that start with a letter. Thats fine, but not yet specific enough for a Canadian postal code (we Canadians are very very picky).</p>
<p>So we can add a number, then another letter constraint to our pattern.</p>
<p><code>^[a-zA-Z][0-9][a-zA-Z]</code></p>
<p>So far, now any string that starts with ANA will result in true- we're almost half way there! Next, we want to specify that the space is optional- that is, its acceptable to have the space or not.<br />
To do this, we use the "?" to specify that the space is optional.  And then to finish up, we add the part of the expression that detects the NAN, and end with a dollar sign which specifies that that needs to be the end of the string (otherwise all strings that started with a valid postal code would pass);</p>
<p><code>^[a-zA-Z][0-9][a-zA-Z][ ]?[0-9][a-zA-Z][0-9]$</code></p>
<p>So there it is- or is it?</p>
<h2>But is the REGEX fussy enough?</h2>
<p>While this pattern that we've created does detect the ANA NAN pattern, and even allows the space to be optional, if you know Canadian postal codes ,you'll know that in fact ANA NAN is not enough by itself.  There are only certain letters that actually exist in certain locations.  So a better REGEX pattern for Canadian postal code validity would be the following:</p>
<p><code>^[abceghjklmnprstvxyABCEGHJKLMNPRSTVXY][0-9][abceghjklmnprstvwxyzABCEGHJKLMNPRSTVWXYZ][ ]?[0-9][[abceghjklmnprstvwxyzABCEGHJKLMNPRSTVWXYZ][0-9]$</code></p>
<p>This pattern explicitly lists valid letters.  Canadian postal codes do not use the letters D,F,O,Q or U anywhere and they do not use W or Z in the first position.  Of course, this brings up another issue with any data quality method-  remember Canada post could decide to change the rules- then your data quality test would need to be updated.</p>
<h2>Ok, so that means the Postal code is ok right?... uh, No.</h2>
<p>So this regular expression will detect if a text string of length 6 or 7 is a valid Canadian postal code- but remember that this alone is probably not enough.  Chances are that this postal code is stored as part of an address, which will also include the city and province.   In Canada, postal codes are of course unique to a given province- the first letter defines the area, and each area exists within a particular province (large provinces have more than one letter assigned to them).<br />
<a href="http://www.datamartist.com/wp-content/uploads/2010/07/canadian-postal-code-first-letter-regions-map.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/07/canadian-postal-code-first-letter-regions-map.jpg" alt="" title="canadian-postal-code-first-letter-regions-map" width="402" height="343" class="alignright size-full wp-image-4709" /></a></p>
<p>This means that a properly formed postal code could be invalid-  for example, an address in Quebec that has a postal code that starts with the letter "V" which is for British Columbia has clearly got something amiss.</p>
<p>So while learning a bit about regular expressions, we've also learned that probably if you had a big mailing list to clean you would probably want to use a dedicated tool-  postal addresses are an area of data quality where lots of attention has been paid over the years, and writing a lot of custom logic and regex patterns is probably not a good use of your time.  But for application specific codes and strings it might be very useful.  In my next post, we'll look at some more tricks with regular expressions that can be used to analyze data quality.</p>
<p>I've posted a small collection of useful regular expressions to the datamartist website <a href="/useful-regular-expressions-for-data-quality">here</a>. </p>
<h1>Datamartist V1.3.0 PRO and Regular expressions</h1>
<p>The professional edition of the <a href="/">Datamartist tool</a> provides a function REGEX(text,regex expression) that returns TRUE or FALSE depending on if the text "matches" with the regular expression specified.   This function can be used anywhere in Datamartist where expressions are available, making it a powerful way to test if a string matches one or more patterns.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/an-introduction-to-using-regular-expressions-for-data-quality-validation/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>When should you data profile? Morning, Noon and Night!</title>
		<link>http://www.datamartist.com/when-should-you-data-profile-morning-noon-and-night</link>
		<comments>http://www.datamartist.com/when-should-you-data-profile-morning-noon-and-night#comments</comments>
		<pubDate>Wed, 22 Sep 2010 13:16:26 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4955</guid>
		<description><![CDATA[Data profiling is an important part of any data related project. The question often arises when the best time to data profile is. As you would expect from a software company that sells a really cool visual data profiling tool, our view is "all the time". Using data profiling tools before the project Data profiling [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.datamartist.com/wp-content/uploads/2010/09/sate-category-graph-250px.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/09/sate-category-graph-250px.jpg" alt="" title="sate-category-graph-250px" width="250" height="151" class="alignright size-full wp-image-5128" /></a>Data profiling is an important part of any data related project.  The question often arises when the best time to data profile is. As you would expect from a software company that sells <a href="/data-profiling-enhanced-datamartist-v1-3-released">a really cool visual</a> data profiling tool, our view is "all the time".</p>
<h2>Using data profiling tools before the project</h2>
<p>Data profiling is useful even before the project is defined.  By doing a first higher level data profiling on key data sets, you will get;</p>
<ul>
<li>better project scope definition</li>
<li>a more accurate budget estimate</li>
<li>a clear baseline from which improvement can be measured</li>
<li>the ability to correctly manage expectations</li>
</ul>
<p>The last two are important ones.  Making wild promises based on what the data model says should be in the tables and then failing to deliver due to data quality issues is not nearly as career enhancing as defining an ambitious but doable scope and delivering based on the actual data, clearly communicating the progress made based on facts.</p>
<p>Data quality issues can seriously (double digit percentage seriously) affect the final cost of a project.  Knowing about issues will let you set a realistic budget.</p>
<h2>Data profiling at the beginning of the project</h2>
<p>After the initial higher level data profiling done before the project,  budgeting for a more detailed data profiling of the source data will:</p>
<ul>
<li>allow clear design guidance to ETL developers</li>
<li>Clearly identify the subject matter experts needed to understand the data, and let you engage them early- rather than in a rush when ETL development hits the underlying data issues, and the project is already running late and over budget.</li>
</ul>
<h2>Data profiling during the project</h2>
<ul>
<li>By setting up automated data profiling tasks for the output of each ETL process, ETL developers and architects can track the progress for migration, cleansing or conversion tasks using concrete information.</li>
<li>Objective criteria can be set for each profile task to determine what level if data quality is considered "acceptable" for the final data load or fact set deliverables.</li>
</ul>
<h2>Data profiling at the end of the project</h2>
<ul>
<li>Doing a final data profiling run, and comparing it to the baseline established before the project will provide a clear Before/After view that will both clearly communicate the progress made, but also assist in justifying and promoting the next data quality initiative.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/when-should-you-data-profile-morning-noon-and-night/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>What is data profiling?  Data in the real world.</title>
		<link>http://www.datamartist.com/what-is-data-profiling-and-when-and-why-should-it-be-done</link>
		<comments>http://www.datamartist.com/what-is-data-profiling-and-when-and-why-should-it-be-done#comments</comments>
		<pubDate>Tue, 21 Sep 2010 20:44:36 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4491</guid>
		<description><![CDATA[First of all, full disclosure, if you haven't already noticed, this blog is written by a software company that makes a pretty cool data profiling tool. We've just released our new version V1.3.0, so obviously we can't pretend to be completely objective in the "should you data profile debate". But bear with me, because I [...]]]></description>
			<content:encoded><![CDATA[<p>First of all, full disclosure, if you haven't already noticed, this blog is written by a software company that makes a <a href="/">pretty cool data profiling tool</a>.  We've just released our new version V1.3.0,  so obviously we can't pretend to be completely objective in the "should you data profile debate".  But bear with me, because I think I'm still going to be able to convince you it's a good idea, regardless of which tool you select. </p>
<p>The way I think of data profiling is that it is the "Reality check" of all your data activities.</p>
<h2> The data model is the theory.</h2>
<h2> Data profiling lets you see reality.</h2>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/09/the-data-model-looks-good-but-data-profiling-tools-say-ooops.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/09/the-data-model-looks-good-but-data-profiling-tools-say-ooops.jpg" alt="" title="the-data-model-looks-good-but-data-profiling-tools-say-ooops" width="393" height="268" class="alignright size-full wp-image-4944" /></a>The bottom line is that you know what is <em>supposed</em> to be in the column by looking at the data model, but you can know what is <em>actually</em> in the column by data profiling the data.</p>
<p>For a fantastic detailed look at data profiling, check out <a href="http://www.ocdqblog.com/home/adventures-in-data-profiling-part-1.html" target="_blank">this series of blog posts</a> from Jim Harris, over at obsessive compulsive data quality.  It's a great series that clearly communicates the type of analysis that might be done while doing data profiling and the kind of detective work it sometimes takes.  While I'll be looking at a number of the same concepts in this series of blog posts, being a software vendor, I won't be as tool agnostic as Jim was- we'll be demonstrating the techniques using the <a href="/">Datamartist</a> tool. </p>
<p>You can download a free trial of the Datamartist tool and try the profiling techniques and features yourself as you follow along.</p>
<p>We hope you enjoy the upcoming data profiling tutorials and that you give the Datamartist data profiler and transformer serious consideration for your data profiling needs.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/what-is-data-profiling-and-when-and-why-should-it-be-done/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

