<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Datamartist.com &#187; Data profiling</title>
	<atom:link href="http://www.datamartist.com/category/data-profiling/feed" rel="self" type="application/rss+xml" />
	<link>http://www.datamartist.com</link>
	<description>Reduce cost with self serve data transformation</description>
	<lastBuildDate>Thu, 09 Feb 2012 20:00:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>A new years resolution to data profile</title>
		<link>http://www.datamartist.com/a-new-years-resolution-to-data-profile</link>
		<comments>http://www.datamartist.com/a-new-years-resolution-to-data-profile#comments</comments>
		<pubDate>Tue, 10 Jan 2012 15:54:05 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Reality Check]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=6165</guid>
		<description><![CDATA[Well, it's the time of making and breaking resolutions, a time when setting realistic goals is sometimes hard to do with all the optimism of the new year. Sometimes, we make decisions NOT to set a goal, because we don't want to break it. You might be thinking you really should step up your data [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.datamartist.com/wp-content/uploads/2012/01/data-profiling-some-data.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2012/01/data-profiling-some-data-300x225.jpg" alt="" title="data-profiling-some-data" width="300" height="225" class="alignright size-medium wp-image-6171" /></a>Well, it's the time of making and breaking resolutions, a time when setting realistic goals is sometimes hard to do with all the optimism of the new year.  </p>
<p>Sometimes, we make decisions NOT to set a goal, because we don't want to break it.  </p>
<p>You might be thinking you really should step up your data quality monitoring- get some data profiling underway to help identify the data domains and areas you most want to tackle in 2012.  But you might be also thinking that with all the pressures and cutbacks that many companies are facing, you don't have the resources to implement a full scale profiling and monitoring effort, and so might decide to delay. </p>
<p>Don't wait. Just do it.  The perfect is the enemy of the good.</p>
<p>Rather than worrying about how much of your data you are going to be able to cover, or that you can't devote enough resources to tackle all of your reference areas at once, work at the problem from another direction.  </p>
<h1>First, start with master data.</h1>
<p>Master data is the data that all your other data is made from.  It's the data everyone uses to view the massive piles of transactional data, so one bad row in a master data table, and the impact is felt across perhaps hundreds of reports, and multiple time periods.  If you have a product in the wrong category, then every transaction, across perhaps hundreds of customers, and all time, will be mis-catagorized, and every total, sub-total and calculated metric using it will suffer.</p>
<p>While bad transactions are bad, bad reference data is deadly.  Bad reference data takes a good transaction and messes it up.</p>
<h1>Worst first!</h1>
<p>Make a list of your reference tables/area.  Customer, Product, Chart of account, etc. etc.  What are the most important for your business?  This isn't something I can tell you- you have to think about what is most critical.</p>
<p>If you are a company that purchases large amounts of materials from many vendors, and purchasing decisions are fast paced and critical, then maybe it's your vendor master, and your accounts payable.</p>
<p>On the other hand, if you have lots of interaction with your customers, and errors in the customer master cost you business, then start with that.</p>
<p>The key is to first make the list, and then think to yourself "if I have bad quality data, where am I most afraid it will be?"  Start profiling there.  You want to find the worst first, and fixing that will have the greatest positive impact.</p>
<h1>Get to know your data</h1>
<p>Don't worry about setting complex or work intensive goals right away.  Data profiling is about data discovery sometimes.  You need to wade into your reference data, play with it, tease out patterns and relationships.  As you get to know your data, you will be able to better identify where there are issues to tackle, and where root causes might lie for data quality issues.</p>
<p>One approach might be to simply resolve to spend an hour a week, every week, profiling some data.  If you aren't do that now, you will find that even just a bit of time set aside will give huge insight- sometimes we get too busy to do the basics, and we miss opportunities to make significant improvements with relatively little effort in our data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/a-new-years-resolution-to-data-profile/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data profiling- a search or a code to crack?</title>
		<link>http://www.datamartist.com/data-profiling-a-search-or-a-code-to-crac</link>
		<comments>http://www.datamartist.com/data-profiling-a-search-or-a-code-to-crac#comments</comments>
		<pubDate>Wed, 03 Nov 2010 17:50:08 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=5848</guid>
		<description><![CDATA[Often, tracking down data quality issues is presented as a search for bad data- but sometimes the data isn't so much bad, as not understood. In legacy systems, you might be more trying to first find the meaning of data- in effect, decoding it as if it had been encrypted (which in a way, time [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.datamartist.com/wp-content/uploads/2010/11/300px-Enigma-rotor-stack.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/300px-Enigma-rotor-stack.jpg" alt="" title="Photo by Bob Lord" width="300" height="225" class="alignright size-full wp-image-5850" /></a>Often, tracking down data quality issues is presented as a search for bad data- but sometimes the data isn't so much bad, as not understood.  In legacy systems, you might be more trying to first find the meaning of data- in effect, decoding it as if it had been encrypted (which in a way, time and lack of documentation might very well have done).</p>
<p>You know that all that data means something- but what?</p>
<p>One of my favorite code-busting stories is the epic victory over the Enigma code during the second world war.  One of the reasons its of interest is that it was one of the early applications of computing- but the key lesson I think is from not the brute force computation done, but the strategies used to crack the code.</p>
<p>When you are trying to crack a code, one of the key things you need are "Cribs"- some way have samples of coded message and clear text.  These cribs can radically reduce the number of possible ways a code can be decoded.</p>
<p>In the case of enigma, the allies would listen for German U-boat radio transmissions, while also using direction finding equipment to estimate their location.  Standard procedure was for a U-Boat to first radio a weather report.</p>
<p>By painstakingly back tracking known weather conditions and locations of U-Boats when they transmitted it was possible to take advantage of that first weather report- there were only so many ways to say "Sunny and calm".  Having this crib gave them a way to break into the code.</p>
<p>What is the point in terms of Data profiling?  While it's critical to have the right tools to analyse the data (a data profiler like <a href="/">Datamartist</a>, for example), its also important to get out there and talk to people, understand whats going on- collect some Cribs that will help it all make sense. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-profiling-a-search-or-a-code-to-crac/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using regular expressions to check data quality Part 2</title>
		<link>http://www.datamartist.com/how-to-use-regular-expressions-to-check-data-quality-part-2</link>
		<comments>http://www.datamartist.com/how-to-use-regular-expressions-to-check-data-quality-part-2#comments</comments>
		<pubDate>Mon, 27 Sep 2010 13:34:21 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Regular expressions]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4740</guid>
		<description><![CDATA[Regular expressions are a powerful way to test if strings match a given pattern or rule set. They can be used to validate the structure of a string field in data, highlighting any obviously incorrect string values. Note: In a previous post, I introduced regular expressions and went through a simple example using Canadian Postal [...]]]></description>
			<content:encoded><![CDATA[<p>Regular expressions are a powerful way to test if strings match a given pattern or rule set.  They can be used to validate the structure of a string field in data, highlighting any obviously incorrect string values.</p>
<p>Note: In a <a href="/an-introduction-to-using-regular-expressions-for-data-quality-validation">previous</a> post, I introduced regular expressions and went through a simple example using Canadian Postal code validation.</p>
<p>In this post, we'll learn some more regular expression syntax, and explore some examples.</p>
<h1>Using regular expressions to validate codes</h1>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/07/we-are-changing-all-the-product-codes-again-problem1.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/07/we-are-changing-all-the-product-codes-again-problem1.jpg" alt="" title="we-are-changing-all-the-product-codes-again-problem" width="373" height="212" class="alignright size-full wp-image-4750" /></a>Often, within ERP systems, various entities such as organisational units, products, etc. are assigned structured alphanumeric codes.  These codes have a defined structure- all valid codes must have this structure.  For example, a product code might have the pattern:</p>
<p><strong>aaa-nnnnn</strong></p>
<p>Where the first three characters are a product group code, a mandatory dash, and then a five digit product number.</p>
<p>You will remember from last time, a set of acceptable characters is defined using square brackets.  So digits only can be specified by  [0123456789] and letters by [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ].</p>
<p>To simplify things, dashes can be used to include all digits or letters in a range, and these to formats can of course be combined.  The string [a-zA-Z0-9] specifies that any letter or digit is acceptable.</p>
<p>In addition, you can make a rule as to how many consecutive characters are required that follow the rule by adding curly brackets.  So the following pattern would test for a valid product code in this format:</p>
<p><code>^[a-zA-Z]{3}[-][0-9]{5}$</code></p>
<p>The {3} and {5} apply a length limit to the immediately preceeding test- so there must be exactly 3 letters, a dash and then five numbers.</p>
<p>To specify a range of lengths, you can put two numbers in the curly brackets, separated by a comma.  So to allow product codes like ABC-1,   DFG-12 or HGF-34564 we can use {1,5} instead for the second test- this allows the number after the dash to be made up of between one and five digits.</p>
<p><code>^[a-zA-Z]{3}[-][0-9]{1,5}$</code></p>
<p>From this example, you can see how its possible to fairly easily make patterns to test for a wide range of structured codes.</p>
<h1>Defining more than one rule set at a time</h1>
<p>Sometimes, you might have two different formats possible for a given value.  As painful as it is, companies often change their coding rules, but due to legacy constraints, keep legacy coded entities in their data sets as well.</p>
<p>In regular expressions, it is easy to combine two completely different patterns in one test by using the pipe character ("|").  In fact, this operator can also be used within a single pattern to check for one OR the other of two different things.</p>
<p>For example, lets pretend that we've added a whole new product line to our offering, and the codes for these new products will have the structure XY-NNNNN-AA  and XY is either the letter "A" followed by a single non-zero digit (ie A1,A2,A3...) OR if X is any letter other than A then Y should be a letter too. (I'm just making this up, but I can tell you some of the rules I've seen in the real world are a lot more complex and seemly arbitrary than this.)</p>
<p>A regular expression that would validate this new product code would be as follows:</p>
<p><code>^([aA][1-9]|[b-zB-Z]{2})[-][0-9]{5}[-][a-zA-Z]{2}$</code></p>
<p>Note that we put the first expressions all in some curved brackets, and separated two different rules by the "|" to make an or.  So if [aA][1-0] OR [b-zB-Z]{2} is true at the start of the string, then that part of the test is OK.  By starting at "b" we exclude the bad codes of "AB" or "AG" etc. because the rule says if it starts with A, the second character needs to be a number.</p>
<p>But, of course all of our old product codes are still going to be there, and they won't pass this new test-  so to combine the two, we just take each of them, and put a pipe between them:</p>
<p><code>^[a-zA-Z]{3}[-][0-9]{1,5}$|^([aA][1-9]|[b-zB-Z]{2})[-][0-9]{5}[-][a-zA-Z]{2}$</code></p>
<p>It looks like crazy gibberish- but when you build it a bit at a time, it all makes sense.  And now, you can easily detect if there are any product codes that don't conform to either valid structure.</p>
<h1>Using this code in Datamartist</h1>
<p>If you were using the <a href="/">Datamartist tool</a>, you could find the bad product codes by adding a new column using a calculation block, called, say "ProductCodeValid", that was defined as the following Datamartist expression:</p>
<p>REGEX([PRODUCT],"^[a-zA-Z]{3}[-][0-9]{1,5}$|^([aA][1-9]|[b-zB-Z]{2})[-][0-9]{5}[-][a-zA-Z]{2}$")</p>
<p>This column will now have TRUE for the records where the product code is well formed, and FALSE where there are issues.</p>
<p>Regular expressions are a very useful way to check many types of data quality and they can help you avoid all sorts of crazy tests with LEFT, RIGHT and MID string functions. </p>
<p>Using REGEX will let you create much more powerful data quality code.</p>
<p>If you want to try building some REGEX yourself, you can <a href="/downloads">download the Datamartist free trial</a>, and give it a go with your own data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/how-to-use-regular-expressions-to-check-data-quality-part-2/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>An introduction to using regular expressions for data quality validation</title>
		<link>http://www.datamartist.com/an-introduction-to-using-regular-expressions-for-data-quality-validation</link>
		<comments>http://www.datamartist.com/an-introduction-to-using-regular-expressions-for-data-quality-validation#comments</comments>
		<pubDate>Thu, 23 Sep 2010 17:43:31 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4694</guid>
		<description><![CDATA[Regular expressions (sometimes referred to as regex or regexp) are a powerful formal language that can be used to match text strings to patterns. They way regular expressions work is like this: A pattern is defined. This is a string of symbols that act as a set of rules. A text string to test, and [...]]]></description>
			<content:encoded><![CDATA[<p>Regular expressions (sometimes referred to as regex or regexp) are a powerful formal language that can be used to match text strings to patterns.</p>
<p>They way regular expressions work is like this:<br />
<a href="http://www.datamartist.com/wp-content/uploads/2010/07/regular-expressions-data-profiling-and-data-quality-overview.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/07/regular-expressions-data-profiling-and-data-quality-overview.jpg" alt="" title="regular-expressions-data-profiling-and-data-quality-overview" width="333" height="221" class="alignright size-full wp-image-4734" /></a></p>
<ol>
<li>A <strong>pattern</strong> is defined.  This is a string of symbols that act as a set of rules.</li>
<li>A <strong>text string to test</strong>, and the pattern are given to a regular expression engine, and compared.</li>
<li>The engine returns a <strong>true/false</strong> value meaning the string follows the rules or not ("PASSED" or "REJECTED by the pattern)</li>
</ol>
<p>This is obviously very useful for someone interested in Data Quality-  If you had a pattern that said "Is this a valid email address?", and got PASS or REJECT back, it would give you a good idea as to the quality of that field in your contact database.</p>
<p>One advantage of regular expressions is that because they are widely used, lots and lots of them have been created, detecting all sorts of patterns- meaning that while you can write your own, you can also look up useful ones you need in libraries.</p>
<p>Regular expressions aren't magic, of course- the result is only as good as the program. (As always.)  Depending on how well the regex is written (or not) there may be false positives or negatives.</p>
<h2>A regex example- Canadian Postal Code</h2>
<p>Lets look at a simple example, and see how they work.  Being from Canada, I'm going to use the example of validating a Canadian postal code.</p>
<p>Canadian Postal codes take the format ANA NAN, where "A" is a letter and "N" is a number.  So what we want is a regular expression that will return TRUE for valid postal codes, and FALSE for postal codes that just can't be right.  "K9J 2K2" could be a valid postal code, but we know that "38X AB2" just can't be.</p>
<p>In a regular expression, we use anchors to say where to start matching. In this case, we want just the Canadian postal code, so we'll use the anchor character "^" to specify the beginning of the string.</p>
<p>To specify that a character has to be within a given set or range of characters, we use square brackets.  So to match when ever the first character of a string is a letter, the regex would be:</p>
<p><code>^[a-zA-Z]</code></p>
<p>This regex will return TRUE for all strings that start with a letter. Thats fine, but not yet specific enough for a Canadian postal code (we Canadians are very very picky).</p>
<p>So we can add a number, then another letter constraint to our pattern.</p>
<p><code>^[a-zA-Z][0-9][a-zA-Z]</code></p>
<p>So far, now any string that starts with ANA will result in true- we're almost half way there! Next, we want to specify that the space is optional- that is, its acceptable to have the space or not.<br />
To do this, we use the "?" to specify that the space is optional.  And then to finish up, we add the part of the expression that detects the NAN, and end with a dollar sign which specifies that that needs to be the end of the string (otherwise all strings that started with a valid postal code would pass);</p>
<p><code>^[a-zA-Z][0-9][a-zA-Z][ ]?[0-9][a-zA-Z][0-9]$</code></p>
<p>So there it is- or is it?</p>
<h2>But is the REGEX fussy enough?</h2>
<p>While this pattern that we've created does detect the ANA NAN pattern, and even allows the space to be optional, if you know Canadian postal codes ,you'll know that in fact ANA NAN is not enough by itself.  There are only certain letters that actually exist in certain locations.  So a better REGEX pattern for Canadian postal code validity would be the following:</p>
<p><code>^[abceghjklmnprstvxyABCEGHJKLMNPRSTVXY][0-9][abceghjklmnprstvwxyzABCEGHJKLMNPRSTVWXYZ][ ]?[0-9][[abceghjklmnprstvwxyzABCEGHJKLMNPRSTVWXYZ][0-9]$</code></p>
<p>This pattern explicitly lists valid letters.  Canadian postal codes do not use the letters D,F,O,Q or U anywhere and they do not use W or Z in the first position.  Of course, this brings up another issue with any data quality method-  remember Canada post could decide to change the rules- then your data quality test would need to be updated.</p>
<h2>Ok, so that means the Postal code is ok right?... uh, No.</h2>
<p>So this regular expression will detect if a text string of length 6 or 7 is a valid Canadian postal code- but remember that this alone is probably not enough.  Chances are that this postal code is stored as part of an address, which will also include the city and province.   In Canada, postal codes are of course unique to a given province- the first letter defines the area, and each area exists within a particular province (large provinces have more than one letter assigned to them).<br />
<a href="http://www.datamartist.com/wp-content/uploads/2010/07/canadian-postal-code-first-letter-regions-map.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/07/canadian-postal-code-first-letter-regions-map.jpg" alt="" title="canadian-postal-code-first-letter-regions-map" width="402" height="343" class="alignright size-full wp-image-4709" /></a></p>
<p>This means that a properly formed postal code could be invalid-  for example, an address in Quebec that has a postal code that starts with the letter "V" which is for British Columbia has clearly got something amiss.</p>
<p>So while learning a bit about regular expressions, we've also learned that probably if you had a big mailing list to clean you would probably want to use a dedicated tool-  postal addresses are an area of data quality where lots of attention has been paid over the years, and writing a lot of custom logic and regex patterns is probably not a good use of your time.  But for application specific codes and strings it might be very useful.  In my next post, we'll look at some more tricks with regular expressions that can be used to analyze data quality.</p>
<p>I've posted a small collection of useful regular expressions to the datamartist website <a href="/useful-regular-expressions-for-data-quality">here</a>. </p>
<h1>Datamartist V1.3.0 PRO and Regular expressions</h1>
<p>The professional edition of the <a href="/">Datamartist tool</a> provides a function REGEX(text,regex expression) that returns TRUE or FALSE depending on if the text "matches" with the regular expression specified.   This function can be used anywhere in Datamartist where expressions are available, making it a powerful way to test if a string matches one or more patterns.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/an-introduction-to-using-regular-expressions-for-data-quality-validation/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data profiling rules and data format strings</title>
		<link>http://www.datamartist.com/data-profiling-rules-and-data-formats</link>
		<comments>http://www.datamartist.com/data-profiling-rules-and-data-formats#comments</comments>
		<pubDate>Thu, 23 Sep 2010 13:26:24 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Datamartist Tool]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4585</guid>
		<description><![CDATA[A very useful technique in data profiling is data format analysis. Rather than looking at the actual individual values for a given column, by profiling the structure of the values you can understand at a higher level the quality of data. This technique is primarily used for string based data. Can't find the forest because [...]]]></description>
			<content:encoded><![CDATA[<p>A very useful technique in data profiling is data format analysis.</p>
<p>Rather than looking at the actual individual values for a given column, by profiling the  structure of the values you can understand at a higher level the quality of data.  This technique is primarily used for string based data.</p>
<h2>Can't find the forest because of the trees?</h2>
<p>Lets look at a simple but very real example.  You have a data set with a few tens of thousands of rows that includes a text field for <strong>telephone number.</strong> </p>
<p> What kind of phone numbers do you have?</p>
<p>Nobody wants to look through 40,000 rows value by value to see where the bad records are.</p>
<h2>Profiling with character substitution and elimination</h2>
<p>With a data profiler you can define rules:</p>
<ol>
<li>IF the character is a digit (0-9) then replace it with the letter "n"</li>
<li>IF the character is a letter (a-z) then replace it with the letter "a"</li>
</li>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/07/phone-number-data-profiled-data-formats1.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/07/phone-number-data-profiled-data-formats1.jpg" alt="" title="phone-number-data-profiled-data-formats" width="450" height="327" class="alignright size-full wp-image-4761" /></a></p>
<p>These two simple rules turn 40,000 rows of phone numbers into a short list of different phone number formats- and we can see which ones seem valid or seem to have problems.</p>
<p>By adding one more rule, eliminating a few characters that do not directly affect the phone number, we can reduce the list even further.<br />
<a href="http://www.datamartist.com/wp-content/uploads/2010/09/phone-rules-data-formating.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/09/phone-rules-data-formating.jpg" alt="" title="phone-rules-data-formating" width="598" height="266" class="aligncenter size-full wp-image-5190" /></a></p>
<p>The last rule ignores the space, open and close brackets and dash characters- and that simplifies the different formats in the analysis to a mere five formats;</p>
<ol>
<li>10 digits</li>
<li>11 digits</li>
<li>9 digits</li>
<li>A 7 character string (drilling down by clicking on this bar reveals the word "unknown"- not so useful.)</li>
<li>Missing.</li>
</ol>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/07/phone-number-data-profiled-data-formats-simplified.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/07/phone-number-data-profiled-data-formats-simplified.jpg" alt="" title="phone-number-data-profiled-data-formats-simplified" width="450" height="327" class="aligncenter size-full wp-image-4774" /></a></p>
<p>Of course, data format profiling does not always give a final answer regarding data quality.  In this case, just because a phone number has the right number of digits, does not mean that it is a valid phone number, and even if its a valid phone number it may not be the right phone number for that customer... </p>
<h2>Data format analysis for structured codes</h2>
<p>This same technique is also useful for any fields that are meant to contain structured codes.</p>
<p>For example, say a valid product code is supposed to start with one letter, where the first letter should be one of A,B,D or G be followed by a dash, and then four digits.</p>
<p>So A-2324 is a valid code, but M-2334 and J234 are not, for example.</p>
<p>The following rules would help detect a validly formatted string (and point out the issues:)</p>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/09/data-format-rules-product-codes.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/09/data-format-rules-product-codes.jpg" alt="" title="data-format-rules-product-codes" width="664" height="278" class="aligncenter size-full wp-image-5168" /></a></p>
<p>The result on a mocked up set of data:</p>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/09/product-codes-data-format-graph.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/09/product-codes-data-format-graph.jpg" alt="" title="product-codes-data-format-graph" width="550" height="243" class="aligncenter size-full wp-image-5172" /></a></p>
<p>Using data format patterns to examine the contents of a string column is a very useful way to start to understand what's in the column.  More than just a Yes-No result, it actually gives you a visual look at what types of issues exist.</p>
<p>The analysis here was done using the <a href="/"> Datamartist tool</a>, an easy to use data profiler and data transformation tool.  To try making some data format rules for your own data, give the <a href="/downloads">free trial a try</a>.</p>
<p>Up next in our data profiling series of blog posts- an even more powerful (although often more complex) technique called "Regular Expressions" or "Regex".  This specification language can define complex rules that analyze strings and determine if they belong to a given set (say, "Valid product codes" as in this example) or not.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-profiling-rules-and-data-formats/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bill&#8217;s Epic data project Fail- a cautionary tale</title>
		<link>http://www.datamartist.com/data-profiling-avoidance-a-cautionary-tale</link>
		<comments>http://www.datamartist.com/data-profiling-avoidance-a-cautionary-tale#comments</comments>
		<pubDate>Wed, 22 Sep 2010 17:02:13 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Humour]]></category>
		<category><![CDATA[Just for fun]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=5133</guid>
		<description><![CDATA[In which Bill decides that data profiling is not necessary. We sit down to keyboard to tell a sad tale Of data project manager Bill, and his epic huge fail Bill thought he had all that it took He looked through the data model, "It reads like a book" He happily flipped through tables and [...]]]></description>
			<content:encoded><![CDATA[<h1>In which Bill decides that data profiling is not necessary.</h1>
<p>We sit down to keyboard to tell a sad tale<br />
Of data project manager Bill, and his epic huge fail</p>
<p>Bill thought he had all that it took<br />
He looked through the data model, "It reads like a book"</p>
<p>He happily flipped through tables and maps<br />
Overjoyed to see a model so free of all gaps.</p>
<p>Then through the sales system checking the norm<br />
Oh Joy! Oh Bliss! validation for each form!</p>
<p>"Why, golly" he said, "I'm really in luck"<br />
"With all of this checking the data won't suck!"</p>
<p>He knew data quality was a number one killer,<br />
But with a model like this and all those good rules,<br />
The data would be clean, perfect, tables of jewels.</p>
<p>He told his boss- "Hey, piece of cake!"<br />
His boss was most happy at how little time it would take.</p>
<p>Then late one night, with a knock on his door,<br />
In came his development lead, eyes to the floor.</p>
<p>"Something is wrong, and I don't understand it."<br />
"The product code extractor, well, it's not like we planned it!"</p>
<p>The codes were all wrong!  But how could that be?<br />
Bill went out to the data entry folks to take a look see.</p>
<p>"Oh those codes?" They said, not a care in the world.</p>
<p>"We just enter "X" then "4" and then double bee."<br />
"The checker  always accepts that you see"</p>
<p>"We can't spend time looking up codes!"<br />
"We have orders to fill, we have to ship loads!"</p>
<p>"Besides the info you want is in a spreadsheet we use."<br />
"When they ask us which system, it's that one that we choose."</p>
<p>Bill's boss was most thoughtful when told the bad news<br />
He thought about all the money and time that he'd lose</p>
<p>"Bill" he said, a frown on his face.<br />
"Your assumptions are a total disgrace"</p>
<p>"You need to look at what data is there."<br />
"When it comes to the model, the real world does not care."</p>
<p>While getting HIS boss on the line with his auto-dialer,<br />
Bill's boss said to poor Bill "Don't just stand there! Go get a profiler!"</p>
<h2> Don't Ignore your data.</h2>
<p>Don't you be like Bill!  Check out the <a href="/">Datamartist data profiler</a>.</p>
<h2>Look your data right in the face.</h2>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/09/caution-data-may-not-follow-data-model1.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/09/caution-data-may-not-follow-data-model1.jpg" alt="" title="caution-data-may-not-follow-data-model" width="288" height="206" class="aligncenter size-full wp-image-5155" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-profiling-avoidance-a-cautionary-tale/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>When should you data profile? Morning, Noon and Night!</title>
		<link>http://www.datamartist.com/when-should-you-data-profile-morning-noon-and-night</link>
		<comments>http://www.datamartist.com/when-should-you-data-profile-morning-noon-and-night#comments</comments>
		<pubDate>Wed, 22 Sep 2010 13:16:26 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4955</guid>
		<description><![CDATA[Data profiling is an important part of any data related project. The question often arises when the best time to data profile is. As you would expect from a software company that sells a really cool visual data profiling tool, our view is "all the time". Using data profiling tools before the project Data profiling [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.datamartist.com/wp-content/uploads/2010/09/sate-category-graph-250px.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/09/sate-category-graph-250px.jpg" alt="" title="sate-category-graph-250px" width="250" height="151" class="alignright size-full wp-image-5128" /></a>Data profiling is an important part of any data related project.  The question often arises when the best time to data profile is. As you would expect from a software company that sells <a href="/data-profiling-enhanced-datamartist-v1-3-released">a really cool visual</a> data profiling tool, our view is "all the time".</p>
<h2>Using data profiling tools before the project</h2>
<p>Data profiling is useful even before the project is defined.  By doing a first higher level data profiling on key data sets, you will get;</p>
<ul>
<li>better project scope definition</li>
<li>a more accurate budget estimate</li>
<li>a clear baseline from which improvement can be measured</li>
<li>the ability to correctly manage expectations</li>
</ul>
<p>The last two are important ones.  Making wild promises based on what the data model says should be in the tables and then failing to deliver due to data quality issues is not nearly as career enhancing as defining an ambitious but doable scope and delivering based on the actual data, clearly communicating the progress made based on facts.</p>
<p>Data quality issues can seriously (double digit percentage seriously) affect the final cost of a project.  Knowing about issues will let you set a realistic budget.</p>
<h2>Data profiling at the beginning of the project</h2>
<p>After the initial higher level data profiling done before the project,  budgeting for a more detailed data profiling of the source data will:</p>
<ul>
<li>allow clear design guidance to ETL developers</li>
<li>Clearly identify the subject matter experts needed to understand the data, and let you engage them early- rather than in a rush when ETL development hits the underlying data issues, and the project is already running late and over budget.</li>
</ul>
<h2>Data profiling during the project</h2>
<ul>
<li>By setting up automated data profiling tasks for the output of each ETL process, ETL developers and architects can track the progress for migration, cleansing or conversion tasks using concrete information.</li>
<li>Objective criteria can be set for each profile task to determine what level if data quality is considered "acceptable" for the final data load or fact set deliverables.</li>
</ul>
<h2>Data profiling at the end of the project</h2>
<ul>
<li>Doing a final data profiling run, and comparing it to the baseline established before the project will provide a clear Before/After view that will both clearly communicate the progress made, but also assist in justifying and promoting the next data quality initiative.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/when-should-you-data-profile-morning-noon-and-night/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>What is data profiling?  Data in the real world.</title>
		<link>http://www.datamartist.com/what-is-data-profiling-and-when-and-why-should-it-be-done</link>
		<comments>http://www.datamartist.com/what-is-data-profiling-and-when-and-why-should-it-be-done#comments</comments>
		<pubDate>Tue, 21 Sep 2010 20:44:36 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4491</guid>
		<description><![CDATA[First of all, full disclosure, if you haven't already noticed, this blog is written by a software company that makes a pretty cool data profiling tool. We've just released our new version V1.3.0, so obviously we can't pretend to be completely objective in the "should you data profile debate". But bear with me, because I [...]]]></description>
			<content:encoded><![CDATA[<p>First of all, full disclosure, if you haven't already noticed, this blog is written by a software company that makes a <a href="/">pretty cool data profiling tool</a>.  We've just released our new version V1.3.0,  so obviously we can't pretend to be completely objective in the "should you data profile debate".  But bear with me, because I think I'm still going to be able to convince you it's a good idea, regardless of which tool you select. </p>
<p>The way I think of data profiling is that it is the "Reality check" of all your data activities.</p>
<h2> The data model is the theory.</h2>
<h2> Data profiling lets you see reality.</h2>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/09/the-data-model-looks-good-but-data-profiling-tools-say-ooops.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/09/the-data-model-looks-good-but-data-profiling-tools-say-ooops.jpg" alt="" title="the-data-model-looks-good-but-data-profiling-tools-say-ooops" width="393" height="268" class="alignright size-full wp-image-4944" /></a>The bottom line is that you know what is <em>supposed</em> to be in the column by looking at the data model, but you can know what is <em>actually</em> in the column by data profiling the data.</p>
<p>For a fantastic detailed look at data profiling, check out <a href="http://www.ocdqblog.com/home/adventures-in-data-profiling-part-1.html" target="_blank">this series of blog posts</a> from Jim Harris, over at obsessive compulsive data quality.  It's a great series that clearly communicates the type of analysis that might be done while doing data profiling and the kind of detective work it sometimes takes.  While I'll be looking at a number of the same concepts in this series of blog posts, being a software vendor, I won't be as tool agnostic as Jim was- we'll be demonstrating the techniques using the <a href="/">Datamartist</a> tool. </p>
<p>You can download a free trial of the Datamartist tool and try the profiling techniques and features yourself as you follow along.</p>
<p>We hope you enjoy the upcoming data profiling tutorials and that you give the Datamartist data profiler and transformer serious consideration for your data profiling needs.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/what-is-data-profiling-and-when-and-why-should-it-be-done/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Datamartist V1.3.0 Value Distribution data profiling</title>
		<link>http://www.datamartist.com/datamartist-v1-3-0-value-distribution-data-profiling</link>
		<comments>http://www.datamartist.com/datamartist-v1-3-0-value-distribution-data-profiling#comments</comments>
		<pubDate>Mon, 26 Jul 2010 18:33:50 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Datamartist Tool]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4855</guid>
		<description><![CDATA[This video gives a quick (under two minute) look at the Datamartist data profiler's ability to explore the distribution of numeric values in a data set by counting the number of values that fall into a series of equal size buckets. It highlights the datamartists calculation, visualization, selection and drill down features using a simple [...]]]></description>
			<content:encoded><![CDATA[<p>This video gives a quick (under two minute) look at the Datamartist data profiler's ability to explore the distribution of numeric values in a data set by counting the number of values that fall into a series of equal size buckets.  It highlights the datamartists calculation, visualization, selection and drill down features using a simple example.</p>
<p><center>
<div id="media">
            <object id="csSWF" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="640" height="498" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,115,0"><param name="src" value="/resources/video/V1_3_0/Value-Dist-Quick-Look-1/Value-Distribution-Quick-Look_controller.swf"/><param name="bgcolor" value="#1a1a1a"/><param name="quality" value="best"/><param name="allowScriptAccess" value="always"/><param name="allowFullScreen" value="false"/><param name="scale" value="showall"/><param name="flashVars" value="autostart=false&#038;thumb=/resources/video/V1_3_0/Value-Dist-Quick-Look-1/FirstFrame.png&#038;thumbscale=65"/><embed name="csSWF" src="/resources/video/V1_3_0/Value-Dist-Quick-Look-1/Value-Distribution-Quick-Look_controller.swf" width="640" height="498" bgcolor="#1a1a1a" quality="best" allowScriptAccess="always" allowFullScreen="false" scale="showall" flashVars="autostart=false&#038;thumb=/resources/video/V1_3_0/Value-Dist-Quick-Look-1/FirstFrame.png&#038;thumbscale=65" pluginspage="http://www.macromedia.com/shockwave/download/index.cgi?P1_Prod_Version=ShockwaveFlash"></embed></object>
        </div>
<p></center></p>
<p>This value profiling tool is just one of many of the  Datamartist data profiling tools capabilities, <a href="/download/beta-download">download the free trial of the BETA</a> to try all the functionality with your own data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/datamartist-v1-3-0-value-distribution-data-profiling/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why you should data profile.</title>
		<link>http://www.datamartist.com/data-profiling-do-it-do-it-now</link>
		<comments>http://www.datamartist.com/data-profiling-do-it-do-it-now#comments</comments>
		<pubDate>Fri, 07 May 2010 02:11:58 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data migration]]></category>
		<category><![CDATA[Data profiling]]></category>
		<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4496</guid>
		<description><![CDATA[Imagine that you have bought a new home, and you've decided to do some landscaping. So you pick three landscapers, draw a rough sketch of what you want, and ask them to bid on the job. But you don`t allow them to come see your property, and your sketch doesn't specify anything about the existing [...]]]></description>
			<content:encoded><![CDATA[<p>Imagine that you have bought a new home, and you've decided to do some landscaping.  So you pick three landscapers, draw a rough sketch of what you want, and ask them to bid on the job.</p>
<p>But you don`t allow them to come see your property, and your sketch doesn't specify anything about the existing landscaping- just the final configuration.  Do you think the landscapers would be willing to offer a reasonable price ? </p>
<p>Unlikely.   What if there are existing patio stones to remove- or an in-ground swimming pool that`s got to go? </p>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/05/did-the-consultants-data-profile-first.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/05/did-the-consultants-data-profile-first.jpg" alt="" title="did-the-consultants-data-profile-first" width="347" height="235" class="alignright size-full wp-image-4513" /></a>No landscaper would take on a job without understanding the lay of the land, and the existing conditions.  It would be impossible to estimate the job. Anyone who did would give you a huge price to cover themselves, or demand extras upon discovering the extra work.</p>
<p>Yet when companies hire consultants to build them business intelligence solutions, or do data migration,  it often happens with only the roughest outline of the existing data sets.  Certainly, often a data model is included- but knowing what the table SHOULD contain rather than what it does is just not the same thing.  It never ceases to amaze me that the simple, cost effective practice of data profiling is just often not part of the initial phases of so many business intelligence and data migration projects.</p>
<p>With the right data profiling tool, and just a few days work, its possible to gain a huge amount of insight into the data quality in your systems, and as a result, be able to make radically more accurate estimates of the cost to go from the "as is" to the "to be".</p>
<p>Phil Simon talked about this in a great post on the Data flux blog called <a href="http://www.dataflux.com/dfblog/?p=2590" target="_blank">"What Consultants Don't tell you"</a>, and raises an important and somewhat ugly truth- many times, service providers don't WANT to do data profiling because it reveals the true extent of the work to be done, increasing the budget requirement, and makes the project less likely to be approved.</p>
<p>Now certainly, we can't use a broad brush to paint all consultants, but it does lead to a reduction in the number of times valuable tools such as data profiling are recommended even though in my opinion they are a low cost, no-brainer, do it unless you are crazy first step to any major project.  </p>
<p>You are going to spend potentially millions of dollars on a business intelligence or data migration project- spend a few weeks to look at the data with the right tools first for goodness sake!</p>
<p>If you want to get a reasonable cost estimate, and you want to go into your business intelligence or data migration project with open eyes, don't imagine you can know what it will cost to get from here to there if you don't take a good look at where here really is.</p>
<p><a href="/resources/screenshots/Data-Profiler-on-States.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/05/Data-Profiler-on-States-Thumb.jpg" alt="" title="Data-Profiler-on-States-Thumb" width="220" height="165" class="alignright size-full wp-image-4506" /></a><strong>Full disclosure</strong>-  of course, you are reading the <a href="/">Datamartist</a> blog, and Datamartist has lots of data profiling functionality- so you have to understand that we are incredibly biased on this topic.  If you are able to overlook our inherent bias, <a href="/downloads">give the tool a try</a>- you`ll discover things about your data you might not have wanted to know, but its better to face the truth prepared, than to rely on wishful thinking, and then discover the bad news when you're well into the project, and your budget is almost gone.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-profiling-do-it-do-it-now/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

