Datamartist gives you data profiling and data transformation in one easy to use visual tool.

« | »

An introduction to using regular expressions for data quality validation

Regular expressions (sometimes referred to as regex or regexp) are a powerful formal language that can be used to match text strings to patterns.

They way regular expressions work is like this:

  1. A pattern is defined. This is a string of symbols that act as a set of rules.
  2. A text string to test, and the pattern are given to a regular expression engine, and compared.
  3. The engine returns a true/false value meaning the string follows the rules or not ("PASSED" or "REJECTED by the pattern)

This is obviously very useful for someone interested in Data Quality- If you had a pattern that said "Is this a valid email address?", and got PASS or REJECT back, it would give you a good idea as to the quality of that field in your contact database.

One advantage of regular expressions is that because they are widely used, lots and lots of them have been created, detecting all sorts of patterns- meaning that while you can write your own, you can also look up useful ones you need in libraries.

Regular expressions aren't magic, of course- the result is only as good as the program. (As always.) Depending on how well the regex is written (or not) there may be false positives or negatives.

A regex example- Canadian Postal Code

Lets look at a simple example, and see how they work. Being from Canada, I'm going to use the example of validating a Canadian postal code.

Canadian Postal codes take the format ANA NAN, where "A" is a letter and "N" is a number. So what we want is a regular expression that will return TRUE for valid postal codes, and FALSE for postal codes that just can't be right. "K9J 2K2" could be a valid postal code, but we know that "38X AB2" just can't be.

In a regular expression, we use anchors to say where to start matching. In this case, we want just the Canadian postal code, so we'll use the anchor character "^" to specify the beginning of the string.

To specify that a character has to be within a given set or range of characters, we use square brackets. So to match when ever the first character of a string is a letter, the regex would be:

^[a-zA-Z]

This regex will return TRUE for all strings that start with a letter. Thats fine, but not yet specific enough for a Canadian postal code (we Canadians are very very picky).

So we can add a number, then another letter constraint to our pattern.

^[a-zA-Z][0-9][a-zA-Z]

So far, now any string that starts with ANA will result in true- we're almost half way there! Next, we want to specify that the space is optional- that is, its acceptable to have the space or not.
To do this, we use the "?" to specify that the space is optional. And then to finish up, we add the part of the expression that detects the NAN, and end with a dollar sign which specifies that that needs to be the end of the string (otherwise all strings that started with a valid postal code would pass);

^[a-zA-Z][0-9][a-zA-Z][ ]?[0-9][a-zA-Z][0-9]$

So there it is- or is it?

But is the REGEX fussy enough?

While this pattern that we've created does detect the ANA NAN pattern, and even allows the space to be optional, if you know Canadian postal codes ,you'll know that in fact ANA NAN is not enough by itself. There are only certain letters that actually exist in certain locations. So a better REGEX pattern for Canadian postal code validity would be the following:

^[abceghjklmnprstvxyABCEGHJKLMNPRSTVXY][0-9][abceghjklmnprstvwxyzABCEGHJKLMNPRSTVWXYZ][ ]?[0-9][[abceghjklmnprstvwxyzABCEGHJKLMNPRSTVWXYZ][0-9]$

This pattern explicitly lists valid letters. Canadian postal codes do not use the letters D,F,O,Q or U anywhere and they do not use W or Z in the first position. Of course, this brings up another issue with any data quality method- remember Canada post could decide to change the rules- then your data quality test would need to be updated.

Ok, so that means the Postal code is ok right?... uh, No.

So this regular expression will detect if a text string of length 6 or 7 is a valid Canadian postal code- but remember that this alone is probably not enough. Chances are that this postal code is stored as part of an address, which will also include the city and province. In Canada, postal codes are of course unique to a given province- the first letter defines the area, and each area exists within a particular province (large provinces have more than one letter assigned to them).

This means that a properly formed postal code could be invalid- for example, an address in Quebec that has a postal code that starts with the letter "V" which is for British Columbia has clearly got something amiss.

So while learning a bit about regular expressions, we've also learned that probably if you had a big mailing list to clean you would probably want to use a dedicated tool- postal addresses are an area of data quality where lots of attention has been paid over the years, and writing a lot of custom logic and regex patterns is probably not a good use of your time. But for application specific codes and strings it might be very useful. In my next post, we'll look at some more tricks with regular expressions that can be used to analyze data quality.

I've posted a small collection of useful regular expressions to the datamartist website here.

Datamartist V1.3.0 PRO and Regular expressions

The professional edition of the Datamartist tool provides a function REGEX(text,regex expression) that returns TRUE or FALSE depending on if the text "matches" with the regular expression specified. This function can be used anywhere in Datamartist where expressions are available, making it a powerful way to test if a string matches one or more patterns.

Tagged as: ,

Twitter

« | »

Leave a Response