Datamartist gives you data profiling and data transformation in one easy to use visual tool.

« | »

Excel auto formating is getting into your genes

We often give Excel our data, and trust it to do the right thing.

There was a link posted on meta-filter today that sparked some lively discussion amongst the crowd. The Excel auto formating “feature” loves to scramble common genetic nomenclature.my-gene-therapist-is-an-excel-nut

It turns out that in the genetics field, common codes get converted to incorrect values regularly. One example given was the code for tumor suppressor “DEC1” which gets coverted to the date December 1. Another was the code “2310009E13” (apparently a “RIKEN clone identifier”) – which would be converted to a number, 2.31E+19. I’m not a geneticist but I can just see how this wouldn’t be helpful.

I checked these examples on Excel 2007, and sure enough, the default will make changes right at import that scramble the mentioned codes- no error, no notification, no problem. Of course Excel is perfectly capable of handling this data properly- the user needs to specify the field as text, and the conversions won’t be done.
The key point brought up in the article (and is always true about excel spreadsheets) is not just that in this case the data gets corrupted but that depending on how carefully a user checks the error may not be detected.
If undetected, what decisions, conclusions and actions will be taken based on the incorrect information?

Excel is super powerful, and super useful, but we have to always remind ourselves to balance the ease of use with how critical our data is, and what the impact of errors might be. In the end, as with all computer use, we have to test, validate and test again at a level consistent with whatever use we are putting the data to.

Tagged as: , , ,

Twitter

« | »

Leave a Response