Datamartist gives you data profiling and data transformation in one easy to use visual tool.

« | »

Too much data storage hurts data quality- the toothpaste effect

When I brush my teeth there is a wide range in terms of amount of toothpaste that is acceptable to me. This is not a profound statement- bear with me.

Only as the tube of toothpaste starts getting near to its end do I start conserving toothpaste because I know I need to make it last.

Another example is the all you can eat buffet- we eat because it's there and we can. Unlike wasting toothpaste, this has more immediate negative consequences.

When there is lots of something, we tend to use more of it than we should.

When the tube of enterprise storage capacity seems to be always full, and when massive databases make an all-you-can-store buffet the standard mode of operation, very often the tendency is to store everything.

Rather than try to determine what information is of a useful level of quality, or focusing on the key information (and ensuring it IS of useful data quality), we stuff our systems full of every type of field and attribute, with massive bloated forms that are too long for anyone to really fill out properly.

Sadly, this doesn't matter because there are too many fields to check anyways (who can define so many business and data quality rules?), so no one is checking.

If we were forced to make a choice between data A and data B, we might think a bit more about which is more useful for answering key business questions (and by connection, actually think about what the key business questions are).

Instead, how many times have I heard an overworked, rushed subject matter expert say - "Just collect it all, we might need it."

By collecting more, we end up with less.

Tagged as:


« | »


  1. Minty fresh post, James 🙂

    It reminds of the psychology experiment described by Barry Schwartz in the book The Paradox of Choice.

    Imagine two tables in a grocery store where you can taste different kinds of jam. One table has 24 kinds of jam, and the other table has only 6.

    Both tasting tables were popular with customers. But when the sales of jam were tallied, the table with only 6 jars generated 10 times as many sales as the other table.

    People simply couldn't decide which jar, among 24, to buy. This is commonly referred to as decision paralysis.

    When our data management strategy is to "just collect it all, we might need it", it's like creating databases with tables loaded with 24 jars of data-jam.

    How could the organization possibly choose which data-jam to spread on their business decision toast?

    Best Regards,


    P.S. Perhaps this comment was influenced by the fact I just finished eating breakfast (I used the grape jam on my toast) and I now need to brush my teeth (with the new bar of toothpaste I bought yesterday).

  2. I am working through an interesting variant at my current client. Due to horrid data warehouse performance, data is duplicated many times, at varying degrees of pre-aggregation to ensure adequate query performance. Of course, keeping all the data consistent as the granular facts are refreshed is becoming a nightmare.

    The client is close to purchasing a DW appliance that will increase capacity by an order of magnitude. Ironically, the same technology will allow us to cut our storage usage by a similar order of magnitude by facilitating the elimination of duplicated, aggregated data and storing it once with the correct grain.

    Increased throughput reduces storage requirements, at least in this case.


  1. Tweets that mention Too much data storage hurts data quality- the toothpaste effect | --