Dimension Tables – Datamartist.com

Mystery or Junk data warehouse dimensions

James Standen — Mon, 18 Jan 2010 17:10:46 +0000

Sometimes, when you are designing a star schema model, you’ll find yourself in a dilemma. You’ve come up with a beautiful design, right out of the pages of a Ralph Kimball book with 5 dimensions, and 5 measures, and you are on your way to star schema heaven when suddenly the users start asking akward questions- where is such and such flag? Where’s the transaction type? Why can’t I sort based on the “e7” code from the system?

You can try to explain to them that pure star schemas should not be cluttered with a bunch of tiny dimensions and your fact table just won’t stand for 100 million rows of the e7 code, and besides computery things like transaction codes should not be in a business savy data model. But face it, after some digging you determine the user is right (happens quite often in fact)- they really do use that information and it is critical that you include it and you don’t have the time or budget to make the perfect data warehouse.

So how do you deliver to them what they need, and avoid messing up your dimensional model?

One answer is to create one or more Junk dimensions, sometimes also referred to as a mystery dimension.

In the end although the content of a mystery dimension may or may not be mysterious, there is nothing particulary mysterious about how to implement this type of dimension table.

Even if its perfectly clear what the column is, there are often a number of them with very low cardinality (that is they have very few distinct values). It really does not make sense to add columns in the fact table for each one, and to have a bunch of tiny dimension tables with only a handful of rows in them.

Faced with this the data architect can wrap all these columns up into a junk dimension.

A junk dimension is a dimension that holds all the unique combinations of a set of columns, and assigns a unique key. This key is what is stored in the fact table, in the mystery dimension column.

Lets look at a mystery dimension example. We’ll make up and example dimension thats very small for simplicity sake. Lets say that the transactional table that is used to generate one of our facts has three columns “Zortz” “a3” and “uudl” which we fully satisfy our mystery dimension criteria. (i.e. we don’t know what they are, but people use them in queries.)

“Zortz” is a true/false value, “a3” is one of two values “Confirmed” or “Pending” and “uudl” is either “” or “k”. All the possible combinations of these values would be put into a dimension table and assigned an integer surrogate key. Thus the mystery dimension table would look like this:

A key consideration when forming mystery dimensions is how many combinations exist. If the number of combinations is too high the mystery dimensions size may be unmanageable.

And be careful assuming that all the combinations have been used yet. You are safe if the data type has a fixed set of values (like Boolean, or codes from a known finite set) because you can be sure you’ve created a dimension row for every combination.

But if there are free form string columns, then you need to make sure your ETL is able to generate new dimension rows and surrogate keys as new combinations are created in the source system. This might still be worth while, depending on how many new combinations get created.

You can also manage the size of the mystery dimension tables by having 2 or more mystery dimensions, which might reduce the overall number of dimensional rows depending on the makeup of the data. Different columns and values may tend to cluster together and you will find that grouping them correctly makes say, two small mystery dimensions rather than one huge one.

If, however the number of rows is manageable, a mystery dimension allows all the columns to be queriable, while only adding one column to the fact table, and providing a much more efficient solution in comparison to either creating multiple dimensions, or leaving all the data in the fact table.

By moving it to a junk dimension or “mystery” dimension then you’ve got fewer indexes on the fact table which might be important depending on the size.

So if you find yourself telling your end users that they will just have to do without a column, think twice about it. The role of a data warehouse is to deliver the data- sometimes you just have to find the right packaging to get the job done.

Wolfram Alpha- Dimensional Generator?

James Standen — Fri, 10 Apr 2009 15:56:56 +0000

Wolfram research is always doing some interesting things- and now they are aiming at providing an answer machine- they are calling it Wolfram Alpha. Its not a search engine that returns documents related to the inputed search terms- but something that computes answers to a question looking for a factual answer.

This is interesting, because often when we are searching for something via, say, Google, we are actually looking for an answer. We do it in two steps- “Country population list”- which gets us the document, and then we look up the countries we are interested in.

Unfortunately, the way Wolfram Alpha was launched and the way the media and observers in general tended to react has created a fair amount of hype, and misconception. Although Wolfram Alpha (lets call it WA) will have a natural language interface, people always get carried away with their expectations for such things. I’m certain that WA will be impressive, but I’m equally pretty sure that you won’t be able to say “Roughly how many people like to have peanut butter on their toast in Ohio” and get a reasonable answer.

In a recent interview with Rudy Rucker, Stephen Wolfram said that rather than sell WA to the search engines, “We’d rather look for things like partnerships or licensing deals or APIs. I see a new field of knowledge-based computing. Imagine a spread sheet that can pull in knowledge about the entries.”

Now this is really interesting. What if it has a way to ask questions like “what is the GDP of [country]” just like that? What if it can tell you the population of any given Zip Code? What if it knows the rate of income change by county? What if it can tell you for any geo-location/date if it was a statutory holiday or not on that day? These are things that could be very very useful in doing data analysis- and are the kinds of things that can build interesting real world dimensional tables in a data warehouse or data mart.

There are of course, various sources to get this information now- but if one, massive, super flexible, broad “answer engine” existed- this might be a real boon to business intelligence practitioners.

Imagine generating a dimensional table by accessing the web service and enriching what your business users can analyze- and knowing that the values are as up to date and as accurate as the brain trust at Wolfram Research can make them.

Although there is some question as to if the API will really be a focus its clear that there is some interest in business applications. Its just not clear if Wolfram research shares this vision.

Although the masses that are expecting to have a conversation with HAL will be disappointed, there might be a new resource in the world for dimension building for data warehousing- I will be following this with interest.

Joining the Dimension Table to the Fact Table- Purchasing Data mart (Part 5)

James Standen — Tue, 17 Feb 2009 16:31:48 +0000

After we have created the dimension tables and the fact table and populated them with data the final step to getting a star schema is of course to actually join the dimension tables to the fact table. In the datamartist tool we do this with a Join block.

Check out the first four parts of this series (1,2 , 3 and 4) where we created an example data mart, with some fictitious purchasing data.

The final step is to join the dimensions we have created to the fact table. To do this, we connect up the two dimensions (Vendor and Item) to the Join block and connect an export block to the output. What has in effect been created is a complete Extract, Transform Load (ETL) and the final star schema join.

(If thats a bit hard to read- click on the image to see the full size screen shot.)

With the generated data set I used for this example, summarizing the data to yearly totals but keeping all the detail on Vendor and Item causes the roughly 4 million row raw data file to be reduced to around 800 thousand rows. (This summarizing was done on another canvas- although it could have been done on this canvas just as easily).

This data mart, with 800 k rows and two dimensions of about three thousand members each took my laptop about a minute and 45 seconds to solve, and save to a 360 Mb text file out.

Of course, by summarizing or filtering (just add blocks) analysis subsets could easily be exported directly to Excel, managing the data volumes involved, and letting you create the graphs, dashboards and reports that you need.

This is part of a 5 part series- here are the links to the various parts: 1,2 , 3 , 4 and 5

Connecting the dimension table to the fact table- Vendor Example (Part 3)

James Standen — Mon, 09 Feb 2009 20:47:55 +0000

In parts one and two of this series we introduced our challenge (to make a data mart to analyze the Acme Company’s spending) and showed how the Datamartist tool could import millions of rows of data and then turn it into a fact table we can use in Excel.

Now we need to create a Vendor dimension table and join it to this fact table to determine who our big vendors are.

In Datamartist it is a simple task to create this vendor dimension. As always we use blocks and connect them together. We define a dimension by using a reference definition block. All we have to do to configure the reference block is to specify which columns uniquely define the dimension (or almost uniquely, Datamartist will resolve duplicate keys using a majority/first rule set for you if you have some data glitches).

We start with an import block that brings in the Vendor master text file, then we define the reference by specifying “Vendor_ID” as the key. These first two blocks look like this:

Then we join it to the fact table we created in part two of this series with a join block. This means that now instead of just the vendor ID number that was in the fact table, we have the name, and address for the vendor in our mini star schema.

And finally we put a summarize block after that to total up all the monthly values for each vendor, and we export to excel. This is what the canvas looks like:

After we do this, we grab the excel file Datamartist just created for us, do a quick sort, and come up with a list of Acme’s top ten suppliers. Feeling pretty good about ourselves, we do a review with the head of purchasing.

“Where’s Mega brothers?” she says with a frown “I think your data is screwy- no way that Mega brothers didn’t make the top ten- we spend a fortune on railways, and a lot of our freight goes with the Mega Brothers Rail company. Of course it is probably entered under different vendors, each location works with the office local to them… But we’ve got to view them as a single vendor in the data mart- you can do that right?”

Fixing Duplicate Rows

Having to deal with duplicate data is a very common issue in any type of data analysis. So, back to the canvas. By simply adding a de-duplicate block to our Vendor dimension table (after the Reference block, and before the join) we can find and resolve the Mega Brothers duplicates.
We just use the filter to find the records- (Easy to do, looking for “Mega” “rail” “brothers” etc. and we map them to a single instance.) This is the filter control that lets us find and tag the duplicates:

As we tag them, they show up in the mapper, which lets us see which duplicate records we have eliminated for the dimension. We run the canvas again, and this time, sure enough, Mega Brothers Rail is in our top ten. But even though the head of purchasing knew it was a lot, this is actually the first time she’s seen the number. “Wow. I’ve got to give them a call- can you give me that in an Excel spreadsheet?”

Stay tuned, more to come as we go further into Datamartist’s ability to segment, filter and organize large data sets.

If you want to see the interface in action watch our first Tutorial Video. Or just get right to it with your own data- download the free trial now– there is no registration required, and it installs in minutes.

This is part of a 5 part series- here are the links to the various parts: 1,2 , 3 , 4 and 5

Degenerate Dimensions in Datamarts

James Standen — Sun, 28 Dec 2008 02:15:32 +0000

Not all dimensions are created equal. A typical dimension is defined by a table that holds the reference data that is being joined to the fact data. So in the fact table, for example, we have the product ID, or the product code, and in the product dimension table we have a single row for each product, that lists all the attributes of that product (its size, its color, its category, its segment, etc. etc.)

So it would follow, then, that there must be a dimension table for every dimension, right? Well, not if the dimension is degenerate. In fact, you could argue that calling it a dimension at all is pushing it, but I think the idea was to keep things tidy.

In any well structured data mart (a star schema), every column in the fact table should be either a measure or a dimension. If it’s a measure, then it’s storing a value for that particular fact- usually a number, and we use it for calculations and aggregations. If it’s a dimension, then we join it to the appropriate dimension table and thereby look up all the interesting things about that fact on that dimension.

Where degenerate dimensions come in is that there are often some columns that we want to have, but that are not measures, and don’t have a table of stuff we want to join to. Example: a purchase order number. These columns store something that we want to have (the purchase order number), but to create an empty dimension table would only slow things down. So, to ensure we don’t feel bad about breaking the “only a measure or a dimension in the fact table” rule, we just CALL them dimensions- even without the table.

In the fact itself, any attribute of the purchase order that was of interest, and that therefore had values that would each have more attributes we would be interested in would have been turned into a dimension, and a dimension table would have been created.

But to create a dimension table that contains a row for every purchase order would create a very large dimension with nothing in it (since there are lots of purchase orders, possibly as many as there are facts if the grain your fact table is one per purchase order). But our users would not be happy if they could not get a list of the purchase orders included in a given total, or be able to drill down to that bottom level of detail that we’ve gone to all the trouble to include.

So, when we create transactional level fact tables, it is normal, in fact, necessary to include some degenerate dimensions- include columns that have useful information (very often referencing back to the source system) but that do not join to any dimension table. Plus you can just impress everyone with your dimensional modelling knowledge when you say “degenerate dimension”.

Since we are very close to closing out 2008 and starting the new year, I’ll share with you one of my new year’s resolutions (there are many)- I’m going to start a data mart data modelling 101 series of blog posts in January, in which I will go through a complete data mart example. My intention is to both explain the data model concepts, and illustrate how they are executed using datamartist. And I think I’ll run with the purchase order example, because given the economic situation we’re going to have in 2009, identifying unnecessary spending, and finding ways to cut costs is one of the most important uses of a data mart- and one with potentially a huge payback.

Update: I’ve posted more recently on junk or mystery dimensions which might be of interest too.

Download the free, no risk Datamartist trial now and try it out on your own data. You’ll be amazed whats possible. No registration required, and the install takes just minutes.

Dimensional Tables and Fact Tables

James Standen — Fri, 31 Oct 2008 02:41:21 +0000

One of the secrets to putting together a good set of data marts is the concept of dimensions. There are two key steps being able to analyse your data, and to build a working data mart model.

Build a set of clean, consistent dimension tables that store reference information about your key dimensions like Product, Customer, Geographical Areas, Sales Areas etc.
Join them up to a fact table that does NOT have dimensional data in it. Just the facts, ma’am.

Usually, to make a proper star schema data mart, it is necessary to transform the source data set, removing dimensional data, and generating a fact set. The dimensional data that is removed must be transformed to remove duplicate rows and to resolve any data quality issues that might exist. Transactional systems don’t know about dimensions- but you do.

A key part of the data modelling is to determine which fields in the source data should be put in the Dimensional tables and which fields should go to the Fact table.

Determining the Grain of the fact table

The very first step is to determine WHAT exactly is one fact in our fact table going to be? The GRAIN or GRANULARITY of the fact table refers to the level of detail of each row in the fact table. For example, an order fact table might have a grain of order, with one row per order, or order line, with a row for every line on each order (meaning more than one line for some orders). It is key to make a decision on the grain of the fact table first. This is often a balance between keeping detail, and managing complexity.
This a key question, and is driven by what it is you want to analyse. For example, if the decision is made to have a granularity of one row per order, then it might be necessary to remove all product information (since any given order might have multiple products) and only have total order value. This won’t work if you want to analyse product segments, or compare different products.
To have our cake and eat it too, we’ll use a simplified example of order data where the grain is one row equals one order and every order in our system has one and only one product. This table has the following columns:

Order Number, Order Date, Ship Date, Customer Name, Customer Segment, Product Name, Product Category, Product Sub Category, Quantity Sold, Unit Price

Some Simple Questions to guide us

To determine which columns should be in the dimension table and which columns in the fact table, ask yourself these questions:

Is the data in the column something that is unique for every order? – if Yes, then its definitely part of the fact table- So order number is definitely in the fact table, as is Order Date, Ship Date, Quantity Sold and (most likely) Unit Price. Since all these things are linked to the order, and might change for each order.
Is the data in the column referring to data in another column and will always be the same? If yes, then this is probably a candidate for a dimensional table. In this example, the Customer Segment is probably something that is the same for a given customer on ALL the orders, so should be in a Customer Dimension. Likewise, the product category and sub-category are probably used to organise products, and therefore can be determined from the product name alone and don’t change from order to order.
Another way to help determine which columns go into the fact table is to think about the directness of the relationship between what is stored in the column and the grain of the fact table. For example in this case the Customer Name field is directly related to the order, but the Customer Segment field is related to the Customer Name field, which is then related to the order. Once removed or more usually means it should be in the dimensional table (again, providing the value is consistent for all orders, or should be).

Taking the time to think about the fact table grain, and determine which dimension tables you are going to build and what you are going to put in them is an important first step to creating a good data model for your data mart, and needs to be done no matter which tools you use to build it. If you want to try a visual, easy to use data transformation tool that lets you get at your data without having to resort to data base programming, check out the Datamartist tool.

Data modelling Hierarchies- how to make a dimension

James Standen — Thu, 23 Oct 2008 02:45:52 +0000

One of the most useful data model structures in a data mart is a Hierarchy (also called a Tree structure). Tree structures let us take a large number of things and organise them in a way that makes sense. More importantly, a tree structure lets us “drill down” into information.

Hierarchy Rules

In a simple tree structure, every object has one and only one parent, or it is at the top level of the tree.
For each level of the tree, all the objects are the same type.

All fine in theory, but what do the actual table structures look like?

Parent Child Relationships

The most efficient way to store a tree structure of objects is in a parent child type structure.

Parent Child Structure

For every object you store one row recording the parent of the object. This means that every relationship in the tree is stored only once.

This is the best form to store the “master copy” of the tree- because there is no ambiguity- one row, one object, one parent. Rule number one is enforced strictly by the physical model in this case- and that’s a good thing.

The downside of this structure is that it requires looking at multiple rows to summarise data. And its just not easy to read.
To find out which country a city is in, we have to first look up the parent (the state province), then we have to look up the parent of that to find the country. If a hierarchy has 10 levels, we have to look at ten rows for every row that we want to summarise to the top level. Not so good.

Dimensional Tables

In a dimensional table, we store one row for each object at the bottom of the hierarchy. In that row, we store its parent, its grand parent, its great grand parent, its great great grand parent etc. etc. Here’s what that table looks like for our example:

This way, we have everything right there and it makes it easy to summarise. To find the totals for a country just add up every row with a given value in the country field. The advantages of this form are clear when it comes time to do the analysis- but what are the disadvantages? Well, if you want to change a parent child relationship between level 1 and 2, then you have to change lots of rows- the relationship between a country and a state/province is repeated many times .

Depending on where the data is, and what applications have access to read and write it’s also possible to have inconsistencies- you could have some rows that say Michigan is in the USA, and others that put it in Canada.

The ideal solution is to store the master copy of the tree as a Parent Child relationship, and generate the Dimensional table automatically so that when the analysis is run, it’s fast and easy, and users can view it in spreadsheet tools in an easy to read format, knowing that it is guaranteed to be consistent.
This is what is done by the Datamartist tool– but rather than worrying about data models and table structures, managing tree structures is done with drag and drop.

Then dimensional tables are generated that are in the “everything in one row” format that is so easy to use in excel, either through an auto-filter, or with pivot tables.

Find out more about Datamartist– and download a free trial version.