Data Modelling – Datamartist.com

Data granularity- avoid going against the grain

James Standen — Wed, 15 Sep 2010 15:43:37 +0000

In the world of data warehousing, the grain of a fact table defines the level of detail that is stored, and which dimensions are included make up this grain. Obviously, the higher the grain the better- although source systems and data volume/performance may intervene.

Using the example in the Wikipedia article on fact tables, a sales fact table holding sales transactions might have a grain of day, store and product. This means that with this grain in place, you can’t analyze inter-day patterns, or which checkout was used, or which shelf the product was on. Makes sense, you have to stop somewhere.

But what if you want more detail? “Going against the grain” might be good for societal change and rebellious youth but accepting the grain in data is usually the right thing to do.

It is possible, if you are a rebel, and don’t worry too much about accuracy, to “generate a finer grain” by “allocating” or “interpolating” data between points from multiple data sets.

The request might come something like; “I know we only collect data Y at region level, but can we allocate it down to stores to have more detail so we can put it in the cube?”

This is a slippery slope. It is always possible to allocate- but based on what? Sales? Shipping costs? Units sold? Employees? some mix of things which should correlate to data Y? Your sisters shoe size? The bottom line is you are making approximations and assumptions.

In the end, while sometimes it must be done, it is better to avoid going against the grain of the data. Spending effort on complex data fabrication processes will probably not drive real insight, and might even risk creating misleading information. Another potential issue is that if people know the data is “massaged” they will not treat it as credible, thus wasting your time as no-one is using your magic numbers.

If you don’t have the right grain, and you need it, then try to go get it. Change the extraction from the source system, or if needed, increase the level of detail the data is captured at the source.

Real data is always best- trying to generate details you don’t have is likely to lead you astray.

Mystery or Junk data warehouse dimensions

James Standen — Mon, 18 Jan 2010 17:10:46 +0000

Sometimes, when you are designing a star schema model, you’ll find yourself in a dilemma. You’ve come up with a beautiful design, right out of the pages of a Ralph Kimball book with 5 dimensions, and 5 measures, and you are on your way to star schema heaven when suddenly the users start asking akward questions- where is such and such flag? Where’s the transaction type? Why can’t I sort based on the “e7” code from the system?

You can try to explain to them that pure star schemas should not be cluttered with a bunch of tiny dimensions and your fact table just won’t stand for 100 million rows of the e7 code, and besides computery things like transaction codes should not be in a business savy data model. But face it, after some digging you determine the user is right (happens quite often in fact)- they really do use that information and it is critical that you include it and you don’t have the time or budget to make the perfect data warehouse.

So how do you deliver to them what they need, and avoid messing up your dimensional model?

One answer is to create one or more Junk dimensions, sometimes also referred to as a mystery dimension.

In the end although the content of a mystery dimension may or may not be mysterious, there is nothing particulary mysterious about how to implement this type of dimension table.

Even if its perfectly clear what the column is, there are often a number of them with very low cardinality (that is they have very few distinct values). It really does not make sense to add columns in the fact table for each one, and to have a bunch of tiny dimension tables with only a handful of rows in them.

Faced with this the data architect can wrap all these columns up into a junk dimension.

A junk dimension is a dimension that holds all the unique combinations of a set of columns, and assigns a unique key. This key is what is stored in the fact table, in the mystery dimension column.

Lets look at a mystery dimension example. We’ll make up and example dimension thats very small for simplicity sake. Lets say that the transactional table that is used to generate one of our facts has three columns “Zortz” “a3” and “uudl” which we fully satisfy our mystery dimension criteria. (i.e. we don’t know what they are, but people use them in queries.)

“Zortz” is a true/false value, “a3” is one of two values “Confirmed” or “Pending” and “uudl” is either “” or “k”. All the possible combinations of these values would be put into a dimension table and assigned an integer surrogate key. Thus the mystery dimension table would look like this:

A key consideration when forming mystery dimensions is how many combinations exist. If the number of combinations is too high the mystery dimensions size may be unmanageable.

And be careful assuming that all the combinations have been used yet. You are safe if the data type has a fixed set of values (like Boolean, or codes from a known finite set) because you can be sure you’ve created a dimension row for every combination.

But if there are free form string columns, then you need to make sure your ETL is able to generate new dimension rows and surrogate keys as new combinations are created in the source system. This might still be worth while, depending on how many new combinations get created.

You can also manage the size of the mystery dimension tables by having 2 or more mystery dimensions, which might reduce the overall number of dimensional rows depending on the makeup of the data. Different columns and values may tend to cluster together and you will find that grouping them correctly makes say, two small mystery dimensions rather than one huge one.

If, however the number of rows is manageable, a mystery dimension allows all the columns to be queriable, while only adding one column to the fact table, and providing a much more efficient solution in comparison to either creating multiple dimensions, or leaving all the data in the fact table.

By moving it to a junk dimension or “mystery” dimension then you’ve got fewer indexes on the fact table which might be important depending on the size.

So if you find yourself telling your end users that they will just have to do without a column, think twice about it. The role of a data warehouse is to deliver the data- sometimes you just have to find the right packaging to get the job done.

Data migration Part 3- Mapping the legacy systems

James Standen — Mon, 14 Dec 2009 18:17:07 +0000

This is part three of an ongoing series that’s taking a look at data migration projects. In this part we’re going to talk about how important it is to know where you are starting from, before you head off on a new application journey. Understanding and mapping your legacy systems is a key success factor for a data migration project, but can be a very difficult and time consuming battle. In this post, I’ll talk a bit about some approaches I’ve found useful in my experience.

If you like, you can start with Part one which was a light hearted introduction to data migration projects in general, and part two, where we talked about the importance of data quality.

Why are we spending so much time on this? Thats the OLD system- we need to focus on the future!

Here are just some of the important things the legacy mapping needs to clarify:

Data location– You can’t migrate data if if you don’t know what it is and where it is.
Data dependencies to other systems All processes and interfaces that rely on interfaces to the legacy systems need to be either replaced or shut off. Often this means that even if the new system is not involved, other systems may stop working because they get data from the legacy systems. The data migration project is not just about turning on the new system. The consequences of turning off the old system have to be known and managed.
Legal requirements to keep legacy data available. Even if data is not migrated to the new system there may be additional data migration requirements into data warehouses or documents that have nothing to do with the new application.
Infrastructure dependencies. The actual infrastructure that the legacy systems are on might perform other tasks that although not directly related to the legacy system will cause issues when that infrastructure is removed. (For example, someone installed a service of some sort on one of the servers that is used by other applications that are completely unlrelated from a data point of view).

Often the first time the Legacy system is documented is just before it’s shut down.

Despite our best intentions, sometimes documentation doesn’t get updated. This is the reality for many systems, and particularly for legacy systems.

One of the first steps in a data migration project is to gather all the existing documentation for the legacy systems, and all the systems they talk to, and make sure its accessible to the data migration project team.

It is critical to have tight control over these documents, and to ensure that everyone works off a “live” version- because your mapping is going to update that documentation, and every developer, data modeler and application team member needs to know that they have the best and latest version.

The application interface diagram.

Now, the ideal situation is to have a dynamic, self correcting, scanning Configuration Management Database tool (CMDB tool) that already has every scrap of meta data about every application and all its interfaces ready to go.

If you have one of these, good for you, and you can stop reading.

For the rest of us, lets talk practical methods of mapping what we have.

How to get the data.

Scan the environment- catch the interfaces in the act.
- Monitor network traffic to detect exchanges between applications.
- Scan file systems to find interface files and determine frequency.
- Catalog services and activity of those services on servers.
Get out there and talk to people.
- Ask people- where is data from this system used?
- Look at management reports and trace backwards to find where information is pulled.
- Don’t assume the interface is direct. My record discovered is 6 hops from source to the excel sheet used by the CEO, with the information passing through two of the same systems twice.
- Hunt down people that were involved in the original installation. Often they’ll have key information that can save you time.
Any other way that works.

What to do with it.

If you don’t have a complex tool to do the mapping of all your systems, then one approach that is a step above the “lots of excel sheets and powerpoint slides” approach, is to use a tool like Microsoft Viso. I’ve used it successfully to map applications, by having the drawing and the interfaces BE the database. This ensures that everything in the drawing is on the interface list, and everything on the interface list shows up on the drawing.

Create different objects in Viso, and give them attributes. At a minimum you need an application, interface and database object.
Draw the applications and the interfaces between them in a single large viso drawing, and fill in the attributes in the visio objects.
Make some simple VBA code in the drawing to dump all the data into flat files or excel sheets (or directly to a DB if you get ambitious).

It’s simple, but it is far better than having spreadsheets, and a drawing- and then constantly trying to determine if the two agree with each other.

In the ERP project where I used this technique, we identified over 1500 interfaces between hundreds of application instances. The ERP project was a very large effort with hundreds of project resources, and multiple phased projects implementing a new common system. The actual original mapping took two people about 3 months to do. They had to work with about 30 different applications support people to systematically map all the applications, and the interfaces, one by one.

A key part of the job was to actually validate the documentation. IE if the documentation said there was a chron job that ran a script on server X, actually go to server X and watch it run. This meant that we could be confident in the map, and make plans based on it.

Everyone on the team used the drawing and lists generated from the drawing to stay on the same page. And it was a big page- the key is to also have access to a plotter- we were plotting out a pretty good size wall poster by the time we were done.

The ERP teams had the drawing taped up to the wall- and they were making notes right on it and emailing my team. We would update the master, and publish a new version, along with the generated lists.

In building this drawing, we found that most of the interfaces were “under” or “un”-documented, and that if documentation did exist, generally it was wrong. By establishing the “official” document for the legacy systems, we focused and coordinated the design effort in a way that would not have happened, if each team just had their own marked up copy of the original documentation or the part that was of interest to them.

Having the map means you can make the plan

This drawing and the interfaces mapped were critical in planning the migration.

Create different layers in your drawing for each phase “Phase 1”, “Phase 2”, “Phase 3”, or “Feb 2010”, “Aug 2010”, “Jan 2011” etc.
Hide or show systems and interfaces (including the new applications and interfaces) as they were phased in or out for each layer.
By viewing and printing layers separately, you can see a step by step plan for the migration- with your application architecture and integration map at each phase.

This was a powerful tool to both do the planning, and to make sure everyone understood the timing and sequence. With multiple phases over a three year period, the project needed it, and without such an overall view, such critical planning would have been haphazard.

The challenge with this mapping is to find the right level of detail required. Not detailed enough and it is wasted effort. Too detailed and it will consume excessive resources and time.

A simple approach- What talks to what and what it runs on.

There are two key aspects to mapping your application architecture.

Functional relationships- applications talking to other applications, with interfaces between them.
Infrastructure relationships – which servers, network connections, services and databases are involved in the functional relationship

You can’t show both completely on a single drawing- don’t try. Some applications run on multiple servers, many servers run more than one application, data bases are shared by many, interfaces often use common infrastructure such as EAI tools etc.

The approach we took, and it worked well, was to show the functional relationships on the diagram, and hold the physical relationships (which databases were on which servers/clusters and which application ran on which server etc.) in the attributes of the applications.

We did sometimes show some physical attributes on the diagram for easy reading, but only as an annotation- the relationship was done via the attributes in the visio application objects.

This meant that you could ask “What runs on this server?” and could ask “Which servers are involved with this application?” by doing a filter or query on the data. Very useful things if you are planning to shut down a server. You make a checklist, and one by one make sure everything is either shutdown, or moved.

Here’s a simple example to illustrate what the diagram might look like;

The circles with the numbers were the interfaces, each one had attributes like “To” , “From” and “Method” etc. The level of detail you go to is a function of how ambitious you are, but at a minimum you need to record the fact that the interface exists.

So in summary:

Create a single map of all your applications and interfaces and share it with everyone on the team
Make sure you validate your map carefully, looking into the actual systems, and talking with as many people as needed to ensure you have captured everything
Make a step by step plan for the migration, showing when each application, interface and infrastructure item is phased in or out.

Next up- the data dictionary and how do we get everyone to agree on those definitions?

Joining the Dimension Table to the Fact Table- Purchasing Data mart (Part 5)

James Standen — Tue, 17 Feb 2009 16:31:48 +0000

After we have created the dimension tables and the fact table and populated them with data the final step to getting a star schema is of course to actually join the dimension tables to the fact table. In the datamartist tool we do this with a Join block.

Check out the first four parts of this series (1,2 , 3 and 4) where we created an example data mart, with some fictitious purchasing data.

The final step is to join the dimensions we have created to the fact table. To do this, we connect up the two dimensions (Vendor and Item) to the Join block and connect an export block to the output. What has in effect been created is a complete Extract, Transform Load (ETL) and the final star schema join.

(If thats a bit hard to read- click on the image to see the full size screen shot.)

With the generated data set I used for this example, summarizing the data to yearly totals but keeping all the detail on Vendor and Item causes the roughly 4 million row raw data file to be reduced to around 800 thousand rows. (This summarizing was done on another canvas- although it could have been done on this canvas just as easily).

This data mart, with 800 k rows and two dimensions of about three thousand members each took my laptop about a minute and 45 seconds to solve, and save to a 360 Mb text file out.

Of course, by summarizing or filtering (just add blocks) analysis subsets could easily be exported directly to Excel, managing the data volumes involved, and letting you create the graphs, dashboards and reports that you need.

This is part of a 5 part series- here are the links to the various parts: 1,2 , 3 , 4 and 5

Degenerate Dimensions in Datamarts

James Standen — Sun, 28 Dec 2008 02:15:32 +0000

Not all dimensions are created equal. A typical dimension is defined by a table that holds the reference data that is being joined to the fact data. So in the fact table, for example, we have the product ID, or the product code, and in the product dimension table we have a single row for each product, that lists all the attributes of that product (its size, its color, its category, its segment, etc. etc.)

So it would follow, then, that there must be a dimension table for every dimension, right? Well, not if the dimension is degenerate. In fact, you could argue that calling it a dimension at all is pushing it, but I think the idea was to keep things tidy.

In any well structured data mart (a star schema), every column in the fact table should be either a measure or a dimension. If it’s a measure, then it’s storing a value for that particular fact- usually a number, and we use it for calculations and aggregations. If it’s a dimension, then we join it to the appropriate dimension table and thereby look up all the interesting things about that fact on that dimension.

Where degenerate dimensions come in is that there are often some columns that we want to have, but that are not measures, and don’t have a table of stuff we want to join to. Example: a purchase order number. These columns store something that we want to have (the purchase order number), but to create an empty dimension table would only slow things down. So, to ensure we don’t feel bad about breaking the “only a measure or a dimension in the fact table” rule, we just CALL them dimensions- even without the table.

In the fact itself, any attribute of the purchase order that was of interest, and that therefore had values that would each have more attributes we would be interested in would have been turned into a dimension, and a dimension table would have been created.

But to create a dimension table that contains a row for every purchase order would create a very large dimension with nothing in it (since there are lots of purchase orders, possibly as many as there are facts if the grain your fact table is one per purchase order). But our users would not be happy if they could not get a list of the purchase orders included in a given total, or be able to drill down to that bottom level of detail that we’ve gone to all the trouble to include.

So, when we create transactional level fact tables, it is normal, in fact, necessary to include some degenerate dimensions- include columns that have useful information (very often referencing back to the source system) but that do not join to any dimension table. Plus you can just impress everyone with your dimensional modelling knowledge when you say “degenerate dimension”.

Since we are very close to closing out 2008 and starting the new year, I’ll share with you one of my new year’s resolutions (there are many)- I’m going to start a data mart data modelling 101 series of blog posts in January, in which I will go through a complete data mart example. My intention is to both explain the data model concepts, and illustrate how they are executed using datamartist. And I think I’ll run with the purchase order example, because given the economic situation we’re going to have in 2009, identifying unnecessary spending, and finding ways to cut costs is one of the most important uses of a data mart- and one with potentially a huge payback.

Update: I’ve posted more recently on junk or mystery dimensions which might be of interest too.

Download the free, no risk Datamartist trial now and try it out on your own data. You’ll be amazed whats possible. No registration required, and the install takes just minutes.

Dimensional Tables and Fact Tables

James Standen — Fri, 31 Oct 2008 02:41:21 +0000

One of the secrets to putting together a good set of data marts is the concept of dimensions. There are two key steps being able to analyse your data, and to build a working data mart model.

Build a set of clean, consistent dimension tables that store reference information about your key dimensions like Product, Customer, Geographical Areas, Sales Areas etc.
Join them up to a fact table that does NOT have dimensional data in it. Just the facts, ma’am.

Usually, to make a proper star schema data mart, it is necessary to transform the source data set, removing dimensional data, and generating a fact set. The dimensional data that is removed must be transformed to remove duplicate rows and to resolve any data quality issues that might exist. Transactional systems don’t know about dimensions- but you do.

A key part of the data modelling is to determine which fields in the source data should be put in the Dimensional tables and which fields should go to the Fact table.

Determining the Grain of the fact table

The very first step is to determine WHAT exactly is one fact in our fact table going to be? The GRAIN or GRANULARITY of the fact table refers to the level of detail of each row in the fact table. For example, an order fact table might have a grain of order, with one row per order, or order line, with a row for every line on each order (meaning more than one line for some orders). It is key to make a decision on the grain of the fact table first. This is often a balance between keeping detail, and managing complexity.
This a key question, and is driven by what it is you want to analyse. For example, if the decision is made to have a granularity of one row per order, then it might be necessary to remove all product information (since any given order might have multiple products) and only have total order value. This won’t work if you want to analyse product segments, or compare different products.
To have our cake and eat it too, we’ll use a simplified example of order data where the grain is one row equals one order and every order in our system has one and only one product. This table has the following columns:

Order Number, Order Date, Ship Date, Customer Name, Customer Segment, Product Name, Product Category, Product Sub Category, Quantity Sold, Unit Price

Some Simple Questions to guide us

To determine which columns should be in the dimension table and which columns in the fact table, ask yourself these questions:

Is the data in the column something that is unique for every order? – if Yes, then its definitely part of the fact table- So order number is definitely in the fact table, as is Order Date, Ship Date, Quantity Sold and (most likely) Unit Price. Since all these things are linked to the order, and might change for each order.
Is the data in the column referring to data in another column and will always be the same? If yes, then this is probably a candidate for a dimensional table. In this example, the Customer Segment is probably something that is the same for a given customer on ALL the orders, so should be in a Customer Dimension. Likewise, the product category and sub-category are probably used to organise products, and therefore can be determined from the product name alone and don’t change from order to order.
Another way to help determine which columns go into the fact table is to think about the directness of the relationship between what is stored in the column and the grain of the fact table. For example in this case the Customer Name field is directly related to the order, but the Customer Segment field is related to the Customer Name field, which is then related to the order. Once removed or more usually means it should be in the dimensional table (again, providing the value is consistent for all orders, or should be).

Taking the time to think about the fact table grain, and determine which dimension tables you are going to build and what you are going to put in them is an important first step to creating a good data model for your data mart, and needs to be done no matter which tools you use to build it. If you want to try a visual, easy to use data transformation tool that lets you get at your data without having to resort to data base programming, check out the Datamartist tool.

Data modelling Hierarchies- how to make a dimension

James Standen — Thu, 23 Oct 2008 02:45:52 +0000

One of the most useful data model structures in a data mart is a Hierarchy (also called a Tree structure). Tree structures let us take a large number of things and organise them in a way that makes sense. More importantly, a tree structure lets us “drill down” into information.

Hierarchy Rules

In a simple tree structure, every object has one and only one parent, or it is at the top level of the tree.
For each level of the tree, all the objects are the same type.

All fine in theory, but what do the actual table structures look like?

Parent Child Relationships

The most efficient way to store a tree structure of objects is in a parent child type structure.

Parent Child Structure

For every object you store one row recording the parent of the object. This means that every relationship in the tree is stored only once.

This is the best form to store the “master copy” of the tree- because there is no ambiguity- one row, one object, one parent. Rule number one is enforced strictly by the physical model in this case- and that’s a good thing.

The downside of this structure is that it requires looking at multiple rows to summarise data. And its just not easy to read.
To find out which country a city is in, we have to first look up the parent (the state province), then we have to look up the parent of that to find the country. If a hierarchy has 10 levels, we have to look at ten rows for every row that we want to summarise to the top level. Not so good.

Dimensional Tables

In a dimensional table, we store one row for each object at the bottom of the hierarchy. In that row, we store its parent, its grand parent, its great grand parent, its great great grand parent etc. etc. Here’s what that table looks like for our example:

This way, we have everything right there and it makes it easy to summarise. To find the totals for a country just add up every row with a given value in the country field. The advantages of this form are clear when it comes time to do the analysis- but what are the disadvantages? Well, if you want to change a parent child relationship between level 1 and 2, then you have to change lots of rows- the relationship between a country and a state/province is repeated many times .

Depending on where the data is, and what applications have access to read and write it’s also possible to have inconsistencies- you could have some rows that say Michigan is in the USA, and others that put it in Canada.

The ideal solution is to store the master copy of the tree as a Parent Child relationship, and generate the Dimensional table automatically so that when the analysis is run, it’s fast and easy, and users can view it in spreadsheet tools in an easy to read format, knowing that it is guaranteed to be consistent.
This is what is done by the Datamartist tool– but rather than worrying about data models and table structures, managing tree structures is done with drag and drop.

Then dimensional tables are generated that are in the “everything in one row” format that is so easy to use in excel, either through an auto-filter, or with pivot tables.

Find out more about Datamartist– and download a free trial version.