<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Datamartist.com &#187; Data Modelling</title>
	<atom:link href="http://www.datamartist.com/category/data-modelling/feed" rel="self" type="application/rss+xml" />
	<link>http://www.datamartist.com</link>
	<description>Reduce cost with self serve data transformation</description>
	<lastBuildDate>Mon, 26 Jul 2010 18:33:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Mystery or Junk data warehouse dimensions</title>
		<link>http://www.datamartist.com/mystery-or-junk-data-warehouse-dimensions</link>
		<comments>http://www.datamartist.com/mystery-or-junk-data-warehouse-dimensions#comments</comments>
		<pubDate>Mon, 18 Jan 2010 17:10:46 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Modelling]]></category>
		<category><![CDATA[Data warehouse]]></category>
		<category><![CDATA[Dimension Tables]]></category>
		<category><![CDATA[ETL]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=2933</guid>
		<description><![CDATA[Sometimes, when you are designing a star schema model, you'll find yourself in a dilemma. You've come up with a beautiful design, right out of the pages of a Ralph Kimball book with 5 dimensions, and 5 measures, and you are on your way to star schema heaven when suddenly the users start asking akward [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.datamartist.com/wp-content/uploads/2010/01/ralph-kimball-on-the-phone-too-many-dimensions1.jpg" alt="ralph-kimball-on-the-phone-too-many-dimensions" title="ralph-kimball-on-the-phone-too-many-dimensions" width="317" height="199" class="alignright size-full wp-image-3952" />Sometimes, when you are designing a star schema model, you'll find yourself in a dilemma.  You've come up with a beautiful design, right out of the pages of a Ralph Kimball book with 5 dimensions, and 5 measures, and you are on your way to star schema heaven when suddenly the users start asking akward questions- where is such and such flag?  Where's the transaction type?  Why can't I sort based on the "e7" code from the system?</p>
<p>You can try to explain to them that pure star schemas should not be cluttered with a bunch of tiny dimensions and your fact table just won't stand for 100 million rows of the e7 code, and besides computery things like transaction codes should not be in a business savy data model.  But face it, after some digging you determine the user is right (happens quite often in fact)- they really do use that information and it is critical that you include it and you don't have the time or budget to make the perfect data warehouse.</p>
<p>So how do you deliver to them what they need, and avoid messing up your dimensional model?</p>
<p>One answer is to create one or more Junk dimensions, sometimes also referred to as a mystery dimension. </p>
<p>In the end although the content of a mystery dimension may or may not be mysterious, there is nothing particulary mysterious about how to implement this type of dimension table.  </p>
<p>Even if its perfectly clear what the column is, there are often a number of them with very low cardinality (that is they have very few distinct values).  It really does not make sense to add columns in the fact table for each one, and to have a bunch of tiny dimension tables with only a handful of rows in them.</p>
<p>Faced with this the data architect can wrap all these columns up into a junk dimension.</p>
<p>A junk dimension is a dimension that holds all the unique combinations of a set of columns, and assigns a unique key.  This key is what is stored in the fact table, in the mystery dimension column.</p>
<p>Lets look at a mystery dimension example.  We'll make up and example dimension thats very small for simplicity sake.  Lets say that the transactional table that is used to generate one of our facts has three columns "Zortz" "a3" and "uudl" which we fully satisfy our mystery dimension criteria.  (i.e. we don't know what they are, but people use them in queries.)</p>
<p>"Zortz" is a true/false value, "a3" is one of two values "Confirmed" or "Pending" and "uudl" is either "" or "k".  All the possible combinations of these values would be put into a dimension table and assigned an integer surrogate key.  Thus the mystery dimension table would look like this:<br />
<img src="http://www.datamartist.com/wp-content/uploads/2009/11/mystery-dimension-example-data-set.jpg" alt="mystery-dimension-example-data-set" title="mystery-dimension-example-data-set" width="425" height="200" class="aligncenter size-full wp-image-3548" /></p>
<p>A key consideration when forming mystery dimensions is how many combinations exist.  If the number of combinations is too high the mystery dimensions size may be unmanageable.</p>
<p>And be careful assuming that all the combinations have been used yet.  You are safe if the data type has a fixed set of values (like Boolean, or codes from a known finite set) because you can be sure you've created a dimension row for every combination.</p>
<p>But if there are free form string columns, then you need to make sure your ETL is able to generate new dimension rows and surrogate keys as new combinations are created in the source system.  This might still be worth while, depending on how many new combinations get created.</p>
<p>You can also manage the size of the mystery dimension tables by having 2 or more mystery dimensions, which might reduce the overall number of dimensional rows depending on the makeup of the data.  Different columns and values may tend to cluster together and you will find that grouping them correctly makes say, two small mystery dimensions rather than one huge one.</p>
<p>If, however the number of rows is manageable, a mystery dimension allows all the columns to be queriable, while only adding one column to the fact table, and providing a much more efficient solution in comparison to either creating multiple dimensions, or leaving all the data in the fact table.  </p>
<p>By moving it to a junk dimension or "mystery" dimension then you've got fewer indexes on the fact table which might be important depending on the size.  </p>
<p>So if you find yourself telling your end users that they will just have to do without a column, think twice about it.  The role of a data warehouse is to deliver the data- sometimes you just have to find the right packaging to get the job done.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/mystery-or-junk-data-warehouse-dimensions/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data migration Part 3- Mapping the legacy systems</title>
		<link>http://www.datamartist.com/data-migration-part-3-mapping-the-legacy-systems-meta-data-and-application-mapping</link>
		<comments>http://www.datamartist.com/data-migration-part-3-mapping-the-legacy-systems-meta-data-and-application-mapping#comments</comments>
		<pubDate>Mon, 14 Dec 2009 18:17:07 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Modelling]]></category>
		<category><![CDATA[Data migration]]></category>
		<category><![CDATA[Meta Data]]></category>
		<category><![CDATA[Meta]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=3640</guid>
		<description><![CDATA[This is part three of an ongoing series that's taking a look at data migration projects. In this part we're going to talk about how important it is to know where you are starting from, before you head off on a new application journey. Understanding and mapping your legacy systems is a key success factor [...]]]></description>
			<content:encoded><![CDATA[<p>This is part three of an ongoing series that's taking a look at data migration projects. In this part we're going to talk about how important it is to know where you are starting from, before you head off on a new application journey.  <strong>Understanding and mapping your legacy systems is a key success factor for a data migration project</strong>, but can be a very difficult and time consuming battle.  In this post, I'll talk a bit about some approaches I've found useful in my experience.</p>
<p><img src="http://www.datamartist.com/wp-content/uploads/2009/12/we-might-have-some-undocumented-interfaces-to-consider1.jpg" alt="we-might-have-some-undocumented-interfaces-to-consider" title="we-might-have-some-undocumented-interfaces-to-consider" width="374" height="275" class="alignright size-full wp-image-3656" />If you like, you can start with Part <a href="/data-migration-part-1-introduction-to-the-data-migration-delema">one</a> which was a light hearted introduction to data migration projects in general, and part <a href="/data-migration-part-2-determining-data-quality-is-the-first-key-step">two</a>, where we talked about the importance of data quality.</p>
<blockquote><p>Why are we spending so much time on this? Thats the OLD system- we need to focus on the future!</p></blockquote>
<p>Here are just some of the important things the legacy mapping needs to clarify:</p>
<ol>
<li><strong>Data location</strong>- You can't migrate data if if you don't know what it is and where it is.</li>
<li><strong>Data dependencies to other systems</strong> All processes and interfaces that rely on interfaces to the legacy systems need to be either replaced or shut off.  Often this means that even if the new system is not involved, other systems may stop working because they get data from the legacy systems.  The data migration project is not just about turning on the new system.  The consequences of turning off the old system have to be known and managed.</li>
<li><strong>Legal requirements to keep legacy data available.</strong> Even if data is not migrated to the new system there may be additional data migration requirements into data warehouses or documents that have nothing to do with the new application.</li>
<li><strong>Infrastructure dependencies.</strong> The actual infrastructure that the legacy systems are on might perform other tasks that although not directly related to the legacy system will cause issues when that infrastructure is removed. (For example, someone installed a service of some sort on one of the servers that is used by other applications that are completely unlrelated from a data point of view).</li>
</ol>
<h2>Often the first time the Legacy system is documented is just before it's shut down.</h2>
<p>Despite our best intentions, sometimes documentation doesn't get updated.  This is the reality for many systems, and particularly for legacy systems.  </p>
<p>One of the first steps in a data migration project is to gather all the existing documentation for the legacy systems, and all the systems they talk to, and make sure its accessible to the data migration project team.</p>
<p>It is critical to have tight control over these documents, and to ensure that everyone works off a "live" version- because your mapping is going to update that documentation, and every developer, data modeler and application team member needs to know that they have the best and latest version.</p>
<h2>The application interface diagram.</h2>
<p>Now, the ideal situation is to have a dynamic, self correcting, scanning Configuration Management Database tool (CMDB tool) that already has every scrap of meta data about every application and all its interfaces ready to go. </p>
<p>If you have one of these, good for you, and you can stop reading.</p>
<p>For the rest of us, lets talk practical methods of mapping what we have.</p>
<h3>How to get the data.</h3>
<ol>
<li>Scan the environment- catch the interfaces in the act.
<ul>
<li>Monitor network traffic to detect exchanges between applications.</li>
<li>Scan file systems to find interface files and determine frequency.</li>
<li>Catalog services and activity of those services on servers.</li>
</ul>
</li>
<li>Get out there and talk to people.
<ul>
<li>Ask people-  where is data from this system used?</li>
<li>Look at management reports and trace backwards to find where information is pulled.</li>
<li>Don't assume the interface is direct.  My record discovered is 6 hops from source to the excel sheet used by the CEO, with the information passing through two of the same systems twice.</li>
<li>Hunt down people that were involved in the original installation. Often they'll have key information that can save you time.</li>
</ul>
</li>
<li>Any other way that works.</li>
</ol>
<h2>What to do with it.</h2>
<p>If you don't have a complex tool to do the mapping of all your systems, then one approach that is a step above the "lots of excel sheets and powerpoint slides" approach, is to use a tool like Microsoft Viso.  I've used it successfully to map applications, by having the drawing and the interfaces BE the database.  This ensures that everything in the drawing is on the interface list, and everything on the interface list shows up on the drawing.  </p>
<ol>
<li>Create different objects in Viso, and give them attributes. At a minimum you need an application, interface and database object.</li>
<li>Draw the applications and the interfaces between them in a single large viso drawing, and fill in the attributes in the visio objects.</li>
<li>Make some simple VBA code in the drawing to dump all the data into flat files or excel sheets (or directly to a DB if you get ambitious).</li>
</ol>
<p>  It's simple, but it is far better than having spreadsheets, and a drawing- and then constantly trying to determine if the two agree with each other.</p>
<p>In the ERP project where I used this technique, we identified over 1500 interfaces between hundreds of application instances.  The ERP project was a very large effort with hundreds of project resources, and multiple phased projects implementing a new common system.  The actual original mapping took two people about 3 months to do.  They had to work with about 30 different applications support people to systematically map all the applications, and the interfaces, one by one.</p>
<p>A key part of the job was to actually validate the documentation.  IE if the documentation said there was a chron job that ran a script on server X, actually go to server X and watch it run.  This meant that we could be confident in the map, and make plans based on it.</p>
<p>Everyone on the team used the drawing and lists generated from the drawing to stay on the same page.  And it was a big page- the key is to also have access to a plotter- we were plotting out a pretty good size wall poster by the time we were done.</p>
<p>The ERP teams had the drawing taped up to the wall- and they were making notes right on it and emailing my team.  We would update the master, and publish a new version, along with the generated lists.  </p>
<p>In building this drawing, we found that most of the interfaces were "under" or "un"-documented, and that if documentation did exist, generally it was wrong. By establishing the "official" document for the legacy systems, we focused and coordinated the design effort in a way that would not have happened, if each team just had their own marked up copy of the original documentation or the part that was of interest to them.</p>
<h2>Having the map means you can make the plan</h2>
<p>This drawing and the interfaces mapped were critical in planning the migration.</p>
<ol>
<li>Create different layers in your drawing for each phase "Phase 1", "Phase 2", "Phase 3", or "Feb 2010", "Aug 2010", "Jan 2011" etc.</li>
<li>Hide or show systems and interfaces (including the new applications and interfaces) as they were phased in or out for each layer.</li>
<li>By viewing and printing layers separately, you can see a step by step plan for the migration- with your application architecture and integration map at each phase.</li>
</ol>
<p>This was a powerful tool to both do the planning, and to make sure everyone understood the timing and sequence.  With multiple phases over a three year period, the project needed it, and without such an overall view, such critical planning would have been haphazard.</p>
<p>The challenge with this mapping is to find the right level of detail required.  Not detailed enough and it is wasted effort.  Too detailed and it will consume excessive resources and time.</p>
<h2>A simple approach- What talks to what and what it runs on.</h2>
<p>There are two key aspects to mapping your application architecture.  </p>
<ol>
<li>Functional relationships- applications talking to other applications, with interfaces between them.</li>
<li>Infrastructure relationships - which servers, network connections, services and databases are involved in the functional relationship</li>
</ol>
<p>You can't show both completely on a single drawing- don't try.  Some applications run on multiple servers, many servers run more than one application, data bases are shared by many, interfaces often use common infrastructure such as EAI tools etc.</p>
<p>The approach we took, and it worked well, was to show the functional relationships on the diagram, and hold the physical relationships (which databases were on which servers/clusters and which application ran on which server etc.) in the attributes of the applications. </p>
<p>We did sometimes show some physical attributes on the diagram for easy reading, but only as an annotation- the relationship was done via the attributes in the visio application objects.</p>
<p>This meant that you could ask "What runs on this server?" and could ask "Which servers are involved with this application?" by doing a filter or query on the data.  Very useful things if you are planning to shut down a server.  You make a checklist, and one by one make sure everything is either shutdown, or moved.</p>
<p>Here's a simple example to illustrate what the diagram might look like;<br />
<img src="http://www.datamartist.com/wp-content/uploads/2009/12/application-and-interface-drawing-example.jpg" alt="application-and-interface-drawing-example" title="application-and-interface-drawing-example" width="543" height="335" class="aligncenter size-full wp-image-3662" /></p>
<p>The circles with the numbers were the interfaces, each one had attributes like "To" , "From" and "Method" etc.  The level of detail you go to is a function of how ambitious you are, but at a minimum you need to record the fact that the interface exists.</p>
<p>So in summary:</p>
<ol>
<li>Create a single map of all your applications and interfaces and share it with everyone on the team</li>
<li>Make sure you validate your map carefully, looking into the actual systems, and talking with as many people as needed to ensure you have captured everything</li>
<li>Make a step by step plan for the migration, showing when each application, interface and infrastructure item is phased in or out.</li>
</ol>
<p>Next up- <a href="/data-migration-creating-a-data-dictionary-how-to-tackle-master-data-management">the data dictionary</a> and how do we get everyone to agree on those definitions?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-migration-part-3-mapping-the-legacy-systems-meta-data-and-application-mapping/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MS Access query example and comparision to Datamartist</title>
		<link>http://www.datamartist.com/microsoft-access-query-example-and-comparision-to-datamartist</link>
		<comments>http://www.datamartist.com/microsoft-access-query-example-and-comparision-to-datamartist#comments</comments>
		<pubDate>Tue, 31 Mar 2009 22:59:55 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Business Intelligence Architecture]]></category>
		<category><![CDATA[Data Modelling]]></category>
		<category><![CDATA[Data Transformation]]></category>
		<category><![CDATA[MS Access]]></category>
		<category><![CDATA[Microsoft Excel]]></category>
		<category><![CDATA[Access]]></category>
		<category><![CDATA[Data Mart Example]]></category>
		<category><![CDATA[Excel]]></category>
		<category><![CDATA[Personal data mart]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=1321</guid>
		<description><![CDATA[Microsoft Access allows users to create complex queries and analyze large data sets. However, it can be complicated to use compared to Excel. In this post, I'll talk about ms access queries and the equivalent way to perform the same data transformation in the Datamartist tool- visually and simply. Microsoft Access has a clear role [...]]]></description>
			<content:encoded><![CDATA[<p>Microsoft Access allows users to create complex queries and analyze large data sets.  However, it can be complicated to use compared to Excel.  In this post, I'll talk about <a href="/help-support/tutorials/microsoft-access-examples-and-tutorials">ms access queries</a> and the equivalent way to perform the same data transformation in the <a href="/product">Datamartist tool</a>- visually and simply.</p>
<p>Microsoft Access has a clear role to play when a small, light database application is required.  However, it has a learning curve, and is not necessarily the best tool for data analysis.</p>
<h2>Product Segmentation Query Example</h2>
<p>Lets look at an example ms access query or two and see how we can do the same thing Datamartist, only without the queries and without any SQL. For this example, lets say that we have two sets of sales data from different time periods, and a product list, and we want to define some product segments based on color and price.  We want to get a summary of the sales Qty and average price sold by month, broken out by the new categories which are as follows:</p>
<ul>
<li> "Red and High Priced" If the product is Red and its minimum price is more than $1000</li>
<li> "Red Low Price wide price range" If the product is Red, has a minimum price less than $1000 but has a min to max price of more than $200</li>
<li> "Red Low Price small price range" If its Red and not in the first two segments</li>
<li> "Yellow" if the product is yellow. </li>
<li> "Other" for all the rest</li>
</ul>
<p>The three data tables we have are as follows:</p>
<ol>
<li> Sales 03-06 with about 120 000 rows, which contains sales data from 2003 - 2006</li>
<li> Sales 2007  with about 30 000 rows, which contains sales data for 2007</li>
<li> Products  which contains the colors for all the products and their minimum and maximum prices</li>
</ol>
<p>So- first step is to combine the two data tables, in Access, this is done with a UNION query with the following SQL code:</p>
<blockquote><p>select * from [Sales Data 03-06] UNION select * from [Sales Data 2007];</p></blockquote>
<p>In Datamartist, we simply connect the two tables up to a combine block.<br />
<img src="/wp-content/uploads/2009/03/segmentation-example-datamartist-combine1.jpg" alt="segmentation-example-datamartist-combine1" title="segmentation-example-datamartist-combine1" width="264" height="234" class="alignnone size-full wp-image-1394" /></p>
<p>Next, we need to define the segmentation-  again in Access this is done with a Query, this time by nesting IIF statements to add a new column called "Product_Segment" to the resulting query.</p>
<blockquote><p>SELECT Products.Product_ID, Products.Product_Name, Products.Product_Group, Products.Product_Category, Products.Product_SubCategory, Products.Shipping_Weight, Products.Color, Products.Price_Min, Products.Price_Max, IIf([Color]="Red" And [Price_Min]>1000,"Red and High Priced",IIf([Color]="Red" And ([Price_max]-[Price_min])>200,"Red Low Price wide price range",IIf([Color]="Red","Red Low Price small price range",IIf([Color]="Yellow","Yellow","Other")))) AS Product_Segment<br />
FROM Products;</p></blockquote>
<p>In Datamartist, we use a segmentation block to do the same thing.  The interface is graphical, and the syntax is the same as you would use in Excel.  There is no need to nest any IF statements, because the overall block is designed to do that.  Heres what the blocks look like-  the MS Access import block on the left, and the segmentation rule block on the right.<br />
<img src="/wp-content/uploads/2009/03/segmentation-example-datamartist-segment-block.jpg" alt="segmentation-example-datamartist-segment-block" title="segmentation-example-datamartist-segment-block" width="418" height="211" class="alignnone size-full wp-image-1428" /><br />
Each segment has the statement that defines if a row is in the segment or not.   The block tests each segment rule in order, starting at the top- the first statement that solves as "TRUE" defines the value for the Product_Segment column for that row. Dragging the segments up and down changes what order the rules are checked.</p>
<p><a href="/resources/images/Segmentation-Example-Product.jpg" target="_blank" onClick="javascript: pageTracker._trackPageview('/screenshots/Segmentation-Example-Product'); "><img src="/resources/images/Segmentation-Example-Product-Thumb.jpg">
<p style="padding:8px;">(Click to Enlarge)</p>
<p></a></p>
<p>Then we have to Join this new product dimension (with the segmentation column) to the sales data, and summarize.</p>
<p>In MS Access, this is done with more queries-  Heres what Access looks like when we're done.<br />
<img src="/wp-content/uploads/2009/03/segmentation-example-access-gui1.jpg" alt="segmentation-example-access-gui1" title="segmentation-example-access-gui1" width="450" height="485" class="alignnone size-full wp-image-1405" /><br />
Compare that list of Tables and Queries to the visual, left to right layout of the Datamartist data canvas that does the same thing.  Without ever having to write any SQL code:</p>
<h2>The VISUAL way to do it</h2>
<p><img src="/wp-content/uploads/2009/03/segmentation-example-solved-canvas.jpg" alt="segmentation-example-solved-canvas" title="segmentation-example-solved-canvas" width="406" height="314" class="alignnone size-full wp-image-1403" /></p>
<p><a href="/resources/images/Segmentation-Example-Datamartist-full-app-shot.jpg" target="_blank" onClick="javascript: pageTracker._trackPageview('/screenshots/Segmentation-Example-Datamartist-full-app-shot'); "><img src="/resources/images/Segmentation-Example-Datamartist-full-app-shot-Thumb.jpg" class="alignright size-full wp-image-1430" ></a><br />
In Datamartist you can see the flow of the data, the row counts are clearly displayed, and clicking on the connectors will bring up the underlying data set in the data viewer.  Its clear which block feeds which, and by adding more blocks and connecting them at the desired point in the data flow, new analysis can be created.</p>
<p>Take Datamartist for a trial run-  <a href="/downloads">download it now</a> because maybe you don't have to learn microsoft access queries after all.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/microsoft-access-query-example-and-comparision-to-datamartist/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>MS Access vs Excel vs Datamartist</title>
		<link>http://www.datamartist.com/ms-access-vs-excel-vs-datamartist-a-do-it-yourself-guide</link>
		<comments>http://www.datamartist.com/ms-access-vs-excel-vs-datamartist-a-do-it-yourself-guide#comments</comments>
		<pubDate>Fri, 06 Mar 2009 02:33:06 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Modelling]]></category>
		<category><![CDATA[MS Access]]></category>
		<category><![CDATA[MS Excel]]></category>
		<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[Excel Data Import]]></category>
		<category><![CDATA[Personal Data Marts]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=1251</guid>
		<description><![CDATA[When data analysis requirements really get tough, the tough get going- and start to seriously use databases. Let's face it, if you're considering Microsoft Access chances are what you need to get done is beyond what Excel does well, so you're looking for options. Its also likely that your IT department is unable or un-willing [...]]]></description>
			<content:encoded><![CDATA[<p><img src="/wp-content/uploads/2009/03/excel-database-datamartist1.jpg" alt="excel-database-datamartist1" title="excel-database-datamartist1" width="200" height="183" class="alignright size-full wp-image-1301" />When data analysis requirements really get tough, the tough get going- and start to seriously use databases.</p>
<p>Let's face it, if you're considering Microsoft Access chances are what you need to get done is beyond what Excel does well, so you're looking for options.  Its also likely that your IT department is unable or un-willing to help you out- this being even more likely as the recession reduces reporting budgets left, right and center.</p>
<p>Two of the key things that lead someone to search for a database solution are:</p>
<ul>
<li><strong>Data Volume</strong>- More than a million rows and Excel becomes very difficult, even before that the performance suffers.</li>
<li><strong>Flexibility to Join Tables</strong> - Vlookup and VBA code only go so far- Access gives an easy way to make joins between tables, one of the powerful features of relational databases.</li>
</ul>
<p>Now, the data volume is what it is- if you have millions and millions of rows, you need something to cut it down to size before you move it into your Excel spreadsheet. </p>
<p>On the other point, however, I can hear the Excel fans saying "now wait a minute, Excel can do that, I don't really need a database" and they are right.  But they are almost always right- Excel can do almost anything. It does not mean, however that its the best tool for the job. Using Vlookup and VBA scripts to join up multiple tables is not my idea of a fun time. And even in Excel 2007 I find the pivot tables annoying and prone to break if I'm adding categories, moving data sets or heaven forbid changing number and order of columns.</p>
<p>Microsoft Access has a very nice interface for creating joins between tables, just a simple drag and drop between fields. The cross tab query capability is useful and good, and being a relational database it's more tolerant of changes to table structure because it's not messing with cell references.</p>
<p>"But", many who have used MS Access will say, "its pretty complex to learn, and even if I do start to get the query stuff down, it doesn't handle bad data well."</p>
<p>Bad data?  Who has bad data? Isn't all data pristine, as intended, correctly formatted and accurate?</p>
<p><img src="/wp-content/uploads/2009/03/enough-to-make-access-decide-its-text1.jpg" alt="enough-to-make-access-decide-its-text1" title="enough-to-make-access-decide-its-text1" width="210" height="225" class="alignright size-full wp-image-1289" />One of the huge differences between Excel and MS Access is that Excel is extremely flexible.  (Probably more flexible than your auditor would like, but thats a different story).  One source of Excels flexibility is its ability to accept different data types in the same column, and to allow editing of cells quickly. In Microsoft Access, for example, when it sees some variation it either discards the data or defaults to the data type "Text"- meaning now you can't perform the calculations you need to do on your data.<img src="/wp-content/uploads/2009/03/sales-data-import-errors.jpg" alt="sales-data-import-errors" title="sales-data-import-errors" width="365" height="232" class="alignright size-full wp-image-1289" /></p>
<p>This illustrates one of the challenges people face in trying to use a database - databases are very strict on data types.  Once you declare a data type for a column, if you import data into the table, the database will discard the values that do not conform to that data type.  In Excel, you get cell errors if you try calculations but the original data is still there.</p>
<p>One of the powerful features of the <a href="/product">Datamartist tool</a> is the fact that it has an underlying database structure that provides flexibility on data types.  Unlike MS Access and other databases, Datamartist can store dates, numbers, strings and booleans natively in a single column. (It does not convert to strings- it stores the full object).  Take a look at this example:<br />
<img src="/wp-content/uploads/2009/03/datamartist-dynamicly-handles-data-type-at-row-level1.jpg" alt="datamartist-dynamicly-handles-data-type-at-row-level1" title="datamartist-dynamicly-handles-data-type-at-row-level1" width="425" height="226" class="aligncenter size-full wp-image-1294" /></p>
<p>In each individual row, Datamartist completes the calculation if possible.  Datamartist is a database that gives you the freedom of a Spreadsheet. Of course, just like excel, if you ask for a calculation on a value that is meaningless you will get an error- but at the individual value- not a full row discard.  This means that with messy data you can still work with it, bring it in, and fix it.  In Access or another database, you can't even get it through the front door (or it defaults to text, making many calculations impossible).</p>
<p>This won't be the last time I compare these three tools- and the types of data structures and tasks each of them are most effective with.</p>
<p>In the mean time- Download <a href="/downloads">Datamartist</a>- see what I'm talking about with your own data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/ms-access-vs-excel-vs-datamartist-a-do-it-yourself-guide/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Datamartist Beta tests out on 100 Million rows</title>
		<link>http://www.datamartist.com/datamartist-beta-very-large-data</link>
		<comments>http://www.datamartist.com/datamartist-beta-very-large-data#comments</comments>
		<pubDate>Sun, 22 Feb 2009 20:04:47 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Modelling]]></category>
		<category><![CDATA[Data Transformation]]></category>
		<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[easy to use etl]]></category>
		<category><![CDATA[Excel Performance]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=1073</guid>
		<description><![CDATA[Since one of the key features I want Datamartist to have is the ability to manage very large data sets, I've built in some new caching functionality- and was kicking the tires late one night recently. First, using a custom program, I generated a 100 million row data file with 10 columns (three integer measures, [...]]]></description>
			<content:encoded><![CDATA[<p><img src="/wp-content/uploads/2009/02/just-getting-started-labeled.jpg" alt="just-getting-started-labeled" title="just-getting-started-labeled" width="249" height="229" class="alignright size-full wp-image-1089" /> Since one of the key features I want Datamartist to have is the ability to manage very large data sets, I've built in some new caching functionality- and was kicking the tires late one night recently.</p>
<p>First, using a custom program, I generated a 100 million row data file with 10 columns (three integer measures, two floating point measures and five dimensional columns).  The input file was about 5.6 Giga bytes.</p>
<p>Then I fired up the Datamartist Beta and imported this file with a Text Import block.  When I initially put the block on the canvas, datamartist counted the rows (showing me its progress of course), so it took a few minutes to get through all 100 million of them- but only a few, and after that since the file was unchanged response was normal (as if I was working on a much smaller set, because I was using preview mode).</p>
<p>I added a calculation block, which added a calculated column, and then I summarized the data, including two summed measures and an average on one of the floats- here's a screen shot of the Summarize block controls:</p>
<p><img src="/wp-content/uploads/2009/02/summary-block-config-100m-load-test-450wide.jpg" alt="summary-block-config-100m-load-test-450wide" title="summary-block-config-100m-load-test-450wide" width="450" height="189" class="alignnone size-full wp-image-1080" /></p>
<p>Everything looked good in the preview, so I pressed the RUN button, and went to bed.</p>
<p>It took three hours and thirty eight minutes (3:37:42.7343750 to be exact) to churn through the 100 Million rows and summarize down to the 100 combinations of the two dimensions selected.  This was running on my D630 Dell Latitude laptop to be more real world (it's probably a bit faster on the Quad core workstation <img src='http://www.datamartist.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> )<br />
<img src="/wp-content/uploads/2009/02/load-complete-blocks1.jpg" alt="load-complete-blocks1" title="load-complete-blocks1" width="450" height="108" class="alignnone size-full wp-image-1078" /></p>
<p>Now, for those of you downloading the Beta, I would not advise doing this at home.  I haven't put any limiters in the application, but keep your testing to under 5 million rows for now <img src='http://www.datamartist.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> .  And of course, using such large data volumes means you need lots of hard drive space to do the caching- but the bottom line is I'm building the plumbing to let you go to these kinds of data volumes if you need to.  Obviously lots more testing to do, and some real world data always discovers new things- so download the beta as soon as it's out, and tackle those big data sets you've had to rely on the IT department for in the past!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/datamartist-beta-very-large-data/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Joining the Dimension Table to the Fact Table- Purchasing Data mart (Part 5)</title>
		<link>http://www.datamartist.com/joining-the-dimension-table-to-the-fact-table-purchasing-data-mart-part-5</link>
		<comments>http://www.datamartist.com/joining-the-dimension-table-to-the-fact-table-purchasing-data-mart-part-5#comments</comments>
		<pubDate>Tue, 17 Feb 2009 16:31:48 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Cost Reduction]]></category>
		<category><![CDATA[Data Modelling]]></category>
		<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Personal Data Marts]]></category>
		<category><![CDATA[Purchasing Analysis]]></category>
		<category><![CDATA[Data Mart Example]]></category>
		<category><![CDATA[Dimension Tables]]></category>
		<category><![CDATA[Purchasing Data Warehouse]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=991</guid>
		<description><![CDATA[After we have created the dimension tables and the fact table and populated them with data the final step to getting a star schema is of course to actually join the dimension tables to the fact table. In the datamartist tool we do this with a Join block. Check out the first four parts of [...]]]></description>
			<content:encoded><![CDATA[<p><img src="/wp-content/uploads/2009/02/join1.jpg" alt="join1" title="join1" width="200" height="200" class="alignright size-full wp-image-995" />After we have created the dimension tables and the fact table and populated them with data the final step to getting a star schema is of course to actually join the dimension tables to the fact table.  In the datamartist tool we do this with a Join block.</p>
<p>Check out the first four parts of this series (<a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">1</a>,<a href="http://www.datamartist.com/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2">2</a> , <a href="/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3">3</a> and <a href="/hierarchies-and-tree-structures-in-dimensions-an-example-item-dimension-part-4">4</a>) where we created an example data mart, with some fictitious purchasing data.</p>
<p>The final step is to join the dimensions we have created to the fact table. To do this, we connect up the two dimensions (Vendor and Item) to the Join block and connect an export block to the output.  What has in effect been created is a complete Extract, Transform Load (ETL) and the final star schema join.<br />
<a href="/wp-content/uploads/2009/02/po-data-mart-screen-shot2.png"><img src="/wp-content/uploads/2009/02/po-datamart-blocks1.jpg" alt="po-datamart-blocks1" title="po-datamart-blocks1" width="400" height="208" class="alignnone size-full wp-image-1002" /></a></p>
<p>(If thats a bit hard to read- click on the image to see the full size screen shot.)</p>
<p>With the generated data set I used for this example, summarizing the data to yearly totals but keeping all the detail on Vendor and Item causes the roughly 4 million row raw data file to be reduced to around 800 thousand rows.  (This summarizing was done on another canvas- although it could have been done on this canvas just as easily).</p>
<p><img src="/wp-content/uploads/2009/02/join-column-selection.jpg" alt="join-column-selection" title="join-column-selection" width="249" height="361" class="alignleft size-full wp-image-1007" />This data mart, with 800 k rows and two dimensions of about three thousand members each took my laptop about a minute and 45 seconds to solve, and save to a 360 Mb text file out.</p>
<p>Of course, by summarizing or filtering (just add blocks) analysis subsets could easily be exported directly to Excel, managing the data volumes involved, and letting you create the graphs, dashboards and reports that you need.</p>
<p>This is part of a 5 part series- here are the links to the various parts: <a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">1</a>,<a href="/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2">2</a> , <a href="/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3">3</a> , <a href="/hierarchies-and-tree-structures-in-dimensions-an-example-item-dimension-part-4">4</a> and <a href="/joining-the-dimension-table-to-the-fact-table-purchasing-data-mart-part-5">5</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/joining-the-dimension-table-to-the-fact-table-purchasing-data-mart-part-5/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hierarchies and Tree Structures in Dimensions- an Example Item Dimension (Part 4)</title>
		<link>http://www.datamartist.com/hierarchies-and-tree-structures-in-dimensions-an-example-item-dimension-part-4</link>
		<comments>http://www.datamartist.com/hierarchies-and-tree-structures-in-dimensions-an-example-item-dimension-part-4#comments</comments>
		<pubDate>Wed, 11 Feb 2009 16:09:09 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Modelling]]></category>
		<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[Personal Data Marts]]></category>
		<category><![CDATA[Purchasing Analysis]]></category>
		<category><![CDATA[Data Mart Example]]></category>
		<category><![CDATA[Hierarchies and Tree Structures]]></category>
		<category><![CDATA[Purchasing Data Warehouse]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=903</guid>
		<description><![CDATA[Having a way to create and manage tree structures (Hierarchies) with your dimension and fact tables is a key part of making a dimensional model in any data warehouse or data mart. Hierarchical structures lend themselves to managing a very large number of categories and we use them to create drill down paths. Check out [...]]]></description>
			<content:encoded><![CDATA[<p><object width="450" height="412"><param name="movie" value="/resources/video/DemoClips/beta2_tree_edit_clip_un_prod.swf"><embed src="/resources/video/DemoClips/beta2_tree_edit_clip_un_prod.swf" width="450" height="412"></embed></object></p>
<p>Having a way to create and manage tree structures (Hierarchies) with your dimension and fact tables is a key part of making a dimensional model in any data warehouse or data mart. Hierarchical structures lend themselves to managing a very large number of categories and we use them to create drill down paths.</p>
<p>Check out the first three parts of this series (<a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">1</a>,<a href="/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2">2</a> and <a href="/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3">3</a>) to see what we've done so far.</p>
<p>In this installment, we will make a another dimension, the Item dimension.  This will illustrate how the Datamartist tool allows you to quickly and easily generate hierarchies, and even edit and manage them in a graphical user interface.</p>
<p>The head of purchasing for Acme has asked us to analyze the company's spend on computer equipment- "I have a feeling some offices are spending more than others- but I don't have the numbers to back it up.  But I don't want you to use the categories in the source system- I just want it broken down by Desktops, Laptops, Printers, PDAs and other.  Can you do that with the data mart?"</p>
<p> In their source system, Acme is using the <a href="http://unstats.un.org/unsd/cr/registry/cpc-2.asp">United Nations Central Product Classification</a>,  (UNCPC) and so we know that all the computer spending we're interested is in division "C45 Office  accounting and computing machinery".   The way the codes are structured is they have a code like "C45222", so we want to take all codes with the left three characters being "C45".  We can do this easily with a filter block. After the filter block we connect a define reference block (to make a dimension), just as we did before-and finally, since we're looking at hierarchies, we'll add a recategorise block too- that last block in the chain is what we use to change the drill down structure;</p>
<p><img src="/wp-content/uploads/2009/02/items-modify-computer-categories.jpg" alt="items-modify-computer-categories" title="items-modify-computer-categories" width="500" height="141" class="alignnone size-full wp-image-932" /></p>
<h2> Tree structures simplify alternate categorisation</h2>
<p>The advantage of using a tree structure is we only have to rearrange the level of the hierarchy that encompasses the level of detail we need: we don't have to map each individual product, just the higher levels.  So it's much less work to start, and when new products are added in the source system, they will automatically map up into the new categorization.  Recategorising in excel often means search and replace at the bottom level which can cause errors, and has to be redone manually every time the data is updated.</p>
<p>When we open the recategorise block, we simply pick the levels we want to see, and then are presented with a tree view that shows us the hierarchy, automatically generated from the underlying data.<br />
<img src="/wp-content/uploads/2009/02/acme-computer-categories-edit.jpg" alt="acme-computer-categories-edit" title="acme-computer-categories-edit" width="500" height="245" class="alignnone size-full wp-image-936" /></p>
<p>Now, directly within the hierarchy we can edit categories, add new categories, and drag and drop categories around to build the new drill down that we want.  <img src="/wp-content/uploads/2009/02/acme-computer-updated-categories1.jpg" alt="acme-computer-updated-categories1" title="acme-computer-updated-categories1" width="250" height="331" class="alignleft size-full wp-image-945" /> The interface is a lot like the windows file explorer, just like renaming and moving folders, except that you are building dimensional data. Of course, the underlying input data is not changed, so there is no need to modify the source system in any way, but the datamartist tool records all the mapping and is able to reproduce it when new data arrives. </p>
<p>You only have to edit the Hiearchy once, and from that point on your analysis can use both the existing, and the edited tree structure.  It's possible to create as many different hiearchies as required- it's a fast way to do "what if" analysis, trying out different drill down paths and categorisations.</p>
<p>This is part of a 5 part series- here are the links to the various parts: <a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">1</a>,<a href="/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2">2</a> , <a href="/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3">3</a> , <a href="/hierarchies-and-tree-structures-in-dimensions-an-example-item-dimension-part-4">4</a> and <a href="/joining-the-dimension-table-to-the-fact-table-purchasing-data-mart-part-5">5</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/hierarchies-and-tree-structures-in-dimensions-an-example-item-dimension-part-4/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Connecting the dimension table to the fact table- Vendor Example (Part 3)</title>
		<link>http://www.datamartist.com/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3</link>
		<comments>http://www.datamartist.com/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3#comments</comments>
		<pubDate>Mon, 09 Feb 2009 20:47:55 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Cost Reduction]]></category>
		<category><![CDATA[Data Modelling]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Personal Data Marts]]></category>
		<category><![CDATA[Data Mart Example]]></category>
		<category><![CDATA[Dimension Tables]]></category>
		<category><![CDATA[Duplicate Data]]></category>
		<category><![CDATA[Purchasing Data Warehouse]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=858</guid>
		<description><![CDATA[In parts one and two of this series we introduced our challenge (to make a data mart to analyze the Acme Company's spending) and showed how the Datamartist tool could import millions of rows of data and then turn it into a fact table we can use in Excel. Now we need to create a [...]]]></description>
			<content:encoded><![CDATA[<p><img src="/wp-content/uploads/2009/02/makingdimseasyway.jpg" alt="makingdimseasyway" title="makingdimseasyway" width="250" height="97" class="alignright size-full wp-image-883" />In parts <a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">one</a> and <a href="/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2">two</a> of this series we introduced our challenge (to make a data mart to analyze the Acme Company's spending) and showed how the <a href="/product">Datamartist tool</a> could import millions of rows of data and then turn it into a fact table we can use in Excel.</p>
<p>Now we need to create a Vendor dimension table and join it to this fact table to determine who our big vendors are.</p>
<p>In Datamartist it is a simple task to create this vendor dimension. As always we use blocks and connect them together.  We define a dimension by using a reference definition block. All we have to do to configure the reference block is to specify which columns uniquely define the dimension (or almost uniquely, Datamartist will resolve duplicate keys using a majority/first rule set for you if you have some data glitches).</p>
<p>We start with an import block that brings in the Vendor master text file, then we define the reference by specifying "Vendor_ID" as the key.  These first two blocks look like this:<br />
<img src="/wp-content/uploads/2009/02/vendor-master-in-and-reference-block.jpg" alt="vendor-master-in-and-reference-block" title="vendor-master-in-and-reference-block" width="302" height="148" class="alignnone size-full wp-image-878" /></p>
<p>Then we join it to the fact table we created in part two of this series with a join block.  This means that now instead of just the vendor ID number that was in the fact table, we have the name, and address for the vendor in our mini star schema.</p>
<p><img src="/wp-content/uploads/2009/02/vendor-dimension-and-join.jpg" alt="vendor-dimension-and-join" title="vendor-dimension-and-join" width="436" height="283" class="alignnone size-full wp-image-879" /></p>
<p>And finally we put a summarize block after that to total up all the monthly values for each vendor, and we export to excel. This is what the canvas looks like:<br />
<img src="/wp-content/uploads/2009/02/vendor-dimension-without-dedup1.jpg" alt="vendor-dimension-without-dedup1" title="vendor-dimension-without-dedup1" width="501" height="198" class="alignnone size-full wp-image-865" /><br />
After we do this, we grab the excel file Datamartist just created for us, do a quick sort, and come up with a list of Acme's top ten suppliers.  Feeling pretty good about ourselves, we do a review with the head of purchasing.</p>
<p>"Where's Mega brothers?" she says with a frown "I think your data is screwy- no way that Mega brothers didn't make the top ten- we spend a fortune on railways, and a lot of our freight goes with the Mega Brothers Rail company. Of course it is probably entered under different vendors, each location works with the office local to them... But we've got to view them as a single vendor in the data mart- you <em><strong>can</strong></em> do that right?"</p>
<p><img src="/wp-content/uploads/2009/02/vendor-dimension-with-dedupe1.jpg" alt="vendor-dimension-with-dedupe1" title="vendor-dimension-with-dedupe1" width="300" height="205" class="alignright size-full wp-image-870" /></p>
<h2>Fixing Duplicate Rows</h2>
<p>  Having to deal with duplicate data is a very common issue in any type of data analysis.  So, back to the canvas.  By simply adding a de-duplicate block to our Vendor dimension table (after the Reference block, and before the join) we can find and resolve the Mega Brothers duplicates.<br />
We just use the filter to find the records- (Easy to do, looking for "Mega" "rail" "brothers" etc. and we map them to a single instance.)  This is the filter control that lets us find and tag the duplicates:<br />
<img src="/wp-content/uploads/2009/02/mega-bros-duplicates-in-picker1.jpg" alt="mega-bros-duplicates-in-picker1" title="mega-bros-duplicates-in-picker1" width="400" height="280" class="alignnone size-full wp-image-871" /></p>
<p><img src="/wp-content/uploads/2009/02/mega-bros-duplicates-in-mapper.jpg" alt="mega-bros-duplicates-in-mapper" title="mega-bros-duplicates-in-mapper" width="312" height="247" class="alignright size-full wp-image-872" />As we tag them, they show up in the mapper, which lets us see which duplicate records we have eliminated for the dimension. We run the canvas again, and this time, sure enough, Mega Brothers Rail is in our top ten.  But even though the head of purchasing knew it was a lot, this is actually the first time she's seen the number.  "Wow. I've got to give them a call- can you give me that in an Excel spreadsheet?"</p>
<p>Stay tuned, more to come as we go further into Datamartist's ability to segment, filter and organize large data sets.</p>
<p>If you want to see the interface in action watch our first <a href="/product/video-and-screenshots/introductory-tutorial-video">Tutorial Video</a>.  Or just get right to it with your own data- <a href="/downloads">download the free 30 day trial now</a>- there is no registration required, and it installs in minutes.</p>
<p>This is part of a 5 part series- here are the links to the various parts: <a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">1</a>,<a href="/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2">2</a> , <a href="/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3">3</a> , <a href="/hierarchies-and-tree-structures-in-dimensions-an-example-item-dimension-part-4">4</a> and <a href="/joining-the-dimension-table-to-the-fact-table-purchasing-data-mart-part-5">5</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Creating a Fact Table with the Vendor dimension Purchasing DM (Part 2)</title>
		<link>http://www.datamartist.com/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2</link>
		<comments>http://www.datamartist.com/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2#comments</comments>
		<pubDate>Fri, 06 Feb 2009 00:23:50 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Modelling]]></category>
		<category><![CDATA[Data Transformation]]></category>
		<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Personal Data Marts]]></category>
		<category><![CDATA[Data Mart Example]]></category>
		<category><![CDATA[Excel Data Import]]></category>
		<category><![CDATA[Excel Performance]]></category>
		<category><![CDATA[Purchasing Data Warehouse]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=781</guid>
		<description><![CDATA[In creating a data warehouse or data mart data model there are two key types of tables- fact tables and dimension tables. Fact tables hold the data to be analyzed, dimensional tables provide categories and analysis values that organize the data. So we have our mission from Part 1: to analyze the "Acme does everything" [...]]]></description>
			<content:encoded><![CDATA[<p><img src="/wp-content/uploads/2009/02/four_million_rows_no_worries1.jpg" alt="four_million_rows_no_worries1" title="four_million_rows_no_worries1" width="300" height="136" class="alignright size-full wp-image-812" />In creating a data warehouse or data mart data model there are two key types of tables- fact tables and dimension tables.  Fact tables hold the data to be analyzed, dimensional tables provide categories and analysis values that organize the data.<br />
So we have our <a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">mission from Part 1</a>: to analyze the "Acme does everything" company's purchasing data and find ways to save money.  The first step, however is getting a handle on the data.  The IT department has given us the files, and with a smug smile told us to "have fun".  We've been given three files that are a snapshot of the purchasing data:</p>
<ul>
<li><strong>Item_Master.txt</strong>  - this holds all the items that Acme buys</li>
<li><strong>Vendor_Master.txt</strong> - this holds a list of all the vendors, with information such as their address</li>
<li><strong>PO_Detail.tx</strong>t - this is the huge data set, all the purchase order data for the last four years</li>
</ul>
<p>The Item and Vendor files aren't very big, but the PO_Detail is over 340 Mb, and it holds almost four million purchase order lines.  Don't try to import it into Excel. Of course you need Excel 2007 to even try to import 4 million rows. In Excel 2003 it would take over sixty sheets and probably some VBA code to try it.  I tried the import in Excel 2007- it takes 20 seconds just to tell me I'll have to go back to the text file import multiple times to do multiple imports onto separate sheets. It took almost two minutes to do the first million rows.  Even once we have the data spread across four sheets it's not clear how to summarize millions of rows in excel easily.<img src="/wp-content/uploads/2009/02/po_detail_columns.jpg" alt="po_detail_columns" title="po_detail_columns" width="247" height="398" class="alignright size-full wp-image-785" /></p>
<p>Instead, let's use the <a href="/product">Datamartist tool</a> to manage this data set and generate one thats more useful.</p>
<p>The first analysis we will do will be on the Vendor dimension, to determine who Acme's big vendors are, and if we can negotiate some price reductions where we have leverage.</p>
<p>In Datamartist, very large files are not an issue because the tool can load in only preview data- this means that it's possible to look at a sampling of a few hundred thousand rows, and design the transformation before running it on the whole data set.</p>
<p>The PO Detail file has the columns shown- let's answer the question - "Who are our biggest suppliers"?<br />
 So which columns do we need?  We probably want to have some sense of trends over time so we'll keep the <strong>order date</strong>, but summarize to <strong>Month</strong>,  we'll keep the <strong>Vendor ID</strong> of course, and then we need to use the <strong>Quantity and Price</strong> fields to calculate the total amount spent.  Then we want to write this summarized data into Excel to check it out.</p>
<p>To do this in Datamartist all it takes is four simple blocks;  A Text import block to load in the PO_Detail.txt file, a calculate block to multiply QTY by PRICE, a Summarize block to do all the summarizing, and an Excel export block to generate the excel file;</p>
<p><img src="/wp-content/uploads/2009/02/po_detail_summarize_blocks.jpg" alt="po_detail_summarize_blocks" title="po_detail_summarize_blocks" width="463" height="92" class="alignnone size-full wp-image-806" /></p>
<p>Each block passes its result to the next block via the connectors, and the last block saves it to an excel file we've specified.</p>
<p>Defining the calculation uses standard spreadsheet functions- here's what the config area looks like;<br />
<img src="/wp-content/uploads/2009/02/calculate_total_closeup.jpg" alt="calculate_total_closeup" title="calculate_total_closeup" width="400" height="91" class="alignnone size-full wp-image-801" /></p>
<p>And defining the summary is as simple as it looks- pick the columns you want, and select what kind of summary you want done.<br />
<img src="/wp-content/uploads/2009/02/summary_block_closeup1.jpg" alt="summary_block_closeup1" title="summary_block_closeup1" width="417" height="111" class="alignnone size-full wp-image-797" /></p>
<p>We run it on a preview set of 100 thousand rows (takes about twelve seconds to run), and check the output.</p>
<p>It looks good, so we run on the whole 4 million rows;</p>
<p><img src="/wp-content/uploads/2009/02/summarize_progress_po_detail.jpg" alt="summarize_progress_po_detail" title="summarize_progress_po_detail" width="466" height="128" class="alignnone size-full wp-image-804" /></p>
<p>About seven minutes later we have our result- an excel sheet with a manageable 130 thousand rows, total spend, by vendor, by month for four years;<br />
<img src="/wp-content/uploads/2009/02/completed_po_detail_summary.jpg" alt="completed_po_detail_summary" title="completed_po_detail_summary" width="461" height="95" class="alignnone size-full wp-image-807" /></p>
<p>Next up we need to create our vendor dimension, and join it to this mini fact table we have created.  Stay tuned.</p>
<p>This is part of a 5 part series- here are the links to the various parts: <a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">1</a>,<a href="/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2">2</a> , <a href="/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3">3</a> , <a href="/hierarchies-and-tree-structures-in-dimensions-an-example-item-dimension-part-4">4</a> and <a href="/joining-the-dimension-table-to-the-fact-table-purchasing-data-mart-part-5">5</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Degenerate Dimensions in Datamarts</title>
		<link>http://www.datamartist.com/degenerate-dimensions-in-datamarts</link>
		<comments>http://www.datamartist.com/degenerate-dimensions-in-datamarts#comments</comments>
		<pubDate>Sun, 28 Dec 2008 02:15:32 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Modelling]]></category>
		<category><![CDATA[Personal Data Marts]]></category>
		<category><![CDATA[Dimension Tables]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=633</guid>
		<description><![CDATA[Not all dimensions are created equal.  A typical dimension is defined by a table that holds the reference data that is being joined to the fact data.  So in the fact table, for example, we have the product ID, or the product code, and in the product dimension table we have a single row for [...]]]></description>
			<content:encoded><![CDATA[<p><img class="size-full wp-image-635 alignleft" title="degenerate-dimension-graphic" src="/wp-content/uploads/2008/12/degenerate-dimension-graphic.jpg" alt="" width="241" height="291" /></p>
<p>Not all dimensions are created equal.  A typical dimension is defined by a table that holds the reference data that is being joined to the fact data.  So in the fact table, for example, we have the product ID, or the product code, and in the product dimension table we have a single row for each product, that lists all the attributes of that product (its size, its color, its category, its segment, etc. etc.) </p>
<p>So it would follow, then, that there must be a dimension table for every dimension, right?  Well, not if the dimension is degenerate.  In fact, you could argue that calling it a dimension at all is pushing it, but I think the idea was to keep things tidy.</p>
<p>In any well structured data mart (a star schema), every column in the fact table should be either a measure or a dimension.   If it's a measure, then it's storing a value for that particular fact- usually a number, and we use it for calculations and aggregations.  If it's a dimension, then we join it to the appropriate dimension table and thereby look up all the interesting things about that fact on that dimension.</p>
<p>Where degenerate dimensions come in is that there are often some columns that we want to have, but that are not measures, and don't have a table of stuff we want to join to.  Example:  a purchase order number.  These columns store something that we want to have (the purchase order number), but to create an empty dimension table would only slow things down.  So, to ensure we don't feel bad about breaking the "only a measure or a dimension in the fact table" rule, we just CALL them dimensions- even without the table.</p>
<p>In the fact itself, any attribute of the purchase order that was of interest, and that therefore had values that would each have more attributes we would be interested in would have been turned into a dimension, and a dimension table would have been created.</p>
<p>But to create a dimension table that contains a row for every purchase order would create a very large dimension with nothing in it (since there are lots of purchase orders, possibly as many as there are facts if the grain your fact table is one per purchase order).  But our users would not be happy if they could not get a list of the purchase orders included in a given total, or be able to drill down to that bottom level of detail that we've gone to all the trouble to include. </p>
<p>So, when we create transactional level fact tables, it is normal, in fact, necessary to include some degenerate dimensions- include columns that have useful information (very often referencing back to the source system) but that do not join to any dimension table. Plus you can just impress everyone with your dimensional modelling knowledge when you say "degenerate dimension". </p>
<p>Since we are very close to closing out 2008 and starting the new year, I'll share with you one of my new year's resolutions (there are many)- I'm going to start a data mart data modelling 101 series of blog posts in January, in which I will go through a complete data mart example.  My intention is to both explain the data model concepts, and illustrate how they are executed using <a href="/product">datamartist</a>.  And I think I'll run with the purchase order example, because given the economic situation we're going to have in 2009, identifying unnecessary spending, and finding ways to cut costs is one of the most important uses of a data mart- and one with potentially a huge payback.</p>
<p>Update:  I've posted more recently on <a href="/mystery-or-junk-data-warehouse-dimensions">junk or mystery dimensions</a> which might be of interest too.</p>
<p>Download the free, no risk <a href="/downloads">Datamartist trial now</a> and try it out on your own data.  You'll be amazed whats possible.  No registration required, and the install takes just minutes.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/degenerate-dimensions-in-datamarts/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
