<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Datamartist.com &#187; Data warehouse</title>
	<atom:link href="http://www.datamartist.com/category/data-warehouse/feed" rel="self" type="application/rss+xml" />
	<link>http://www.datamartist.com</link>
	<description>Reduce cost with self serve data transformation</description>
	<lastBuildDate>Thu, 09 Feb 2012 20:00:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Mystery or Junk data warehouse dimensions</title>
		<link>http://www.datamartist.com/mystery-or-junk-data-warehouse-dimensions</link>
		<comments>http://www.datamartist.com/mystery-or-junk-data-warehouse-dimensions#comments</comments>
		<pubDate>Mon, 18 Jan 2010 17:10:46 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Modelling]]></category>
		<category><![CDATA[Data warehouse]]></category>
		<category><![CDATA[Dimension Tables]]></category>
		<category><![CDATA[ETL]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=2933</guid>
		<description><![CDATA[Sometimes, when you are designing a star schema model, you'll find yourself in a dilemma. You've come up with a beautiful design, right out of the pages of a Ralph Kimball book with 5 dimensions, and 5 measures, and you are on your way to star schema heaven when suddenly the users start asking akward [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.datamartist.com/wp-content/uploads/2010/01/ralph-kimball-on-the-phone-too-many-dimensions1.jpg" alt="ralph-kimball-on-the-phone-too-many-dimensions" title="ralph-kimball-on-the-phone-too-many-dimensions" width="317" height="199" class="alignright size-full wp-image-3952" />Sometimes, when you are designing a star schema model, you'll find yourself in a dilemma.  You've come up with a beautiful design, right out of the pages of a Ralph Kimball book with 5 dimensions, and 5 measures, and you are on your way to star schema heaven when suddenly the users start asking akward questions- where is such and such flag?  Where's the transaction type?  Why can't I sort based on the "e7" code from the system?</p>
<p>You can try to explain to them that pure star schemas should not be cluttered with a bunch of tiny dimensions and your fact table just won't stand for 100 million rows of the e7 code, and besides computery things like transaction codes should not be in a business savy data model.  But face it, after some digging you determine the user is right (happens quite often in fact)- they really do use that information and it is critical that you include it and you don't have the time or budget to make the perfect data warehouse.</p>
<p>So how do you deliver to them what they need, and avoid messing up your dimensional model?</p>
<p>One answer is to create one or more Junk dimensions, sometimes also referred to as a mystery dimension. </p>
<p>In the end although the content of a mystery dimension may or may not be mysterious, there is nothing particulary mysterious about how to implement this type of dimension table.  </p>
<p>Even if its perfectly clear what the column is, there are often a number of them with very low cardinality (that is they have very few distinct values).  It really does not make sense to add columns in the fact table for each one, and to have a bunch of tiny dimension tables with only a handful of rows in them.</p>
<p>Faced with this the data architect can wrap all these columns up into a junk dimension.</p>
<p>A junk dimension is a dimension that holds all the unique combinations of a set of columns, and assigns a unique key.  This key is what is stored in the fact table, in the mystery dimension column.</p>
<p>Lets look at a mystery dimension example.  We'll make up and example dimension thats very small for simplicity sake.  Lets say that the transactional table that is used to generate one of our facts has three columns "Zortz" "a3" and "uudl" which we fully satisfy our mystery dimension criteria.  (i.e. we don't know what they are, but people use them in queries.)</p>
<p>"Zortz" is a true/false value, "a3" is one of two values "Confirmed" or "Pending" and "uudl" is either "" or "k".  All the possible combinations of these values would be put into a dimension table and assigned an integer surrogate key.  Thus the mystery dimension table would look like this:<br />
<img src="http://www.datamartist.com/wp-content/uploads/2009/11/mystery-dimension-example-data-set.jpg" alt="mystery-dimension-example-data-set" title="mystery-dimension-example-data-set" width="425" height="200" class="aligncenter size-full wp-image-3548" /></p>
<p>A key consideration when forming mystery dimensions is how many combinations exist.  If the number of combinations is too high the mystery dimensions size may be unmanageable.</p>
<p>And be careful assuming that all the combinations have been used yet.  You are safe if the data type has a fixed set of values (like Boolean, or codes from a known finite set) because you can be sure you've created a dimension row for every combination.</p>
<p>But if there are free form string columns, then you need to make sure your ETL is able to generate new dimension rows and surrogate keys as new combinations are created in the source system.  This might still be worth while, depending on how many new combinations get created.</p>
<p>You can also manage the size of the mystery dimension tables by having 2 or more mystery dimensions, which might reduce the overall number of dimensional rows depending on the makeup of the data.  Different columns and values may tend to cluster together and you will find that grouping them correctly makes say, two small mystery dimensions rather than one huge one.</p>
<p>If, however the number of rows is manageable, a mystery dimension allows all the columns to be queriable, while only adding one column to the fact table, and providing a much more efficient solution in comparison to either creating multiple dimensions, or leaving all the data in the fact table.  </p>
<p>By moving it to a junk dimension or "mystery" dimension then you've got fewer indexes on the fact table which might be important depending on the size.  </p>
<p>So if you find yourself telling your end users that they will just have to do without a column, think twice about it.  The role of a data warehouse is to deliver the data- sometimes you just have to find the right packaging to get the job done.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/mystery-or-junk-data-warehouse-dimensions/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Making rapid prototypes for data warehouse ETL jobs</title>
		<link>http://www.datamartist.com/making-rapid-prototypes-for-data-warehouse-etl-jobs</link>
		<comments>http://www.datamartist.com/making-rapid-prototypes-for-data-warehouse-etl-jobs#comments</comments>
		<pubDate>Mon, 14 Sep 2009 20:39:00 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data warehouse]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Project Management]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=3022</guid>
		<description><![CDATA[Data warehouses and even data marts can be expensive, complex projects. They are not projects to start lightly, and they are not projects that you want to launch without doing some solid planning. But there is a way to get a handle on the tricky parts of your data warehouse scope, and to reduce your [...]]]></description>
			<content:encoded><![CDATA[<p>Data warehouses and even data marts can be expensive, complex projects. They are not projects to start lightly, and they are not projects that you want to launch without doing some solid planning. </p>
<p>But there is a way to get a handle on the tricky parts of your data warehouse scope, and to reduce your projects overall cost.</p>
<p><img src="http://www.datamartist.com/wp-content/uploads/2009/09/ETL-Cost-vs-Other-Data-Warehouse-Cost.jpg" alt="ETL-Cost-vs-Other-Data-Warehouse-Cost" title="ETL-Cost-vs-Other-Data-Warehouse-Cost" width="302" height="236" class="alignright size-full wp-image-3122" />The major cost component of any data warehouse project is the Extract Transform and Load (ETL) development. Obviously every project is slightly different, but in my experience ETL will often make up in the order of 70% of the development cost.  One of the drivers of this cost is the relatively high priced ETL development resources required.  In the markets where I've hired resources, an ETL developer will often demand a 30-40% higher hourly rate than a business intelligence report writer, for example.</p>
<p><strong>Making ETL prototypes will give you insights that can reduce cost </strong> by shortening the ETL development process and making the optimum use of those highly talented and expensive ETL resources.</p>
<h2>What affects the cost and complexity of ETL jobs?</h2>
<p><img src="http://www.datamartist.com/wp-content/uploads/2009/09/DW-prototype-not-all-data-in-erp.jpg" alt="DW-prototype-not-all-data-in-erp" title="DW-prototype-not-all-data-in-erp" width="353" height="250" class="alignright size-full wp-image-3072" />For any given scope, the following will have a large impact on the number and complexity of ETL jobs and therefore their cost.</p>
<ol>
<li>The number of different data sources involved.</li>
<li>The consistency in terms of master data definitions between systems.</li>
<li>The level of data quality in the systems.</li>
</ol>
<p>Ideally, you want to get a good handle on these three things before you hire all the ETL developers, and be confident that you are going to satisfy the users needs before millions of dollars are spent on Extract Transform and Load (ETL) jobs and business intelligence reports.  </p>
<p>One part of the preparation needed to do this can be the creation of a proof of concept or mockup of key parts of the data warehouse ETL deliverable.</p>
<p>Now, there are mockups, there are prototypes, and there are "first versions".  The the most effective approach is to create a mockup or prototype that;</p>
<ul>
<li>Goes just deep enough into the data to:
<ul>
<li>Establish all data sources that will be required</li>
<li>Gives a high level audit of their master data and data quality</li>
</ul>
</li>
<li>Provides enough output that:
<ul>
<li>End users can be supplied with example reports or cubes to get hands on</li>
<li>The functional scope can be locked down with confidence on all sides.</li>
</ul>
</li>
</ul>
<p>The goal of a data warehouse prototype is to learn about the underlying data, and to be able to try different data transformation techniques and approaches on the real data.  The goal is not to make the finished product, nor to deliver actually usable reports to end users, although it may be to generate an example result for users to validate.</p>
<p>An example might be to create a prototype to calculate total sales by segment for a period under a new customer segmentation.  This would identify if the segmentation rules that have been suggested actually result in the expected segementation of sales data, and if the fields involved are complete and correctly populated in the source systems.</p>
<p>A prototype should focus on the dimensions and data sources that are expected to be the most difficult, and involve multi-source integration.  Don't spend time prototyping the easy stuff.</p>
<p>When you are making a prototype remember its a one-time development.  Manual steps and doing some "data cleaning by hand" are perfectly reasonable-  its what you learn from the prototype, not how you learn it that is important.  Take a snapshot, or a sample of the various tables and put them in a sandbox environment where you can manipulate them quickly and easily.</p>
<p>The whole point is to move quickly, get lots of feedback from users, and be able to avoid unpleasant discoveries during the actual data warehouse development.  </p>
<p>If you find a data quality issue, and it's a tough one, then just remove those rows and continue on- remember you don't have to solve all the problems in the prototype- you need to identify them.  Be open with your users about what the exercise is about- and that it is a very rough pass, and a mockup.</p>
<h2>How much could this impact cost? </h2>
<p>If you can identify issues during the prototype then you can solve them before all the ETL development resources are brought onto the project. </p>
<p>If you do not do a prototype, and find a data quality issue that requires some back and forth with the business, every week of delay will probably represent thousands or tens of thousands of dollars, with the project team waiting on the resolution before being able to resume coding the ETL jobs in question.</p>
<p>So in summary, making prototypes will:</p>
<ul>
<li>Reduce the risk of scope creep because users have actually seen and "touched" a mockup of the final output.</li>
<li>Reduce the amount of rework in ETL code because different data transformation approaches can be tested early.</li>
<li>Reduce the risk of the expensive ETL development phase of the project slipping due to unknown data quality issues.</li>
</ul>
<h2>The right tool for ETL Prototypes.</h2>
<p>Often prototypes are built in a combination of Excel, MS Access or other databases. These tools can work, but excel has serious issues handling larger data sets, and database development is often cumbersome-  the idea is to make a prototype, not actually build the SQL code.  Things like different data types, field formats, column naming rules etc. between different source databases often frustrate attempts to do something quickly.</p>
<p>Obviously another option is the enterprise ETL tools themselves- but the cost, complexity and overhead of these tools again makes them better suited to the production system- not a quick mockup or rapid prototype.</p>
<p>What you need to make an ETL prototype is an easy to use ETL tool that provides the basic type of functionality and graphical user interface of high end ETL tools, but also allows a more flexible treatment of data types, all with the ability to pull data from multiple sources, including more informal sources like Excel spreadsheets.</p>
<p>The <a href="/">Datamartist tool</a> was created to provide exactly such a data scratchpad, ideal for rapid prototyping data transformations. It lets you profile your data and build data transformations using a visual, block and connector interface.  But it represents a clear, focused and easy to use ETL tool, without all the feature bloat, cost and server configuration required by many expensive enterprise ETL solutions.  </p>
<p><a href="/downloads">Download the free trial</a>, and see for yourself.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/making-rapid-prototypes-for-data-warehouse-etl-jobs/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

