<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Datamartist.com &#187; ETL</title>
	<atom:link href="http://www.datamartist.com/category/etl/feed" rel="self" type="application/rss+xml" />
	<link>http://www.datamartist.com</link>
	<description>Reduce cost with self serve data transformation</description>
	<lastBuildDate>Wed, 25 Jan 2012 15:47:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Preparing Data for QlikView</title>
		<link>http://www.datamartist.com/preparing-data-for-qlikview</link>
		<comments>http://www.datamartist.com/preparing-data-for-qlikview#comments</comments>
		<pubDate>Thu, 18 Nov 2010 14:44:34 +0000</pubDate>
		<dc:creator>Cam Quinn</dc:creator>
				<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Qlik View]]></category>
		<category><![CDATA[Qlikview]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=5870</guid>
		<description><![CDATA[In this blog post, I am going to play with some economic data- specifically, Canadian Import and Export data using Datamartist and then use QlikView Business Intelligence Software to analyze the results. The trick with public data like this is that often (ok almost ALWAYS) either data is missing, or the codes don't match up. [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.datamartist.com/wp-content/uploads/2010/11/QlikView-Introduction-Screen-Shot.jpeg"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/QlikView-Introduction-Screen-Shot-300x150.jpg" alt="" width="300" height="150" class="alignleft size-medium wp-image-5875" /></a>In this blog post, I am going to play with some economic data- specifically,  Canadian Import and Export data using Datamartist and then use QlikView Business Intelligence Software to analyze the results.</p>
<p>The trick with public data like this is that often (ok almost ALWAYS) either data is missing, or the codes don't match up.  In this case, the country descriptions from various data sets I want to use don't match- and different data sets have different holes (i.e. not all datasets include data for all countries).  Finally, some data sets have a different definition of a country- for example, they break out places like "British Indian Ocean Territories" that need to get rolled up in the UK numbers.</p>
<p> Country statistics data such as GDP, GNI and Population were also incorporated to provide dimensions to carry the analysis out on. The raw trade data was obtained from Industry Canada's "Trade Data Online" (<a href="http://www.ic.gc.ca/sc_mrkti/tdst/tdo/tdo.php?lang=30&amp;productType=HS6" target="_blank">http://www.ic.gc.ca/sc_mrkti/tdst/tdo/tdo.php?lang=30&amp;productType=HS6</a>). The World Bank was the source of the country statistics data (<a href="http://data.worldbank.org/indicator" target="_blank">http://data.worldbank.org/indicator</a>). A zip file containing the raw data, as well as the Datamartist data transformation .dmc file is provided at the bottom of this post.</p>
<p>I started by transforming the country statistics data. The raw data included information on GDP, GNI, Total Population and Urban Population. A screenshot of the Datamartist canvas for the first portion of this data transformation is provided below.<a href="http://www.datamartist.com/wp-content/uploads/2010/11/Datamartist-Country-Statistics-Canvas-Screenshot.jpeg"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/Datamartist-Country-Statistics-Canvas-Screenshot-300x133.jpg" alt="" width="300" height="133" class="aligncenter size-medium wp-image-5878" /></a>As seen above, the first step in the data transformation involved importing the four excel data files. During this import, columns with data not relevant to the year 2009 were filtered out and zeros were inserted into any null data rows, signifying that data for that row was not available. <a href="http://www.datamartist.com/wp-content/uploads/2010/11/Datamartist-GDP-GNI-Join-Screenshot.jpeg"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/Datamartist-GDP-GNI-Join-Screenshot-300x128.jpg" alt="" width="300" height="128" class="alignright size-medium wp-image-5886" /></a>A series of data "Join" functions were then carried out to create one data file containing all of the country statistics information. Upon completion of joining these data files," a "Calculation" block was utilized to replace any null data values resulting from the data join with zero's. Finally, the country statistics information was joined with a country cross reference list. Basically, this join standardizes all of the country names.</p>
<p>The second part of the country statistics data transformation focused on segmenting the data, as shown in the Datamartist canvas screenshot below.<a href="http://www.datamartist.com/wp-content/uploads/2010/11/Datamartist-Country-Statistics-Segmentation-Screenshot.jpeg"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/Datamartist-Country-Statistics-Segmentation-Screenshot-300x82.jpg" alt="" width="300" height="82" class="aligncenter size-medium wp-image-5891" /></a>Before the data could be segmented, it was summarized so that there was only one data row for each standardized country name. A "Calculation" block was then added to calculate the GDP per Capita, GNI per Capita and Urban Population Percentage using the Population data. With these calculations complete, a series of "Segment" blocks were added to the canvas. The "Segment" blocks are extremely useful because they add an additional column to the data set which is <a href="http://www.datamartist.com/wp-content/uploads/2010/11/Datamartist-Population-Segment-Block-Screenshot.jpeg"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/Datamartist-Population-Segment-Block-Screenshot-300x141.jpg" alt="" width="300" height="141" class="alignleft size-medium wp-image-5895" /></a>populated according to a set of segmentation rules defined by the user. In this example, the "Segment" block was used to segment the GDP per Capita, GNI per Capita, Urban Population Percentage and Population data. The segmentation rules for the Population "Segment" block are shown in the screenshot on the left.</p>
<p>A similar set of data transformations was also carried out on the Canadian Import and Export Trade data. <a href="http://www.datamartist.com/wp-content/uploads/2010/11/Datamartist-Trade-Canvas-Screenshot.jpeg"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/Datamartist-Trade-Canvas-Screenshot-300x92.jpg" alt="" width="300" height="92" class="aligncenter size-medium wp-image-5900" /></a>As seen in the Datamartist canvas screenshot, the Canadian Import and Export Trade data was imported, joined and null data values were replaced with zeros. The country names were then standardized and summarized so that the Canadian Import and Export Trade data could be joined with the Country Statistics data.</p>
<p>With all of the raw data transformed into a suitable format, a final set of data transformations were carried out to create a single text file. This text file was then exported so that QlikView could be used to analyze the data. A screenshot of the Datamartist canvas for this final set of data transformations is shown below.<a href="http://www.datamartist.com/wp-content/uploads/2010/11/Datamartist-Star-Schema-Canvas-Screenshot.jpeg"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/Datamartist-Star-Schema-Canvas-Screenshot-300x135.jpg" alt="" width="300" height="135" class="aligncenter size-medium wp-image-5907" /></a> In this final set of data transformations, the "Star Schema" block was used first. The "Star Schema" block is a handy data transformation tool because it allows numerous data join operations to be carried out simultaneously.<a href="http://www.datamartist.com/wp-content/uploads/2010/11/Datamartist-Star-Schema-Block-Screenshot1.jpeg"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/Datamartist-Star-Schema-Block-Screenshot1-300x121.jpg" alt="" width="300" height="121" class="alignright size-medium wp-image-5912" /></a> It was used to combine the Country Statistics data and Canadian Import and Export Trade data with data defining a country's geographical region. A screenshot of the "Star Schema" block configuration window is shown to the left. The joined data was then put through a "Calculation" block one last time to eliminate any null data values. Finally, the transformed data was exported as a text file so that it could be analyzed in QlikView.</p>
<p>The transformed data was then imported<a href="http://www.datamartist.com/wp-content/uploads/2010/11/Screen-shot-2010-11-11-at-10.40.08-AM.png"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/Screen-shot-2010-11-11-at-10.40.08-AM-300x187.png" alt="" width="300" height="187" class="alignright size-medium wp-image-5917" /></a> into QlikView and a dashboard was created to analyze the data with. QlikView is a great data analysis tool because it allows data to be filtered and visualized very efficiently. In this example, I made a dashboard that allows Canada Import and Export Trade data to be visualized using the Country Statistics Data segments created using the Datamartist software as filters. A screenshot of the dashboard with no filters applied is shown to the right. I am now going to show a series of screenshots with different data filters applied. To start off, I want to see the countries that Canada Exports the most goods to. To do this, I just dragged a box over the largest bars on the export graph as seen below.<a href="http://www.datamartist.com/wp-content/uploads/2010/11/Screen-shot-2010-11-11-at-11.05.05-AM.png"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/Screen-shot-2010-11-11-at-11.05.05-AM-300x187.png" alt="" width="300" height="187" class="aligncenter size-medium wp-image-5919" /></a> Once I finished making the data selection box on the graph, I released the mouse and QlikView automatically zoomed in on the area I selected. In addition, the upper right table in the dashboard updates as well. The image below shows the results. As seen in the image, Canada's biggest export trade partners in 2009 were the United States, the United Kingdom and China.<a href="http://www.datamartist.com/wp-content/uploads/2010/11/Screen-shot-2010-11-11-at-11.14.16-AM.png"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/Screen-shot-2010-11-11-at-11.14.16-AM-300x187.png" alt="" width="300" height="187" class="aligncenter size-medium wp-image-5922" /></a> As a second example, I will filter the data using the data segments created in Datamartist. If I click on "Asia" in the "Region" box then only data from the countries in Asia is shown in the table and graphs. Furthermore, the segments in the other filter boxes (GDP per Capita, GNI per Capita, etc) updates to the region selection as well. It does this by highlighting the data filter segments that are valid for the "Asia" region in white. For example, in the "GDP per Capita" box, all data segments are valid except for the "GDP &gt; $100 Thousand" segment. A screenshot of QlikView with the "Asia" region filter on is shown below.<a href="http://www.datamartist.com/wp-content/uploads/2010/11/Screen-shot-2010-11-11-at-12.15.11-PM.png"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/Screen-shot-2010-11-11-at-12.15.11-PM-300x187.png" alt="" width="300" height="187" class="aligncenter size-medium wp-image-5929" /></a> I can further filter the data by clicking on any other data filter segments that are white. As an example, if I select "$1 Thousand - $5 Thousand" in the "GDP per Capita" box and "40% - 60%" in the "Urban Population Percentage" box, the graphs and table update again. In this instance, the only countries that meet these filter requirements are China, Georgia and Mongolia. A QlikView screenshot with all three of the filters chosen is shown below.<a href="http://www.datamartist.com/wp-content/uploads/2010/11/Screen-shot-2010-11-11-at-12.21.03-PM.png"><img src="http://www.datamartist.com/wp-content/uploads/2010/11/Screen-shot-2010-11-11-at-12.21.03-PM-300x187.png" alt="" width="300" height="187" class="aligncenter size-medium wp-image-5930" /></a></p>
<h2>Try it out yourself with the free trial</h2>
<p>You can give Datamartist a try with this data, just <a href="/downloads">signup and download</a> the free trial, and then download <a href="http://www.nmodal.com/downloads/CanadaWorldTradingExample.zip">a zip file will all the data, and the example Datamartist file</a>.</p>
<p>Just extract all the files in the above ZIP file into the "My Datamartist" folder that the Datamartist trial will create when you run it, and open the "World Trading Example.DMC" file with Datamartist.</p>
<p>You'll find that Datamartist gives you a powerful, visual way to transform data from lots of places, and get it ready for great visualization tools like Qlikview in a step by step, clean, repeatable way.</p>
<p>On top of that, datamartist can be automated- so if you have data transformations you need to run on a schedule, you can design them in a graphical environment, test them, and then have them run automatically.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/preparing-data-for-qlikview/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A simple ETL tool with data profiling tools built in</title>
		<link>http://www.datamartist.com/a-simple-etl-tool-with-data-profiling-tools-built-in</link>
		<comments>http://www.datamartist.com/a-simple-etl-tool-with-data-profiling-tools-built-in#comments</comments>
		<pubDate>Thu, 08 Jul 2010 04:36:36 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[ETL]]></category>
		<category><![CDATA[Datamartist Tool]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4554</guid>
		<description><![CDATA[Datamartist is a new idea in ETL and data profiling tools. It gives people who are serious about getting at their data a powerful, simple to use, right sized tool. Easy to install Easy to use ETL features and data profiling capability Avoid using the wrong tool for the job Enterprise ETL tools (Extract Transform [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.datamartist.com/wp-content/uploads/2010/06/Sales-example-full-screen-shot-profiler-perspective-300w.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/06/Sales-example-full-screen-shot-profiler-perspective-300w.jpg" alt="" title="Sales-example-full-screen-shot-profiler-perspective-300w" width="300" height="228" class="alignright size-full wp-image-4557" /></a>Datamartist is a new idea in ETL and data profiling tools.  It gives people who are serious about getting at their data a powerful, simple to use, right sized tool.</p>
<ul>
<li>Easy to install</li>
<li>Easy to use</li>
<li>ETL features and data profiling capability</li>
</ul>
<h2>Avoid using the wrong tool for the job</h2>
<p>Enterprise ETL tools (Extract Transform and Load) are very powerful but often extremely difficult to use.  </p>
<ul>
<li>expensive, particularly if multiple environments are needed</li>
<li>require server infrastructure, configuration and setup.</li>
<li>require expensive developers who have been trained in the specific programming language of each particular vendors tool.</li>
<li>designed for performance and data volume, not ease of use.</li>
</ul>
<p>Obviously they have their time and place, but when you want fast, visual access to your data, you end up getting slowed down by expensive ETL server overkill.</p>
<h2>A better choice- the visual, clean ETL tool</h2>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/07/Join-Block-Edit.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/07/Join-Block-Edit.jpg" alt="" title="Join-Block-Edit" width="300" height="232" class="alignright size-full wp-image-4594" /></a>Datamartist is designed to let you extract data from multiple sources, and then mix it, match it, transform it, and understand it.</p>
<p>It uses a visual block and connector model, with the concept of "Data canvases" that let you easily manage and simplify complex data transformations.  But unlike many overly complex ETL tools, Datamartist provides visual, configurable blocks, rather than requiring code.</p>
<h2>Easy to install</h2>
<p>Datamartist installs in minutes, and runs on your desktop, giving you control of your data, and what you need to do.  Don't configure servers, don't worry about installing the right version of Java, don't spend hours searching wikis and forums and tweaking config files.  Just <a href="/downloads">download it</a>, single step install it, and use it.</p>
<p>It makes it easy for you to take a snapshot of the data you need- locally with cut and paste or drag and drop from files, and locally or remotely with native connections to SQL Server, Oracle, MySql and MS Access, and pretty much anything else via ODBC.</p>
<p>And since the Datamartist data transformation engine can be run from the command line or scripted, it can also be automated to implement ETL tasks running on a windows server.</p>
<h2>Speed up data delivery, reduce cost.</h2>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/07/Tree-Structure-Management.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/07/Tree-Structure-Management.jpg" alt="" title="Tree-Structure-Management" width="400" height="300" class="alignleft size-full wp-image-4599" /></a>Datamartist provides a flexible, simple to use ETL environment that will let you shorten your time to delivery significantly for a wide range of data transformation tasks.</p>
<ul>
<li>Deliver small and medium sized data transformation tasks more quickly</li>
<li>Build rapid prototypes and proofs of concepts</li>
<li>Automate data profiling and data quality monitoring</li>
</ul>
<h2>Give the Datamartist ETL Tool a try</h2>
<p>You can <a href="/downloads">download the Datamartist trial</a> and be up and running in minutes.  You don't even have to register- and you will have full access to a fully functioning version of Datamartist to try out this simple, visual ETL tool on your own data.</p>
<p>We're also very excited about V1.3.0, currently in private beta.  If you'd like to participate in the public beta, drop me a line at "beta at datamartist.com", and we'll send you a link when that download is available.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/a-simple-etl-tool-with-data-profiling-tools-built-in/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Datamartist V1.2 now available</title>
		<link>http://www.datamartist.com/datamartist-v1-2-now-available</link>
		<comments>http://www.datamartist.com/datamartist-v1-2-now-available#comments</comments>
		<pubDate>Tue, 02 Mar 2010 14:45:33 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Transformation]]></category>
		<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[ETL]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=4266</guid>
		<description><![CDATA[nModal solutions is pleased to announce that Datamartist V1.2 is now available. In this version, we've introduced a Standard and Pro edition, letting customers get the features they need at the right price. Datamartist Standard: $349 Datamartist Professional: $745 A comparison of the feature sets explains the details. Whats new in V1.2 Data source import [...]]]></description>
			<content:encoded><![CDATA[<p>nModal solutions is pleased to announce that Datamartist V1.2 is now available.</p>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/02/Sales-example-full-screen-shot-profiler-perspective-300w.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/02/Sales-example-full-screen-shot-profiler-perspective-300w.jpg" alt="" title="Sales-example-full-screen-shot-profiler-perspective-300w" width="300" height="228" class="alignright size-full wp-image-4302" /></a>In this version, we've introduced a Standard and Pro edition, letting customers get the features they need at the right price. </p>
<ul>
<li>Datamartist Standard:       $349</h3>
<li>Datamartist Professional:   $745</h3>
</ul>
<p>A <a href="/product/datamartist-pricing-and-edition-comparison">comparison of the feature sets explains</a> the details.</p>
<h1>Whats new in V1.2</h1>
<h2>Data source import enhancements</h2>
<ul style="margin-top:10px;">
<li>Ability to cut and paste between Excel, Text files, the Datamartist canvas and any Datamartist data viewer.</li>
<li>New integrated data source repository with drag and drop to canvas.</li>
<li>SQL Editor to allow the creation of SQL queries to get data from databases.</li>
</ul>
<p><a href="http://www.datamartist.com/wp-content/uploads/2010/02/Edit-SQL-Datamartist1.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/02/Edit-SQL-Datamartist1.jpg" alt="" title="Edit-SQL-Datamartist" width="609" height="342" class="aligncenter size-full wp-image-4307" /></a></p>
<h2>Running Datamartist canvases automatically</h2>
<p>Now that Datamartist can be run from the command line, it is possible to schedule datamartist transforms- even running it on a Windows server.  Details about the logging and options <a href="/resources/datamartist-doc-files/V1_0_Documentation/DM-running-from-cmd-line-Doc.html">are here</a>.<br />
<a href="http://www.datamartist.com/wp-content/uploads/2010/02/Running-datamartist-from-the-command-line-610w.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/02/Running-datamartist-from-the-command-line-610w.jpg" alt="" title="Running-datamartist-from-the-command-line-610w" width="610" height="308" class="aligncenter size-full wp-image-4310" /></a></p>
<h2>Edit Internal data sets.</h2>
<p>The addition of fully editable internal data sets that are stored within the DMC file itself gives a powerful new ability to create "What if" type scenarios.  Imagine you want to see the effect of changing the sales regions slightly-  just copy and paste the existing from a data viewer onto the canvas- that gives you an internal data set block with that data in it-  now you can add a column "New Region" or rename the column, then edit some values, join it back into the original data with a join block, and be trying different scenarios in no time.<br />
<a href="http://www.datamartist.com/wp-content/uploads/2010/02/Internal-edit-regions-list.jpg"><img src="http://www.datamartist.com/wp-content/uploads/2010/02/Internal-edit-regions-list.jpg" alt="" title="Internal-edit-regions-list" width="547" height="389" class="aligncenter size-full wp-image-4313" /></a></p>
<p>We're excited about this new release, and thanks to all our customers and testers for their feedback- we're glad to be incorporating some of those great ideas into the product.</p>
<p>If you haven't tried Datamartist yet, <a href="/downloads">this is the perfect time</a>, and now with two editions to choose from you can get the features you need at the right price.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/datamartist-v1-2-now-available/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Inner and outer joins SQL examples and the Join block</title>
		<link>http://www.datamartist.com/sql-inner-join-left-outer-join-full-outer-join-examples-with-syntax-for-sql-server</link>
		<comments>http://www.datamartist.com/sql-inner-join-left-outer-join-full-outer-join-examples-with-syntax-for-sql-server#comments</comments>
		<pubDate>Wed, 10 Feb 2010 16:13:45 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[SQL Code]]></category>
		<category><![CDATA[Joining data]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=3966</guid>
		<description><![CDATA[In this post I'll show you how to do all the main types of Joins with clear SQL examples. The examples are written for Microsoft SQL Server, but very similar syntax is used in Oracle, MySQL and other databases. Joins can be said to be INNER or OUTER joins, and the two tables involved are [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.datamartist.com/wp-content/uploads/2010/02/join-block-venn-diagram-datamartist.jpg" alt="join-block-venn-diagram-datamartist" title="join-block-venn-diagram-datamartist" width="212" height="188" class="alignright size-full wp-image-4068" /><br />
In this post I'll show you how to do all the main types of Joins with clear SQL examples.  The examples are written for Microsoft SQL Server, but very similar syntax is used in Oracle, MySQL and other databases.</p>
<p>Joins can be said to be INNER or OUTER joins, and the two tables involved are referred to as LEFT and RIGHT.  By combining these two concepts you get all the various types of joins in join land: Inner, left outer, right outer, and the full outer join.  </p>
<h2>Tables used for SQL Examples</h2>
<p><img src="http://www.datamartist.com/wp-content/uploads/2010/02/Join-Example-Students-And-Advisors-Tables.jpg" alt="Join-Example-Students-And-Advisors-Tables" title="Join-Example-Students-And-Advisors-Tables" width="606" height="214" class="aligncenter size-full wp-image-4057" /></p>
<p>In the screen shots I've configured Datamartist to  only show the name columns to save space.  The SQL code shown is "Select *" so it will return all the columns.  You can see that in the <a href="/">Datamartist tool</a> the type of join is selected by just checking the parts of the venn diagram that contain the rows you want.</p>
<h2>1) Inner Join SQL Example</h2>
<p><code>select * from dbo.Students S INNER JOIN dbo.Advisors A ON S.Advisor_ID=A.Advisor_ID</code></p>
<p><img src="http://www.datamartist.com/wp-content/uploads/2010/02/Join-Example-Students-And-Advisors-Inner-Join.jpg" alt="Join-Example-Students-And-Advisors-Inner-Join" title="Join-Example-Students-And-Advisors-Inner-Join" width="560" height="234" class="aligncenter size-full wp-image-4058" /></p>
<h2>2) Left Outer Join SQL Example</h2>
<p><code>select * from dbo.Students S LEFT OUTER JOIN dbo.Advisors A ON S.Advisor_ID=A.Advisor_ID</code></p>
<p><img src="http://www.datamartist.com/wp-content/uploads/2010/02/Join-Example-Students-And-Advisors-Left-Outer-Join.jpg" alt="Join-Example-Students-And-Advisors-Left-Outer-Join" title="Join-Example-Students-And-Advisors-Left-Outer-Join" width="625" height="265" class="aligncenter size-full wp-image-4059" /></p>
<h2>4) Full Outer Join SQL Example</h2>
<p><code>select * from dbo.Students S FULL OUTER JOIN dbo.Advisors A ON S.Advisor_ID=A.Advisor_ID</code><br />
<img src="http://www.datamartist.com/wp-content/uploads/2010/02/Join-Example-Students-And-Advisors-Full-Outer-Join.jpg" alt="Join-Example-Students-And-Advisors-Full-Outer-Join" title="Join-Example-Students-And-Advisors-Full-Outer-Join" width="581" height="291" class="aligncenter size-full wp-image-4063" /></p>
<h2>5) SQL example for just getting the rows that don't join</h2>
<p><code>select * from dbo.Students S FULL OUTER JOIN dbo.Advisors A ON S.Advisor_ID=A.Advisor_ID where A.Advisor_ID is null or S.Student_ID is null</code><br />
<img src="http://www.datamartist.com/wp-content/uploads/2010/02/Join-Example-Students-And-Advisors-non-joining-Join.jpg" alt="Join-Example-Students-And-Advisors-non-joining-Join" title="Join-Example-Students-And-Advisors-non-joining-Join" width="638" height="227" class="aligncenter size-full wp-image-4065" /></p>
<h2>6) SQL example for just rows from one table that don't join</h2>
<p><code>select * from dbo.Students S FULL OUTER JOIN dbo.Advisors A ON S.Advisor_ID=A.Advisor_ID where A.Advisor_ID is null</code><br />
<img src="http://www.datamartist.com/wp-content/uploads/2010/02/Join-Example-Students-And-Advisors-left-exlusive-Join.jpg" alt="Join-Example-Students-And-Advisors-left-exlusive-Join" title="Join-Example-Students-And-Advisors-left-exlusive-Join" width="615" height="228" class="aligncenter size-full wp-image-4070" /></p>
<h1>But what about the duplicate row thing?</h1>
<p>Now, since in this case we had a simple one to one relationship, the number of rows that were returned made the venn diagrams make sense, and add up pretty normally with table one and two.</p>
<p>What happens if the data in the tables are not a simple one to one relationship?  What happens if we add one duplicate advisor with the same ID, but a different name?<br />
<img src="http://www.datamartist.com/wp-content/uploads/2010/02/Join-Example-Students-And-Advisors-duplicate-advisors.jpg" alt="Join-Example-Students-And-Advisors-duplicate-advisors" title="Join-Example-Students-And-Advisors-duplicate-advisors" width="431" height="184" class="aligncenter size-full wp-image-4080" /></p>
<p>A join will create a row for every combination of rows that join together.  So if there are two advisors with the same key, for every student record that has that key, you will have two rows in the inner part of the join.  The advisor duplicate makes duplicate student records for every student with that advisor.</p>
<p>You can see how this could add up to a lot of extra rows.  The number of rows is the product of the two sets of joining rows. If the tables get big, just a few duplicates will cause the results of a join to be much larger than the total number of rows in the input tables- this is something you have to watch very carefully when joining- check your row counts.</p>
<p>So there you have it.  If you want to try joining tables with the Datamartist tool- <a href="/downloads">give it a try</a>.  It's a super fast install, and you'll be joining like a pro in no time.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/sql-inner-join-left-outer-join-full-outer-join-examples-with-syntax-for-sql-server/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Making rapid prototypes for data warehouse ETL jobs</title>
		<link>http://www.datamartist.com/making-rapid-prototypes-for-data-warehouse-etl-jobs</link>
		<comments>http://www.datamartist.com/making-rapid-prototypes-for-data-warehouse-etl-jobs#comments</comments>
		<pubDate>Mon, 14 Sep 2009 20:39:00 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data warehouse]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Project Management]]></category>
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=3022</guid>
		<description><![CDATA[Data warehouses and even data marts can be expensive, complex projects. They are not projects to start lightly, and they are not projects that you want to launch without doing some solid planning. But there is a way to get a handle on the tricky parts of your data warehouse scope, and to reduce your [...]]]></description>
			<content:encoded><![CDATA[<p>Data warehouses and even data marts can be expensive, complex projects. They are not projects to start lightly, and they are not projects that you want to launch without doing some solid planning. </p>
<p>But there is a way to get a handle on the tricky parts of your data warehouse scope, and to reduce your projects overall cost.</p>
<p><img src="http://www.datamartist.com/wp-content/uploads/2009/09/ETL-Cost-vs-Other-Data-Warehouse-Cost.jpg" alt="ETL-Cost-vs-Other-Data-Warehouse-Cost" title="ETL-Cost-vs-Other-Data-Warehouse-Cost" width="302" height="236" class="alignright size-full wp-image-3122" />The major cost component of any data warehouse project is the Extract Transform and Load (ETL) development. Obviously every project is slightly different, but in my experience ETL will often make up in the order of 70% of the development cost.  One of the drivers of this cost is the relatively high priced ETL development resources required.  In the markets where I've hired resources, an ETL developer will often demand a 30-40% higher hourly rate than a business intelligence report writer, for example.</p>
<p><strong>Making ETL prototypes will give you insights that can reduce cost </strong> by shortening the ETL development process and making the optimum use of those highly talented and expensive ETL resources.</p>
<h2>What affects the cost and complexity of ETL jobs?</h2>
<p><img src="http://www.datamartist.com/wp-content/uploads/2009/09/DW-prototype-not-all-data-in-erp.jpg" alt="DW-prototype-not-all-data-in-erp" title="DW-prototype-not-all-data-in-erp" width="353" height="250" class="alignright size-full wp-image-3072" />For any given scope, the following will have a large impact on the number and complexity of ETL jobs and therefore their cost.</p>
<ol>
<li>The number of different data sources involved.</li>
<li>The consistency in terms of master data definitions between systems.</li>
<li>The level of data quality in the systems.</li>
</ol>
<p>Ideally, you want to get a good handle on these three things before you hire all the ETL developers, and be confident that you are going to satisfy the users needs before millions of dollars are spent on Extract Transform and Load (ETL) jobs and business intelligence reports.  </p>
<p>One part of the preparation needed to do this can be the creation of a proof of concept or mockup of key parts of the data warehouse ETL deliverable.</p>
<p>Now, there are mockups, there are prototypes, and there are "first versions".  The the most effective approach is to create a mockup or prototype that;</p>
<ul>
<li>Goes just deep enough into the data to:
<ul>
<li>Establish all data sources that will be required</li>
<li>Gives a high level audit of their master data and data quality</li>
</ul>
</li>
<li>Provides enough output that:
<ul>
<li>End users can be supplied with example reports or cubes to get hands on</li>
<li>The functional scope can be locked down with confidence on all sides.</li>
</ul>
</li>
</ul>
<p>The goal of a data warehouse prototype is to learn about the underlying data, and to be able to try different data transformation techniques and approaches on the real data.  The goal is not to make the finished product, nor to deliver actually usable reports to end users, although it may be to generate an example result for users to validate.</p>
<p>An example might be to create a prototype to calculate total sales by segment for a period under a new customer segmentation.  This would identify if the segmentation rules that have been suggested actually result in the expected segementation of sales data, and if the fields involved are complete and correctly populated in the source systems.</p>
<p>A prototype should focus on the dimensions and data sources that are expected to be the most difficult, and involve multi-source integration.  Don't spend time prototyping the easy stuff.</p>
<p>When you are making a prototype remember its a one-time development.  Manual steps and doing some "data cleaning by hand" are perfectly reasonable-  its what you learn from the prototype, not how you learn it that is important.  Take a snapshot, or a sample of the various tables and put them in a sandbox environment where you can manipulate them quickly and easily.</p>
<p>The whole point is to move quickly, get lots of feedback from users, and be able to avoid unpleasant discoveries during the actual data warehouse development.  </p>
<p>If you find a data quality issue, and it's a tough one, then just remove those rows and continue on- remember you don't have to solve all the problems in the prototype- you need to identify them.  Be open with your users about what the exercise is about- and that it is a very rough pass, and a mockup.</p>
<h2>How much could this impact cost? </h2>
<p>If you can identify issues during the prototype then you can solve them before all the ETL development resources are brought onto the project. </p>
<p>If you do not do a prototype, and find a data quality issue that requires some back and forth with the business, every week of delay will probably represent thousands or tens of thousands of dollars, with the project team waiting on the resolution before being able to resume coding the ETL jobs in question.</p>
<p>So in summary, making prototypes will:</p>
<ul>
<li>Reduce the risk of scope creep because users have actually seen and "touched" a mockup of the final output.</li>
<li>Reduce the amount of rework in ETL code because different data transformation approaches can be tested early.</li>
<li>Reduce the risk of the expensive ETL development phase of the project slipping due to unknown data quality issues.</li>
</ul>
<h2>The right tool for ETL Prototypes.</h2>
<p>Often prototypes are built in a combination of Excel, MS Access or other databases. These tools can work, but excel has serious issues handling larger data sets, and database development is often cumbersome-  the idea is to make a prototype, not actually build the SQL code.  Things like different data types, field formats, column naming rules etc. between different source databases often frustrate attempts to do something quickly.</p>
<p>Obviously another option is the enterprise ETL tools themselves- but the cost, complexity and overhead of these tools again makes them better suited to the production system- not a quick mockup or rapid prototype.</p>
<p>What you need to make an ETL prototype is an easy to use ETL tool that provides the basic type of functionality and graphical user interface of high end ETL tools, but also allows a more flexible treatment of data types, all with the ability to pull data from multiple sources, including more informal sources like Excel spreadsheets.</p>
<p>The <a href="/">Datamartist tool</a> was created to provide exactly such a data scratchpad, ideal for rapid prototyping data transformations. It lets you profile your data and build data transformations using a visual, block and connector interface.  But it represents a clear, focused and easy to use ETL tool, without all the feature bloat, cost and server configuration required by many expensive enterprise ETL solutions.  </p>
<p><a href="/downloads">Download the free trial</a>, and see for yourself.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/making-rapid-prototypes-for-data-warehouse-etl-jobs/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Data to the people- why self serve ETL</title>
		<link>http://www.datamartist.com/data-to-the-people-why-self-serve-etl</link>
		<comments>http://www.datamartist.com/data-to-the-people-why-self-serve-etl#comments</comments>
		<pubDate>Tue, 21 Jul 2009 17:11:37 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Business Intelligence Architecture]]></category>
		<category><![CDATA[Data Transformation]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Analyst tools]]></category>
		<category><![CDATA[Business Intelligence trends]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=2866</guid>
		<description><![CDATA[As regular readers of this blog know, I believe in a balance between formal and informal data analysis tools. I believe in an approach that firmly places people in the center of a new way of looking at the data analysis process. In the past, “big business intelligence” created an infrastructure heavy, highly centralised and [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.datamartist.com/wp-content/uploads/2009/07/you-have-used-unautorised-data-transformation.jpg" alt="you-have-used-unautorised-data-transformation" title="you-have-used-unautorised-data-transformation" width="403" height="354" class="alignright size-full wp-image-2887" />As regular readers of this blog know, I believe in a balance between formal and informal data analysis tools.</p>
<p>I believe in an approach that firmly places people in the center of a new way of looking at the data analysis process.</p>
<p>In the past, “big business intelligence” created an infrastructure heavy, highly centralised and technology focused approach to getting data from source systems into reports in the hands of the users.  Under this regime, users were not to be trusted with raw data, but were given tightly controlled, managed and aggregated reports in order to protect the “single version of the truth”.</p>
<blockquote><p>The theory and practice were tightly defined, and had been honed over decades of business intelligence and data warehouse orthodoxy.   Giving raw data to end users would lead to chaos. Letting end users define new ways to look at the data would corrupt the master data, and lead to everyone looking at something different.</p></blockquote>
<p>You can guess the  <a href="http://datadoodle.com/2009/07/16/just-give-me-the-data/" target="_blank">sort of response</a> this "don't give them the raw data" approach gets from capable, curious people that want to get down to some real analysis.  </p>
<p>But to be fair you can see why these concerns are thought to be well founded.  Almost every large enterprise is awash in a sea of excel files and a tangle of links and formulas.  Excel is a wonderful tool, but it only offers the illusion of solving the data transformation problem.  It is a much better reporting/dashboard tool than an ETL. (Although in the right hands it can do remarkable things.)</p>
<p>And this is the true state of affairs now.  When the “official” system does not provide the answers that the business needs the people who need to make decisions get the data anyway, and they do it themselves. They do it in excel, they take night courses in Structured Query Language (SQL) they hire consultants (or even summer students) to build rogue data bases that they run on servers hidden under desks to get at the answers they need.</p>
<p>It is easy for the data warehouse theorists to highlight the clear issues with "spreadmarts" and "shadow systems".  </p>
<p>But we need to be pragmatic. The reality of building a centralized structure that imposes strict formal rules and change management processes is that often while it does ensure that there is only one version of the truth,  it is a version of the truth that no one can use because it has been so formalized, aggregated,  compromised and delayed that by the time it is delivered the pressing business questions have changed and meaning has been expunged.  The data warehouse becomes reporting rather than analysis.</p>
<p>Its clear that enterprises need this kind of reporting- I'm not advocating abandoning the existing approach- but augmenting it.  Up till now, the solution has often been "more of the same".</p>
<blockquote><p>The regime decided that the solution was to add more technology to the central systems, increase enforcement, and search out and repress all the dissident data manipulators.  The data resistance was forced to go underground, to hide their spreadsheets, to outwardly appear to be following the official line.</p></blockquote>
<p>It is very true that there are some risks in allowing people to analyze their own data, but there is also a reward.  There are a small group of people who love data, who understand the business questions, who work to tease insight out of a steaming pile of raw data and can find things that are game changing.  Massive, formal, designed by committee data warehouses can deliver a powerful and useful view of things, but they rarely offer flashes of insight.  When they do, it is often during the design and discovery process- rarely by users using the system after it has gone live.</p>
<p>The <a href="/product">Datamartist tool</a> has been built based on the belief that both formal, centralized systems AND local, personal data transformation have a place in the architecture and that both should be official places.</p>
<p>People can be trusted with the data.  In fact I think for an organisation to truly be successful at mastering its information, they have to be.</p>
<p>We have to realize that we can't allow our obsession with the quest for a single version of the truth to turn us into totalitarian regimes, certain that OUR truth is THE truth, and that messing around with the data is by its very nature subversive and dangerous.</p>
<p>Data to the people.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/data-to-the-people-why-self-serve-etl/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Self Serve Business Intelligence</title>
		<link>http://www.datamartist.com/self-serve-business-intelligence</link>
		<comments>http://www.datamartist.com/self-serve-business-intelligence#comments</comments>
		<pubDate>Tue, 31 Mar 2009 16:40:14 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Business Intelligence Architecture]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Analyst tools]]></category>
		<category><![CDATA[easy to use etl]]></category>
		<category><![CDATA[Fixing Data]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=1354</guid>
		<description><![CDATA[Self serve business intelligence dreams of letting everyone whip up any report or analysis they want. The reality is that its often not the report that's the problem- it the underlying data and model. So the idea of self serve business intelligence is a wonderful idea- the problem is that its not all about pretty [...]]]></description>
			<content:encoded><![CDATA[<p><img src="/wp-content/uploads/2009/03/self-serve-bi-garbage.jpg" alt="Self Serve Business Intelligence" title="Self Serve Business Intelligence" width="300" height="209" class="alignright size-full wp-image-1384" />Self serve business intelligence dreams of letting everyone whip up any report or analysis they want.  The reality is that its often not the report that's the problem- it the underlying data and model.</p>
<p>So the idea of self serve business intelligence is a wonderful idea- the problem is that its not all about pretty graphs and fancy web user interfaces.  You need to somehow design the data model so that every possible report that users dream up is possible- or move so much data modeling functionality into the report writer that it ends up looking more like a data transformation tool, yet is easy to use.  There are "new" techniques that are getting lots of discussion- columnar databases are one, and certainly they provide interesting techniques, but only IF you've got meaningful data.</p>
<p>Here's something to try-  call up your favorite Business Intelligence vendor, and ask for a demo.  It will be wonderful, it will be clear, easy, and simple- and it will be done on a set of data that was made by someone who knew exactly what the demo was going to be.</p>
<p>Real world data is messy, and it often does not follow simple rules.  As a result, it takes significant work to build analysis- and to have a system that is ready for ANY analysis that any user might think up at the moment is non-trivial.</p>
<p>But its a worthy pursuit.  And I think there are three key fronts in this battle;</p>
<ol>
<li>Reporting and Analysis front end tools</li>
<li>Data transformation and integration tools</li>
<li>Fixing the source systems.</li>
</ol>
<p>All the big players have integrated suites of products that perform the functions of the first two categories-  IBM/Cognos, Oracle, SAP.</p>
<p>There are also lots of very interesting new tools in the first category- <a href="http://www.tableausoftware.com/">Tableau</a> is one that gets a lot of buzz.  Of course, the king of graphing and dashboarding is still the spreadsheet, and Excel has the crown.</p>
<p>There are lots of tools for the IT department in the second category.  <a href="/product">Datamartist</a> is a unique new tool for the end user in this area- a self-serve desktop tool for data transformation.  It is a tool that allows users to quickly transform data sets to experiment and create new analysis that can then be queried and viewed using the tools in the first category.</p>
<p>But in the end, I think the number one limiter on achieving the dream of self-serve business intelligence will be getting a handle on the quality of the data in the source systems. Garbage in means garbage out- and the last two layers shouldn't have to tie themselves in knots trying to fix issues that are generated in the transactional systems.</p>
<p>But they do tie themselves in knots- and it is those knots that stop users from more freely and accessing their data using an intuitive and almost brainstorming approach- which in the end is the goal of self serve.</p>
<p>Unless the underlying data in the source systems is quality controlled, and designed to capture the information that is critical for analysts, then the Business Intelligence layer will have to work too hard to "fix" the data, keeping a large IT team busy writing code, and there's nothing self serve about that.</p>
<p>In the end, the data quality issue is often what makes data warehouses and data marts so expensive to build- and drives users to spreadsheets or even databases (microsoft access is a common one).  This is what I call "self serve data transformation"- and this is what Datamartist does-  if you're frustrated with the access you have to your data- <a href="/downloads">download it </a>and give it a try.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/self-serve-business-intelligence/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Joining the Dimension Table to the Fact Table- Purchasing Data mart (Part 5)</title>
		<link>http://www.datamartist.com/joining-the-dimension-table-to-the-fact-table-purchasing-data-mart-part-5</link>
		<comments>http://www.datamartist.com/joining-the-dimension-table-to-the-fact-table-purchasing-data-mart-part-5#comments</comments>
		<pubDate>Tue, 17 Feb 2009 16:31:48 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Cost Reduction]]></category>
		<category><![CDATA[Data Modelling]]></category>
		<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Personal Data Marts]]></category>
		<category><![CDATA[Purchasing Analysis]]></category>
		<category><![CDATA[Data Mart Example]]></category>
		<category><![CDATA[Dimension Tables]]></category>
		<category><![CDATA[Purchasing Data Warehouse]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=991</guid>
		<description><![CDATA[After we have created the dimension tables and the fact table and populated them with data the final step to getting a star schema is of course to actually join the dimension tables to the fact table. In the datamartist tool we do this with a Join block. Check out the first four parts of [...]]]></description>
			<content:encoded><![CDATA[<p><img src="/wp-content/uploads/2009/02/join1.jpg" alt="join1" title="join1" width="200" height="200" class="alignright size-full wp-image-995" />After we have created the dimension tables and the fact table and populated them with data the final step to getting a star schema is of course to actually join the dimension tables to the fact table.  In the datamartist tool we do this with a Join block.</p>
<p>Check out the first four parts of this series (<a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">1</a>,<a href="http://www.datamartist.com/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2">2</a> , <a href="/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3">3</a> and <a href="/hierarchies-and-tree-structures-in-dimensions-an-example-item-dimension-part-4">4</a>) where we created an example data mart, with some fictitious purchasing data.</p>
<p>The final step is to join the dimensions we have created to the fact table. To do this, we connect up the two dimensions (Vendor and Item) to the Join block and connect an export block to the output.  What has in effect been created is a complete Extract, Transform Load (ETL) and the final star schema join.<br />
<a href="/wp-content/uploads/2009/02/po-data-mart-screen-shot2.png"><img src="/wp-content/uploads/2009/02/po-datamart-blocks1.jpg" alt="po-datamart-blocks1" title="po-datamart-blocks1" width="400" height="208" class="alignnone size-full wp-image-1002" /></a></p>
<p>(If thats a bit hard to read- click on the image to see the full size screen shot.)</p>
<p>With the generated data set I used for this example, summarizing the data to yearly totals but keeping all the detail on Vendor and Item causes the roughly 4 million row raw data file to be reduced to around 800 thousand rows.  (This summarizing was done on another canvas- although it could have been done on this canvas just as easily).</p>
<p><img src="/wp-content/uploads/2009/02/join-column-selection.jpg" alt="join-column-selection" title="join-column-selection" width="249" height="361" class="alignleft size-full wp-image-1007" />This data mart, with 800 k rows and two dimensions of about three thousand members each took my laptop about a minute and 45 seconds to solve, and save to a 360 Mb text file out.</p>
<p>Of course, by summarizing or filtering (just add blocks) analysis subsets could easily be exported directly to Excel, managing the data volumes involved, and letting you create the graphs, dashboards and reports that you need.</p>
<p>This is part of a 5 part series- here are the links to the various parts: <a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">1</a>,<a href="/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2">2</a> , <a href="/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3">3</a> , <a href="/hierarchies-and-tree-structures-in-dimensions-an-example-item-dimension-part-4">4</a> and <a href="/joining-the-dimension-table-to-the-fact-table-purchasing-data-mart-part-5">5</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/joining-the-dimension-table-to-the-fact-table-purchasing-data-mart-part-5/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Connecting the dimension table to the fact table- Vendor Example (Part 3)</title>
		<link>http://www.datamartist.com/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3</link>
		<comments>http://www.datamartist.com/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3#comments</comments>
		<pubDate>Mon, 09 Feb 2009 20:47:55 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Cost Reduction]]></category>
		<category><![CDATA[Data Modelling]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Personal Data Marts]]></category>
		<category><![CDATA[Data Mart Example]]></category>
		<category><![CDATA[Dimension Tables]]></category>
		<category><![CDATA[Duplicate Data]]></category>
		<category><![CDATA[Purchasing Data Warehouse]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=858</guid>
		<description><![CDATA[In parts one and two of this series we introduced our challenge (to make a data mart to analyze the Acme Company's spending) and showed how the Datamartist tool could import millions of rows of data and then turn it into a fact table we can use in Excel. Now we need to create a [...]]]></description>
			<content:encoded><![CDATA[<p><img src="/wp-content/uploads/2009/02/makingdimseasyway.jpg" alt="makingdimseasyway" title="makingdimseasyway" width="250" height="97" class="alignright size-full wp-image-883" />In parts <a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">one</a> and <a href="/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2">two</a> of this series we introduced our challenge (to make a data mart to analyze the Acme Company's spending) and showed how the <a href="/product">Datamartist tool</a> could import millions of rows of data and then turn it into a fact table we can use in Excel.</p>
<p>Now we need to create a Vendor dimension table and join it to this fact table to determine who our big vendors are.</p>
<p>In Datamartist it is a simple task to create this vendor dimension. As always we use blocks and connect them together.  We define a dimension by using a reference definition block. All we have to do to configure the reference block is to specify which columns uniquely define the dimension (or almost uniquely, Datamartist will resolve duplicate keys using a majority/first rule set for you if you have some data glitches).</p>
<p>We start with an import block that brings in the Vendor master text file, then we define the reference by specifying "Vendor_ID" as the key.  These first two blocks look like this:<br />
<img src="/wp-content/uploads/2009/02/vendor-master-in-and-reference-block.jpg" alt="vendor-master-in-and-reference-block" title="vendor-master-in-and-reference-block" width="302" height="148" class="alignnone size-full wp-image-878" /></p>
<p>Then we join it to the fact table we created in part two of this series with a join block.  This means that now instead of just the vendor ID number that was in the fact table, we have the name, and address for the vendor in our mini star schema.</p>
<p><img src="/wp-content/uploads/2009/02/vendor-dimension-and-join.jpg" alt="vendor-dimension-and-join" title="vendor-dimension-and-join" width="436" height="283" class="alignnone size-full wp-image-879" /></p>
<p>And finally we put a summarize block after that to total up all the monthly values for each vendor, and we export to excel. This is what the canvas looks like:<br />
<img src="/wp-content/uploads/2009/02/vendor-dimension-without-dedup1.jpg" alt="vendor-dimension-without-dedup1" title="vendor-dimension-without-dedup1" width="501" height="198" class="alignnone size-full wp-image-865" /><br />
After we do this, we grab the excel file Datamartist just created for us, do a quick sort, and come up with a list of Acme's top ten suppliers.  Feeling pretty good about ourselves, we do a review with the head of purchasing.</p>
<p>"Where's Mega brothers?" she says with a frown "I think your data is screwy- no way that Mega brothers didn't make the top ten- we spend a fortune on railways, and a lot of our freight goes with the Mega Brothers Rail company. Of course it is probably entered under different vendors, each location works with the office local to them... But we've got to view them as a single vendor in the data mart- you <em><strong>can</strong></em> do that right?"</p>
<p><img src="/wp-content/uploads/2009/02/vendor-dimension-with-dedupe1.jpg" alt="vendor-dimension-with-dedupe1" title="vendor-dimension-with-dedupe1" width="300" height="205" class="alignright size-full wp-image-870" /></p>
<h2>Fixing Duplicate Rows</h2>
<p>  Having to deal with duplicate data is a very common issue in any type of data analysis.  So, back to the canvas.  By simply adding a de-duplicate block to our Vendor dimension table (after the Reference block, and before the join) we can find and resolve the Mega Brothers duplicates.<br />
We just use the filter to find the records- (Easy to do, looking for "Mega" "rail" "brothers" etc. and we map them to a single instance.)  This is the filter control that lets us find and tag the duplicates:<br />
<img src="/wp-content/uploads/2009/02/mega-bros-duplicates-in-picker1.jpg" alt="mega-bros-duplicates-in-picker1" title="mega-bros-duplicates-in-picker1" width="400" height="280" class="alignnone size-full wp-image-871" /></p>
<p><img src="/wp-content/uploads/2009/02/mega-bros-duplicates-in-mapper.jpg" alt="mega-bros-duplicates-in-mapper" title="mega-bros-duplicates-in-mapper" width="312" height="247" class="alignright size-full wp-image-872" />As we tag them, they show up in the mapper, which lets us see which duplicate records we have eliminated for the dimension. We run the canvas again, and this time, sure enough, Mega Brothers Rail is in our top ten.  But even though the head of purchasing knew it was a lot, this is actually the first time she's seen the number.  "Wow. I've got to give them a call- can you give me that in an Excel spreadsheet?"</p>
<p>Stay tuned, more to come as we go further into Datamartist's ability to segment, filter and organize large data sets.</p>
<p>If you want to see the interface in action watch our first <a href="/product/video-and-screenshots/introductory-tutorial-video">Tutorial Video</a>.  Or just get right to it with your own data- <a href="/downloads">download the free trial now</a>- there is no registration required, and it installs in minutes.</p>
<p>This is part of a 5 part series- here are the links to the various parts: <a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">1</a>,<a href="/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2">2</a> , <a href="/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3">3</a> , <a href="/hierarchies-and-tree-structures-in-dimensions-an-example-item-dimension-part-4">4</a> and <a href="/joining-the-dimension-table-to-the-fact-table-purchasing-data-mart-part-5">5</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Creating a Fact Table with the Vendor dimension Purchasing DM (Part 2)</title>
		<link>http://www.datamartist.com/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2</link>
		<comments>http://www.datamartist.com/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2#comments</comments>
		<pubDate>Fri, 06 Feb 2009 00:23:50 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Modelling]]></category>
		<category><![CDATA[Data Transformation]]></category>
		<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[ETL]]></category>
		<category><![CDATA[Personal Data Marts]]></category>
		<category><![CDATA[Data Mart Example]]></category>
		<category><![CDATA[Excel Data Import]]></category>
		<category><![CDATA[Excel Performance]]></category>
		<category><![CDATA[Purchasing Data Warehouse]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=781</guid>
		<description><![CDATA[In creating a data warehouse or data mart data model there are two key types of tables- fact tables and dimension tables. Fact tables hold the data to be analyzed, dimensional tables provide categories and analysis values that organize the data. So we have our mission from Part 1: to analyze the "Acme does everything" [...]]]></description>
			<content:encoded><![CDATA[<p><img src="/wp-content/uploads/2009/02/four_million_rows_no_worries1.jpg" alt="four_million_rows_no_worries1" title="four_million_rows_no_worries1" width="300" height="136" class="alignright size-full wp-image-812" />In creating a data warehouse or data mart data model there are two key types of tables- fact tables and dimension tables.  Fact tables hold the data to be analyzed, dimensional tables provide categories and analysis values that organize the data.<br />
So we have our <a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">mission from Part 1</a>: to analyze the "Acme does everything" company's purchasing data and find ways to save money.  The first step, however is getting a handle on the data.  The IT department has given us the files, and with a smug smile told us to "have fun".  We've been given three files that are a snapshot of the purchasing data:</p>
<ul>
<li><strong>Item_Master.txt</strong>  - this holds all the items that Acme buys</li>
<li><strong>Vendor_Master.txt</strong> - this holds a list of all the vendors, with information such as their address</li>
<li><strong>PO_Detail.tx</strong>t - this is the huge data set, all the purchase order data for the last four years</li>
</ul>
<p>The Item and Vendor files aren't very big, but the PO_Detail is over 340 Mb, and it holds almost four million purchase order lines.  Don't try to import it into Excel. Of course you need Excel 2007 to even try to import 4 million rows. In Excel 2003 it would take over sixty sheets and probably some VBA code to try it.  I tried the import in Excel 2007- it takes 20 seconds just to tell me I'll have to go back to the text file import multiple times to do multiple imports onto separate sheets. It took almost two minutes to do the first million rows.  Even once we have the data spread across four sheets it's not clear how to summarize millions of rows in excel easily.<img src="/wp-content/uploads/2009/02/po_detail_columns.jpg" alt="po_detail_columns" title="po_detail_columns" width="247" height="398" class="alignright size-full wp-image-785" /></p>
<p>Instead, let's use the <a href="/product">Datamartist tool</a> to manage this data set and generate one thats more useful.</p>
<p>The first analysis we will do will be on the Vendor dimension, to determine who Acme's big vendors are, and if we can negotiate some price reductions where we have leverage.</p>
<p>In Datamartist, very large files are not an issue because the tool can load in only preview data- this means that it's possible to look at a sampling of a few hundred thousand rows, and design the transformation before running it on the whole data set.</p>
<p>The PO Detail file has the columns shown- let's answer the question - "Who are our biggest suppliers"?<br />
 So which columns do we need?  We probably want to have some sense of trends over time so we'll keep the <strong>order date</strong>, but summarize to <strong>Month</strong>,  we'll keep the <strong>Vendor ID</strong> of course, and then we need to use the <strong>Quantity and Price</strong> fields to calculate the total amount spent.  Then we want to write this summarized data into Excel to check it out.</p>
<p>To do this in Datamartist all it takes is four simple blocks;  A Text import block to load in the PO_Detail.txt file, a calculate block to multiply QTY by PRICE, a Summarize block to do all the summarizing, and an Excel export block to generate the excel file;</p>
<p><img src="/wp-content/uploads/2009/02/po_detail_summarize_blocks.jpg" alt="po_detail_summarize_blocks" title="po_detail_summarize_blocks" width="463" height="92" class="alignnone size-full wp-image-806" /></p>
<p>Each block passes its result to the next block via the connectors, and the last block saves it to an excel file we've specified.</p>
<p>Defining the calculation uses standard spreadsheet functions- here's what the config area looks like;<br />
<img src="/wp-content/uploads/2009/02/calculate_total_closeup.jpg" alt="calculate_total_closeup" title="calculate_total_closeup" width="400" height="91" class="alignnone size-full wp-image-801" /></p>
<p>And defining the summary is as simple as it looks- pick the columns you want, and select what kind of summary you want done.<br />
<img src="/wp-content/uploads/2009/02/summary_block_closeup1.jpg" alt="summary_block_closeup1" title="summary_block_closeup1" width="417" height="111" class="alignnone size-full wp-image-797" /></p>
<p>We run it on a preview set of 100 thousand rows (takes about twelve seconds to run), and check the output.</p>
<p>It looks good, so we run on the whole 4 million rows;</p>
<p><img src="/wp-content/uploads/2009/02/summarize_progress_po_detail.jpg" alt="summarize_progress_po_detail" title="summarize_progress_po_detail" width="466" height="128" class="alignnone size-full wp-image-804" /></p>
<p>About seven minutes later we have our result- an excel sheet with a manageable 130 thousand rows, total spend, by vendor, by month for four years;<br />
<img src="/wp-content/uploads/2009/02/completed_po_detail_summary.jpg" alt="completed_po_detail_summary" title="completed_po_detail_summary" width="461" height="95" class="alignnone size-full wp-image-807" /></p>
<p>Next up we need to create our vendor dimension, and join it to this mini fact table we have created.  Stay tuned.</p>
<p>This is part of a 5 part series- here are the links to the various parts: <a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">1</a>,<a href="/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2">2</a> , <a href="/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3">3</a> , <a href="/hierarchies-and-tree-structures-in-dimensions-an-example-item-dimension-part-4">4</a> and <a href="/joining-the-dimension-table-to-the-fact-table-purchasing-data-mart-part-5">5</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

