<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Datamartist.com &#187; Spreadsheet Tips</title>
	<atom:link href="http://www.datamartist.com/category/spreadsheet-tips/feed" rel="self" type="application/rss+xml" />
	<link>http://www.datamartist.com</link>
	<description>Reduce cost with self serve data transformation</description>
	<lastBuildDate>Mon, 26 Jul 2010 18:33:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Purchasing Data Mart &#8211; cutting costs with analysis (Part 1)</title>
		<link>http://www.datamartist.com/purchasing-data-mart-cutting-costs-with-analysis-part-1</link>
		<comments>http://www.datamartist.com/purchasing-data-mart-cutting-costs-with-analysis-part-1#comments</comments>
		<pubDate>Tue, 27 Jan 2009 20:08:35 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Personal Data Marts]]></category>
		<category><![CDATA[Purchasing Analysis]]></category>
		<category><![CDATA[Spreadsheet Tips]]></category>
		<category><![CDATA[Accounts payable]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[Data Mart Example]]></category>
		<category><![CDATA[Data Warehouse Example]]></category>
		<category><![CDATA[Example Data mart]]></category>
		<category><![CDATA[Purchasing]]></category>
		<category><![CDATA[Purchasing Data Warehouse]]></category>

		<guid isPermaLink="false">http://www.datamartist.com.php5-2.dfw1-1.websitetestlink.com/?p=774</guid>
		<description><![CDATA[In these difficult economic times, cutting costs isn't just optimization, it's survival. You can't reduce what you can't quantify so it's critical to analyze the accounts payable (AP), or purchasing data to identify the areas where cost savings are possible. This is one of the most useful financial data marts because spending is often something [...]]]></description>
			<content:encoded><![CDATA[<p><img src="/wp-content/uploads/2009/02/purchasingdatamartgraphic-300x224.jpg" alt="purchasingdatamartgraphic" title="purchasingdatamartgraphic" width="300" height="224" class="alignright size-medium wp-image-775" />In these difficult economic times, cutting costs isn't just optimization, it's survival. You can't reduce what you can't quantify so it's critical to analyze the accounts payable (AP), or purchasing data to identify the areas where cost savings are possible.  This is one of the most useful financial data marts because spending is often something that can be controlled quickly once understood.</p>
<p>In the next series of posts I am going walk through the design and implementation of a purchasing data mart, including its fact tables and dimensions to allow us to analyze some typical purchasing data.  I’ll build this data mart model using the <a href="/product">Datamartist tool</a>.  </p>
<p>This will create a “snapshot” analysis of purchasing data with a desktop data analysis tool that can be built quickly yet will access millions of rows of data, and deal with data quality issues such as duplicate rows.</p>
<p>For the purchasing data mart model that we’ll be defining, I’ll use a fictitious company that manufactures and sells a broad range of things- the "Acme does everything company".  </p>
<p>Acme is a long standing enterprise, with a number of offices and factories in the US. But they’ve never done an in-depth analysis of their costs because they didn’t have to until now- profits were good, and the business was growing well.  But then the economy took a turn for the worst, and Acme’s customers are cutting back on pretty much everything.  Acme’s CFO has announced that if costs aren't reduced quickly, Acme is going to simply run out of cash.  He wants you to head up the analysis on the company’s purchases- where can Acme save?</p>
<p>I look forward to showcasing the functionality in the Datamartist tool that makes it possible to do this without programming, and without requiring database software, developers or servers.  This kind of snapshot, immediate data transformation is what we think will make Datamartist such a cost effective and efficient addition to any serious analyst's toolkit.</p>
<p>This is part of a 5 part series- here are the links to the various parts: <a href="/purchasing-data-mart-cutting-costs-with-analysis-part-1">1</a>,<a href="/creating-a-fact-table-with-the-vendor-dimension-purchasing-dm-part-2">2</a> , <a href="/connecting-the-dimension-table-to-the-fact-table-vendor-example-part-3">3</a> , <a href="/hierarchies-and-tree-structures-in-dimensions-an-example-item-dimension-part-4">4</a> and <a href="/joining-the-dimension-table-to-the-fact-table-purchasing-data-mart-part-5">5</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/purchasing-data-mart-cutting-costs-with-analysis-part-1/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Duplicate Data and removing duplicate records</title>
		<link>http://www.datamartist.com/duplicate-data-and-removing-duplicate-records</link>
		<comments>http://www.datamartist.com/duplicate-data-and-removing-duplicate-records#comments</comments>
		<pubDate>Wed, 15 Oct 2008 02:07:19 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Datamartist Tool]]></category>
		<category><![CDATA[Spreadsheet Tips]]></category>
		<category><![CDATA[Concatonated Keys]]></category>
		<category><![CDATA[Duplicate Data]]></category>
		<category><![CDATA[Fixing Data]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=159</guid>
		<description><![CDATA[Duplicate records, doubles, redundant data, duplicate rows; it doesn't matter what you call them, they are one of the biggest problems in any data analyst's life. There are lots of different types of data quality problems, but in this post I'll focus on Duplicates. I'll share some hints on how to find duplicate records and remove duplicate records, [...]]]></description>
			<content:encoded><![CDATA[<p><img src="/wp-content/uploads/2008/10/duplicate-customers-john-smith.jpg" alt="duplicate-customers-john-smith" title="duplicate-customers-john-smith" width="300" height="242" class="alignright size-full wp-image-1527" />Duplicate records, doubles, redundant data, duplicate rows; it doesn't matter what you call them, they are one of the biggest problems in any data analyst's life.</p>
<p>There are lots of different types of data quality problems, but in this post I'll focus on Duplicates.</p>
<p>I'll share some hints on how to find duplicate records and remove duplicate records, at least from your sight, if not from the source system.</p>
<h2>Duplicate Records</h2>
<p>A lot of the duplicate records that you're apt to meet belong to two distinct types.</p>
<h2>Non-unique Keys</h2>
<p style="text-align: left;">This is where two records in the same table have the same code or key, but may or may not have different values and meanings-  this can happen when you're mixing data, or data is coming from non-database sources like text files, (csv files from a csv import say), or excel files.  Databases usually have some sort of unique key so don't tend to have this problem- but if you merge data from two different databases the uniqueness might be lost- example: say you have an oracle database (System 1) and a mysql database (System 2), both of which use a "unique" integer to track products.  When you merge the two, you are going to have two of everything:</p>
<p style="text-align: center;"><img class="size-full wp-image-161 aligncenter" title="duplicate-keys" src="/wp-content/uploads/2008/10/duplicate-keys.jpg" alt="" width="274" height="94" /></p>
<p>Notice I've added a column that specifies the source system- where the record came from- this is the first step in solving this problem- you need to Concatenate or combine the keys- although "Product Key" is not unique by itself, "Source System" + "Product Key" is unique, because each source system is internally unique. Now there is a trick to concatenation- <strong>add a string of unusual characters when combining</strong>.  This ensures that by random luck the two keys don't combine to be another duplicate key-  here's a different example that illustrates the point:</p>
<p style="text-align: center;"><img class="aligncenter size-full wp-image-162" title="concatonatedgoodandbad" src="/wp-content/uploads/2008/10/concatonatedgoodandbad.jpg" alt="" width="500" height="120" /></p>
<p>I like to use one or more of the pipe "|" character because its often not present, or even not allowed in source data and codes. Of course, you need a tool that is willing to accept that character as part of a string key for this to work.  If you are doing this in excel, use the "&amp;" to concatenate fields together and add in other characters as needed.  The above example used the following syntax in the formula  ="||" &amp; A1 &amp; "||||" &amp; B1 &amp; "||"</p>
<p>Its a trick you can use to be able to use VLOOKUP more effectively-  another example- say you have a list that has first name, last name, and some address information.  First Name + Last Name might not be unique on its own- throw in the street address though, and chances are you can get a more accurate list of unique people from a key point of view.  Of course this doesn't solve the John Smith, J. Smith, Johnny Smith, Johnathan Smith problem, or addresses like 123 Any Street vs 123 Any St. vs 123 Any Avenue (often all the same, with errors in data entry)... which leads us to;</p>
<h3>Duplicate Meaning</h3>
<p>This is more common, and sometimes harder to deal with. In example above, even though you can fix the duplicate key problem by concatonating a code for the source system to the key (along with some unused characters to ensure no "gotchas")- its pretty clear that "Television" and "TV" are probably the same thing,  and you don't really want to see two products.  These types of duplicates are often the most damaging to good analysis.  Everything works, but your reports are difficult to read, or worse, you make decisions based on your "top 20 products" when in fact, 15 of them are not in the top twenty at all, because the REAL best sellers got split between "TV" and "Television" and "TV Screen" etc. Some automated duplicate detection tools exist (particularly in the area of peoples names and addresses), but in the end for many types of data its the old human eyeball that has to do the work- and you need some sort of system to keep a map of all the duplicates you've identified.</p>
<p>And obviously, you know by now where all this is going-  the tool you need is the tool I'm creating; Datamartist.</p>
<p>Here are some teaser screen shots of the work in progress, and examples of the functionality that deals with the two problems we've discussed above:</p>
<p><img class="alignright size-medium wp-image-183" title="duplicatekeys" src="/wp-content/uploads/2008/10/duplicatekeys-300x185.jpg" alt="" width="300" height="185" />To resolve duplicate keys, Datamartist scans the data, and allows you to select keys and experiment- as many as you like (doing the concatenation trick that I described above automatically) and informs you which keys are duplicate, and shows you the duplicates for the various key combinations.  If there is no way around it, you can keep a non-unique key- Datamartist will fix the reference by taking the value that is most common within a given data set and mapping attributes from that record, giving you a clean, unique reference set to work with, and eliminating that handfull of bad records that are messing things up.</p>
<p> </p>
<p><img class="alignleft size-medium wp-image-184" style="border: white 10px solid;" title="duplicatesmiths" src="/wp-content/uploads/2008/10/duplicatesmiths-300x145.jpg" alt="" width="300" height="145" />In the case of the second type of duplicates, Datamartist provides a filter/search capability to let you find all the duplicate rows (with "Smith" as the last name, for example). Then it allows you to identify which records are the "Master" and which are to be treated as duplicates.  From that point on, the duplicates are mapped to the master, and the reference set shows a single, consistent set of data. </p>
<p> In both these cases, and as a general rule of how Datamartist works, the mapping and configuration you do is not lost if you change input files, or update with new data.  As long as the minimum data structure consistency is there, the mapping you did stays with you, so you only have to do it once. You might need to remap some field names, but Datamartist lets you do that easily too, so the same mapping can be used to analyze many different data sets that use the same underlying keys (and would have had the same underlying data quality issues).</p>
<p> <a href="/download">Download Datamartist now</a>- see the de-duplication functionality in action.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/duplicate-data-and-removing-duplicate-records/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Importing Data into Excel</title>
		<link>http://www.datamartist.com/importing-data-into-excel</link>
		<comments>http://www.datamartist.com/importing-data-into-excel#comments</comments>
		<pubDate>Mon, 01 Sep 2008 15:50:43 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Personal Data Marts]]></category>
		<category><![CDATA[Spreadsheet Tips]]></category>
		<category><![CDATA[Business Intelligence]]></category>
		<category><![CDATA[Excel]]></category>
		<category><![CDATA[Excel Data Import]]></category>
		<category><![CDATA[Excel Performance]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=20</guid>
		<description><![CDATA[I've seen lots of Business Intelligence (BI) solutions, (data marts, data warehouses and the accompanying reports and dashboards) using all sorts of different tools. But I'll tell you- NO tool has yet been as successful as Microsoft Excel for providing a do it yourself data analysis platform to import data into. Now, I'm not suggesting [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-107" title="excelisking3" src="/wp-content/uploads/2008/09/excelisking3.jpg" alt="" width="254" height="269" />I've seen lots of <a href="http://en.wikipedia.org/wiki/Business_intelligence" target="_blank">Business Intelligence</a> (BI) solutions, (<a href="http://en.wikipedia.org/wiki/Data_mart" target="_blank">data marts</a>, <a href="http://en.wikipedia.org/wiki/Data_warehouse" target="_blank">data warehouses</a> and the accompanying reports and dashboards) using all sorts of <a href="http://en.wikipedia.org/wiki/Business_intelligence_tools" target="_blank">different tools</a>. But I'll tell you- NO tool has yet been as successful as Microsoft Excel for providing a do it yourself data analysis platform to import data into. Now, I'm not suggesting that Excel (even when used with the <a href="/product">upcoming Datamartist tool </a> <img src='http://www.datamartist.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  ) will make traditional data marts obsolete. Clearly the <a title="Market Growth of Enterprise BI" href="http://www.gartner.com/it/page.jsp?id=580708" target="_blank">billions of dollars being spent on "enterprise BI"</a>are not going to dry up. But there are enough times you have to wait- or your needs are "too specific"- for a large BI project. Often the existing data marts or data warehouses will be the source of raw data. But you will still need to prepare data for Excel import. In the next few posts I'm going to discuss various aspects of using excel for data analysis. In this first part, I'll talk about data size in excel and performance which is important - when should you import the data? Import the HUGE raw file, or treat it before import to reduce its size?</p>
<h2>Data Size Limits in Excel</h2>
<p>There are different types of limits-</p>
<ol>
<li>The size in rows and columns the actual spreadsheet has.</li>
<li>Excel's (and your PC's) ability to crunch the numbers in a reasonable time. (RAM, CPU)</li>
<li>The size of the files involved and load and save times.</li>
</ol>
<p>In Excel 2003, a spreadsheet has rows 1 to 65 536 and columns A to IV. This makes it a grid 256 X 65536. In Excel 2007 the spreadsheet is much, much larger, with rows from 1 to 1 048 576 and columns from A to XFD. (Making a grid 16384 X 1 048 576).<a href="/wp-content/uploads/2008/09/importtoexcel1.jpg"><img class="alignright size-medium wp-image-63" title="importtoexcel1" src="/wp-content/uploads/2008/09/importtoexcel1-300x227.jpg" alt="" width="300" height="227" /></a> Now before you get too excited about how much space you have in 2007, the reality is that limits number 2 and 3 define how you can actually use that space. But it is more and more is good.</p>
<p>So lets kick the tires on large data sets in Excel 2007. For these very informal tests I'm using a Quad-core workstation with 4Gb of RAM, so the results I get represent a best case compared to a typical laptop or desktop PC. First of all- putting a million rows of data in Excel 2007 (even a "narrow table" of only 3-4 columns) slows everything down. Delete a column, and you'll often see a 5-10 second freeze-up while excel churns away in the background- roughly the same amount of time needed to save the file. Plus, when I push it I've had it lock up on me a few times- requiring some Ctrl-Alt-Del action to kill it. Even a narrow table such as this makes the Excel file be at minimum 15-20 megabytes. For the particular text file I used, the .txt version was 9 Mb, the .xlsx file was double the size at 18 Mb. I added a few columns and the file quickly became 80 Mb.</p>
<p>Also, strangely, doing exactly the same thing multiple times results in very different times to complete- when I'm mentioning times its the average of 2-3 trials (see graph).</p>
<p> <a href="/wp-content/uploads/2008/09/excel-operations-times.jpg"><img class="size-medium wp-image-65 alignleft" title="excel-operations-times" src="/wp-content/uploads/2008/09/excel-operations-times-300x169.jpg" alt="" width="300" height="169" /></a>All in all, although Excel 2007 can technically store a million rows, I'd advise against it. There are other reasons its a pain- scroll bars and page-up page-down don't scale well to 1M rows- its just hard to copy 250000 rows accurately- takes for ever to get to the end, and then you overshoot by a mile, and page up again forever to find it etc. etc. (And yes you can use the Go To command on the Home&gt;Editing&gt;Find and Select&gt;Go to - but a model of ease its not.</p>
<p>I can tell you, however, that using all the other features on more reasonable data sets (up to say, 100 k rows), I LOVE what it can do in terms of analysis and reporting. Once you have the data in reasonable result sets, there is no better place to have it than in Excel if you want full control in my opinion. But how to get it there. Next posts: how to link to data in Access and build a mini personal data mart. We'll learn how to make a personal data mart given the currently available tools. (And you just know there will be some posts later where I show you how to do the same thing, but using Datamartist. ) <strong>Update:  Datamartist now available.</strong>  <a href="/downloads">Download the tool now</a>, and find a whole new way to transform and managed your data, including <strong>managing huge data imports into excel</strong>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/importing-data-into-excel/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Setting the stage: managing data issues</title>
		<link>http://www.datamartist.com/setting-the-stage-managing-data-issues</link>
		<comments>http://www.datamartist.com/setting-the-stage-managing-data-issues#comments</comments>
		<pubDate>Thu, 01 May 2008 17:56:55 +0000</pubDate>
		<dc:creator>James Standen</dc:creator>
				<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Spreadsheet Tips]]></category>
		<category><![CDATA[Excel]]></category>

		<guid isPermaLink="false">http://www.datamartist.com/?p=12</guid>
		<description><![CDATA[Anyone who has done any data analysis with more than a few lines of data knows that some of the biggest time wasters are data quality issues. What is bad data?  Well, some of it is easy to see, some is downright impossible to find. Lets look at an easy example; a row of data where the country [...]]]></description>
			<content:encoded><![CDATA[<p>Anyone who has done any data analysis with more than a few lines of data knows that some of the biggest time wasters are data quality issues.</p>
<p>What is bad data?</p>
<p> Well, some of it is easy to see, some is downright impossible to find. Lets look at an easy example; a row of data where the country is "US" and the state/province is "Ontario".</p>
<p> You just know both those values can't be right.  So why did the source system let it happen-  good question- when the programmers of the application tell you let me know...</p>
<p>So for this easy one  should we assume that the country is right, and change the state to Michigan since that's close to Ontario?  Or maybe Ohio because they both start with "O"?</p>
<p>The right answer is of course to go back to the source and fix the problem-  but if you've got hundreds or thousands of users, an application that can't be modified to stop this type of error at the source, and and IT department that is overworked then that is probably not an option.</p>
<p>Say your company does $500 million dollars of business in a year, and the Ontario, US data represents $1500 of sales- for practical analysis purposes it just doesn't really matter where it goes-  just so you don't see Ontario as a State in the report you give the CEO.</p>
<p>Often the solution in Excel is to just "fix it"- but if you reload the data each month, you have to go back and fix it again and again.</p>
<p>Or maybe you only load in the new months data, so you don't overwrite last months.  This works until some definitions change, and now all the historical data is out of sync with the new months data.  A good example of this is if you have sales regions.  If the regions are changed (new ones added, existing ones split up or merged) then the historical data you have on your machine will have to be dumped to get the sales region codes corrected for the past.  But then all your fixes have to be redone-  could be a real nightmare if you've been using it for a while. </p>
<p>Another issue is that eventually the data in the source system might get fixed- and it turns out the IT departments fix wasn't the same as yours-  Ohio rather than Michigan- so you have more discrepancies to chase down.</p>
<p>On top of that, even though the amounts are small, its disconcerting to see different spreadsheets show different totals because you "fixed it" in one of them, and didn't in the other even just within your spreadsheets- not to mention the spreadsheets of your colleagues.  I hate it when my boss sees three different numbers.</p>
<p>The key is to create a single data-set from all your sources, fix the problem once, and do it in a way that the entire data set can be refreshed automatically, the fixes (those that still apply) can be "re-run" and then all your spreadsheets link to THAT single version of your truth.  The more you share this "master" sheet with your co-workers the better.</p>
<p>Of course, that's the whole point of having data warehouses and data marts and Enterprise business intelligence systems.  But what if the analysis you need to do hasn't been covered, or isn't scheduled to go live for another eight months?</p>
<p>If its just you and your best friend Excel, then here are some pointers;</p>
<ul>
<li><strong>Stage your data.</strong>  Don't make 10 spreadsheets that all take data from the raw source, rather make one spreadsheet that is the "Fixed data", and have all other spreadsheets link to this. This will mean you will have a single "staging" spreadsheet, and then a number of "reporting" spreadsheets.</li>
<li><strong>Record all your "fixes" on a sheet in this staging spreadsheet</strong> called "Known Issues".  The ideal would be to automate it so that the fixes get applied each time you reload the entire data-set, but by at least having a clear record you wil be able to quickly get the data where you want it each time you reload.</li>
<li><strong>Don't think about report layout or formatting in your staging spreadsheet</strong>-  keep the data in simple tables that is more aligned with how you get the raw data- every column has the same values in it all the way down.  If you have different data-sets, use different sheets.  Do the reporting in your reporting spreadsheets where you can have tables with different mixes of data from different sources.</li>
</ul>
<p>The more macros or scripts you can use the better- but even if you have a set order of cut and paste, and follow it to the letter (write it down on a sheet called "Refresh Steps" in the staging spreadsheet maybe) it will reduce the amount of time it takes to update each time new data is available, and you only do the fixes once, and if you've linked it cleanly to the reporting sheets then the rework will be reduced.</p>
<p> Of course, its still a lot of work.  In a nutshell that's why I'm working on the Datamartist tool that will automate this and much more, allowing you to easily and without programming be able to manage your spreadsheets much more effectively.  Stay tuned, and in the meantime happy staging.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamartist.com/setting-the-stage-managing-data-issues/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
