ETL – Datamartist.com

A simple ETL tool with data profiling tools built in

James Standen — Thu, 08 Jul 2010 04:36:36 +0000

Datamartist is a new idea in ETL and data profiling tools. It gives people who are serious about getting at their data a powerful, simple to use, right sized tool.

Easy to install
Easy to use
ETL features and data profiling capability

Avoid using the wrong tool for the job

Enterprise ETL tools (Extract Transform and Load) are very powerful but often extremely difficult to use.

expensive, particularly if multiple environments are needed
require server infrastructure, configuration and setup.
require expensive developers who have been trained in the specific programming language of each particular vendors tool.
designed for performance and data volume, not ease of use.

Obviously they have their time and place, but when you want fast, visual access to your data, you end up getting slowed down by expensive ETL server overkill.

A better choice- the visual, clean ETL tool

Datamartist is designed to let you extract data from multiple sources, and then mix it, match it, transform it, and understand it.

It uses a visual block and connector model, with the concept of “Data canvases” that let you easily manage and simplify complex data transformations. But unlike many overly complex ETL tools, Datamartist provides visual, configurable blocks, rather than requiring code.

Easy to install

Datamartist installs in minutes, and runs on your desktop, giving you control of your data, and what you need to do. Don’t configure servers, don’t worry about installing the right version of Java, don’t spend hours searching wikis and forums and tweaking config files. Just download it, single step install it, and use it.

It makes it easy for you to take a snapshot of the data you need- locally with cut and paste or drag and drop from files, and locally or remotely with native connections to SQL Server, Oracle, MySql and MS Access, and pretty much anything else via ODBC.

And since the Datamartist data transformation engine can be run from the command line or scripted, it can also be automated to implement ETL tasks running on a windows server.

Speed up data delivery, reduce cost.

Datamartist provides a flexible, simple to use ETL environment that will let you shorten your time to delivery significantly for a wide range of data transformation tasks.

Deliver small and medium sized data transformation tasks more quickly
Build rapid prototypes and proofs of concepts
Automate data profiling and data quality monitoring

Give the Datamartist ETL Tool a try

You can download the Datamartist trial and be up and running in minutes. You don’t even have to register- and you will have full access to a fully functioning version of Datamartist to try out this simple, visual ETL tool on your own data.

We’re also very excited about V1.3.0, currently in private beta. If you’d like to participate in the public beta, drop me a line at “beta at datamartist.com”, and we’ll send you a link when that download is available.

Mystery or Junk data warehouse dimensions

James Standen — Mon, 18 Jan 2010 17:10:46 +0000

Sometimes, when you are designing a star schema model, you’ll find yourself in a dilemma. You’ve come up with a beautiful design, right out of the pages of a Ralph Kimball book with 5 dimensions, and 5 measures, and you are on your way to star schema heaven when suddenly the users start asking akward questions- where is such and such flag? Where’s the transaction type? Why can’t I sort based on the “e7” code from the system?

You can try to explain to them that pure star schemas should not be cluttered with a bunch of tiny dimensions and your fact table just won’t stand for 100 million rows of the e7 code, and besides computery things like transaction codes should not be in a business savy data model. But face it, after some digging you determine the user is right (happens quite often in fact)- they really do use that information and it is critical that you include it and you don’t have the time or budget to make the perfect data warehouse.

So how do you deliver to them what they need, and avoid messing up your dimensional model?

One answer is to create one or more Junk dimensions, sometimes also referred to as a mystery dimension.

In the end although the content of a mystery dimension may or may not be mysterious, there is nothing particulary mysterious about how to implement this type of dimension table.

Even if its perfectly clear what the column is, there are often a number of them with very low cardinality (that is they have very few distinct values). It really does not make sense to add columns in the fact table for each one, and to have a bunch of tiny dimension tables with only a handful of rows in them.

Faced with this the data architect can wrap all these columns up into a junk dimension.

A junk dimension is a dimension that holds all the unique combinations of a set of columns, and assigns a unique key. This key is what is stored in the fact table, in the mystery dimension column.

Lets look at a mystery dimension example. We’ll make up and example dimension thats very small for simplicity sake. Lets say that the transactional table that is used to generate one of our facts has three columns “Zortz” “a3” and “uudl” which we fully satisfy our mystery dimension criteria. (i.e. we don’t know what they are, but people use them in queries.)

“Zortz” is a true/false value, “a3” is one of two values “Confirmed” or “Pending” and “uudl” is either “” or “k”. All the possible combinations of these values would be put into a dimension table and assigned an integer surrogate key. Thus the mystery dimension table would look like this:

A key consideration when forming mystery dimensions is how many combinations exist. If the number of combinations is too high the mystery dimensions size may be unmanageable.

And be careful assuming that all the combinations have been used yet. You are safe if the data type has a fixed set of values (like Boolean, or codes from a known finite set) because you can be sure you’ve created a dimension row for every combination.

But if there are free form string columns, then you need to make sure your ETL is able to generate new dimension rows and surrogate keys as new combinations are created in the source system. This might still be worth while, depending on how many new combinations get created.

You can also manage the size of the mystery dimension tables by having 2 or more mystery dimensions, which might reduce the overall number of dimensional rows depending on the makeup of the data. Different columns and values may tend to cluster together and you will find that grouping them correctly makes say, two small mystery dimensions rather than one huge one.

If, however the number of rows is manageable, a mystery dimension allows all the columns to be queriable, while only adding one column to the fact table, and providing a much more efficient solution in comparison to either creating multiple dimensions, or leaving all the data in the fact table.

By moving it to a junk dimension or “mystery” dimension then you’ve got fewer indexes on the fact table which might be important depending on the size.

So if you find yourself telling your end users that they will just have to do without a column, think twice about it. The role of a data warehouse is to deliver the data- sometimes you just have to find the right packaging to get the job done.

Data quality at the burger joint

James Standen — Thu, 24 Sep 2009 00:22:46 +0000

I have noticed that when I go to a fast food outlet no matter what I get to drink with my meal it is almost always listed as “Cola” on the receipt. But I didn’t order Cola. Ever. Usually I get juice, or milk. So every time I order a burger, I’m clearly a source of bad quality data.

I have looked over the counter on many occasions while I was waiting for my burger and watched the server key in other peoples orders; their fingers flew accross the key pads, but only ever hit the cola key (always in the more central location it seemed). I could actually see the extra wear on the surface of the touch pad. I suspect that the number one reason for keypad replacement in the fast food industry is “cola key not working”. I am guessing that employees understand that speed is important, it is fast food after all. I wonder how much data quality is discussed.

Now, this is hardly a scientific study, and falls clearly in the “anecdotal evidence” column, but when I see this it strikes me that somewhere a data warehouse is probably capturing my drink, and its being ignored because all the analysts know that it’s always cola.

Or, perhaps, there is a complicated ETL job that took hundreds of hours of expensive consulting time to write that cross feeds drink information from the inventory system that tracks the different quantities of syrups required by each location and then estimates the drinks sold randomly allocating that percentage across the number of meals sold.

If this was done you would not have good information about who drank what with what- is orange soda or milk more popular with the cheese burger? Are the fancy fruit drinks (which have a lower margin) more likely to be ordered by people getting the spicy wrap or the regular? What is the real margin on each meal taking into account the drink?

Or maybe the drink dimension is a special dimension that only shows drink categories at a summarized level because thats the granularity the inventory system uses.

Messy. Reduces the value of the information, hard to explain to the end users. But what can you do if you don’t collect the data at the level of each individual order?

Of course, the point is not that I think this wrong drink keyed in issue is an important one for the fast food industry. The point is that if the information at the point of capture is wrong, we can spend a lot of extra effort in the extract transform and load (ETL) logic than we need to with little or no result.

In fact, if we spend enough time on the ETL to make the final data warehouse data appear to be telling us something, it might even be damaging, since the ETL itself might be generating patterns that don’t exist, and will lead analysts down dead ends, forever chasing the apparent relationship between Dr. Pepper and curly fries.

And like many issues in business intelligence, and data quality in general, the root problem is one of process and people. Here is what I think the problem is; it is harder to have a thousand data entry people be careful about their data capture than to hire one ETL developer to write some crazy twenty thousand dollar chunk of code. The result of this is that instead of fixing the problem at its source- in this case right at the point of order, we try to fix it in the data base, after the fact.

We need to get the people on the ground who actually experience the event to be motivated to get the data right, right from the start.

Obviously this is true for retail, it’s true for the loading dock, it’s true for the order desk, it’s true even for self serve and on line processes where the data is coming directly from the customer. It’s true for all data. Get it right as soon as it goes in, and you’ve won a big part of the battle.

In my experience, often the key is to only ask for the information you really want, and when you do ask for it, make it clear that it must be accurate, and put in place closed loop processes that ensure it is. Syrup purchases don’t match the 100% cola data? Ask why. Include data accuracy as part of the store supervisors assessment criteria. Obviously, the more the process can be automated with bar codes, radio frequency ID tags (RFID) or other technologies, the better.

Data quality starts on the ground. The further from the ground, and the deeper into various operational systems, ETL jobs, staging tables, data warehouses or data marts we try to fix the problem, the harder it will be.

Making rapid prototypes for data warehouse ETL jobs

James Standen — Mon, 14 Sep 2009 20:39:00 +0000

Data warehouses and even data marts can be expensive, complex projects. They are not projects to start lightly, and they are not projects that you want to launch without doing some solid planning.

But there is a way to get a handle on the tricky parts of your data warehouse scope, and to reduce your projects overall cost.

The major cost component of any data warehouse project is the Extract Transform and Load (ETL) development. Obviously every project is slightly different, but in my experience ETL will often make up in the order of 70% of the development cost. One of the drivers of this cost is the relatively high priced ETL development resources required. In the markets where I’ve hired resources, an ETL developer will often demand a 30-40% higher hourly rate than a business intelligence report writer, for example.

Making ETL prototypes will give you insights that can reduce cost by shortening the ETL development process and making the optimum use of those highly talented and expensive ETL resources.

What affects the cost and complexity of ETL jobs?

For any given scope, the following will have a large impact on the number and complexity of ETL jobs and therefore their cost.

The number of different data sources involved.
The consistency in terms of master data definitions between systems.
The level of data quality in the systems.

Ideally, you want to get a good handle on these three things before you hire all the ETL developers, and be confident that you are going to satisfy the users needs before millions of dollars are spent on Extract Transform and Load (ETL) jobs and business intelligence reports.

One part of the preparation needed to do this can be the creation of a proof of concept or mockup of key parts of the data warehouse ETL deliverable.

Now, there are mockups, there are prototypes, and there are “first versions”. The the most effective approach is to create a mockup or prototype that;

Goes just deep enough into the data to:
- Establish all data sources that will be required
- Gives a high level audit of their master data and data quality
Provides enough output that:
- End users can be supplied with example reports or cubes to get hands on
- The functional scope can be locked down with confidence on all sides.

The goal of a data warehouse prototype is to learn about the underlying data, and to be able to try different data transformation techniques and approaches on the real data. The goal is not to make the finished product, nor to deliver actually usable reports to end users, although it may be to generate an example result for users to validate.

An example might be to create a prototype to calculate total sales by segment for a period under a new customer segmentation. This would identify if the segmentation rules that have been suggested actually result in the expected segementation of sales data, and if the fields involved are complete and correctly populated in the source systems.

A prototype should focus on the dimensions and data sources that are expected to be the most difficult, and involve multi-source integration. Don’t spend time prototyping the easy stuff.

When you are making a prototype remember its a one-time development. Manual steps and doing some “data cleaning by hand” are perfectly reasonable- its what you learn from the prototype, not how you learn it that is important. Take a snapshot, or a sample of the various tables and put them in a sandbox environment where you can manipulate them quickly and easily.

The whole point is to move quickly, get lots of feedback from users, and be able to avoid unpleasant discoveries during the actual data warehouse development.

If you find a data quality issue, and it’s a tough one, then just remove those rows and continue on- remember you don’t have to solve all the problems in the prototype- you need to identify them. Be open with your users about what the exercise is about- and that it is a very rough pass, and a mockup.

How much could this impact cost?

If you can identify issues during the prototype then you can solve them before all the ETL development resources are brought onto the project.

If you do not do a prototype, and find a data quality issue that requires some back and forth with the business, every week of delay will probably represent thousands or tens of thousands of dollars, with the project team waiting on the resolution before being able to resume coding the ETL jobs in question.

So in summary, making prototypes will:

Reduce the risk of scope creep because users have actually seen and “touched” a mockup of the final output.
Reduce the amount of rework in ETL code because different data transformation approaches can be tested early.
Reduce the risk of the expensive ETL development phase of the project slipping due to unknown data quality issues.

The right tool for ETL Prototypes.

Often prototypes are built in a combination of Excel, MS Access or other databases. These tools can work, but excel has serious issues handling larger data sets, and database development is often cumbersome- the idea is to make a prototype, not actually build the SQL code. Things like different data types, field formats, column naming rules etc. between different source databases often frustrate attempts to do something quickly.

Obviously another option is the enterprise ETL tools themselves- but the cost, complexity and overhead of these tools again makes them better suited to the production system- not a quick mockup or rapid prototype.

What you need to make an ETL prototype is an easy to use ETL tool that provides the basic type of functionality and graphical user interface of high end ETL tools, but also allows a more flexible treatment of data types, all with the ability to pull data from multiple sources, including more informal sources like Excel spreadsheets.

The Datamartist tool was created to provide exactly such a data scratchpad, ideal for rapid prototyping data transformations. It lets you profile your data and build data transformations using a visual, block and connector interface. But it represents a clear, focused and easy to use ETL tool, without all the feature bloat, cost and server configuration required by many expensive enterprise ETL solutions.

Download the free trial, and see for yourself.

Easy to use ETL

James Standen — Fri, 05 Dec 2008 21:32:02 +0000

As We’ve been creating Datamartist, we’ve been trying to avoid using acronyms to describe what it is, but when I’m talking to people who have a background in data warehousing, I only have to say “its an easy to use desktop ETL tool”, and suddenly they know what I am talking about.

An Extract, Transform, Load (ETL) tool is an intermediate software application that extracts the data from the source system, transforms it (often another way to say it FIXES it) and then loads it into the destination system. They are also very expensive.

The destination system is usually a data warehouse or data mart, and most of the ETL tools available are server based. The ETL tool and related development is key to any any data warehouse project (and represent a third or more of the cost on a typical project).

Although most ETL tools use a visual interface of one sort or another, at the core they require programming skills and specialized knowledge. Google “datastage training” and you’ll see that there is an industry grown up around learning how to use these tools.

But there’s nothing magical about it. If you ever made a spreadsheet with data from multiple sources, transformed the data, and then either made reports or moved the data into another spreadsheet then you have made (or most likely were an integral human part of) an ETL. The problem is that out of the box tools like Excel and Access are so flexible, that too much is possible. Where to start?

The amazing thing is that EVERYONE needs ETL functionality, yet overwhelmingly the tools available are expensive, hard to learn and designed for the really, really heavy lifting.

Surely not every data manipulation task that is too much for Excel needs an enterprise ready server based ETL tool? Particularly in the current economic environment, oversized solutions are not an option.

A hard working analyst that has a bit of data analysis to do, and nothing but Excel or maybe Access on his/her desktop is short on options and long on messy spreadsheets or the need to “learn SQL in 21 easy steps”.

The vision behind Datamartist is to provide an easy to use, powerful, yet low cost data transformation tool, that guides users to generate well structured data analysis sets. And all at a price that represents less than a single day of those consultants you have to hire to use the other software you paid too much for.

This is the perfect time find out what easy, flexible, visual data transformation can be like- download now.