Data warehouse – Datamartist.com

Making rapid prototypes for data warehouse ETL jobs

James Standen — Mon, 14 Sep 2009 20:39:00 +0000

Data warehouses and even data marts can be expensive, complex projects. They are not projects to start lightly, and they are not projects that you want to launch without doing some solid planning.

But there is a way to get a handle on the tricky parts of your data warehouse scope, and to reduce your projects overall cost.

The major cost component of any data warehouse project is the Extract Transform and Load (ETL) development. Obviously every project is slightly different, but in my experience ETL will often make up in the order of 70% of the development cost. One of the drivers of this cost is the relatively high priced ETL development resources required. In the markets where I’ve hired resources, an ETL developer will often demand a 30-40% higher hourly rate than a business intelligence report writer, for example.

Making ETL prototypes will give you insights that can reduce cost by shortening the ETL development process and making the optimum use of those highly talented and expensive ETL resources.

What affects the cost and complexity of ETL jobs?

For any given scope, the following will have a large impact on the number and complexity of ETL jobs and therefore their cost.

The number of different data sources involved.
The consistency in terms of master data definitions between systems.
The level of data quality in the systems.

Ideally, you want to get a good handle on these three things before you hire all the ETL developers, and be confident that you are going to satisfy the users needs before millions of dollars are spent on Extract Transform and Load (ETL) jobs and business intelligence reports.

One part of the preparation needed to do this can be the creation of a proof of concept or mockup of key parts of the data warehouse ETL deliverable.

Now, there are mockups, there are prototypes, and there are “first versions”. The the most effective approach is to create a mockup or prototype that;

Goes just deep enough into the data to:
- Establish all data sources that will be required
- Gives a high level audit of their master data and data quality
Provides enough output that:
- End users can be supplied with example reports or cubes to get hands on
- The functional scope can be locked down with confidence on all sides.

The goal of a data warehouse prototype is to learn about the underlying data, and to be able to try different data transformation techniques and approaches on the real data. The goal is not to make the finished product, nor to deliver actually usable reports to end users, although it may be to generate an example result for users to validate.

An example might be to create a prototype to calculate total sales by segment for a period under a new customer segmentation. This would identify if the segmentation rules that have been suggested actually result in the expected segementation of sales data, and if the fields involved are complete and correctly populated in the source systems.

A prototype should focus on the dimensions and data sources that are expected to be the most difficult, and involve multi-source integration. Don’t spend time prototyping the easy stuff.

When you are making a prototype remember its a one-time development. Manual steps and doing some “data cleaning by hand” are perfectly reasonable- its what you learn from the prototype, not how you learn it that is important. Take a snapshot, or a sample of the various tables and put them in a sandbox environment where you can manipulate them quickly and easily.

The whole point is to move quickly, get lots of feedback from users, and be able to avoid unpleasant discoveries during the actual data warehouse development.

If you find a data quality issue, and it’s a tough one, then just remove those rows and continue on- remember you don’t have to solve all the problems in the prototype- you need to identify them. Be open with your users about what the exercise is about- and that it is a very rough pass, and a mockup.

How much could this impact cost?

If you can identify issues during the prototype then you can solve them before all the ETL development resources are brought onto the project.

If you do not do a prototype, and find a data quality issue that requires some back and forth with the business, every week of delay will probably represent thousands or tens of thousands of dollars, with the project team waiting on the resolution before being able to resume coding the ETL jobs in question.

So in summary, making prototypes will:

Reduce the risk of scope creep because users have actually seen and “touched” a mockup of the final output.
Reduce the amount of rework in ETL code because different data transformation approaches can be tested early.
Reduce the risk of the expensive ETL development phase of the project slipping due to unknown data quality issues.

The right tool for ETL Prototypes.

Often prototypes are built in a combination of Excel, MS Access or other databases. These tools can work, but excel has serious issues handling larger data sets, and database development is often cumbersome- the idea is to make a prototype, not actually build the SQL code. Things like different data types, field formats, column naming rules etc. between different source databases often frustrate attempts to do something quickly.

Obviously another option is the enterprise ETL tools themselves- but the cost, complexity and overhead of these tools again makes them better suited to the production system- not a quick mockup or rapid prototype.

What you need to make an ETL prototype is an easy to use ETL tool that provides the basic type of functionality and graphical user interface of high end ETL tools, but also allows a more flexible treatment of data types, all with the ability to pull data from multiple sources, including more informal sources like Excel spreadsheets.

The Datamartist tool was created to provide exactly such a data scratchpad, ideal for rapid prototyping data transformations. It lets you profile your data and build data transformations using a visual, block and connector interface. But it represents a clear, focused and easy to use ETL tool, without all the feature bloat, cost and server configuration required by many expensive enterprise ETL solutions.

Download the free trial, and see for yourself.

6 Tips for making a business intelligence project budget

James Standen — Mon, 15 Jun 2009 14:17:04 +0000

Sometimes it seems like going over budget on a data warehouse project is an unwritten rule. But very often, there are some simple ways to help avoid making a budget that is doomed to be overrun. By budgeting correctly at the start, and managing cost, the project manager can deliver the benefits that the business was expecting, in the time frame and budget envelope agreed to. Here are a few tips that I’ve found through experience to help;

1) Establish scope clearly before you establish budget. This sounds obvious but its amazing how often I’ve seen this go wrong. There is no way to know what something will cost if you don’t know what it is. If you leave the scope too broad, user expectations will drive feature and scope creep that will wipe out your original budget. There is only one way to have a clear scope- write it down. And the best kind of scope document is one that is signed by all the various players. Make sure everyone understands what is being delivered, when, and at what cost. If there are gray zones, at least flag them as potential areas where additional costs might arise.

2) Take a look at the data and talk to people who work with it. It seems obvious to say that what’s in the data matters. (Or more accurately what’s not in the data). But its amazing how many budgets are made without doing any serious analysis of the data that is actually in the source systems. Don’t look at the data model, see a field called “customer birthday” and design functionality around it. Problems can range from missing data, to mandatory fields that are filled with garbage because “otherwise the system won’t let us put the order through” to differences in interpretation of definitions between groups within the company. For example if all the Asian sales offices classify customers into the same segments, but have slightly different rules then you will need to “reallocate” this segmentation- even though it is the same field, and the same codes. That reallocation is an ETL job you didn’t budget for, unless you found it in a pre-budget data audit. Often the key here is not to launch a massive data audit, but to find the people who have been trying to make the global reports in spreadsheets- they’ve run up against these sorts of issues, and can probably even offer some solutions. (That well respected analyst in head office who has already painstakingly established a cross mapping for for the customer segments working with his colleagues in Asia, for example). These same folks are also going to be key in terms of adoption of the final solution, since they are often the current source of data for the underground data system the new data warehouse is supposed to be improving on. By involving the key people, you gain credibility, save time, and ensure that the final solution addresses business needs.

3) Do proof of concepts for the tricky bits. In many projects there are areas where something is being tried for the first time (certainly in the more interesting ones)- not surprisingly this is often where the issues arise. One way to help quantify how much effort will really be required by these areas is to do some quick and dirty proof of concepts to validate the basic technical and/or functional aspects of the component or system. Often, if it is early in the project and software and hardware selection has not yet been done, your vendors will be willing to assist with a proof of concept (or even do it themselves) as part of the evaluation process. You can learn important things in this stage- For example, if you are doing a reporting project that needs to deliver 300 reports, by doing a proof of concept of 3-4 reports (even if they are not much more than mock-ups) you can at least get a first estimate of how much development effort is involved, how easy the tool is, and how the software performs. I’ve done proof of concepts that doubled my estimates- because when we actually sat down and used the tool, we realized there was additional data cleanup and hardware required to make it work. Better to know that before you set your budget rather than after. The ideal proof of concept is to actually build a “wire frame” version of the key data marts with real data and let the users try it out. Often, its possible to do this quickly, particularly with a tool that lets you do rapid prototypes of ETL transformations.

4) Involve the project team in the budgeting process As a project manager, the responsibility for the budget is ultimately yours, however by getting input from the experts in the various fields you can greatly improve its accuracy. Don’t guess how long it will take to configure the server- go to the infrastructure team and find out how long it took the last three times they did a similar install and configuration. Be aware of the differences, but for many items by talking to the people who have been there and done that you will get good estimates of the true cost/duration of your project line items.

5) Watch out for Infrastructure and User centric costs The following costs are often overlooked, and end up being part of the cost overrun in the end. Don’t get caught by these classic end of project costs;

Infrastructure costs- We start to take our IT infrastructure for granted, and assume it will accommodate the new system without modifications- but is the network fast enough? Is there enough data storage? Just assuming that the existing server capacity, storage and bandwidth are sufficient may hide significant costs that the project will need to take on just to make the system operational. I’ve also seen cases where three projects that were launched at the same time all checked that the storage was available- but of course each project did not take into account the other two. The last project to go live ended up having to buy more hard drive capacity- and went over budget.
Training for end users. There are few systems that don’t require some amount of training for end users. Not taking this into account will provide a rude surprise at the end of the project. Costs here include actual training by third parties, but also travel expenses for trainers or trainees if they are not all based in the same location. Web based training can be a cost effective alternative, but results vary- and its difficult to ensure that “attendees” are really attending.
Transitional support costs. If the project has a wide scope and involves a large number of users, be aware that there may be an initial spike in help desk calls and PC support as the system goes live. Depending on how your help desk is structured, you might end up paying more to your outsourcing company, or need to hire some temporary employees to help handle the extra calls for a few months.

6) Have a contingency amount included in your budget and do everything you can to keep it till the end There are two big mistakes commonly made regarding budget contingencies- first, not having one at all, and second, using it early in the project. Contingencies are often unpopular, first because by increasing the amount that needs to be approved, they might make it harder to get the green light for the project, and secondly because some view them as a “fudge” or a lack of willingness to do a proper cost estimate. However, ideally a contingency is a realistic number that should be based on the risk inherent in the project. Very small, simple projects might only need a 5% contingency. Large, complex projects that involve multiple departments, hundreds or thousands of users and multiple software and hardware vendors need higher contingencies. There are simply more things that can go wrong, and its not realistic to expect that cost analysis can be accurate enough to foresee everything. The key is not to consume your entire contingency with the first scope change- I’ve seen it happen again and again. Contingency is supposed to be for those unforeseen things- for example, a technical problem with the interaction between two vendors packages requires custom development to provide the original functionality envisioned, or requires more hardware than expected.

By spending the time required up front, a realistic, practical budget can be created- and you can get your project started off on the right foot.

A Cost comparision between Data Marts and a Data Warehouse

James Standen — Mon, 19 Jan 2009 02:19:51 +0000

I’ve noticed a fair bit of search traffic focusing on cost questions, particularly which is cheaper; a series of data marts or a single enterprise data warehouse. I think it’s a bit like the question of lease vs buy. Starting off building a single departmental data mart will represent a much smaller cash flow out. But by the time you’ve built all the data marts, and then have to redo them all again to integrate between subject areas and departments, I’d have to say that I’m with Bill Inmon when he says no number of data marts add up to a data warehouse.

With data marts (just like leasing a car) you get behind the wheel quickly, and it gets you where you want to go in style. And the monthly payment is something you can afford now. However, long term, well, in three years you don’t own it, and have paid a bundle.

But let’s be realistic. Just as having all the cash on hand to buy the car outright just might not be in the cards, a true data warehouse might require a very significant outlay before anything comes out the other end, making it unaffordable. A quick, focused departmental data mart could be delivering value in a matter of weeks with relatively little investment. (Your actual mileage may vary- depending on where you’re at, its always dangerous to believe someone when they say “a matter of weeks” when software and people are involved.)

Will that departmental data mart, or even a number of data marts lead you to a single version of the truth? Will it give you deep competitive advantage through a culture of data analytics and cross enterprise master data management? In my honest opinion, No.

But is it something you can afford in today’s economy, and will you learn things about your data, your company’s information culture, and your business that will be useful if in the future you embark on a true data warehouse initiative. Yes. Yes it is, and yes you will.

And I’ll take it one (blatantly promotional) step further. Is a personal data mart on your desk top as good as a full fledged departmental data mart with an army of highly paid developers maintaining it? Probably not.

Is the personal data mart on your desk basicly free in comparision to the servers, software and hired help the data mart requires?- Yes. And does it, just like the data mart does for the data warehouse, prepare the ground for the next evolution when the economy turns around? Yes. Yes it does.

In difficult times companies that are pragmatic, and do what is possible, preparing for the day when more will be, survive to see that day.

It seems obvious that doing nothing because you can’t afford to do the best thing is a bad strategy- but we need to ask ourselves, how often do we make that exact choice through inaction?

Data Warehouse vs Data Mart

James Standen — Tue, 02 Dec 2008 17:20:00 +0000

Very often, the question is asked- what’s the difference between a data mart and a data warehouse- which of them do I need?

Data warehouse or Data Mart?

Data Warehouse:

Holds multiple subject areas
Holds very detailed information
Works to integrate all data sources
Does not necessarily use a dimensional model but feeds dimensional models.

Data Mart

Often holds only one subject area- for example, Finance, or Sales
May hold more summarised data (although many hold full detail)
Concentrates on integrating information from a given subject area or set of source systems
Is built focused on a dimensional model using a star schema.

More Info about Data mart models

More Detail regarding Data Warehouse Vs Datamart: and Inmon vs Kimball

As the concept of decisional systems, and data warehouses and data marts evolved, two major points of view came into existence. There are two giants in this field. Bill Inmon, and Ralph Kimball.

There are some that argue the best approach is to start with data marts, department by department, then merge them together to form a data warehouse- this is more in line with Kimballs approach.

Now, Bill Inmon is an advocate of the Data warehouse. Here’s one of his articles, which contains the following quote that makes it clear what he thinks about the idea:

“You can catch all the minnows in the ocean and stack them together and they still do not make a whale,” Bill Inmon, January 8, 1998.

Ralph Kimball, on the other hand, advocates what he calls a bus architecture data warehouse. His methodology specifies conformed dimensions, where multiple fact tables share common dimensional tables. For me, each of these fact tables represents a data mart. The row of dimensional tables that all the fact tables plug into is the bus, and because, for example, the finance and the sales data marts both use the same product dimension table there is integration between departments.

The more extreme data mart strategy is that of the completely stand alone data mart, the concept being that its fast, easy, cheap, and delivers value immediately. I’m a supporter of this at the desktop level- thats the point of the Datamartist tool afterall. But I don’t buy this for server based architectures- what is really fast, easy and cheap when you have to buy servers, create a project and form a commitee? In my mind if you’ve decided you need a central server solution, some level of integration is needed, and don’t pretend its going to be magic.

The interesting thing about these approaches, is that the harder you work on really conforming your dimensions, the more your data marts look like the data warehouse that Inmon advocates. (Data modellers in the know will be jumping up and down right now shouting NO they don’t- but this is a high level conversation…) But the reality is, even in a data warehouse, issues will arise that require compromise- things that just don’t map or conform, and budget, schedule and business reality will mean that nothing is ever perfect, and in the end the world is full of data warehouses that are less conformed than some data mart clusters. Its just not simple.

Generally, it is probably true that data warehouses provide a solution that is closer to the “single version of the truth”, but they do take a HUGE amount of effort, and an ability to coordinate across the entire organisation. If you have not already built at least half a dozen data marts, don’t think you can estimate how much effort a data warehouse will take. You can’t. And bring your cheque book.

Whereas data marts might deliver some value early, if built without sufficient effort on cross functional mapping and data cleanup they are just more silo systems and have their own set of costs and issues. Don’t measure payback on datamarts in years- nothing is the same in a few years, you’ll be back to the drawing board shortly.

It’s a real dilemma. So which one? Data warehouse? Data mart? In my view, the right answer is “it depends” and “yes”. However, never launch a data warehouse project as your first shot. A successful strategy will balance the fast, pain point addressing solutions, with a medium and long term plan, and investment in infrastructure and competencies to build the technology, processes and culture that a company needs to manage information. Depending on your industry and how sucessful you are, a massive data warehouse might be in your future. But sorry, no magic bullet.

Build a multi-level data strategy

Level 1- Get the data to the people
Level 2- Build Departmental Data Marts
Level 3- Plan long term infrastructure and architecture

Don’t do these things in order- this isn’t step 1, 2, 3- actively work on all three levels at once and ensure the plans at each level are coordinated.

Data to the People

People are building spreadsheets, and spending money on data base development now- you know they are. Give them better tools, help them better use the spreadsheets, and formalize the way they do. The do-it-yourself exists- but it doesn’t have to be completely informal.

The Datamartist tool is adding another capability that can accelerate the process- letting you move more quickly, proto-typing and analysing to determine which areas are ready for additional analysis capability.

In some cases Datamartist itself might simply be the best choice for certain types of analysis, cutting costs dramatically. In addition, if your end users are building their own data marts, when it comes time to build server based data marts they know the concept, they understand the structure, and can even provide concrete examples of the dimensions they need.

But whatever you do- don’t make the mistake of thinking this is all you need. Work on all levels at once.

Build Departmental Data marts

If a whole department is flying blind, and big money is on the table, then don’t launch a three year data warehouse project- create some departmental data marts. These projects should be designed to be 3-6 months long, and be sold to management honestly and clearly as being for short term gains, and as part of a broader discovery process. The hardware and software licenses will be reusable- but be clear that the data marts will have a limited lifespan- they always do.

And trust me, when you build these data marts you will discover all sorts of things about your data, your organisation, and your definitions and business processes. You will discover that the sales organisation needs to analyse product segments in a way that is fundamentally inconsistent with how finance has been reporting it for years- and neither group is willing to budge (and they may both be right). You will discover that 80% of your sales orders have errors on them or are incomplete. These discoveries will help you build the next data mart, and assess if a data warehouse is possible. They should also send you back to your transactional system and business processes to work to clean up the problems.

Build the infrastructure and deal with the foundation

There are lots of pitfalls in creating a decisional architecture- this short list of from Gartner resonates with me- I’ve seen and battled these issues on project after project.

Set standards in terms of tools, project management etc. Buy infrastructure for multiple projects, not on a project by project basis- don’t have multiple servers when one with multiple data marts is radically cheaper.

And probably most important, I honestly think that you can build anything you want, any way you want, but it will not succeed if you don’t have your definitions, both data and information, under control. (See Gartner issue #8) In the end, it’s not what language you speak, it’s if you have a dictionary or not, and if everyone is using the same dictionary across your organisation.

You should have common reference data sets that are used by all levels. Datamartist can load in and use reference data that is coordinated with departmental data marts and the eventual warehouse. Make these data sets available to everyone- you’ll be amazed that if they are easy to get and use, people will put them in their spreadsheets, and things might actually start matching up.