Datamartist.com

How much data is enough?

James Standen — Fri, 19 Jan 2018 15:05:24 +0000

While I don’t feel an urge to argue the “should we use data or use our gut to make decisions” question nearly as much as I used to, I am noticing that there is the “data is good, data is great, more must be better” camp that is starting to occasionally make me think. The pendulum is swinging- where is it in it’s arc is anyone’s guess, but I’m thinking we’re past center line.

“Don’t argue against over reliance on data! You sell a data tool, for goodness sake!”

Now, I’m a huge fan of data. I’ve been known to exclaim that more data is better and to store it all because storage is cheap you never know if you’ll need it.

I really do believe that you should data profile early and often.

But, as we get older (not old, mind you, just a bit older than before is all), we start to wonder if perhaps we’re suffering from a type of “irrational exuberance” in terms of our data.

Obviously, since I’m in the business of data tools, and managing lots of it, and doing interesting things with it, data is important and key and I always want lots around.

But too much of anything is too much. And sometimes piles of bad data are worse than no data at all. And unquestioned beliefs are the first step to blind spots, and after that it’s out of touch time.

Being too sure of your methods is often accompanied by being overly harsh on alternatives. “Going with our gut” is not really making decisions with our lower intestine at all. Even with the upper intestinal tract lending some opinions, we all know that the brain is involved in all those “gut” calls. When we make leaps in judgement that might not be completely supported by a MDM, ODS, DW, Data profiling approved method there is data involved. It’s been through a very advanced neural net that is the product of millions of years of evolution.

Even if we don’t know exactly WHY we feel a certain course is right, it is often a fact that our brains have sorted out patterns and trends in the data we’ve seen that might defy our formal systems, but in reality what’s often called “gut feel” or “intuition” it is fact based, just a bit more mystical.

So where am I going with all this?

I think in the end, there are no absolutes. Data is a critical thing. Nurture your processes, infrastructure and tools for collecting, storing and analyzing data.

But never forget that at the end of the day, a decision needs to be made. Not making it is a decision in itself, and if you wait long enough, if you wait for the data to support you completely, it’s possible the only decision you’ll make is not to decide, waiting for more analysis. Sometimes you need to leap. Leap based on Data- do fast ad-hoc analysis, use your Data Lake, don’t wait for the massive IT Data warehouse project to get done – get your Gut and Your Brain and at least some data- but leap.

Finding balance in your ETL strategy

James Standen — Wed, 22 Mar 2017 21:08:26 +0000

Organizations large and small can experience ETL (Extract Transform Load) bottlenecks in one way or another. The task is complex and significant damage to your business can occur due to inadequate ETL strategy (or lack thereof).

In general, companies approach solving ETL needs in three different ways.

ETL empires

These are companies that have a developed, structured, and well-documented approach to ETL activities. They typically have implemented an end-to-end integration process using batch file transfers. Some of them are able to do a complete load, and replace or load and merge every hour. Yet, they struggle with integrating new data sources and creating new ETL processes, because of undesired downtime, the large amounts of manual work involved, and the logistics of process management.

ETL ninjas

In organizations that do not have a data warehouse, do not integrate their data at all, or use a large data lake (in which a large amount of raw flat data is stored), people tend to master some form of self-service ETL process. But these ninjas act solo most of the time. Data transformations are repeated among different ETL ninjas and hours are lost every day.

Others

Companies that have inconsistent, undocumented and cumbersome ETL processes can be scared of the huge pile of ETL jobs they have on their hands. The lack of consistency and documentation creates a great deal of confusion.They frequently are not sure what data has been refreshed, when, and who has access to it.

The reason that so many companies struggle with ETL is because their approach to it is inadequate. They are either constrained by an overly structured and inflexible ETL process, or are reluctant to establish an ETL process at all, ending up with analysts doing too much self-serve data preparation – which in turn results in the numbers not adding up and time wasted on repeat data preparation activities.

Use a hybrid approach to break through ETL bottlenecks in your organization

More and more innovative data services pop up on the cloud every day – so ETL is not going to go away anytime soon. Integrating this data means that your organization achieves a well-rounded understanding of the business over time. Regardless of which of the three categories above you are in, you have the opportunity to act immediately.

So what is the hybrid approach to ETL? In fact, in most cases, it is a balance between the three categories listed above. Analysts need to have a standardized structured ETL process available while also being able to rely on self-service ETL tools like Datamartist for agile ETL and ad-hoc analysis. New data should be allowed to come in and be used until it is documented and either implemented as a part of an ETL job, or left for self-service tools, to be accessed when required.

Think of running an ETL process as operating a flight route for large aircrafts. Establishing a new flight route is a long and laborious process. Although it will take a lot of people from point A to point B, in most cases it only makes economic sense to run this flight once a day, to prevent aircrafts from flying empty. Same with data – it is inexpensive to do a complete refresh every minute, but if it isn’t used, why do it?

Also, don’t limit your people to air transportation. It doesn’t make sense to fly a large aircraft to a town that’s only 30 minutes away by car. It also doesn’t make sense to create a flight to a new destination, unless you know that it will remain in high demand for a long time. So why limit your people to a standardized ETL? Let them ‘drive’ using self-service tools! Yes, they may take different routes to the destination and drive at different speeds, but they’ll get their job done without the establishment of yet another ETL job.

Finally, don’t let all of them drive for days to the same destination. If many of your employees are spending days creating nearly identical data pipelines using self-service data preparation tools, this repetition of work is slowing your business down and wasting their time.

Benefits of the hybrid approach

This approach can help in the following ways:

Self-service data preparation will reduce the number of ETL jobs that an IT department has to get through by increasing the number of people who can handle small-scale ETL jobs.
Self-service ETL tools reduce technical demands: time and money spent on the creation, maintenance and administration of ETL tasks.
Automation of routine ETL activities will allow you to maintain highly consistent, reliable, and available data sets.
Automation of ETL processes that handle the most frequently used data will increase your reporting and analytics capacity.

There are many BI tools out there that will promise everything: automation of highly sophisticated ETL processes; comfortable use for both IT and business users; flexible workflows; integration with all data sources, including streaming data, etc. The truth is that the hybrid approach may require the use of various ETL tools that do the job that they were designed to do. Find your own way to balance, to prevent ETL bottlenecks from killing your business.

The top 3 reasons to automate your data preparation

James Standen — Thu, 30 Jun 2016 01:43:22 +0000

Almost everyone has “that data crunching” that they do once a month, or once a quarter or (these are often the really brutal ones) once a year.

You go get the raw data, then you painfully, step by step, work your way through it- copy paste, cut and move, formula after formula in your spreadsheet until you get the report that everyone knows you deliver every period.

Every time you clean up the same mess, you do the same steps (perhaps with a tiny variation, but fundamentally the core work is repeated again and again).

There are three BIG reasons that you should automate

You need to spend less time doing mindless work. It’s painful.
Manual steps create errors, errors create pain, pain leads to suffering… via just plain wrong business decisions.
You are leaving huge piles of money on the table because your analysis cycle is too slow.

Big Piles of Money?

Ok, so 1 and 2 make sense- no one likes doing mindless work and its a waste of your time and means you can’t be doing interesting analysis, and errors are a big problem- garbage in garbage out.

But whats this talk of piles of money??

One of the things we sometimes don’t think about when we are doing things manually is that because they are manual, how often we do them is reduced.

That report that takes 2 hours to do? You can’t do that every day- you’d spend 25% of your time on it. So you do it once a month.

That analysis of product performance that takes 3 days of number crunching and copy-paste? Only once a quarter.

That in depth analysis that actually cross checks customer profitability- that takes takes a week- so we do it just once a year.

But think about it- if you do the analysis once a quarter and find something that would improve results- then you have lost up to 3 months of the new improved results! Even if its only a few percentage points- whats 5% for 3 months? Its likely a pile of money.

But you say- automation is hard- I have to wait for the IT department and by the time they get back to me the problem has changed. Its impossible.

Its not impossible- you might just be using the wrong tool. Spreadsheets are fine- but unless you are a programmer they don’t really automate all that well.

But with a visual data preparation tool that lets you draw your data transformations, then run them with a single click, it is possible for an analyst to automate their own reporting.

It takes a while to learn it, it takes longer to automate something than to do it once manually too, but think about it- say it takes 3 times as long to automate something than do it manually. Say you spend a day on something now, and it will take 3 days to automate it. That means if you are doing that thing monthly now, in 3 months you’ll be ahead- AND you’ll have that report every week if you want it- or every day. Not just every quarter.

Invest in automation. Invest in tools beyond the humble spreadsheet. It will make you a better analyst, and deliver results to the bottom line.

Every day you keep manually preparing everything, you make your lost time bigger and bigger. Every day you wait before doing the analysis because you can’t do it frequently enough, an optimisation is sitting there in the raw data, ready to be discovered, but not being taken advantage of.

When spreadsheets were a new technology, the manual cut and paste approach was state of the art. Now, new visual block and connector tools let analysts automate things quickly- without the need for technical resources.

Avoid the Spaghetti

James Standen — Sun, 26 Jul 2015 13:19:19 +0000

Has this ever happened to you? You have built some fantastic analytics- you are the Business Intelligence hero, the fixer of data, the creator of reports.

They ask you for something, you fix it. Data quality issue? No problem, you add filters, views, and report edits that hide the dirt. Its fast, its easy. You are the King or Queen of Data.

Then they ask you for a new analysis, and looking into it, you realize that there is no way to fix that.

And then you have to face the pile of spaghetti.

You have built so many reports on the existing structure that even the smallest change to the data model will break them all- and all the data quality patches, so fast to implement at the time, have to be reproduced for the new report, or none of the numbers are going to match.

Often, when moving fast, we make poor choices. Choices like adding that data fix in the report rather than in the database. Choices like pre-calculating ratios, pre-aggregating away detail and generally taking data steps that “we are sure will always be the case”.

Then we have to look at the spaghetti.

In my 20 years of messing with data, one of the most important lessons learned is this- don’t fix data quality problems in the reporting layer. Sure it works at first, but even medium term its a nightmare.

Reports and analysis should be views of clean, ready data in the data layer. Reports don’t contain data- they define a window on the data. When we add data fixes into a report, we are turning the report into a data transformation step. Any other report will not have access to this version of the data- as a result, any fixes must be reproduced accross every report. This creates a source of extra effort, and increases the likely hood of errors and omissions.

What this means is that while many tools have great features that let you do data cleansing right in the dashboard, and practicality means of course we use these features from time to time, every time we do we have to remember its a temporary patch. We need to fix the underlying data, either in the source system or at least in our repository, and then restore the report to its proper function.

Reports and Analytics tools should not fix data- they should display it.

Set up a test then check the data then decide

James Standen — Mon, 23 Feb 2015 15:38:07 +0000

How often do you argue to someone making a gut call the importance of being “Data driven”? In the business intelligence business, we’ve been talking about this for more than 20 years.

I was very pleased to see that the City of Toronto, Canada (our home town) is doing some driving with data, and it is a great example of how real data makes real decisions better.

One of the biggest challenges all cities face is congestion and traffic- and various solutions are (sometimes it seems endlessly) debated in council and at the kitchen table. In the city core the classic balance is between cars and pedestrians.

Toronto has been experimenting with all way pedestrian crossings for a number of years– three intersections in the core of the city were converted, and data has been collecting. Sometimes called “scramble” crossings, these intersections periodically give a red light in both directions to cars, and pedestrians can cross all ways, including diagonally. Letting large crowds move quickly. Supporters argue pedestrians are greatly helped with limited impact on cars, car supporters argue that the intersections generate gridlock and are dangerous.

In Toronto, with three different intersections and lots of data, rather than a simplistic “all way crossings are bad for cars” or “eliminating all way crossing are bad for pedestrians”, the city could actually look at the data. And found that it wasn’t a yes-no question.

For two of the intersections, the benefit to reducing wait time for pedestrians and the impact on car traffic was a net win.

For the third, wait times for cars increased more (more idling, more pollution), accidents increased (side swipe collisions doubled, rear end collisions up by half), without significant improvement in corner crowding and pedestrian wait time. As a result, the authors of the study recommend keeping the intersections that work, and reverting the third back to a normal setting.

In other words- there is no simple black and white answer for these types of crossings. Everything is a balance. But there is an algorithm that can work.

Study the existing data, make decisions on where to test next, set up a good test, collect data, and then using defined criteria and metrics that evolve as we learn more, determine which intersections are best suited, and which should be returned to a normal configuration. Then do it all again, constantly optimizing.

It works for business data, it works for cities. Test, collect, analyze, act and start testing again armed with your new knowledge.

Arguing about something then making a huge all or nothing decision is often the wrong choice- set up a test, try it, collect data, and use that to drive your decision. Do it as quickly as you can and you will manage risk, and- providing you listen to what the data says- make better decisions.

Putting data analysis into the hands of the business user

James Standen — Thu, 24 Jul 2014 18:33:26 +0000

Having too many people involved can slow down the entire analytics process- by using tools that are designed to be accessible by business users, huge gains can be made.

However a challenge faced by all companies is connecting the business knowledge and data manipulation knowledge in a way that the data can be used.

As recently discussed in a recent McKinsey blog post 98% of CMOs characterize finding data scientist type resources a challenge.

Typically, what happens is that more than one person is required- often even three- The business person needing the answer, a Business Analyst who can formulate the question and apply a combination of business and data expertise, and then an IT wizard who can actually write the SQL or program to get the data, set up the servers and databases etc. etc.

Using a visual model to eliminate programming and SQL

Datamartist aims to simplify the process for at least some of the analysis that gets done- eliminating at least one, and perhaps sometimes even two of these players in some cases. (Hint- its the business person that always remains- their is no point in finding answers unless they are driving the business forward).

This is by creating a visual interface where data sets can be connected together, adjusted and combined visually. Blocks represent data sets, or transformations on the data, and the flow of data as it evolves can be tracked across the canvas.

In the above example, we can see two input data sets joined together and a calculation done on the result. Every circle on each block is called a “stub” and by clicking on the stub we can see the data at that step. The result is that a sophisticated data query can be built a step at a time, without ever having to use code.

For example, the join block above gives a simple user interface- just pick the columns that you want to match up, and it visually shows how many rows connected.

Small data in the age of Big Data

James Standen — Fri, 28 Mar 2014 15:09:34 +0000

Just a quick thought on the subject of small data.

Don’t hear much about that in the press these days- everything is BIG BIG BIG.

But small- (and remember, small is still hundreds of thousands or millions of rows) is still critically important. I hope I’m stating the obvious here- but with some of the “irrational exuberance” around Big data, we need to keep our perspective.

Your entire customer list that (hopefully) you’ve worked hard to keep clean and accurate is probably small data- unless you’re Facebook and it has around a Billion rows.

Your entire product list is likely smallish data.

Certainly the categories that you want to look at, the geographical groupings and regions, your chart of accounts, pretty much all of your master data is small data.

What is clear is it is not just BIG data that matters- its how we mix our small data to our big data, and how we turn our big data into the small (aggregated, summarized, cleansed, ready for consumption by other than a data scientist) data sets we use day to day.

While the abilities of Big data are fantastic, and the capabilities of todays systems are truely awesome, they do not eliminate the need for good old data quality and master data management.

What they do though is open up whole new architectures and capabilities. So embrace big Data- take advantage of the amazing platforms that are available- but don’t get so excited about the big to let your small data program slip. After all, small is still beautiful too.

Exact isn’t everything- Surf your data!

James Standen — Mon, 08 Apr 2013 20:38:04 +0000

Sometimes an analyst needs to take off the accountants hat, forget the urge to chase down every last penny, and instead put on their surfing gear, grab the data surf board (i.e. their set of prefered data tools), and just surf some data.

There are some cases were “Exact” is the only acceptable level of data quality. When we’re sending invoices to our customers, not only does the amount need to be right to the penny, the invoice needs to make it to the right person, on time.

But sending accurate invoices is not exactly the cutting edge in terms of data.

The challenges for todays data driven organizations are to be able to make sense out of the ocean of data available, and to do it faster than the competition.

And the ocean does have the highest quality water all the time. In fact, some of the data is downright dirty.

But there is a lot of it, and it can tell us things.

Just like ocean surfing, when you are data surfing you can sometimes ride the wrong wave and end up underwater, you can spend hours on your board in flat water, hoping the surf will come up but instead getting no where.

But when the wave comes, and you ride it, letting yourself go with the flow- while it won’t give you any exact answers, it will give you a “feel” and sense of where the wave is going, and if you are fast to grab it, and your board is good, you might just find yourself getting some awesome, gnarly insight that will score you some competitive advantage- and you just can’t surf that wave if you are worried about only looking at Exact, perfect, “we’ve checked it three times” data.

Data Surf is up- and by the look of the ocean it’s going to be a rocking rolling ride- good luck out there!

Data quality monitoring and reporting

James Standen — Thu, 17 Jan 2013 15:43:00 +0000

In the vast majority of cases, useful data sets are not static, but are being updated, added to and purged constantly.

Data quality monitoring aims to provide data quality information that is also being constantly updated, and can be used to detect issues quickly, before the bad data piles up.

Don’t let those bad records pile up.

Imagine a company that does a mailing to its customers every 3 months. Imagine that a new call center training program is put in place, and unknown to all, customer information starts being incorrectly entered and updated due to an error in the program. When do you want to know about the growing number of invalid customer records? When its time to do the mailing, after 90 days of bad data generation, or after just a few days of problem entries?

ALERT! ALERT! Bad data Alert!

The trick to data quality monitoring is having a set of data quality rules, and profiling tests that will automatically give indications of issues. Some rules are easier than others, but the idea is that each record or group of records in a the table(s) in question is checked against a series of tests, and the number of data quality rule infringements are tracked. As the size of the table grows, if the overall percentage of bad to good is increasing, you know you have an issue. If it spikes up, you sound the alarm.

Actually having a large fog horn wired up in the CEOs office is optional, due to the challenges of false positives (detecting a “Bad” record when in fact the record is ok).

I would also be very cautious in having automatic data modification processes going on, but setting alerts that will notify those responsible for potential data quality issues is a relatively straight forward exercise and will at the very least improve your visibility of data quality trends. What you do to deal with them is up to you, but at least some of the battle is being aware of the problem.

It monitors a number of things, but the key here is to make a decision for ever record if it is “Good” or “Bad”, and give some analysis of which data quality rules are broken. You can see in the upper right, there are row counts of infractions by a number of data quality rules. In the middle is the overall Bad vs Good split on the records.

The datamartist Pro version can quite easily create automated data quality monitoring. Using visual blocks, and arranging them on a Canvas, Datamartist lets you create data quality rules, define profiling on selected columns, and then automaticaly, as a scheduled process, place the results either in Excel reports/dashboards or into a database for use by your reporting tool of choice.

By understanding how your data quality evolves, and catching problems early, you can reduce the amount of cleansing you need to do, and communicate clearly and regularly to your organisation the progress your data quality programs are making.

Automating data update for Tableau Server using Tabcmd and Datamartist

James Standen — Wed, 29 Aug 2012 17:41:37 +0000

Tableau is a wonderful, powerful visualization tool. If you have the data, it will generate insightful, powerful dashboards and reports.

But actually getting the data is often the tricky bit. And having it appear in your dashboards and reports without having to press lots of buttons is the goal. (No one likes to have to do a long series of cut and paste operations to get the data in place).

Datamartist is a tool that lets you combine data from multiple sources, format it, clean it, organize it, and then make it ready for tools like Tableau.

Because Datamartist can be automated from the command line, and commands on Tableau server can be run from the command line using tabcmd, it means that with Datamartist and tableau together, you can have a complete and automated reporting solution.

Have you ever wished that you could update your Tableau dashboards automatically every day? Do you have non-database information in the form of spreadsheets that you need to integrate into your Tableau workbooks?

With Datamartist and Tableau together, you can automate all these tasks, integrating formal database data with spreadsheet data from shared drives seamlessly, and quickly. And when your users log into Tableau, they see the dashboards they need to see.

Datamartist lets you import data from Databases, excel files and text files, define data transformations that prepare and clean the data for visualization, and then export it to a location (very often a database, but potentially an Excel workbook or file) where it can be easily, automatically read by a visualization tool like Tableau server.

You can learn more about what Datamartist does here.

To give you an idea about what a batch file that first runs a datamartist canvas, (which loads data from multiple sources, transforms it and makes it ready for Tableau) then refreshes workbooks on a remote Tableau Server automatically, here is an example of a bit of simple batch file code:

The first line runs Datamartist- this runs a .DMC file that you have built with datamartist that can extract the data from multiple sources- databases, files, internal hard coded sets- then do joins, segmentations, filters, calculations etc. and export the data to the location that the Tableau server workbooks are connected to.

“C:\Program Files (x86)\nModal Solutions Inc\Datamartist\Datamartist.exe” RUN /f “C:\MyDataMartistFile.DMC” /l “C:\WhereIWantTheLogFiles\

Then, using a utility called tabcmd from Tableau, you can log into a tableau server, and refresh the workbook, thus updating all your tableau dashboards to do this, you use the following lines in the batch file:

“C:\Program Files (x86)\Tableau\Tableau Server\7.0\extras\Command Line Utility\tabcmd.exe” login -s MyServerURL -u MyUserName -p MyPassword -t MySite
“C:\Program Files (x86)\Tableau\Tableau Server\7.0\extras\Command Line Utility\tabcmd.exe” refreshextracts –project “MyProject” –workbook “MyWorkbook” –site MySite

You can find more information about tableau’s tabcmd on their site.

Net result? You have full automation from data to visualization, Datamartist takes care of the data, Tableau creates fantastic dashboards.

If you build a batch file, and set it to run once a day, every morning when your users log into Tableau server, they’ll see updated data, ready to go.

Try Datamartist- its a full function free trial. Find out easy it is to automate tableau server with sophisticated data extraction.