Datamartist gives you data profiling and data transformation in one easy to use visual tool.

« | »

Datamartist Beta tests out on 100 Million rows

just-getting-started-labeled Since one of the key features I want Datamartist to have is the ability to manage very large data sets, I've built in some new caching functionality- and was kicking the tires late one night recently.

First, using a custom program, I generated a 100 million row data file with 10 columns (three integer measures, two floating point measures and five dimensional columns). The input file was about 5.6 Giga bytes.

Then I fired up the Datamartist Beta and imported this file with a Text Import block. When I initially put the block on the canvas, datamartist counted the rows (showing me its progress of course), so it took a few minutes to get through all 100 million of them- but only a few, and after that since the file was unchanged response was normal (as if I was working on a much smaller set, because I was using preview mode).

I added a calculation block, which added a calculated column, and then I summarized the data, including two summed measures and an average on one of the floats- here's a screen shot of the Summarize block controls:

summary-block-config-100m-load-test-450wide

Everything looked good in the preview, so I pressed the RUN button, and went to bed.

It took three hours and thirty eight minutes (3:37:42.7343750 to be exact) to churn through the 100 Million rows and summarize down to the 100 combinations of the two dimensions selected. This was running on my D630 Dell Latitude laptop to be more real world (it's probably a bit faster on the Quad core workstation :-) )
load-complete-blocks1

Now, for those of you downloading the Beta, I would not advise doing this at home. I haven't put any limiters in the application, but keep your testing to under 5 million rows for now :-) . And of course, using such large data volumes means you need lots of hard drive space to do the caching- but the bottom line is I'm building the plumbing to let you go to these kinds of data volumes if you need to. Obviously lots more testing to do, and some real world data always discovers new things- so download the beta as soon as it's out, and tackle those big data sets you've had to rely on the IT department for in the past!

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • email
  • Reddit
  • Technorati
  • Twitter
Tagged as: , , ,

Twitter

« | »

1 Comment

Trackbacks

  1. graemebunton.com » Datamartist Beta Goes Public

Leave a Response