I've been poking around to find a fun datset to play with in Datamartist, and I have to report that its a frustrating experience.

I'm finding often that a number of the files are not really raw data at all. They are reports, rendered as files. So when you look into the file, you'll find that it will not parse as a delimited file, and contains subtotals, spaces, blank lines etc. Or it is an Excel file with merged cells, a copy paste from some sort of OLAP tool or perhaps a pivot table. Add to this the complication of multiple files (all with varying size headers and spacing because they were designed to be printed and read, not analyzed), and what becomes clear is that faced with deadlines to publish data, agencies just dumped reports into files and uploaded. Looking at some DOD "datasets" I found that different years had different formats- HTML for earlier years, then a switch to PDF, with of course different columns, metrics and summary levels.

Of course, the data is there, and we can get it out- but there is a pile of work to do, with the data spread far and wide across multiple files, spreadsheets and formats.

In a lot of these cases it probably took more time for someone to generate these reports than it would have to just publish the raw data. If you can make a pile of pivot table reports, then you must be able to just dump the raw data from somewhere.

RAW is ok, we data junkies can deal with raw. Just give us a row per line, delimited or fixed width is fine, and give us a nice data dictionary that tells us what each column is. We'll take it from there.

