We are at our core, essentially a data company. My focus naturally turns to the quality of our data a fair bit. In the last week or so I have been looking at the data we have around car emissions and got a little freaked out. But with a little digging I realized the origin of the problem is the source data itself. Here are the, lets call them lessons, I’ve experienced since I started looking at this issue.

Lesson One: Your data is only as good as what it is made of

  • The US, Australian and UK data are all in completely different formats
  • All three release information on different schedules
  • Only one (the Australian one) make it available in an easy programatic manner
  • The UK data does not explicitly list the model year (just ‘since model year 1995’ as part of the description)

Lesson Two: Look at the code for quality smells

  • To go MVC for a second, if your model has a switch regarding a database column, then that is a bad smell
  • I’ll be looking at the ‘Fuel Type’ and ‘Transmission’ columns first; they’re rank

Lesson Three (A): Define how you want the data to look like

  • Right now I’m trying to figure out which information is common to all three sources
  • I’ll use that as the grounds for massaging our data into some common order

Lesson Three (B): But only with serious consultation

  • I could right now ‘fix’ the issue with the ‘Fuel Type’ column, but I would likely just break a bunch of stuff
  • I know which stuff is most at risk, but I’d rather not do it in the first place
  • So I’ll be bouncing things off the environmental scientists that use this data

Lesson Four: Big Bang is not just bad for deployments

  • I have identified 5 or 6 different things that affect the quality of this data, and all will be fixed. Eventually.
  • One problem requires one identifiable, and rollback-able fix

Lesson Five: Best offense is a good defense

  • Of course, I don’t want to have to do this again, so I’ll have to identify and implement the proper checks for when the data gets updated to make sure I don’t have to pay (as close) attention to this area in the future.
  • This may not be as easy as it seems as we are expand out offerings and services. Is it in Java? Rails? Right in the database? All three?

Lesson Six: Thank goodness I can script

  • I first learned Perl when I need to hack up a data file. This is no different a task except I’ll do it in Ruby likely.
  • It would be really nice if all these organizations would decide on a standard information set and format and make use of it.
  • Even within the same organization from year-to-year; I need to create 5 or 6 different scripts just to get a common representation of EPA data alone.
  • Not to mention the UK or Australia or ???

Lesson Seven: Null is still evil, but sometimes necessary

  • Null is Evil; there is usually a better value to put in
  • But when you are backporting a common view to existing data you might not know what the better value is
  • Better does not mean magic
  • The code might be making use of that null somewhere that you don’t (yet) know about