Real World Data is Dirty
We are at our core, essentially a data company. My focus naturally turns to the quality of our data a fair bit. In the last week or so I have been looking at the data we have around car emissions and got a little freaked out. But with a little digging I realized the origin of the problem is the source data itself. Here are the, lets call them lessons, I’ve experienced since I started looking at this issue.
Lesson One: Your data is only as good as what it is made of
- The US, Australian and UK data are all in completely different formats
- All three release information on different schedules
- Only one (the Australian one) make it available in an easy programatic manner
- The UK data does not explicitly list the model year (just ‘since model year 1995’ as part of the description)
Lesson Two: Look at the code for quality smells
- To go MVC for a second, if your model has a switch regarding a database column, then that is a bad smell
- I’ll be looking at the ‘Fuel Type’ and ‘Transmission’ columns first; they’re rank
Lesson Three (A): Define how you want the data to look like
- Right now I’m trying to figure out which information is common to all three sources
- I’ll use that as the grounds for massaging our data into some common order
Lesson Three (B): But only with serious consultation
- I could right now ‘fix’ the issue with the ‘Fuel Type’ column, but I would likely just break a bunch of stuff
- I know which stuff is most at risk, but I’d rather not do it in the first place
- So I’ll be bouncing things off the environmental scientists that use this data
Lesson Four: Big Bang is not just bad for deployments
- I have identified 5 or 6 different things that affect the quality of this data, and all will be fixed. Eventually.
- One problem requires one identifiable, and rollback-able fix
Lesson Five: Best offense is a good defense
- Of course, I don’t want to have to do this again, so I’ll have to identify and implement the proper checks for when the data gets updated to make sure I don’t have to pay (as close) attention to this area in the future.
- This may not be as easy as it seems as we are expand out offerings and services. Is it in Java? Rails? Right in the database? All three?
Lesson Six: Thank goodness I can script
- I first learned Perl when I need to hack up a data file. This is no different a task except I’ll do it in Ruby likely.
- It would be really nice if all these organizations would decide on a standard information set and format and make use of it.
- Even within the same organization from year-to-year; I need to create 5 or 6 different scripts just to get a common representation of EPA data alone.
- Not to mention the UK or Australia or ???
Lesson Seven: Null is still evil, but sometimes necessary
- Null is Evil; there is usually a better value to put in
- But when you are backporting a common view to existing data you might not know what the better value is
- Better does not mean magic
- The code might be making use of that null somewhere that you don’t (yet) know about