EPA Fuel Economy Guides – The data provisioning anti-pattern
I’ve started to convert the EPA’s Fuel Economy data into CSV so it can be imported nicely into our database for use in calculations. It has been nowhere as easy as it should be. If you are providing data to customers or the public at large, please do not use the EPA as your model. The following list is my new checklist for testing data exports (and could serve as a step-by-step guide for parsing the 1978 EPA guide.)
- Data Sets – One data set, one file. It’s simple really; do not put the data for California and the other 49 states in the same file.
- Data Formats – One format, one file. If you feel like you have to display data in two different ways, take a step back and think about it. Either put the data into separate files for each format or create a format that works for both.
- Portability – File dumps from your old mainframe or mini in their native format is not portable. CSV (or similar delineated file) or XML are your best choices here. .xls is also not as universally portable as business folk like; it is also not directly importable into a different system (without COM magic)
- Format Consistency – From 1978 – 2009, there are about 4 or 5 different formats that information is provided in. Please use one format for all information. Information that is not easy to use, will not be used.
- Embedded Data – One column, one piece of information. I understand that space used to be a premium in systems in the late 70s through about the early 90s, but there is really no excuse for overloading a single data point with multiple pieces of information. 2DR- 68/ 7 tells us that the particular vehicle is a 2DR body type, has 68 cubic feet of interior space and 7 cubic feet of trunk (or cargo) space. Embedded data makes it hard to do analytics on it.
- Consistent Data – The ‘Fuel System’ section of the data tells whether a car is FI (Fuel Injected) or a number. That number is the number of barrels in the carburetor. The opposite of FI is C, not a number. If you really want to tell use how many barrels there are, put that in a different column.
- Unnecessary Characters – Things like $ are unnecessary to include in the raw data. If you really need to communicate that it is in dollars put that in the column name, or better still, use an accurate heading that makes it clear.