The Universe of Discourse


Tue, 29 Jan 2008

The Census Bureau's data file
Last week I posted an article about calculations with data provided by the U.S. Census Bureau. I'm thinking about writing that up in more detail, but today I learned something so astonishing that I couldn't wait to mention it.

The data is available from the Census Bureau's web site. It is a CSV file. Most of the file contains actual data, like this:

20220,,"Dubuque, IA",Metropolitan Statistical Area,"92,384","91,603","91,223","90,635","89,571","89,216","89,265","89,156","89,143"
Experienced data mungers will feel a sense of foreboding as they look at the commas in those numerals. Commas are for people, and if the data file is written for people, rather than for computers, then getting the computer to read it is going to require at least a little bit of suffering. Indeed, the rest of the data is rather dirty. There is a useless header:
table with row headers in column A and column headers in rows 3 through 4 (leading dots indicate sub-parts),,,,,,,,,,,,^M
"Table 1. Annual Estimates of the Population of Metropolitan and Micropolitan Statistical Areas: April 1, 2000 to July 1, 2006",,,,,,,,,,,,^M
CBSA Code,"Metro
Division
Code",Geographic area,"Legal/statistical
area description",Population estimates,,,,,,,"April 1, 2000",^M
,,,,"July 1, 
2006","July 1, 
2005","July 1, 
2004","July 1, 
2003","July 1, 
2002","July 1, 
2001","July 1, 
2000",Estimates base,Census^M
,,Metropolitan statistical areas,,,,,,,,,,^M
And there is a similarly useless footer on the bottom of the file. Any program that wants to use this data has to trim off the header and the footer, or ignore them, or the user will have to trim them off manually.

(I've translated ASCII CR characters to ^M sequences so that you can see that although the lines of the file are CR-LF terminated, some of the items contain extra LFs for no particular reason.)

Well, all this is minor. My real complaint is that some of the state name abbreviations are garbled:

19740,,"Denver-Aurora, CO1",Metropolitan Statistical Area,"2,408,750","2,361,778","2,326,126","2,299,879","2,276,592","2,245,030","2,193,737","2,179,320","2,179,240"
Notice that it says CO1 rather than CO, short for "Colorado". I was fortunate to notice this garbling. Since it occurred on the line for Denver (among others) the result was that the program was unable to locate the population of Denver, which is the capital of Colorado, and a mandatory part of the program's output. So it raised a warning. Then I went in and manually corrected the CO1 to say CO. I also added a check to the program to make sure that it recognized all the state abbreviations; I should have had this in there in the first place.

Then I sent email to an acquaintance who works for the Census Bureau (identity suppressed to protect the innocent), pointing out the errors so that they could be corrected.

My contact checked with the people who produced the data, and informed me that, according to them, CO1 was not an error. Rather, the 1 was a footnote mark, directing me to a footnote at the bottom of the file:

"1Broomfield, CO was formed from parts of Adams, Boulder, Jefferson, and Weld Counties, CO on November 15, 2001 and was coextensive with Broomfield city.",,,,,,,,,,,,^M
"For purposes of presenting data for metropolitan and micropolitan statistical areas for Census 2000, Broomfield is treated as if it were a county at the time of the 2000 census.",,,,,,,,,,,,^M
A footnote.

I realize now that that footer was not as useless as I thought it was. Wow. A footnote. Wow.

I would like to suggest the following as a basic principle of computerized data processing:

Data files should contain data.

Not metadata. Not explanations. Not little essays. And not footnotes. Just the data.

There's a larger issue here about confusing content and presentation. But "Data files should contain data" is simpler and easier to remember.

I suspect that this file was exported from a spreadsheet program, probably Excel. Spreadsheet programs desperately want you to confuse content and presentation. This is why one should not use a spreadsheet as a database.

I now recall another occasion when I had to deal with data that was exported from a spreadsheet that was pretending to be a database. It was a database of products made by a large cosmetics company. A typical record looked like this:

"Soft-Pressed Powder Blusher","618J-05","Warm, natural-looking powder colour for all skins.  Wide range of shades-subtle to vibrant.  With applicator brush.","Cheeks","Nudes","Chestnut Blush","All","","19951201","Yes","","14.5",""
The 618J-05 here is a product code. Bonus points if you see what's coming next.

"Water-Dissolve Cream Cleanser","6.61E+01","Creamy cleanser for drier, more sensitive skins.  Dissolves even the most tenacious makeups.","Cleansers","","","","Sub I, I, II","19951201","Yes","1","14.5",""
That 6.61E+01 should have been 661E-01, but Excel decided that it was a numeral, in scientific notation, and put it into normal form.

Back to the Census Bureau, which almost screwed me by putting a footnote on a state name. What if they had decided to put footnotes on the population figures? Then I would have been really screwed, because it would have been completely undetectable.

No, wait! It's all become clear. That's why they put the commas in the numerals!

[ Addendum 20080129: My Census Bureau contact tells me that the authors of the data file have seen the wisdom of my point of view, in spite of my unconstructive and unhelpful feedback (I said "Wow, that is an incredibly terrible idea") and are planning to address the issue in the next release of the data. Hooray for happy endings! ]

[ Addendum 20080129: My Census Bureau contact tells me that they do sometimes put footnotes on the data items, so don't laugh too hard at my remark about the commas. ]


[Other articles in category /misc] permanent link