The Universe of Discourse
           
Tue, 29 Jan 2008

The Census Bureau's data file
Last week I posted an article about calculations with data provided by the U.S. Census Bureau. I'm thinking about writing that up in more detail, but today I learned something so astonishing that I couldn't wait to mention it.

The data is available from the Census Bureau's web site. It is a CSV file. Most of the file contains actual data, like this:

20220,,"Dubuque, IA",Metropolitan Statistical Area,"92,384","91,603","91,223","90,635","89,571","89,216","89,265","89,156","89,143"
Experienced data mungers will feel a sense of foreboding as they look at the commas in those numerals. Commas are for people, and if the data file is written for people, rather than for computers, then getting the computer to read it is going to require at least a little bit of suffering. Indeed, the rest of the data is rather dirty. There is a useless header:
table with row headers in column A and column headers in rows 3 through 4 (leading dots indicate sub-parts),,,,,,,,,,,,^M
"Table 1. Annual Estimates of the Population of Metropolitan and Micropolitan Statistical Areas: April 1, 2000 to July 1, 2006",,,,,,,,,,,,^M
CBSA Code,"Metro
Division
Code",Geographic area,"Legal/statistical
area description",Population estimates,,,,,,,"April 1, 2000",^M
,,,,"July 1, 
2006","July 1, 
2005","July 1, 
2004","July 1, 
2003","July 1, 
2002","July 1, 
2001","July 1, 
2000",Estimates base,Census^M
,,Metropolitan statistical areas,,,,,,,,,,^M
And there is a similarly useless footer on the bottom of the file. Any program that wants to use this data has to trim off the header and the footer, or ignore them, or the user will have to trim them off manually.

(I've translated ASCII CR characters to ^M sequences so that you can see that although the lines of the file are CR-LF terminated, some of the items contain extra LFs for no particular reason.)

Well, all this is minor. My real complaint is that some of the state name abbreviations are garbled:

19740,,"Denver-Aurora, CO1",Metropolitan Statistical Area,"2,408,750","2,361,778","2,326,126","2,299,879","2,276,592","2,245,030","2,193,737","2,179,320","2,179,240"
Notice that it says CO1 rather than CO, short for "Colorado". I was fortunate to notice this garbling. Since it occurred on the line for Denver (among others) the result was that the program was unable to locate the population of Denver, which is the capital of Colorado, and a mandatory part of the program's output. So it raised a warning. Then I went in and manually corrected the CO1 to say CO. I also added a check to the program to make sure that it recognized all the state abbreviations; I should have had this in there in the first place.

Then I sent email to an acquaintance who works for the Census Bureau (identity suppressed to protect the innocent), pointing out the errors so that they could be corrected.

My contact checked with the people who produced the data, and informed me that, according to them, CO1 was not an error. Rather, the 1 was a footnote mark, directing me to a footnote at the bottom of the file:

"1Broomfield, CO was formed from parts of Adams, Boulder, Jefferson, and Weld Counties, CO on November 15, 2001 and was coextensive with Broomfield city.",,,,,,,,,,,,^M
"For purposes of presenting data for metropolitan and micropolitan statistical areas for Census 2000, Broomfield is treated as if it were a county at the time of the 2000 census.",,,,,,,,,,,,^M
A footnote.

I realize now that that footer was not as useless as I thought it was. Wow. A footnote. Wow.

I would like to suggest the following as a basic principle of computerized data processing:

Data files should contain data.

Not metadata. Not explanations. Not little essays. And not footnotes. Just the data.

There's a larger issue here about confusing content and presentation. But "Data files should contain data" is simpler and easier to remember.

I suspect that this file was exported from a spreadsheet program, probably Excel. Spreadsheet programs desperately want you to confuse content and presentation. This is why one should not use a spreadsheet as a database.

I now recall another occasion when I had to deal with data that was exported from a spreadsheet that was pretending to be a database. It was a database of products made by a large cosmetics company. A typical record looked like this:

"Soft-Pressed Powder Blusher","618J-05","Warm, natural-looking powder colour for all skins.  Wide range of shades-subtle to vibrant.  With applicator brush.","Cheeks","Nudes","Chestnut Blush","All","","19951201","Yes","","14.5",""
The 618J-05 here is a product code. Bonus points if you see what's coming next.

"Water-Dissolve Cream Cleanser","6.61E+01","Creamy cleanser for drier, more sensitive skins.  Dissolves even the most tenacious makeups.","Cleansers","","","","Sub I, I, II","19951201","Yes","1","14.5",""
That 6.61E+01 should have been 661E-01, but Excel decided that it was a numeral, in scientific notation, and put it into normal form.

Back to the Census Bureau, which almost screwed me by putting a footnote on a state name. What if they had decided to put footnotes on the population figures? Then I would have been really screwed, because it would have been completely undetectable.

No, wait! It's all become clear. That's why they put the commas in the numerals!

[ Addendum 20080129: My Census Bureau contact tells me that the authors of the data file have seen the wisdom of my point of view, in spite of my unconstructive and unhelpful feedback (I said "Wow, that is an incredibly terrible idea") and are planning to address the issue in the next release of the data. Hooray for happy endings! ]

[ Addendum 20080129: My Census Bureau contact tells me that they do sometimes put footnotes on the data items, so don't laugh too hard at my remark about the commas. ]


[Other articles in category /misc] permanent link

Wed, 23 Jan 2008

Smallest state capitals
I think it was James Kushner who first suggested that I consider the question of which U.S. state capital had the smallest population in relation to its state's largest city. For example, New York City is the largest city in the United States, but the state capital of New York State is Albany, with a population of about 850,000. The population of New York City exceeds that of Albany by a factor of around 20.

At the other end of the scale, of course, we have state capitals like Boston, Denver, Atlanta, and Honolulu that are their state's largest cities. For these states, the population quotient is 1, its theoretical minimum.

Well, James, it only took me thirty years, but here it is.

I tried to resolve the question manually a few weeks ago, by browsing Wikipedia for the populations of likely candidates. Today I took a more methodical approach, downloading the U.S. Census Bureau's July 2006 estimates for populations of metropolitan areas, and writing a couple of little programs to grovel the data.

I had to augment the Census Bureau's data with two items: Annapolis, MD, and Montpelier, VT are not large enough to be included in the metropolitan area data file. I used U.S. Census 2006 estimates for these cities as well.

I discarded one conurbation: the Census Bureau includes a "Metropolitan Division" in New Hampshire that consists of Rockingham and Strafford counties; this was the most populous identified area in New Hampshire. It didn't seem entirely germane to the question, so I took it out. On the other hand, including it doesn't change the results much: its population is 416,000, compared with Manchester-Nashua's 402,000.

The results follow.

State Capital and
its Population
Largest metropolitan area
and its population
Quotient
MD Annapolis 36,408 Baltimore-Towson 2,658,405 73.02
IL Springfield 206,112 Chicago-Naperville-Joliet 9,505,748 46.12
NV Carson City 55,289 Las Vegas-Paradise 1,777,539 32.15
VT Montpelier 7,954 Burlington-South Burlington 206,007 25.90
NY Albany 850,957 New York-Northern New Jersey-Long Island 18,818,536 22.11
MO Jefferson City 144,958 St. Louis 2,796,368 19.29
KY Frankfort 69,068 Louisville-Jefferson County 1,222,216 17.70
FL Tallahassee 336,502 Miami-Fort Lauderdale-Miami Beach 5,463,857 16.24
WA Olympia 234,670 Seattle-Tacoma-Bellevue 3,263,497 13.91
AK Juneau 30,737 Anchorage 359,180 11.69
PA Harrisburg 525,380 Philadelphia-Camden-Wilmington 5,826,742 11.09
SD Pierre 19,761 Sioux Falls 212,911 10.77
MI Lansing 454,044 Detroit-Warren-Livonia 4,468,966 9.84
NJ Trenton 367,605 Edison 2,308,777 6.28
CA Sacramento 2,067,117 Los Angeles-Long Beach-Santa Ana 12,950,129 6.26
NM Santa Fe 142,407 Albuquerque 816,811 5.74
OR Salem 384,600 Portland-Vancouver-Beaverton 2,137,565 5.56
DE Dover 147,601 Wilmington 691,688 4.69
VA Richmond 1,194,008 Washington-Arlington-Alexandria 5,290,400 4.43
ME Augusta 121,068 Portland-South Portland-Biddeford 513,667 4.24
TX Austin 1,513,565 Dallas-Fort Worth-Arlington 6,003,967 3.97
AL Montgomery 361,748 Birmingham-Hoover 1,100,019 3.04
NE Lincoln 283,970 Omaha-Council Bluffs 822,549 2.90
WI Madison 543,022 Milwaukee-Waukesha-West Allis 1,509,981 2.78
NH Concord 148,085 Manchester-Nashua 402,789 2.72
KS Topeka 228,894 Wichita 592,126 2.59
MT Helena 70,558 Billings 148,116 2.10
ND Bismarck 101,138 Fargo 187,001 1.85
NC Raleigh 994,551 Charlotte-Gastonia-Concord 1,583,016 1.59
LA Baton Rouge 766,514 New Orleans-Metairie-Kenner 1,024,678 1.34
OH Columbus 1,725,570 Cleveland-Elyria-Mentor 2,114,155 1.23
AR Little Rock 652,834 Little Rock-North Little Rock 652,834 1.00
AZ Phoenix 4,039,182 Phoenix-Mesa-Scottsdale 4,039,182 1.00
CO Denver 2,408,750 Denver-Aurora 2,408,750 1.00
CT Hartford 1,188,841 Hartford-West Hartford-East Hartford 1,188,841 1.00
GA Atlanta 5,138,223 Atlanta-Sandy Springs-Marietta 5,138,223 1.00
HI Honolulu 909,863 Honolulu 909,863 1.00
IA Des Moines 534,230 Des Moines-West Des Moines 534,230 1.00
ID Boise 567,640 Boise City-Nampa 567,640 1.00
IN Indianapolis 1,666,032 Indianapolis-Carmel 1,666,032 1.00
MA Boston 4,455,217 Boston-Cambridge-Quincy 4,455,217 1.00
MN St. Paul 3,175,041 Minneapolis-St. Paul-Bloomington 3,175,041 1.00
MS Jackson 529,456 Jackson 529,456 1.00
OK Oklahoma City 1,172,339 Oklahoma City 1,172,339 1.00
RI Providence 1,612,989 Providence-New Bedford-Fall River 1,612,989 1.00
SC Columbia 703,771 Columbia 703,771 1.00
TN Nashville 1,455,097 Nashville-Davidson--Murfreesboro 1,455,097 1.00
UT Salt Lake City 1,067,722 Salt Lake City 1,067,722 1.00
WV Charleston 305,526 Charleston 305,526 1.00
WY Cheyenne 85,384 Cheyenne 85,384 1.00
Nineteen of fifty state capitals are their state's largest cities.

Vermont is an interesting outlier here. It makes fourth place not because it has a large city, but because its capital, Montpelier, is so very small. I tried doing some scatter plots, to see if anything else jumped out, but they weren't very illuminating. If anything, the data is suprisingly evenly distributed. Here's an example:

The x-axis is the population of the state capital; the y-axis is the quotient. (Both axes are log scale.) Vermont is the leftmost point, near the top. The large collection of points on the x-axis are of course the nineteen states for which the capital and largest city coincide.

[ Addendum 20080129: Some remarks about the format of the Census Bureau's data file. ]


[Other articles in category /misc] permanent link

Tue, 04 Dec 2007

An Austrian coincidence
Only one of these depicts a location in my obstetrician's waiting room where the paint is chipped.

(Andy Lester recently referred to my blog as "the single most intelligent blog out there". I'll make you eat those words, Andy!)


[Other articles in category /misc] permanent link

Thu, 13 Sep 2007

Girls of the SEC
I'm in the Raleigh-Durham airport, and I just got back from the newsstand, where I learned that the pictorial in this month's Playboy magazine this month is "Girls of the SEC". On seeing this, I found myself shaking my head in sad puzzlement.

This isn't the first time I've had this reaction on learning about a Playboy pictorial; last time was probably in August 2002 when I saw the "Women of Enron" cover. (I am not making this up.) I wasn't aware of The December 2002 feature, "Women of Worldcom" (I swear I'm not making this up), but I would have had the same reaction if I had been.

I know that in recent years the Playboy franchise has fallen from its former heights of glory: circulation is way down, the Playboy Clubs have all closed, few people still carry Playboy keychains. But I didn't remember that they had fallen quite so far. They seem to have exhausted all the plausible topics for pictorial features, and are now well into the scraping-the-bottom-of-the-barrel stage. The June 1968 feature was "Girls of Scandinavia". July 1999, "Girls of Hawaiian Tropic". Then "Women of Enron" and now "Girls of the SEC".

How many men have ever had a fantasy about sexy SEC employees, anyway? How can you even tell? Sexy flight attendants, sure; they wear recognizable uniforms. But what characterizes an SEC employee? A rumpled flannel suit? An interest in cost accounting? A tendency to talk about the new Basel II banking regulations? I tried to think of a category that would be less sexually inspiring than "SEC employees". It's difficult. My first thought was "Girls of Wal-Mart." But no, Wal-Mart employees wear uniforms.

If you go too far in that direction you end up in the realm of fetish. For example, Playboy is unlikely to do a feature on "girls of the infectious disease wards". But if they did, there is someone (probably on /b/), who would be extremely interested. It is hard to imagine anyone with a similarly intense interest in SEC employees.

So what's next for Playboy? Girls of the hospital gift shops? Girls of State Farm Insurance telephone customer service division? Girls of the beet canneries? Girls of Acadia University Grounds and Facilities Services? Girls of the DMV?

[ Pre-publication addendum: After a little more research, I figured out that SEC refers here to "Southeastern Conference" and that Playboy has done at least two other features with the same title, most recently in October 2001. I decided to run the article anyway, since I think I wouldn't have made the mistake if I hadn't been prepared ahead of time by "Women of Enron". ]

[ Addendum 20070913: This article is now on the first page of Google searches for "girls of the sec playboy". ]

[ Addendum 20070915: The article has moved up from tenth to third place. Truly, Google works in mysterious ways. ]


[Other articles in category /misc] permanent link

Fri, 20 Jul 2007

Tough questions
It's easy to recognize a good question: a good question is one that takes a lot longer to answer than it does to ask. Chip Buchholtz's example is "what is a byte?" To answer that you have to get into the nitty gritty of computer architecture and how, although the information in the computer is stored by the bit, the memory bus can only address it by the byte.

One of the biology interns asked a me a good one a couple of weeks ago: he asked how, if Perl runs Perl scripts, and the OS is running Perl, what is running the OS? Now that is a tough question to answer. I explained about logic gates, and how the logic gates are built into trivial arithmetic and memory circuits, how these are then built up into ALUs and memories, and how these in turn are controlled by microcode, and finally how the logical parts are assembled into a computer. I don't know how understandable it was, but it was the best I could do in five minutes, and I think I got some of the idea across. But I started and finished by saying that it was basically miraculous.

My daughter Iris asks a ton of questions, some better than others. On any given evening she is likely to ask "Daddy, what are you doing?" about fifteen times, and "why?" about fifteen million times. "Why" can be a great question, but sometimes it's not so great; Iris asks both kinds. Sometimes it's in response to "I'm eating a sandwich." Then the inevitable "why?" is rather annoying.

Some of the "why" questions are nearly impossible to answer. For example, we see a lady coming up the street toward us. "Is that Susanna?" "No." "Why is it not Susanna?"

I think what's happening here is that having discovered this magic word that often produces interesting information, Iris is employing it whenever possible, even when it doesn't make sense, because she hasn't yet learned when it works and when not. Why is that not Susanna? Hey, you never know when you might get an interesting answer. But there might be something else going on that I don't appreciate.

But the nice thing about Iris's incessant questions is that she listens to and remembers the answers, ponders them deeply, and then is likely to come back with an insightful followup when you least expect it.

Order
Make Way for Ducklings
Make Way for Ducklings
with kickback
no kickback
This weekend we went to visit my parents in New York, and as we drove down the Henry Hudson Parkway, we passed the North River wastewater treatment plant. Three-year-olds are fascinated with poop, so I took the opportunity to point out the plant to Iris. I said that although it had a park with trees on the roof, the inside was a giant machine for turning poop into garden soil; they cleaned it and mixed with with wood chips and it composted like the stuff in our composter. (I later found that some of these details were not quite accurate, but the general idea is correct. See the official site for the official story. My wife provided the helpful analogy with the composter.) As I expected, Iris was interested, and thought this over; she confirmed that they turned poop into soil, and then asked what they made pee into. I was not prepared for that one, and I had to promise her I would find out later. It took me some Internet research time to find out about denitrogenation.

Speaking of poop, last month Iris asked a puzzler: why don't birds use toilets? I think this was motivated by our earlier discussion of bird poop on our car.

In Make Way for Ducklings there's a picture of the friendly policeman Michael, running back to his police box to order a police escort to help the ducklings across Beacon Street. He's holding his billy club. Iris asked what that was for. I thought a moment, and then said "It's for hitting people with." Later I wondered if I had given an inaccurate or incomplete answer, so I asked around, and did some reading. It appears I got that one right. Some folks I know suggested that I should have said it was for hitting bad people, but I'd rather stick to the plain facts, and leave out the editorializing.


Order
The Defeat of the Spanish Armada
The Defeat of the Spanish Armada
with kickback
no kickback
Anyway, lately I've been rereading The Defeat of the Spanish Armada, by Garrett Mattingly, which is a really good book; it won a special Pulitzer Prize when it was published. It's about the attempt by Spain to invade England in 1588. The invasion was a failure, and the Spanish got clobbered. Most interesting minor detail: Francis Drake went to St. Vincent the year before the Armada sailed and captured a bunch of merchant ships that were carrying seasoned barrel-staves, which he burnt. As a result, when the mighty Armada sailed, many of the ships had to carry casks made of green wood, and they leaked; whenever the Spanish opened a cask that should have contained food or water, they were as likely as not to find it full of green slime instead.

So I was reading the Mattingly book this evening, and Iris was eating and playing with Play-Doh on the kitchen floor. After the eleventh repetition of "Daddy, what are you doing?" "Reading." I decided to tell Iris what I was reading about. I said that I was reading about ships, that ships are big boats; they carry lots of men and guns. Iris asked why they carried guns, and I explained that often the ships carried treasure, like spices or gold or jewels or cloth, and that pirates tried to steal it. Iris asked if the cloth was like a wash cloth, and I said no, it was more like the kind of cloth that Mommy makes quilts from, or like the silk that her silk dress is made of. I explained about the pirates, which she seemed to understand, because toddlers know all about people who try to take stuff that isn't theirs. And then she asked the question I couldn't answer: Why were there men on the ships, but no women?

I was totally stumped; I don't even know where to begin explaining to a three-year-old why there are no women on ships in 1588. The only answers I could think of had to do with women's traditional roles, with European mores, social constructions of gender, and so on, all stuff that wouldn't help. Sometimes women were smuggled aboard ship, but I wasn't going to say that either.

I don't usually give up, but this time I gave up. This is a tough question of the first order, easy to ask, hard to answer. It's a lot easier to explain wastewater treatment.


[Other articles in category /misc] permanent link