| The Universe of Discourse | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
12 recent entries Archive:
In this section:
Comments disabled |
Tue, 29 Jan 2008
The Census Bureau's data file
The data is available from the Census Bureau's web site. It is a CSV file. Most of the file contains actual data, like this:
20220,,"Dubuque, IA",Metropolitan Statistical Area,"92,384","91,603","91,223","90,635","89,571","89,216","89,265","89,156","89,143"Experienced data mungers will feel a sense of foreboding as they look at the commas in those numerals. Commas are for people, and if the data file is written for people, rather than for computers, then getting the computer to read it is going to require at least a little bit of suffering. Indeed, the rest of the data is rather dirty. There is a useless header: table with row headers in column A and column headers in rows 3 through 4 (leading dots indicate sub-parts),,,,,,,,,,,,^M "Table 1. Annual Estimates of the Population of Metropolitan and Micropolitan Statistical Areas: April 1, 2000 to July 1, 2006",,,,,,,,,,,,^M CBSA Code,"Metro Division Code",Geographic area,"Legal/statistical area description",Population estimates,,,,,,,"April 1, 2000",^M ,,,,"July 1, 2006","July 1, 2005","July 1, 2004","July 1, 2003","July 1, 2002","July 1, 2001","July 1, 2000",Estimates base,Census^M ,,Metropolitan statistical areas,,,,,,,,,,^MAnd there is a similarly useless footer on the bottom of the file. Any program that wants to use this data has to trim off the header and the footer, or ignore them, or the user will have to trim them off manually. (I've translated ASCII CR characters to ^M sequences so that you can see that although the lines of the file are CR-LF terminated, some of the items contain extra LFs for no particular reason.) Well, all this is minor. My real complaint is that some of the state name abbreviations are garbled:
19740,,"Denver-Aurora, CO1",Metropolitan Statistical Area,"2,408,750","2,361,778","2,326,126","2,299,879","2,276,592","2,245,030","2,193,737","2,179,320","2,179,240"
Notice that it says CO1 rather than CO, short for
"Colorado". I was fortunate to notice this garbling. Since it
occurred on the line for Denver (among others) the result was that the
program was unable to locate the population of Denver, which is the
capital of Colorado, and a mandatory part of the program's output. So
it raised a warning. Then I went in and manually corrected the
CO1 to say CO. I also added a check to the program
to make sure that it recognized all the state abbreviations; I should
have had this in there in the first place.Then I sent email to an acquaintance who works for the Census Bureau (identity suppressed to protect the innocent), pointing out the errors so that they could be corrected. My contact checked with the people who produced the data, and informed me that, according to them, CO1 was not an error. Rather, the 1 was a footnote mark, directing me to a footnote at the bottom of the file:
"1Broomfield, CO was formed from parts of Adams, Boulder, Jefferson, and Weld Counties, CO on November 15, 2001 and was coextensive with Broomfield city.",,,,,,,,,,,,^M "For purposes of presenting data for metropolitan and micropolitan statistical areas for Census 2000, Broomfield is treated as if it were a county at the time of the 2000 census.",,,,,,,,,,,,^MA footnote.
I would like to suggest the following as a basic principle of computerized data processing:
Data files should contain data. Not metadata. Not explanations. Not little essays. And not footnotes. Just the data.There's a larger issue here about confusing content and presentation. But "Data files should contain data" is simpler and easier to remember. I suspect that this file was exported from a spreadsheet program, probably Excel. Spreadsheet programs desperately want you to confuse content and presentation. This is why one should not use a spreadsheet as a database. I now recall another occasion when I had to deal with data that was exported from a spreadsheet that was pretending to be a database. It was a database of products made by a large cosmetics company. A typical record looked like this:
"Soft-Pressed Powder Blusher","618J-05","Warm, natural-looking powder colour for all skins. Wide range of shades-subtle to vibrant. With applicator brush.","Cheeks","Nudes","Chestnut Blush","All","","19951201","Yes","","14.5",""The 618J-05 here is a product code. Bonus points if you see what's coming next.
"Water-Dissolve Cream Cleanser","6.61E+01","Creamy cleanser for drier, more sensitive skins. Dissolves even the most tenacious makeups.","Cleansers","","","","Sub I, I, II","19951201","Yes","1","14.5",""
That 6.61E+01 should have been 661E-01, but Excel
decided that it was a numeral, in scientific notation, and put it into
normal form.Back to the Census Bureau, which almost screwed me by putting a footnote on a state name. What if they had decided to put footnotes on the population figures? Then I would have been really screwed, because it would have been completely undetectable. No, wait! It's all become clear. That's why they put the commas in the numerals! [ Addendum 20080129: My Census Bureau contact tells me that the authors of the data file have seen the wisdom of my point of view, in spite of my unconstructive and unhelpful feedback (I said "Wow, that is an incredibly terrible idea") and are planning to address the issue in the next release of the data. Hooray for happy endings! ] [ Addendum 20080129: My Census Bureau contact tells me that they do sometimes put footnotes on the data items, so don't laugh too hard at my remark about the commas. ]
[Other articles in category /misc] permanent link Wed, 23 Jan 2008
Smallest state capitals
At the other end of the scale, of course, we have state capitals like Boston, Denver, Atlanta, and Honolulu that are their state's largest cities. For these states, the population quotient is 1, its theoretical minimum. Well, James, it only took me thirty years, but here it is. I tried to resolve the question manually a few weeks ago, by browsing Wikipedia for the populations of likely candidates. Today I took a more methodical approach, downloading the U.S. Census Bureau's July 2006 estimates for populations of metropolitan areas, and writing a couple of little programs to grovel the data. I had to augment the Census Bureau's data with two items: Annapolis, MD, and Montpelier, VT are not large enough to be included in the metropolitan area data file. I used U.S. Census 2006 estimates for these cities as well. I discarded one conurbation: the Census Bureau includes a "Metropolitan Division" in New Hampshire that consists of Rockingham and Strafford counties; this was the most populous identified area in New Hampshire. It didn't seem entirely germane to the question, so I took it out. On the other hand, including it doesn't change the results much: its population is 416,000, compared with Manchester-Nashua's 402,000. The results follow.
Vermont is an interesting outlier here. It makes fourth place not because it has a large city, but because its capital, Montpelier, is so very small. I tried doing some scatter plots, to see if anything else jumped out, but they weren't very illuminating. If anything, the data is suprisingly evenly distributed. Here's an example:
[ Addendum 20080129: Some remarks about the format of the Census Bureau's data file. ]
[Other articles in category /misc] permanent link Tue, 04 Dec 2007
An Austrian coincidence
[Other articles in category /misc] permanent link Thu, 13 Sep 2007
Girls of the SEC
This isn't the first time I've had this reaction on learning about a Playboy pictorial; last time was probably in August 2002 when I saw the "Women of Enron" cover. (I am not making this up.) I wasn't aware of The December 2002 feature, "Women of Worldcom" (I swear I'm not making this up), but I would have had the same reaction if I had been. I know that in recent years the Playboy franchise has fallen from its former heights of glory: circulation is way down, the Playboy Clubs have all closed, few people still carry Playboy keychains. But I didn't remember that they had fallen quite so far. They seem to have exhausted all the plausible topics for pictorial features, and are now well into the scraping-the-bottom-of-the-barrel stage. The June 1968 feature was "Girls of Scandinavia". July 1999, "Girls of Hawaiian Tropic". Then "Women of Enron" and now "Girls of the SEC". How many men have ever had a fantasy about sexy SEC employees, anyway? How can you even tell? Sexy flight attendants, sure; they wear recognizable uniforms. But what characterizes an SEC employee? A rumpled flannel suit? An interest in cost accounting? A tendency to talk about the new Basel II banking regulations? I tried to think of a category that would be less sexually inspiring than "SEC employees". It's difficult. My first thought was "Girls of Wal-Mart." But no, Wal-Mart employees wear uniforms. If you go too far in that direction you end up in the realm of fetish. For example, Playboy is unlikely to do a feature on "girls of the infectious disease wards". But if they did, there is someone (probably on /b/), who would be extremely interested. It is hard to imagine anyone with a similarly intense interest in SEC employees. So what's next for Playboy? Girls of the hospital gift shops? Girls of State Farm Insurance telephone customer service division? Girls of the beet canneries? Girls of Acadia University Grounds and Facilities Services? Girls of the DMV? [ Pre-publication addendum: After a little more research, I figured out that SEC refers here to "Southeastern Conference" and that Playboy has done at least two other features with the same title, most recently in October 2001. I decided to run the article anyway, since I think I wouldn't have made the mistake if I hadn't been prepared ahead of time by "Women of Enron". ] [ Addendum 20070913: This article is now on the first page of Google searches for "girls of the sec playboy". ] [ Addendum 20070915: The article has moved up from tenth to third place. Truly, Google works in mysterious ways. ]
[Other articles in category /misc] permanent link Fri, 20 Jul 2007
Tough questions
One of the biology interns asked a me a good one a couple of weeks ago: he asked how, if Perl runs Perl scripts, and the OS is running Perl, what is running the OS? Now that is a tough question to answer. I explained about logic gates, and how the logic gates are built into trivial arithmetic and memory circuits, how these are then built up into ALUs and memories, and how these in turn are controlled by microcode, and finally how the logical parts are assembled into a computer. I don't know how understandable it was, but it was the best I could do in five minutes, and I think I got some of the idea across. But I started and finished by saying that it was basically miraculous. My daughter Iris asks a ton of questions, some better than others. On any given evening she is likely to ask "Daddy, what are you doing?" about fifteen times, and "why?" about fifteen million times. "Why" can be a great question, but sometimes it's not so great; Iris asks both kinds. Sometimes it's in response to "I'm eating a sandwich." Then the inevitable "why?" is rather annoying. Some of the "why" questions are nearly impossible to answer. For example, we see a lady coming up the street toward us. "Is that Susanna?" "No." "Why is it not Susanna?" I think what's happening here is that having discovered this magic word that often produces interesting information, Iris is employing it whenever possible, even when it doesn't make sense, because she hasn't yet learned when it works and when not. Why is that not Susanna? Hey, you never know when you might get an interesting answer. But there might be something else going on that I don't appreciate. But the nice thing about Iris's incessant questions is that she listens to and remembers the answers, ponders them deeply, and then is likely to come back with an insightful followup when you least expect it.
Speaking of poop, last month Iris asked a puzzler: why don't birds use toilets? I think this was motivated by our earlier discussion of bird poop on our car. In Make Way for Ducklings there's a picture of the friendly policeman Michael, running back to his police box to order a police escort to help the ducklings across Beacon Street. He's holding his billy club. Iris asked what that was for. I thought a moment, and then said "It's for hitting people with." Later I wondered if I had given an inaccurate or incomplete answer, so I asked around, and did some reading. It appears I got that one right. Some folks I know suggested that I should have said it was for hitting bad people, but I'd rather stick to the plain facts, and leave out the editorializing.
So I was reading the Mattingly book this evening, and Iris was eating and playing with Play-Doh on the kitchen floor. After the eleventh repetition of "Daddy, what are you doing?" "Reading." I decided to tell Iris what I was reading about. I said that I was reading about ships, that ships are big boats; they carry lots of men and guns. Iris asked why they carried guns, and I explained that often the ships carried treasure, like spices or gold or jewels or cloth, and that pirates tried to steal it. Iris asked if the cloth was like a wash cloth, and I said no, it was more like the kind of cloth that Mommy makes quilts from, or like the silk that her silk dress is made of. I explained about the pirates, which she seemed to understand, because toddlers know all about people who try to take stuff that isn't theirs. And then she asked the question I couldn't answer: Why were there men on the ships, but no women? I was totally stumped; I don't even know where to begin explaining to a three-year-old why there are no women on ships in 1588. The only answers I could think of had to do with women's traditional roles, with European mores, social constructions of gender, and so on, all stuff that wouldn't help. Sometimes women were smuggled aboard ship, but I wasn't going to say that either. I don't usually give up, but this time I gave up. This is a tough question of the first order, easy to ask, hard to answer. It's a lot easier to explain wastewater treatment.
[Other articles in category /misc] permanent link |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||