Archive:
Subtopics:
Comments disabled |
Sat, 01 Apr 2023
United States first names of newborns 1960–2021
Various United States government agencies keep statistics of forenames and surnames, which I have often found useful, but which can be hard to find. I am storing them here for future convenience. The data is in the public domain. Share and enjoy.
What's here?A collection of 62×51 = 3162 CSV files with names By the way, !!3162\approx\sqrt{1000000}!!. File formatEach file contains at least 200 records in the following format:
The fields are: year; state; sex; name; count. The (year, state, sex, name) tuple is a unique key over the entire data set. The count is the number of babies with the specified name and sex born that year in that state. For example, the records above indicate that there were 152 male babies named David born in Alaska in year 1960, and 79 female babies named Mary. At least the 100 most common names for each year-state-sex triple are included. I believe that the extra records are included when the least common names are tied for frequency. ProvenanceThe United States social security administration (SSA) provides a web form that will deliver data for one year-state pair. I automated 3162 requests to this form and then scraped the HTML output. (A couple of months ago I found a source for the raw data, as a single easy download, and then forgot where it was. Hence this post.) Caution: data contains errorsThere are several possible sources of error in these files. Most obviously, I might have made a mistake in the extraction, scraping, or recording. But also it seems to me that the SSA itself provided some bad data. The data for Kentucky for year 2004 is clearly incorrect. The name “Jacob” for girls is not to be found in the data — except in
the I've added a comment to the top of the I have contacted the SSA web site but I am not hopeful that this will be corrected. [ Update 20230413: Here is their completely nonresponsive reply to my bug report. ] Update 20230812: I recently discovered that some of the files contain entries for people named “Unknown”, and the 1989 Wisconsin file contains entries for 158 people named “Unnamed”. Also, there are no other names in any of files that begin with letter ‘U’. Really? Hundreds of Xaviers and Ximenas, but no significant numbers of Ulysseses or Ursulas? I don't know whether this is another error. If you notice other likely errors, please bring them to my attention. Data limitationsThe SSA has a page of qualifications about the data. Since I expect this page will outlast the SSA's, here is an archived copy. What for?A coworker recently mentioned that, of the 37 people who attended the most recent meeting of his (California) church youth group, no two had the same name. He asked if this was remarkable. (My conclusion: only somewhat; the computed probability was around 1 in 5, and would be higher if I hadn't had to ignore the long tail of names that aren't in the files.) More recently, I had an argument with ChatGPT about whether the name “James” was commonly used for women. It is not listed as one of the top 100 names in any state for the last 60 years, except in the obviously erroneous Kentucky 2004 data, as I mentioned above. [Other articles in category /misc] permanent link |