The Universe of Discourse


Sat, 01 Apr 2023

United States first names of newborns 1960–2021

Various United States government agencies keep statistics of forenames and surnames, which I have often found useful, but which can be hard to find. I am storing them here for future convenience. The data is in the public domain. Share and enjoy.

What's here?

A collection of 62×51 = 3162 CSV files with names dddd-XX.csv, where dddd is a four-digit year between 1960 and 2021, and XX is a standard two-letter state abbreviation, or DC. (Information for Puerto Rico and other U.S. territories was available from the SSA but I did not collect it.)

By the way, !!3162\approx\sqrt{1000000}!!.

File format

Each file contains at least 200 records in the following format:

    1960,AK,M,David,152
    1960,AK,F,Mary,79

The fields are: year; state; sex; name; count. The (year, state, sex, name) tuple is a unique key over the entire data set.

The count is the number of babies with the specified name and sex born that year in that state. For example, the records above indicate that there were 152 male babies named David born in Alaska in year 1960, and 79 female babies named Mary.

At least the 100 most common names for each year-state-sex triple are included. I believe that the extra records are included when the least common names are tied for frequency.

Provenance

The United States social security administration (SSA) provides a web form that will deliver data for one year-state pair. I automated 3162 requests to this form and then scraped the HTML output.

(A couple of months ago I found a source for the raw data, as a single easy download, and then forgot where it was. Hence this post.)

Caution: data contains errors

There are several possible sources of error in these files. Most obviously, I might have made a mistake in the extraction, scraping, or recording.

But also it seems to me that the SSA itself provided some bad data. The data for Kentucky for year 2004 is clearly incorrect.

The name “Jacob” for girls is not to be found in the data — except in the 2004-KY.csv file. The SSA claims that there were 130 baby girls in Kentucky that year named Jacob, making it the 17th most common female name, ahead of Anna and just behind Lauren. The same file contains many similar oddities. For example, it claims that there were dozens of boys named Hannah and Emily, and dozens of girls named Michael, Joseph, and Christopher, but only that year and only in Kentucky.

I've added a comment to the top of the 2004-KY.csv file, which I hope will cause a processing failure so that nobody uses it accidentally.

I have contacted the SSA web site but I am not hopeful that this will be corrected. [ Update 20230413: Here is their completely nonresponsive reply to my bug report. ]

Update 20230812: I recently discovered that some of the files contain entries for people named “Unknown”, and the 1989 Wisconsin file contains entries for 158 people named “Unnamed”. Also, there are no other names in any of files that begin with letter ‘U’. Really? Hundreds of Xaviers and Ximenas, but no significant numbers of Ulysseses or Ursulas? I don't know whether this is another error.

If you notice other likely errors, please bring them to my attention.

Data limitations

The SSA has a page of qualifications about the data.

Since I expect this page will outlast the SSA's, here is an archived copy.

What for?

A coworker recently mentioned that, of the 37 people who attended the most recent meeting of his (California) church youth group, no two had the same name. He asked if this was remarkable. (My conclusion: only somewhat; the computed probability was around 1 in 5, and would be higher if I hadn't had to ignore the long tail of names that aren't in the files.)

More recently, I had an argument with ChatGPT about whether the name “James” was commonly used for women. It is not listed as one of the top 100 names in any state for the last 60 years, except in the obviously erroneous Kentucky 2004 data, as I mentioned above.


[Other articles in category /misc] permanent link