The Universe of Discourse

Fri, 16 Feb 2024

Etymology roundup 2024-02

The Recurse Center Zulip chat now has an Etymology channel, courtesy of Jesse Chen, so I have been posting whenever I run into something interesting. This is a summary of some of my recent discoveries. Everything in this article is, to the best of my knowledge, accurate. That is, there are no intentional falsehoods.

Baba ghanouj

I tracked down the meaning of (Arabic) baba ghanouj. It was not what I would have guessed.

Well, sort of. Baba is “father” just like in every language. I had thought of this and dismissed it as unlikely. (What is the connection with eggplants?) But that is what it is.

And ghanouj is …

So it's the father of coquetry.

Very mysterious.


Toph asked me if “nog” appeared in any word other than “eggnog”. Is there lemonnog or baconnog? I had looked this up before but couldn't remember what it was except that it was some obsolete word for some sort of drink.

“Nog” is an old Norfolk (England) term for a kind of strong beer which was an ingredient in the original recipe, sometime in the late 17th or early 18th century.

I think modern recipes don't usually include beer.


“Wow!” appears to be an 18th-century borrowing from an indigenous American language, because most of its early appearances are quotes from indigenous Americans. It is attested in standard English from 1766, spelled “waugh!”, and in Scots English from 1788, spelled “vow!”


Katara asked me for examples of words in English like “bear” where there are two completely unrelated meanings. (The word bear like to bear fruit, bear children, or bear a burden is not in any way related to the big brown animal with claws.)

There are a zillion examples of this. They're easy to find in a paper dictionary: you just go down the margin looking for a superscript. When you see “bear¹” and “bear²”, you know you've found an example.

The example I always think of first is “venery” because long, long ago Jed Hartman pointed it out to me: venery can mean stuff pertaining to hunting (it is akin to “venison”) and it can also mean stuff pertaining to sex (akin to “venereal”) and the fact that these two words are spelled the same is a complete coincidence.

Jed said “I bet this is a really rare phenomenon” so I harassed him for the next several years by emailing him examples whenever I happened to think of it.

Anyway, I found an excellent example for Katara that is less obscure than “venery”: “riddle” (like a puzzling question) has nothing to do with when things are riddled with errors. It's a complete coincidence.

The “bear” / “bear” example is a nice simple one, everyone understands it right away. When I was studying Korean I asked my tutor an etymology question, something like whether the “eun” in eunhaeng 은행, “bank”, was the same word as “eun” 은 which means “silver”. He didn't understand the question at first: what did I mean, “is it the same word”?

I gave the bear / bear example, and said that to bear fruit and to bear children are the same word, but the animal with claws is a different word, and just a coincidence that it is spelled the same way. Then he understood what I meant.

(Korean eunhaeng 은행 is a Chinese loanword, from 銀行. 銀 is indeed the word for silver, and 行 is a business-happening-place.)

Right and left

The right arm is the "right" arm because, being the one that is (normally) stronger and more adept, it is the right one to use for most jobs.

But if you ignore the right arm, there is only one left, so that is the "left" arm.

This sounds like a joke, but I looked it up and it isn't.

Leave and left

"Left" is the past tense passive of "leave". As in, I leave the room, I left the room, when I left the room I left my wallet there, my wallet was left, etc.

(As noted above, this is also where we get the left side.)

There are two other words "leave" in English. Leaves like the green things on trees are not related to leaving a room.

(Except I was once at a talk by J.H. Conway in which he was explaining some sort of tree algorithm in which certain nodes were deleted and he called the remaining ones "leaves" because they were the ones that were left. Conway was like that.)

The other "leave" is the one that means "permission" as in "by your leave…". This is the leave we find in "sick leave" or "shore leave". They are not related to the fact that you have left on leave, that is a coincidence.

Normal norms

Latin norma is a carpenter's square, for making sure that things are at right angles to one another.

So something that is normal is something that is aligned the way things are supposed to be aligned, that is to say at right angles. And a norm is a rule or convention or standard that says how things ought to line up.

In mathematics and physics we have terms like “normal vector”, “normal forces” and the like, which means that vectors or forces are at right angles to something. This is puzzling if you think of “normal” as “conventional” or “ordinary” but becomes obvious if you remember the carpenter's square.

In contrast, mathematical “normal forms” have nothing to do with right angles, they are conventional or standard forms. “Normal subgroups” are subgroups that behave properly, the way subgroups ought to.

The names Norman and Norma are not related to this. They are related to the surname Norman which means a person from Normandy. Normandy is so-called because it was inhabited by Vikings (‘northmen’) starting from the 9th century.

Hydrogen and oxygen

Jesse Chen observed that hydrogen means “water-forming”, because when you burn it you get water.

A lot of element names are like this. Oxygen is oxy- (“sharp” or “sour”) because it makes acids, or was thought to make acids. In German the analogous calque is “sauerstoff”.

Nitrogen makes nitre, which is an old name for saltpetre (potassium nitrate). German for nitre seems to be salpeter which doesn't work as well with -stoff.

The halogen gases are ‘salt-making’. (Greek for salt is hals.) Chlorine, for example, is a component of table salt, which is sodium chloride.

In Zulip I added that The capital of Denmark, Copenha-gen, is so-called because in the 11th century is was a major site for the production of koepenha, a Germanic term for a lye compound, used in leather tanning processes, produced from bull dung. I was somewhat ashamed when someone believed this lie despite my mention of bull dung.

Spas, baths, and coaches

Spas (like wellness spa or day spa) are named for the town of Spa, Belgium, which has been famous for its cold mineral springs for thousands of years!

(The town of Bath England is named for its baths, not the other way around.)

The coach is named for the town of Kocs (pronounced “coach”), Hungary, where it was invented. This sounds like something I would make up to prank the kids, but it is not.

Spanish churches

“Iglesia” is Spanish for “church”, and you see it as a surname in Spanish as in English. (I guess, like “Church”, originally the name of someone who lived near a church).

Thinking on this, I realized: “iglesia” is akin to English “ecclesiastic”.

They're both from ἐκκλησία which is an assembly or congregation.

The mysterious Swedish hedgehog

In German, a hedgehog is “Igel”. This is a very ancient word, and several other Germanic languages have similar words. For example, in Frisian it's “ychel”.

In Swedish, “igel” means leech. The hedgehog is “igelkott”.

I tried to find out what -kott was about. “kotte” is a pinecone and may be so-called because “kott” originally meant some rounded object, so igelkott would mean the round igel rather than the blood igel, which is sometimes called blodigel in Swedish.

I was not able to find any other words in Swedish with this sense of -kott. There were some obviously unrelated words like bojkott (“boycott”). And there are a great many Swedish words that end in -skott, which is also unrelated. It means “tail”. For example, the grip of a handgun is revolverskott.

[ Addendum: Gustaf Erikson advises me that I have misunderstood ‑skott; see below. ]

Bonus hedgehog weirdness: In Michael Moorcock's Elric books, Elric's brother is named “Yyrkoon”. The Middle English for a hedgehog is “yrchoun” (variously spelled). Was Moorcock thinking of this? The -ch- in “yrchoun” is t͡ʃ though, which doesn't match the stop consonant in “Yyrkoon”. Also which makes clear that “yrchoun” is just a variant spelling of “urchin”. (Compare “sea urchin”, which is a sea hedgehog. or compare “street urchin”, a small round bristly person who scuttles about in the gutter.)

In Italian a hedgehog is riccio, which I think is also used as a nickname for a curly-haired or bristly-haired person.

Slobs and schlubs

These are not related. Schlub is originally Polish, coming to English via (obviously!) Yiddish. But slob is Irish.

-euse vs. -ice

I tried to guess the French word for a female chiropractor. I guessed “chiropracteuse" by analogy with masseur, masseuse, but I was wrong. It is chiropractrice.

The '‑ice' suffix was clearly descended from the Latin '‑ix' suffix, but I had to look up ‘‑euse’. It's also from a Latin suffix, this time from ‘‑osa’.


When you jot something down on a notepad, the “jot” is from Greek iota, which is the name of the small, simple letter ι that is easily jotted.

Bonus: This is also the jot that is meant by someone who says “not a jot or a tittle”, for example Matthew 5:18 (KJV):

For verily I say unto you, Till heaven and earth pass, one jot or one tittle shall in no wise pass from the law, till all be fulfilled.

A tittle is the dot above the lowercase ‘i’ or ‘j’. The NIV translates this as “not the smallest letter, not the least stroke of a pen”, which I award an A-plus for translation.

Vilifying villains

I read something that suggested that these were cognate, but they are not.

“Vilify” is from Latin vīlificō which means to vilify. It is a compound of vīlis (of low value or worthless, I suppose the source of “vile”) and faciō (to make, as in “factory” and “manufacture”.)

A villain, on the other hand, was originally just a peasant or serf; that is, a person who lives in a village. “Village” is akin to Latin villa, which originally meant a plantation.

Döner kebab

I had always assumed that “Döner” and its “ö” were German, but they are not, at least not originally. “Döner kebab” is the original Turkish name of the dish, right down to the diaresis on the ‘ö’, which is the normal Turkish spelling; Turkish has an ‘ö’ also. Döner is the Turkish word for a turning-around-thing, because döner kebab meat roasts on a vertical spit from which it is sliced off as needed.

“Döner” was also used in Greek as a loanword but at some point the Greeks decided to use the native Greek word gyro, also a turning-around-thing, instead. Greek is full of Turkish loanwords. (Ottoman Empire, yo.)

“Shawarma”, another variation on the turning-around-vertical-spit dish, is from a different Ottoman Turkish word for a turning-around thing, this time چویرمه (çevirme).

The Armenian word for shawarma is also shawarma, but despite Armenian being full of Turkish loanwords, this isn't one. They got it from Russian.

Everyone loves that turning-on-a-vertical-spit dish. Lebanese immigrants brought it to Mexico, where it is served in tacos with pineapple and called tacos al pastor (“shepherd style”). I do not know why the Mexicans think that Lebanese turning-around-meat plus pineapples adds up to shepherds. I suppose it must be because the meat is traditionally lamb.

Roll call

To roll is to turn over with a circular motion. This motion might wind a long strip of paper into a roll, or it might roll something into a flat sheet, as with a rolling pin. After rolling out the flat sheet you could then roll it up into a roll.

Dinner rolls are made by rolling up a wad of bread dough.

When you call the roll, it is because you are reading a list of names off a roll of paper.

Theatrical roles are from French rôle which seems to have something to do with rolls but I am not sure what. Maybe because the cast list is a roll (as in roll call).

Wombats and numbats

Both of these are Australian animals. Today it occurred to me to wonder: are the words related? Is -bat a productive morpheme, maybe a generic animal suffix in some Australian language?

The answer is no! The two words are from different (although distantly related) languages. Wombat is from Dharug, a language of the Sydney area. Numbat is from the Nyungar language, spoken on the other end of the continent.


Gustaf Erikson advises me that I have misunderstood ‑skott. It is akin to English shoot, and means something that springs forth suddenly, like little green shoots in springtime, or like the shooting of an arrow. In the former sense, it can mean a tail or a sticking-out thing more generally. But in revolverskott is it the latter sense, the firing of a revolver.

[Other articles in category /lang/etym] permanent link

Sun, 03 Dec 2023

Compass points in Czech

Over the weekend a Gentle Reader sent me an anecdote about getting lost in a Czech zoo. He had a map with a compass rose, and the points of the compass were labeled SVZJ. Gentle Reader expected that S and V were south and west, as they are in many European languages. (For example, Danish has syd and vest; English has “south” and “vvest” — sorry, “west”.

Unfortunately in Czech, S and V are sever, “north”, and východ, east. Oops.

A while back I was thinking about the names of the cardinal directions in Catalán because I was looking at a Catalán map of the Sagrada Família, and observed that the Catalán word for east, llevant is a form of _llevar, which literally means “to rise”, because the east is where the sun rises. (Llevar is from Latin levāre and is akin to words like “levity” and “levitate”.) Similarly the Latin word for “east” is oriēns, from orior, to get up or to arise.

I looked into the Czech a little more and learned that východ, “east”, is also the Czech word for “exit”:

Photo of a
sign in the Prague airport, labeled “VÝCHOD / EXIT”

“Aha,” I said. “They use východ for “east” not because that's where the sun comes up but because that's where it enters…”



No. Entrance is not exit. Východ is exit. Entrance is vchod.

I dunno, man. I love the Czechs, but this is a little messed up.


  • I think I recall that sever, “north”, is thought to be maybe akin to “shower”, since the north is whence the cold rains come, but maybe I made that up.

  • An earlier version of this article had an error about the Catalán. Thanks to Alex Corcoles for pointing this out.

  • I never mentioned the other Czech compass points. They are: sever, north; východ, east, západ, west, jih, south. Východ seems to be related to Russian восто́к (/vostók/) but I'm not sure how.

  • 20231220: A Gentle Reader asked about the pronunciation of vchod as compared with východ. Is it some unpronounceable Czechism? Nope! I was very pleased with the analogous example I found: the difference is no harder to hear or to say than the difference between “climb” and “keylime”.

[Other articles in category /lang/etym] permanent link

Fri, 01 Dec 2023

Obsolete spellings and new ligatures in the names of famous persons

There's this technique you learn in elementary calculus called l'Hospital's rule or l'Hôpital's rule, depending on where and when you learned it. It's named for Guillaume l'Hospital or Guillaume l'Hôpital.

In modern French the ‘s’ is silent before certain consonants, and sometime in the 18th century it became standard to omit it, instead putting a circumflex over the preceding vowel to show that the ‘s’ was lurking silently. You can see the same thing in many French words, where the relationship with English becomes clear if you remember that the circumflex indicates a silent letter ‘s’. For example

  • côte (coste, coast)
  • fête (feste, feast)
  • île (isle, isle)
  • pâté (paste, paste)

and of course

  • hôpital (hospital, hospital)

Wikipedia has a longer list.

But the spelling change from ‘os’ to ‘ô’ didn't become common until the 18th century and l'Hôpital, who died in 1704, spelled his name the old way, as “l'Hospital”. The spelling with the circumflex is in some sense an anachronism. I've always felt a little funny about this. I suppose the old spelling looks weird to francophones but I'm not a francophone and it seems weird to me to spell it in a way that l'Hospital himself would not have recognized.

For a long time I felt this way about English names also, and spelled Shakespeare's name “Shakspere”. I eventually gave up on this, because I thought it would confuse people. But I still think about the question every time I have to spell it and wonder what Shakespeare would have thought. Perhaps he would have thought nothing of it, living as he did in a time of less consistent orthography.

To find out the common practice, I went to the German Wikipedia page for Karl Gauss, for whom there a similar issue arises. They spell it the modern way, “Gauß”. But now another issue intrudes: They spell it “Carl” and not “Karl”! If the name were completely modernized, wouldn't it be “Karl Gauß” and not “Carl Gauß”? Is “Carl” still a thing in German?

Gauss is glowering down at me from his picture on an old ten-mark banknote I keep on my wall, so I checked just now and Deutsche Bundesbank also spells it ”Carl Gauß”. (The caption sprouts forth from his left shoulder.)

obverse of 1999
10 Deutsche mark banknote with portrait of Karl Gauss. The portrait is
captioned “1777–1855 Carl Friedr. Gauß”

Now I wonder why I checked the German Wikipedia for Gauss before checking the French Wikipedia for l'Hôpital. Pure stupidity on my part. French Wikipedia uniformly spells it the modern way, with the circumflex.

I suppose I will have to change my practice, and feel the same strangeness whenever I write “Gauß” or “l'Hôpital” as I do when I write “Shakespeare”.


  • Math SE search for l'Hôpital produces 9,336 hits including many that omit the ‘s’ entirely, “l'Hopital”. A search for l'Hospital produces a surprisingly large 5,593 hits.

  • I also consulted the Chicago Manual of Style but found nothing helpful.

  • I once knew a graduate student named Chris Geib, who explained to me that his German ancestors had probably been named “Geiß” (“goat”) but that the ẞ was misinterpreted at some point.

[Other articles in category /lang] permanent link

Sun, 22 Oct 2023

Cats in Romance languages

We used to have a cat named Chase. To be respectful we would sometimes refer to him as “Mr. Cat”. And sometimes I amused myself by calling him “Señor Gato”.

Yesterday I got to wondering: Where did Spanish get “gato”, which certainly sounds like “cat”, when the Latin is fēles or fēlis (like in “feline’)? And similarly French has chat.

Well, the real question is, where did Latin get fēles? Because Latin also has cattus, which I think sounds like a joke. You're in Latin class, and you're asked to translate cat, but you haven't done your homework, so what do you say? “Uhhhh… ‘cattus’?”

But cattus is postclassical Latin, replacing the original word fēles no more than about 1500 years ago. The word seems to have wandered all over Europe and Western Asia and maybe North Africa, borrowed from one language into another, and its history is thoroughly mixed up. Nobody is sure where it came from originally, beyond “something Germanic”. The OED description of cat runs to 600 words and shrugs its shoulders.

I learned recently that such words (like brinjal, the eggplant) are called Wanderworts, wandering words.

[Other articles in category /lang/etym] permanent link

Sat, 21 Oct 2023

Portuguese food words in Asia

The other day I was looking into vindaloo curry and was surprised to learn that the word “vindaloo” is originally Portuguese vin d'alho, a wine and garlic sauce. Amazing.

In Japanese, squashes are called kabocha. (In English this refers to a specific type of squash associated with Japan, but in Japanese it's more generic.) Kabocha is from Portuguese again. The Portuguese introduced squashes to Japan via Cambodia, which in Portuguese is Camboja.

[Other articles in category /lang/etym] permanent link

Mon, 31 Jul 2023

Can you identify this language?

Rummaging around in the Internet Archive recently, I found a book in a language I couldn't recognize. Can you identify it? Here's a sample page:

The page
is hard to read, but as far as I can tell, it begins: “plac'het iaouank a ioa ouz ho gortoz, ho chleuzeuriou var
elum, a ieas gantho d ai zal a eured; ha goudeze e oue serret an or Ar
plac'het iaouank all a erruas ive d'ar fin, ha setu hi da c'hervel ar
goaz nevez en eur lavaret; …”

I regret that IA's scan is so poor.

Answer: Breton.


Addendum 20230731: Bernhard Schmalhofer informs me that HathiTrust has a more legible scan. ]

[Other articles in category /lang] permanent link

Wed, 31 May 2023

Why does this phrase sound so threatening?

Screenshot of tweet from Ari Cohn (@AriCohn)
saying “If you are the lawyer for the Village of melrose Park, this
phrasing is really not what you want to see at the opening of the
opinion.”  Below that is Cohn's screenshot of the opening words of a
2022 opinion of U.S. District Judge Steven C. Seeger: “The Village of
melrose Park decided that it would be a good idea”.

I took it the same way:

The Village of Melrose Park decided that it would be a good idea

is a menacing way to begin, foreboding bad times ahead for the Village.

But what about this phrasing communicates that so unmistakably? I can't put my finger on it. Is it “decided that”? If so, why? What would have been a less threatening way to say the same thing? Does “good idea” contribute to the sense of impending doom? Why or why not?

(The rest of the case is interesting, but to avoid distractions I will post about it separately. The full opinion is here.)

[Other articles in category /lang] permanent link

Fri, 26 May 2023

Hieroglyphic monkeys holding stuff

I recently had occasion to mention this Unicode codepoint with the undistinguished name EGYPTIAN HIEROGLYPHIC SIGN E058A:

In a slightly more interesting world it would have been called STANDING MONKEY HOLDING SEVERED HEAD.

Unicode includes a group of eight similar hieroglyphic signs of monkeys holding stuff. Screenshots are from Unicode proposal N1944, Encoding Egyptian Hieroglyphs in Plane 1 of the UCS. The monkeys are on page 27. The names are my own proposals.


That monkey looks altogether too pleased with itself for my liking.


I have no idea what the triangle thingy is supposed to be. A thorn? A bread cone maybe? The object on the monkey's head is the crown of northern Egypt.


What if you want to type the character for a standing monkey holding the left eye of Ra? I suppose you have to compose several codepoints?


Is it a ball? An orb? A bowl? A dolerite pounder?


I have no idea what the flower thingy is supposed to represent. Budge's dictionary classifies it with the “trees, plants, flowers, etc.” but assigns it only a phonetic value. (Budge, E. Wallis; An Egyptian Hieroglyphic Dictionary (London 1920), v.1, p. cxxiii)


The monkey is holding, but not wearing, the crown of southern Egypt.


This last one is amazing.

I think the hook by the monkey's foot is a sign with no meaning other than the ‘s’ sound.

The object in the monkey's left hand is quite common in hieroglyphic writing but I do not know what it is. Budge (p.cxxxiii) says it is a “sacred object worshipped in the Delta” and that it is pronounced “tcheṭ” or “ṭeṭ”, but I have not been able to find what it is called at present. Hmmm…

Aha! It is called djed:

It is a pillar-like symbol in Egyptian hieroglyphs representing stability. It is associated with the creator god Ptah and Osiris, the Egyptian god of the afterlife, the underworld, and the dead. It is commonly understood to represent his spine.

Thanks to Wikipedia's list of hieroglyphs.

Addendum: This morning I feel a little foolish because I found tcheṭ in the “list of hieroglyphic characters” section of Budge's dictionary, but when I didn't know what it was, it didn't occur to me to actually look it up in the dictionary.

Screencap of the
entry from Budge's dictionary, defining tcheṭ.  The glyph is a sort of
pillar or column with a fluted middle and a sort of vertebral thing
on top.  The definition reads: “an amulet that was supposed to endue
the wearer with the permanence and stability of the backbone of
Osiris”.  Then there is another hieroglyph that incorporates tcheṭ as
a component, glossed as “the backbone of Osiris, the sacrum bone”.

[Other articles in category /lang] permanent link

Fri, 05 May 2023

Water, polo, and water polo in Russian

I recently learned January First-of-May's favorite Russian anagrams:

  • австралопитек (/avstralopitek/, “Australopithecus”)
  • ватерполистка (/vaterpolistka/. ”Female water-polo player”)

Looking into this further, I learned that there appear to be two words in Russian for water polo. Ватерполо (/vaterpolo/) is obviously an English loanword. But Во́дное по́ло (/vódnoye pólo/) is native Russian; вода́ /vodá/ is water.

(Incidentally во́дка /vódka/ is the diminutive of вода́, it's the smaller and more adorable version of water. I feel like this one etymology encapsulates a great deal of the Russian national character.)

I am not sure where Russian got the word по́ло for polo. The English word is borrowed from Tibetan པོ་ལོ /polo/, meaning “ball”. Russian might have gotten it directly from Tibetan, or (more likely) via English. But here's a twist: The Tibetan word is itself a borrowing of the English word “ball”!

[Other articles in category /lang/etym] permanent link

Mon, 20 Mar 2023

Compass directions in Catalan

Looking over a plan of the Sagrada Família Sunday, I discovered that the names of the cardinal directions are interesting.

  • Nord (north). Okay, this is straightforward. It's borrowed from French, which for some reason seems to have borrowed from English.

  • Llevant (east). This one is fun. As in Spanish, llevar is “to rise”, from Latin levāre which also gives us “levity” and “levitate”. Llevant is the east, where the sun rises.

    This is also the source of the English name “Levant” for the lands to the east, in the Eastern Mediterranean. I enjoy the way this is analogous to the use of the word “Orient” for the lands even farther to the east: Latin orior is “to rise” or “to get up”. To orient a map is to turn it so that the correct (east) side is at the top, and to orient yourself is (originally) to figure out which way is east.

  • Migdia (south). The sun again. Migdia is analogous to “midday”. (Mig is “mid” and dia is “day”.) And indeed, the south is where the sun is at midday.

  • Ponent (west). This is ultimately from Latin ponens, which means putting down or setting down. It's where the sun sets.

Bonus unrelated trivia: The Russian word for ‘north’ is се́вер (/séver/), which refers to the cold north wind, and is also the source of the English word “shower”.

[ Addendum 20231203: Compass directions in Czech ]

[Other articles in category /lang/etym] permanent link

Thu, 16 Feb 2023

Dog breeds in Korean

A couple of days ago I mentioned a Korean sign about “petiquette”. Part of the sign lists of breeds that must be kept muzzled:

Sign with Korean text and
pictures of five dogs wearing muzzles.

Here are the Korean texts and their approximate pronunciations. See if you can figure out what the five breeds are:

1. 도사견 (to-sa-gyŏn)
2. 아메리칸 핏불테리어 (a-me-ri-kan pit-bul-te-ri-ŏ)
3. 아메리칸 스태퍼드셔 테리어 (a-me-ri-kan seu-tae-pŏ-deu-sya te-ri-ŏ)
4. 스태퍼드셔 불 테리어 (seu-tae-pŏ-deu-sya bul te-ri-ŏ)
5. 로트와일러 (ro-teu-wa-il-lŏ)

(Answers below.)

Dog #1 is 도사견 (to-sa-gyŏn), in English called the Tosa. I had not heard of that before but if I had there would have been nothing to guess.

Dog #5, 로트와일러 (ro-teu-wa-il-lŏ), I figured out quickly; it's a Rottweiler. Korean Wikipedia spells it differently: [로트일러] (ro-teu-ba-il-lŏ).

[ Addendum 20230904: I just realized the likely cause of the difference: Korean Wikipedia is using the German pronunciation. ]

The other three are all terriers of some sort. (테리어 (te-ri-ŏ) was clearly “terrier”). It didn't take long to understand that #2 was “American Pit Bull Terrier”, which I guessed was the official name for a Pit Bull. That isn't quite right but it is close and I did understand the name correctly.

Similarly #3 was an American ¿something? terrier and #4 was a ¿something? bull terrier. But what was ¿something?? I could not recognize 스태퍼드셔 (seu-tae-pŏ-deu-sya) as anything I knew and it was clearly not Korean.

Once I got home, I asked the Goog “What kinds of terriers are there?” and the answer to the puzzle was instantly revealed.

“Seu-tae-pŏ-deu-sya” is the Hangeul rendering of the very un-Korean word “Staffordshire”.

[Other articles in category /lang] permanent link

Wed, 15 Feb 2023

Multilingual transliteration corruption

The Greek alphabet has letters beta (Ββ) and delta (Δδ). In classical times these were analogous to Roman letters B and D, but over the centuries the pronunciation changed. Beta is now pronounced like an English ‘v’. For example, the Greek word for “alphabet”, αλφάβητο, is pronounced /alfavito/

Modern Greek delta is pronounced like English voiced ‘th’, as in ‘this’ or ‘father’. The Greek word for “diameter” διάμετρος is pronounced /thiametros/.

Okay, but sometimes Greeks do have to deal with words that have hard /b/ and /d/ sounds, in loanwords if nowhere else. How do Greeks write that? They indicate it explicitly: For a /b/ they write the compound μπ ('mp'), and for a /d/ they write ντ ('nt'). So for example the word for the number fifty is spelled πενήντα, 'peninta', and pronounced 'penida' — the ‘-nt’ cluster is pronounced like English ‘d’. And the word for beer, borrowed from Italian birra, is spelled μπύρα, ‘mpyra’, and pronounced as in Italian, ‘birra’.

There is a Greek professional basketball player named Giannis Antetokounmpo. The first time I saw this I was a little bit boggled, particularly by that -nmpo cluster at the end. But then I realized what had happened.

Antetokounmpo's family is from Nigeria and their name is of Yoruba origin. In English, the name would be written as Adetokunbo and easily pronounced as written. But in Greek the ‘d’ and ‘b’ must be written as ‘nt’ and ‘mb’ so that, when pronounced as written in Greek, it sounds correct. This means that the correct, pronounce-as-written spelling in Greek is Γιάννης Αντετοκούνμπο.

The Yoruba-to-Greek translation was carried out perfectly. The problem here is that the Greek-to-English translation was chosen to preserve the spelling rather than the pronunciation, so that Αντετοκούνμπο turned into ‘Antetokounmpo’ instead of ‘Adetokunbo’.

[Other articles in category /lang] permanent link

Tue, 14 Feb 2023

English loanwords in Korean

(Before I start, a note about the romanization of Korean words, which is simple and systematic but can be misleading in appearance.

  • The Korean vowel is conventionally romanized as ‘eo’. This is so misleading that I have chosen instead to render it as ‘ŏ’ as was common in the 20th century. ㅓ is pronounced partway between "uh" and "aw".

  • ‘ae’ () is similar to the vowel in ‘air’

  • ‘eu’ () does not sound like anything in English. The closest one can come is the vowel in ‘foot’, but ㅡ is farther back in the throat. Or say “boot” but without rounding your lips. It serves something of the default role of the English schwa vowel, and is often very reduced.

Are you seated comfortably? Then let's begin.)

A great deal of Korean vocabulary has been borrowed from English. For example here's a sign advertising aerobics.

I did not include alt texts for
the photographs in this article because they are all photos of signs
and the whole point of the article is to describe in detail what the
signs say.  If this was the wrong choice, please drop me an email so I
can fix it.

It says “에어로빅” (e-ŏ-ro-bik). This is not surprising. You wouldn't expect there to have been an ancient traditional Korean word for aerobics.

But something that struck me when I was in Korea last year was how often signs would use borrowed English words even when there was already a perfectly good word already in Korean. Here's a very typical example:

This is the Samsong Building. There is a Korean word for ‘building” (Wiktionary says “건물” (gŏn-mul)) but that word isn't used here. Instead, the sign says “삼송빌딩”, pronounced ‘sam-song bil-ding’.

This use of “빌딩” (bil-ding) is extremely common. You can see it under the aerobics sign (연희빌딩, yŏn-hui bil-ding), and here's another one:

The green metal plate has Chinese words 起韓 (something like “arise Korea”) and then “빌딩” (bil-ding). This is the Arise Korea Building.

Also common in this context is “타워” (ta-wŏ, ‘tower’). (Remember that ‘ŏ’ (ㅓ) is pronounced similar to the vowels in ‘bought’ or ‘butt’, so ‘ta-wŏ’ represents something more like ‘ta-wuh’.) Here's a bit I clipped out of a Google Street View that translates “Trade Tower” as “트레이드타워” (teu-re-i-deu ta-wŏ)

Apparently this giant building does not have a Korean name. I tried to think of an analogous American example, and all I could come up with was this little grocery store in Philadelphia's Chinatown:

The Chinese name was 中美食品公司 (zhōng měi shípǐn gōngsī): “Chinese-American food company”, or maybe “Chin-Am food company” if you want to get cute. But the English name on the sign calls it the Chung May food market, transliterating 中美 rather than translating it.

[ Addendum 20230225: I found a much better example. ]

Okay, back to Korea. This banner from a small park is mainly in Korean, but its title is “펫티켓 가이드”: pet-ti-ket ga-i-deu, “pettiquette guide”:

(Item 2 is a list of dog breeds that must be muzzled. Item 4 is a list of the fines you will pay if you are insufficiently petiquettulous.)

I found this next one remarkable because, not only does it use the English word for “hair”, but it does so even though the pronunciation of “hair” is so alien to Korean phonology:

“헤어” (he-ŏ, “hair”). I think if I had been making this sign I might have rendered it as “핼” (haer) but what do I know? I suspect that “매직” in the red text farther down says “magic” but I don't recognize the four syllables before it. [ Addendum 20230517: “크리닉” /keu-ri-nik/ is probably “clinic”. ]

This purple sign for the Teepee Gym (짐티피, jim-ti-pi) advertises in big letters “헬스” (hel-seu, ‘health’). Korean doesn't have anything like English /-th/.

I'm lucky there was a helpful picture of a teepee on the sign or I would not have figured out “티피”.

The smaller sign, under the gyros, has a mixture of Korean and English. The first line says pu-ri-mi-ŏm, ‘premium’. The second says ho-tel-sik-hel-seu, which I think is ‘hotel식 health’, where ‘식’ is a suffix that means ‘-like’ or ‘-type’.

I can't make out the third line, even though it is evidently English. Hebrew words are recognizable as such in Latin script, just from their orthography: too many v's and z's, way too many consonant clusters like ‘tz’ and ‘zv’ that never happen in English. Recognizing English words in Hangeul is similarly easy: they have too many ㅌ's, ㅋ's, and ㅍ's, and too many ㅔ's. English is full of diphthongs like long A and I that Korean doesn't have and has to simulate with ㅐ이 and ㅏ이. Many borrowed words end in ㅡ because the English ended in a hard consonant, but in Korean that sounds weird so they add a vowel at the end.

That third line 다이어트 has all the signs, it's as clearly English as “shavuot” is Hebrew, but I can't quite make it out. It is pronounced ‘da-i-ŏ-teu’…

Oh, I get it now. It's ‘diet’!

Sometimes these things can be hard to figure out, and then they hit you in a flash and are obvious. Lorrie once told me about a sign that mystified her, “크로켓” (keu-ro-ket) and eventually she realized it was advertising croquettes.

I don't know what the fourth line is and I can't even tell if it's English or Korean. The 체 looks like it is going to be English but then it seems to change its mind. It is pronounced something like ‘che-hyŏng-gyu-jŏng’, so I guess probably Korean.

The last line is cut off in the picture but definitely starts with “바디” (ba-di, ‘body’) and probably some English word after that, judging by the next syllable ‘프’.

Let's see, what else do I have for you? I believe this is a dance or exercise studio named “Power Dance” (pa-wŏ-daen-seu).

I took this picture because the third floor is so mysterious:

What on earth is “PRIME IELTS"? A typo? No, apparently not; the Korean says peu-ra-im (‘prime’) a-i-el-jeu (wtf). It does at least reveal that the I in ‘IELTS’ is pronounced like in ‘Iowa’, not like in ‘Inez’.

Aha, the Goog tells me it is an acronym for “International English Language Testing System”. (Pause while I tick an item off a list… now there are only 14,228,093,174,028,595 things I don't know yet!)

By the way, the fifth-floor business has spelled out the French loanword “atelier” as “아뜰리에” (a-ddeul-li-e). I don't know what “VU” is (maybe the sign is for an optician?) but Korean has nothing like ‘V’ so in Korean it becomes “뷰” (byu).

(One of the members of BTS goes by the moniker “V”, which does not translate well into Korean at all; it has to be pronounced more like ‘bwi’.)

This next one is fun because the whole sentence is in English. The text at the top of the sign reads

peu-rang-seu peu-ri-mi-ŏm kŏ-pi NO. 1 beu-raen-deu

Can you figure this out? You might remember peu-ri-mi-ŏm from the Teepee Gym sign.

It says:

France Premium Coffee NO. 1 Brand

I leave you with this incredible example. In the annals of Korean signs using English words where there is already a Korean word for the same thing, this sign is really outstanding:

The Korean name for Korea is “한국” (han-guk).

But on this sign, “Korea” is rendered as “코리아” (ko-ri-a).

[ Thanks to SengMing Tan for identifying the character 韓. ]

[ Addendum: Prodded by Jesse Chen, I thought of a much better example of this happening in the U.S., so much better than the little Chung May food market. So good. But you will have to wait to hear about it until later this week. ]

[ Addendum 20230225: Here's the example. ]

[Other articles in category /lang] permanent link

Mon, 13 Feb 2023

English signs in Korea

I saw this sign in Korea last year:

A yellow banner with Korean text in
large red Korean script on top and smaller text below.  If you look
just at the bottom half of the most prominent Korean word, it looks
like it says TOOL.  But this is coincidence, because the Korean script
just happens to contain shapes that are the same as the letters of TOOL.

As you can imagine, I completely misread this. It appears to say TOOL. But of course it does not say that, because it is in Korean. The appearance of TOOL is an illusion. None of those letters is Latin script. The thing that looks like a ‘T’ is actually a vowel ㅜ, coincidentally pronounced like the ‘oo’ in TOOL. The things that look like ‘O’s are consonants ㅇ, pronounced like the ‘ng’ in ‘ring’. The thing that looks like an ‘L’ is letter ㄴ, pronounced like an ‘n’.

The mystery word here, 수행정진, is actually pronounced /soohaeng jeongjin/. I'm not sure what this means, I think it might be something about vigorous devotion (정진, 精進) to asceticism (수행, 修 行) since I took the picture on the grounds of Bongeunsa, a Buddhist temple. I think the words 특별법회 in the blue oval are something about a special Buddhist ceremony to be held on 30 December.

I must have thought about such misleading oddities when I was first learning Korean, but I've never seen one in the wild before.

[Other articles in category /lang] permanent link

Wed, 08 Feb 2023

Misinterpretation of ‘my’

A couple of years back I complained about this stupid interaction I had once had:

I was once harangued by someone for using the phrase "my girlfriend." "She is not 'your' girlfriend," said this knucklehead. "She does not belong to you."

Sometimes you can't think of the right thing to say at the right time, but this time I did think of the right thing. "My father," I said. "My brother. My husband. My doctor. My boss. My congressman."

"Oh yeah."

I was thinking about this today (not for any reason, it doesn't keep happening, fortunately) and I thought of a new variation. You wait for your opportunity, and before long it will go like this:

Knucklehead: (blah blah blah) … I'll check when I get back to my house.

You: You own a house? In this market? Wow, where'd you get the money?

Knucklehead (now annoyed by your quibbling): I rent a house. It belongs to my landlord.

You: You own a landlord?

[Other articles in category /lang] permanent link

Sun, 18 Dec 2022

Den goede of den kwade?

Recently I encountered the Dutch phrase den goede of den kwade, which means something like "the good [things] or the bad [ones]”, something like the English phrase “for better or for worse”.

Goede is obviously akin to “good”, but what is kwade? It turns out it is the plural of kwaad, which does mean “bad”. But are there any English cognates? I couldn't think of any, which is surprising, because Dutch words usually have one. (English is closely related to Frisian, which is still spoken in the northern Netherlands.)

I rummaged the dictionary and learned that it kwaad is akin to “cud”, the yucky stuff that cows regurgitate. And “cud” is also akin to “quid”, which is a chunk of chewing tobacco that people chew on like a cow's cud. (It is not related to the other quids.)

I was not expecting any of that.

[ Addendum: this article, which I wrote at 3:00 in the morning, is filled with many errors, including some that I would not have made if it had been daytime. Please disbelieve what you have read, and await a correction. ]

[ Addendum 20221229: Although I wrote that attendum the same day, I forgot to publish it. I am now so annoyed that I can't bring myself to write the corrections. I will do it next year. Thanks to all the very patient Dutch people who wrote to correct my many errors. ]

[Other articles in category /lang/etym] permanent link

Minor etymological victory

A few days ago I was thinking about Rosneft (Росне́фть), the Russian national oil company. The “Ros” is obviously short for Rossiya, the Russian word for Russia, but what is neft?

“Hmm,” I wondered. “Maybe it is akin to naphtha?”

Yes! Ultimately both words are from Persian naft, which is the Old Persian word for petroleum. Then the Greeks borrowed it as νάφθα (naphtha) and the Russians, via Turkish. Petroleum is neft in many other languages, not just the ones you would expect like Azeri, Dari, and Turkmen, but also Finnish, French, Hebrew, and Japanese.

Sometimes I guess this stuff and it's just wrong, but it's fun when I get it right. I love puzzles!

[ Addendum 20230208: Tod McQuillin informs me that the Japanese word for petroleum is not related to naphtha; he says it is 石油 /sekiyu/ (literally "rock oil") or オイル /oiru/. The word I was thinking of was ナフサ /nafusa/ which M. McQuillin says means naphtha, not petroleum. (M. McQuillin also supposed that the word is borrowed from English, which I agree seems likely.)

I think my source for the original claim was this list of translations on Wiktionary. It is labeled as a list of words meaning “naturally occurring liquid petroleum”, and includes ナフサ and also entries purporting to be Finish, French, and Hebrew. I did not verify any of the the claims in Wiktionary, which could be many varieties of incorrect. ]

[Other articles in category /lang/etym] permanent link

Sun, 30 Oct 2022


A while back, discussing Vladimir Putin (not putain) I said

In English we don't seem to be so quivery. Plenty of people are named “Hoare”. If someone makes a joke about the homophone, people will just conclude that they're a boor.

Today I remembered Frances Trollope and her son Anthony Trollope. Where does the name come from? Surely it's not occupational?

Happily no, just another coincidence. According to Wikipedia it is a toponym, referring to a place called Troughburn in Northumberland, which was originally known as Trolhop, “troll valley”. Sir Andrew Trollope is known to have had the name as long ago as 1461.

According to the Times of London, Joanna Trollope, a 6th-generation descendant of Frances, once recalled

a night out with a “very prim and proper” friend who had the surname Hoare. The friend was dismayed by the amusement she caused in the taxi office when she phoned to book a car for Hoare and Trollope.

I guess the common name "Hooker" is occupational, perhaps originally referring to a fisherman.

[ Frances Trollope previously on this blog: [1] [2] ]

[ Addendum: (Wiktionary says that Hooker is occupational, a person who makes hooks. I find it surprising that this would be a separate occupattion. And what kind of hooks? I will try to look into this later. ]

[Other articles in category /lang] permanent link

Thu, 20 Oct 2022

A linguistic oddity

Last week I was in the kitchen and Katara tried to tell Toph a secret she didn't want me to hear. I said this was bad opsec, told them that if they wanted to exchange secrets they should do it away from me, and without premeditating it, I uttered the following:

You shouldn't talk about things you shouldn't talk about while I'm in the room while I'm in the room.

I suppose this is tautological. But it's not any sillier than Tarski's observation that "snow is white" is true exactly if snow is white, and Tarski is famous.

I've been trying to think of more examples that really work. The best I've been able to come up with is:

You shouldn't eat things you shouldn't eat because they might make you sick, because they might make you sick.

I'm trying to decide if the nesting can be repeated. Is this grammatical?

You shouldn't talk about things you shouldn't talk about things you shouldn't talk about while I'm in the room while I'm in the room while I'm in the room.

I think it isn't. But if it is, what does it mean?

[ Previously, sort of. ]

[Other articles in category /lang] permanent link

Thu, 13 Oct 2022


Today I realized I'm annoyed by the word "stethoscope". "Scope" is Greek for "look at". The telescope is for looking at far things (τῆλε). The microscope is for looking at small things (μικρός). The endoscope is for looking inside things (ἔνδον). The periscope is for looking around things (περί). The stethoscope is for looking at chests (στῆθος).

Excuse me? The hell it is! Have you ever tried looking through a stethoscope? You can't see for shit.

It should obviously have been called the stethophone.

(It turns out that “stethophone” was adopted as the name for a later elaboration of the stethoscope, shown at right, that can listen to two parts of the chest at the same time, and deliver the sounds to different ears.)

Stethophone illustration is in the public domain, via Wikipedia.

[Other articles in category /lang/etym] permanent link

Sat, 28 May 2022

“Llaves” and other vanishing consonants

Lately I asked:

Where did the ‘c’ go in llave (“key”)? It's from Latin clavīs

Several readers wrote in with additional examples, and I spent a little while scouring Wiktionary for more. I don't claim that this list is at all complete; I got bored partway through the Wiktionary search results.

Spanish English Latin antecedent
llagar to wound plāgāre
llama flame flamma
llamar to summon, to call clāmāre
llano flat, level plānus
llantén plaintain plantāgō
llave key clavis
llegar to arrive, to get, to be sufficient   plicāre
lleno full plēnus
llevar to take levāre
llorar to cry out, to weep plōrāre
llover to rain pluere

I had asked:

Is this the only Latin word that changed ‘cl’ → ‘ll’ as it turned into Spanish, or is there a whole family of them?

and the answer is no, not exactly. It appears that llave and llamar are the only two common examples. But there are many examples of the more general phenomenon that

(consonant) + ‘l’ → ‘ll’

including quite a few examples where the consonant is a ‘p’.

Spanish-related notes

  • Eric Roode directed me to this discussion of “Latin CL to Spanish LL” on the language forums. It also contains discussion of analogous transformations in Italian. For example, instead of plānusllano, Italian has → piano.

  • Alex Corcoles advises me that Fundéu often discusses this sort of issue on the Fundéu web site, and also responds to this sort of question on their Twitter account. Fundéu is the Foundation of Emerging Spanish, a collaboration with the Royal Spanish Academy that controls the official Spanish language standard.

  • Several readers pointed out that although llave is the key that opens your door, the word for musical keys and for encryption keys is still clave. There is also a musical instrument called the claves, and an associated technical term for the rhythmic role they play. Clavícula (‘clavicle’) has also kept its ‘c’.

  • The connection between plicāre and llegar is not at all clear to me. Plicāre means “to fold”; English cognates include ‘complicated’, ‘complex’, ‘duplicate’, ‘two-ply’, and, farther back, ‘plait’. What this has to do with llegar (‘to arrive’) I do not understand. Wiktionary has a long explanation that I did not find convincing.

  • The levārellevar example is a little weird. Wiktionary says "The shift of an initial 'l' to 'll' is not normal".

  • Llaves also appears to be the Spanish name for the curly brace characters { and }. (The square brackets are corchetes.)

Not related to Spanish

  • The llover example is a favorite of the Universe of Discourse, because Latin pluere is the source of the English word plover.

  • French parler (‘to talk’) and its English descendants ‘parley’ and ‘parlor’ are from Latin parabola.

  • Latin plōrāre (‘to cry out’) is obviously the source of English ‘implore’ and ‘deplore’. But less obviously, it is the source of ‘explore’. The original meaning of ‘explore’ was to walk around a hunting ground, yelling to flush out the hidden game.

  • English ‘autoclave’ is also derived from clavis, but I do not know why.

  • Wiktionary's advanced search has options to order results by “relevance” and last-edited date, but not alphabetically!


  • Thanks to readers Michael Lugo, Matt Hellige, Leonardo Herrera, Leah Neukirchen, Eric Roode, Brent Yorgey, and Alex Corcoles for hints clues, and references.

[ Addendum: Andrew Rodland informs me that an autoclave is so-called because the steam pressure inside it forces the door lock closed, so that you can't scald yourself when you open it. ]

[ Addendum 20230319: llevar, to rise, is akin to the English place name Levant which refers to the region around Syria, Israel, Lebanon, and Palestine: the “East”. (The Catalán word llevant simply means “east”.) The connection here is that the east is where the sun (and everything else in the sky) rises. We can see the same connection in the way the word “orient”, which also means an eastern region, is from Latin orior, “to rise”. ]

[Other articles in category /lang/etym] permanent link

Thu, 26 May 2022

Quick Spanish etymology question

Where did the ‘c’ go in llave (“key”)? It's from Latin clavīs, like in “clavicle”, “clavichord”, “clavier” and “clef”.

Is this the only Latin word that changed ‘cl’ → ‘ll’ as it turned into Spanish, or is there a whole family of them?

[ Addendum 20220528: There are more examples. ]

[Other articles in category /lang/etym] permanent link

Sat, 14 May 2022

Cathedrals of various sorts

A while back I wrote a shitpost about octahedral cathedrals and in reply Daniel Wagner sent me this shitpost of a cat-hedron:

computer graphics drawing of a roughly cat-shaped polyhedron with a
glowing blue crucifix stuck on its head.

But that got me thinking: the ‘hedr-’ in “octahedron” (and other -hedrons) is actually the Greek word ἕδρα (/hédra/) for “seat”, and an octahedron is a solid with eight “seats”. The ἕδρα (/hédra/) is akin to Latin sedēs (like in “sedentary”, or “sedate”) by the same process that turned Greek ἡμι- (/hémi/, like in “hemisphere”) into Latin semi- (like in “semicircle”) and Greek ἕξ (/héx/, like in “hexagon”) into Latin sex (like in “sextet”).

So a cat-hedron should be a seat for cats. Such seats do of course exist:

combination “cat tree” and scratching post sits on the floor of a
living room in front of the sofa.  The object is about two feed high
and has a carpeted platform atop a column wrapped in sisal rope.
Hanging from the platform is a cat toy, and  on
the platform resides a black and white domestic housecat.  A second
cat investigates the carpeter base of the cat tree.

But I couldn't stop there because the ‘hedr-’ in “cathedral” is the same word as the one in “octahedron”. A “cathedral” is literally a bishop's throne, and cathedral churches are named metonymically for the literal throne they contain or the metaphorical one represent. A cathedral is where a bishop has his “seat” of power.

So a true cathedral should look like this:

same picture as before, but the cat has been digitally erased from the
platform, and replaced with a gorgeously uniformed cardinal of the
Catholic Church, wearing white and gold robes and miter.

[Other articles in category /lang/etym] permanent link

Sat, 26 Mar 2022

U.S. surnames with no vowels

While writing the recent article about Devika Icecreamwala (born Patel) I acquired the list of most common U.S. surnames. (“Patel” is 95th most common; there are about 230,000 of them.) Once I had the data I did many various queries on it, and one of the things I looked for was names with no vowels. Here are the results:

name rank count
NG 1125 31210
VLK 68547 287
SMRZ 91981 200
SRP 104156 172
SRB 129825 131
KRC 149395 110
SMRT 160975 100

It is no surprise that Ng is by far the most common. It's an English transcription of the Cantonese pronunciation of , which is one of the most common names in the world. belongs to at least twenty-seven million people. Its Mandarin pronunciation is Wu, which itself is twice as common in the U.S. as Ng.

I suspect the others are all Czech. Vlk definitely is; it's Czech for “wolf”. (Check out the footer of the Vlk page for eighty other common names that all mean “wolf”, including Farkas, López, Lovato, Lowell, Ochoa, Phelan, and Vuković.)

Similarly Smrz is common enough that Wikipedia has a page about it. In Czech it was originally Smrž, and Wikipedia mentions Jakub Smrž, a Czech motorcycle racer. In the U.S. the confusing háček is dropped from the z and one is left with just Smrz.

The next two are Srp and Srb. Here it's a little harder to guess. Srb means a Serbian person in several Slavic languages, including Czech and it's not hard to imagine that it is a Czech toponym for a family from Serbia. (Srb is also the Serbian word for a Serbian person, but an immigrant to the U.S. named Srb, coming from Czechia, might fill out the immigration form with “Srb” and might end up with their name spelled that way, whereas a Serbian with that name would write the unintelligible Срб and would probably end up with something more like Serb.) There's also a town in Croatia with the name Srb and the surname could mean someone from that town.

I'm not sure whether Srp is similar. The Serbian-language word for the Serbian language itself is Srpski (српски), but srp is also Slavic for “sickle” and appears in quite a few Slavic agricultural-related names such as Sierpiński. (It's also the name for the harvest month of August.)

Next is Krc. I guessed maybe this was Czech for “church” but it seems that that is kostel. There is a town south of Prague named Krč and maybe Krc is the háčekless American spelling of the name of a person whose ancestors came from there.

Last is Smrt. Wikipedia has an article about Thomas J. Smrt but it doesn't say whether his ancestry was Czech. I had a brief fantasy that maybe some of the many people named Smart came from Czech families originally named Smrt, but I didn't find any evidence that this ever happened; all the Smarts seem to be British. Oh well.

[ Bonus trivia: smrt is the Czech word for “death”, which we also meet in the name of James Bond's antagonist SMERSH. SMERSH was a real organization, its name a combination of смерть (/smiert/, “death”) and шпио́нам (/shpiónam/, “to spies”). Шпио́нам, incidentally, is borrowed from the French espion, and ultimately akin to English spy itself. ]

[ Addenda 20220327: Thanks to several readers who wrote to mention that Smrž is a morel and Krč is (or was) a stump or a block of wood, I suppose analogous to the common German name Stock. Petr Mánek corrected my spelling of háček and also directed me to, a web site providing information about Czech surnames. Finally, although Smrt is not actually a shortened form of Smart I leave you with this consolation prize. ]

[Other articles in category /lang/etym] permanent link

Fri, 25 Mar 2022

My horse Pongo

I tried playing Red Dead Redemption 2 last week. I was a bit disappointed because I was hoping for Old West Skyrim but it's actually Old West GTA. I'm not sure how long I will continue.

Anyway, I acquired a new horse and was prompted to name it. My first try, “Pongo”, was rejected by the profanity filter. Puzzled, I supposed I had mistyped and included a ZWNJ or something. No, it was rejecting "Pongo”.

The only meaning I know for “Pongo” is that it is the name of the daddy dog in 101 Dalmatians. So I asked the Goog. The Goog shrugged and told me that was the only Pongo it knew also.

Steeling myself, I asked Urban Dictionary, preparing to learn that Pongo was obscene, racist, or probably both. Urban Dictionary told me that “Pongo” is 1900-era Brit slang for a soldier. (Which I suppose explains its appearance as the name of the dog.) Nothing obscene or racist.

I'm stumped. I forget what I ended up naming the horse.

[ Addendum 20220521: Apparently I'm not the first person to be puzzled by this. ]

[Other articles in category /lang] permanent link

Tue, 22 Feb 2022

“Shall” and “will” strike back from beyond the grave

In former times and other dialects of English, there was a distinction between ‘shall’ and ‘will’. To explain the distinction correctly would require research, and I have a busy day today. Instead I will approximate it by saying that up to the middle of the 19th century, ‘shall’ referred to events that would happen in due course, whereas ‘will' was for events brought about intentionally, by force of will. An English child of the 1830's, stamping its foot and shouting “I will have another cookie”, was expressing its firm intention to get the cookie against all opposition. The same child shouting “I shall have another cookie” was making a prediction about the future that might or might not have turned out to be correct.

In American English at least, this distinction is dead. In The American Language, H.L. Mencken wrote:

Today the distinction between will and shall has become so muddled in all save the most painstaking and artificial varieties of American that it may almost be said to have ceased to exist.

That was no later than 1937, and he had been observing the trend as early as the first edition (1919):

… the distinction between will and shall, preserved in correct English but already breaking down in the most correct American, has been lost entirely in the American common speech.

But yesterday, to my amazement, I found myself grappling with it! I had written:

The problem to solve here … [is] “how can OP deal with the inescapable fact that they can't and won't pass the exam”.

To me, the “won't” connoted a willful refusal on the part of OP, in the sense of “I won't do it!”, and not what I wanted to express, which was an inevitable outcome. I'm not sure whether anyone else would have read it the same way, but I was happier after I rewrote it:

The problem to solve here … [is] “how can OP deal with the inescapable fact that they cannot and will not pass the exam”.

I could also gotten the meaning I wanted by replacing “can't and won't” with “can't and shan't” — except that “shan't’ is dead, I never use it, and, had I thought of it, I would have made a rude and contemptuous nose noise.

Mencken says “the future in English is most commonly expressed by neither shall nor will, but by the must commoner contraction 'll’. In this case that wasn't true! I wonder if he missed the connotation of “won't” that I felt, or if the connotation arose after he wrote his book, or if it's just something idiosyncratic to me.

[Other articles in category /lang] permanent link

Thu, 03 Feb 2022

Mosaic church

Driving around today I passed by Mosaic Community Church. I first understood “mosaic” in the sense of colored tiles, but shortly after realized it is probably “Mosaic” (that is, pertaining to Moses) and not “mosaic”. But maybe not, perhaps it is an intentional double meaning, with “mosaic” meant to suggest a diverse congregation.

This got me thinking about words that completely change meaning when you capitalize them. The word “polish” came to mind.

I wondered if there were any other examples and realized there must be a great many boring ones of a certain type, which I confirmed when I got home: Pennsylvania has towns named Perry, Auburn, Potter, Bath, and so on. I think what makes “Polish” and “Mosaic” more interesting may be that their meanings are not proper nouns themselves but are derived adjectives.

[Other articles in category /lang] permanent link

Sun, 16 Jan 2022


Yesterday I related Wikitionary's explanation of why Vladamir Putin's name is transliterated in French as Poutine:

in French, “Putin” would be pronounced /py.tɛ̃/, exactly like putain, which means “whore”.

In English we don't seem to be so quivery. Plenty of people are named “Hoare”. If someone makes a joke about the homophone, people will just conclude that they're a boor. “Hoare” or “hoar” is an old word for a gray-white color, one of a family of common hair-color names along with “Brown”, “White”, and “Grey”.

There is a legend at Harvard University that its twelve residential houses are named for the first twelve presidents of Harvard: Dunster House, Eliot House, Mather House, and so on. Except, says the legend, they were unwilling to name a house after the fourth president, Leonard Hoar, and called it North House instead. The only part of this that is true is that most of the houses were named for presidents of Harvard.

(The common name “Green” is not a hair-color name. It refers to someone who lives by the green.)

[ Addendum 20221030: Yet more on this ridiculous topic. ]

[Other articles in category /lang] permanent link

Sat, 15 Jan 2022


In French Canada, poutine is a dish of fried potatoes with cheese curds and brown gravy. But today I learned that in French, Vladimir Putin's name is Vladimir Poutine.

As described above: a white plate
with shoestring-shaped french fries, crumbles of white cheese curds,
and brown gravy. = Putin in 2018 is 66 years old but
looks much younger. He has a round face, a calm and thoughtful expression, and cold blue eyes.  His hair,
medium-brown and thinning, is gray only at the temples. He wears a
dark suit jacket, white shirt, and a dark red silk tie in a striking pattern.
Poutine  Putin

Wiktionary explains: in French, “Putin” would be pronounced /py.tɛ̃/, exactly like putain, which means “whore”. “Poutine” is silly, but at least comparatively inoffensive.

Mario Tremblay of Montréal gave in to temptation, and opened a poutine restaurant named “Vladimir Poutine". There was a poutine dish on the menu named “Vladimir Poutine”. In a sort of nod to borscht, it was topped with beet confit. The restaurant has since closed.

[ Other people who are accidentally foods ]

Left-hand poutine photograph by Joe Shlabotnik from Forest Hills, Queens, USA, CC BY 2.0, via Wikimedia Commons. Right-hand photograph also via Wikimedia Commons.

[ Addendum 20220116: Further musings on names with bad homophones ]

[ Addendum 20220307: a Paris restaurant, La Maison de la Poutine, defends itself against insults and threats from people confused about the meaning of poutine in the restaurant's name. Further reporting from Business Insider. ]

[ Addendum 20221030: Yet more names with bad homophones ]

[Other articles in category /lang] permanent link

Tue, 26 Oct 2021

Oe wowe is me

[ Content warning: pointless. ]

A colleague of mine recently remarked:

lose rhymes with choose; loose rhymes with goose

The parallel construction suggests that the two cases are similar. They're not. The words lose and choose are unique exceptions, and loose and goose aren't. All the -oose words other than choose rhyme with loose goose moose, and all the -ose words other than lose (and sometimes close) rhyme with rose nose pose.

English spelling is full of awful quagmires, but I don't remember noticing this one before. “-ough” gets talked up a lot, it's overplayed. This Goose thing is at least as bad.

For example, consider these rhyming words:

choose coos
lose shoes twos cruise

What a mess.

Shoes and woes don't rhyme.

Shoes rhymes with lose, but lose does not rhyme with close.

Close does rhyme with woes, and it also rhymes with gross, but they don't rhyme with each other. I suppose it is excusable that gross doesn't rhyme with woes, but it also doesn't rhyme with boss. And gross rhymes with dose for some reason. You'd think dose would rhyme with hose but if you want it to do that you have to spell it doze. Which at least makes sense: dose, unvoiced, doze, voiced. There are seven ways to write that -oze sound in doze, and the only words that I can find that actually spell it -oze are doze and froze:

nos owes sews

The most common ending here is -ows and looking at a word that has it you can't tell if it's crows shows slows or brows cows vows. Sometimes it's both, like with sows sows bows bows rows rows.

Oh, and does, which you see there in column 3 with toes and goes, but which is also an extremely common word that is usually pronounced “duz”, completely unlike any other word spelled that way. When it obviously should have been pronounced the way dues is.

Going the other way we have goose and loose which seems okay at first but turns into its own little quagmire:

spruceuse Zeus

At least you can't get these words mixed up with other words spelled the same way. Except for use and use. The consonant is voiced when it's a verb, unvoiced when it's a noun. Because reasons.

Well, for completeness I suppose I should do use:

views use

Brits might want me to put news here and maybe some of its siblings.

Ugh, this could go on forever. Half the -ouse words rhyme with mouse and the other half rhyme with rouse. And the most important one rhymes with both: house.

Hmm, there's that verb-voiced, noun-unvoiced thing again. I should look into that. The close thing is similar: voiced is a verb, unvoiced is an adjective.

My respect for people who learn English as a second language was already high, but it has just gone up several notches.

[Other articles in category /lang] permanent link

Sun, 10 Oct 2021

More words change meanings

“Salient” seems to have lost its original meaning, and people mostly use it as if it were synonymous with “relevant” or “pertinent”. This is unfortunate. It's from Latin salīre, which is to jump, and it originally meant something that jumps out at you. In a document, the salient point isn't necessarily the one that is most important, most crucial, or most worth consideration; it's the one that jumps out.

It is useful to have a word specifically for something that jumps out, but people no longer understand it that way.

Cognates of salīre include “assail" and “assault”, “salmon” (the jumping fish), and the mysterious “somersault”.

[Other articles in category /lang] permanent link

Words change meanings

This Imgur gallery has a long text post, about a kid who saw the movie Labyrinth in London and met David Bowie after. The salient part was:

He seemed surprised I would want to know, and he told me the whole thing, all out of order, and I eked the details out of him.

This is a use of “eke” that I haven't seen before. Originally “eke” meant an increase, or a small addition, and it was also used in the sense of “also”. For example, from the prologue to the Wife of Bath's tale:

I hadde the bettre leyser for to pleye, And for to se, and eek for to be seye

(“I had more opportunity to play, and to see, and also to be seen.”)

Or also, “a nickname” started out as “an ekename”, an also-name.

From this we get the phrase “to eke out a living”, which means that you don't have quite enough resources, but by some sort of side hustle you are able to increase them to enough to live on.

But it seems to me that from there the meaning changed a little, so that while “eke out a living” continued to mean to increase one's income to make up a full living, it also began to connote increasing one's income bit by bit, in many small increments. This is the sense in which it appears to be used in the original quotation:

He seemed surprised I would want to know, and he told me the whole thing, all out of order, and I eked the details out of him.


Searching for something in a corpus of Middle English can be very frustrating. I searched and searched the University of Michigan Corpus of Middle English Prose and Verse looking for the Chaucer quotation, and couldn't find it, because it has “to se” and “to be seye”, but I searched for “to see” and “to seye”; it has “eek” and I had been searching for “eke”. Ouch.

In the Chaucer, “leyser” is “leisure”, but a nearly-dead sense that we now see only in “complete the task at your leisure”.

[Other articles in category /lang] permanent link

Fri, 08 Oct 2021

Diminishing resources in the Korean Language

Hangul, the Korean alphabet, was originally introduced in the year 1443. At that time it had 28 letters, four of which have since fallen out of use. If the trend continues, the Korean alphabet will be completely used up by the year 7889, preceded by an awful period in which all the words will look like

앙 앙앙앙 앙앙 앙 앙앙앙앙 앙

and eventually


[Other articles in category /lang] permanent link

Fri, 09 Jul 2021

“Forensic” doesn't mean what I thought it did

Last week at work we released bad code, which had somehow survived multiple reviews. I was very interested in finding out how this happened, dug into the Git history to find out, and wrote a report. Originally I titled the report something like “Forensic analysis of Git history” (and one of my co-workers independently referred to the investigation as forensic) but then I realized I wasn't sure what “forensic” meant. I looked it up, and learned it was the wrong word.

A forensic analysis is one performed in the service of a court or court case. “Forensic” itself is from Latin forum, which is a public assembly place where markets were held and court cases were heard.

Forensic medicine is medicine in service of a court case, for example to determine a cause of death. For this reason it often refers to a postmortem examination, and I thought that “forensic” meant a postmortem or other retrospective analysis. That was the sense I intended it. But no. I had written a postmortem analysis, but not a forensic one.

[Other articles in category /lang/etym] permanent link

Wed, 07 Jul 2021

Examples of dummy pronouns

Katara is interested in linguistics. When school was over for the year and she had time to think about things, I gave her all my old linguistics books. The other day for some reason I mentioned to her that I had known people who were engaged in formal research on the problem of how to get a computer to know what a pronoun referred to, and that this is very difficult.

(I once had a co-worker who claimed that it was simple: the pronoun always refers back to the nearest noun. It wasn't hard to go back in his Slack history and find a counterexample he had uttered a few minutes before.)

Today I wanted to tell Katara about dummy pronouns, which refer to nothing at all. I intended to send her the example from Wiktionary:

it is good to know that you are okay

I started my message:

Here's an interesting example of how hard it can be to find what a pronoun refers to

Then I realized I no longer needed the example.

[Other articles in category /lang] permanent link

Mon, 05 Jul 2021

Duckface in German

In English, this is called duckface:

Ariana Grande looking over her
shoulder with her lips abnormally everted

In German, I've learned, it's Schlauchbootlippen.

Schlauch is “tube”. A Schlauchboot is a tube-boat — an inflatable rubber dingy. Schlauchbootlippen means dinghy-lips.

[Other articles in category /lang/etym] permanent link

Sun, 27 Jun 2021

Dogs that look like board games

In Korean, “바둑이” (/badugi/) is a common name for a spotted dog, especially a black-spotted dog. This is because “바둑” (/baduk/) is the native Korean name for the game of go, in which round black and white stones are placed on a board.

In English, black-and-white spotted dogs are sometimes named “Checkers” for essentially the same reason.

[Other articles in category /lang/etym] permanent link

Mon, 19 Apr 2021

Odd translation choices

Recently I've been complaining about unforced translation errors. ([1] [2]) Here's one I saw today:

picture of two cows in a field.  One has a child-sized toy plastic car
on its head.  The cow with the car on its head is saying: “БИП-БИП ВАШ

The translation was given as:

“honk honk, your Uber has arrived”

“Oleg, what the fuck”

Now, the Russian text clearly says “beep-beep” (“бип-бип”), not “honk honk”. I could understand translating this as "honk honk" if "beep beep" were not a standard car sound in English. But English-speaking cars do say “beep beep”, so why change the original?

(Also, a much smaller point: I have no objection to translating “Что за херня” as “what the fuck”. But why translate “Что за херня, Олег?” as “Oleg, what the fuck” instead of “What the fuck, Oleg”?)

[ Addendum 20210420: Katara suggested that perhaps the original translator was simply unaware that Anglophone cars also “beep beep”. ]

[Other articles in category /lang] permanent link

Wed, 14 Apr 2021

More soup-guzzling

A couple of days ago I discussed the epithet “soup-guzzling pie-muncher”, which in the original Medieval Italian was brodaiuolo manicator di torte. I had compained that where most translations rendered the delightful word brodaiuolo as something like “soup-guzzler” or “broth-swiller”, Richard Aldington used the much less vivid “glutton”.

A form of the word brodaiuolo appears in one other place in the Decameron, in the sixth story on the first day, also told by Emilia, who as you remember has nothing good to say about the clergy:

… lo 'nquisitore sentendo trafiggere la lor brodaiuola ipocrisia tutto si turbò…

J. M. Rigg (1903), who had elsewhere translated brodaiuolo as “broth-guzzling”, this time went with “gluttony”:

…the inquisitor, feeling that their gluttony and hypocrisy had received a home-thrust…

G. H. McWilliam (1972) does at least imply the broth:

…the inquisitor himself, on hearing their guzzling hypocrisy exposed…

John Payne (1886):

the latter, feeling the hit at the broth-swilling hypocrisy of himself and his brethren…

Cormac Ó Cuilleanáin's revision of Payne (2004):

…the inquisitor himself, feeling that the broth-swilling hypocrisy of himself and his brethren had been punctured…

And what about Aldington (1930), who dropped the ball the other time and rendered brodaiuolo merely as “glutton”? Here he says:

… he felt it was a stab at their thick-soup hypocrisy…

Oh, Richard.

I think you should have tried harder.

[Other articles in category /lang] permanent link

Fri, 12 Mar 2021

Trans-Siberian Railway

For no particular reason, I looked up the Trans-Siberian Railway today and learned that its name in Russian is

Транссибирская магистраль

pronounced roughly “trans-siberskaya magistral”. The Транссибирская is clear, but what is магистраль?

Wiktionary says it means "main line" or "trunkline". But it doesn't give an etymology. Still, it's not hard to guess: it's akin to the French (and also English) word “magistral” which means something that relates to a master.

So it's the Trans-Siberian master train line. But "train line” is implicit, the way English-speaking recording engineers use "master" to refer to a master tape, or Americans will call a trunk road an "arterial". English loves to turn adjectives into nouns in that way, but I didn't know that Russian did it also.

[Other articles in category /lang] permanent link

Thu, 11 Mar 2021

Debate between Bird and Fish

I recently read Finkel and Taylor's excellent little book Cuneiform. On page 27 they discuss the kinds of texts that young boys studied in school:

Alongside ‘citizenship training’ through hymns, myths and law codes, schoolboys learnt how to debate. They trained on texts arguing the benefits to mankind of antagonistic pairs: winter and summer, sheep and grain or bird and fish.

“Hey,” I said. “I've read that!” I love when this happens, something pops up that I would have wanted to know a little more about, but it's already something I do know a little more about. I feel like I'm getting somewhere in my project of reading every book ever written. Progress!

From The Debate Between Bird and Fish, Sumerian, around 4000 years ago:

“You cause damage in the vegetable plots; you are a nuisance. In the damp parts of fields, there are your unpleasing footprints. Bird, you are shameless: you fill the courtyard with your droppings.”

Bird retorts:

You are bereft of hips!

It's not so much a debate as a diss battle.

[ Addendum 20210312: Now I would like to see an cartoon version of the debate, animated by Chuck Jones. ]

[Other articles in category /lang] permanent link

Mon, 08 Mar 2021

Canon in Euopean languages and Arabic

Today I was reading about Avicenna's work The Canon of Medicine and learned that the original Arabic title

القانون في الطب

is rendered in Latin script as al-Qānūn fī al-Ṭibb with al-Qānun (“the law”) being translated into English as “Canon” (“rule” or “law”). The English word comes via French and Latin, ultimately from Greek κανών, “rule”.

Is the resemblance between Qānūn and κανών a coincidence, or is the Arabic word originally borrowed from Greek?

I was about to write the next sentence “and where could I have looked this up?” but then I remembered that this kind of thing can be looked up in English Wiktionary. English Wiktionary is not a dictionary of English, but a universal dictionary in English. It not only defines English words, but also words in many other languages, with the descriptions and etmologies written in English.

So I looked it up, and it is a Greek loanword!

The Internet is amazing and wonderful. Truly, we live in an age of marvels.

[Other articles in category /lang/etym] permanent link

Tue, 02 Mar 2021


Often when I'm reading something that was translated from another language, I get to wondering what the original was. Often this appears in connection with some sort of wordplay. For example, the first chapter of Stanisław Lem's novel The Cyberiad begins:

One day Trurl the constructor put together a machine that could create anything starting with n. When it was ready, he tried it out, ordering it to make needles, then nankeens and negligees, which it did, then nail the lot to narghiles filled with nepenthe and numerous other narcotics. The machine carried out his instructions to the letter. Still not completely sure of its ability, he had it produce, one after the other, nimbuses, noodles, nuclei, neutrons, naphtha, noses, nymphs, naiads, and natrium. This last it could not do, and Trurl, considerably irritated, demanded an explanation.

"Never heard of it," said the machine.

"What? But it's only sodium. You know, the metal, the element..."

"Sodium starts with an s, and I work only in n."

In the end Trurl asks the machine to make “nothing”, which is an important plot point.

Okay, but The Cyberiad was written in Polish. I wondered for years: was it “N” in Polish also? If the Polish word for “nothing” happened to begin with a “W”, then the Polish text would have had to have had a machine that could create anything starting with “W”. Then the translator couldn't keep the “W” the way it was, because the whole point of the story leads up to “nothing”; they have to rewrite the whole thing with “N”.

One day I met the translator, Michael Kandel, and was able to ask. And yes, it was originally “N”; the polish word for “nothing” is nic.

(Here's a related question on SF Stack Exchange. It discusses how the original “N” items turn into their somewhat-similar “N” counterparts in English.)

But anyway, I meant to talk about Pippi Longstocking, which was originally written in Swedish.

Pippi and the Ibex

In one episode, Pippi goes to school, where the teacher tries to teach her the alphabet. She shows her a card with a letter ‘i' and a picture of an ibex. Pippi says:

“I think it looks exactly like a straight line with a little fly speck over it. But what I'd really like to know is, what has the ibex to do with the fly speck?”

Clearly Pippi is describing a lowercase letter ‘i’. “Ibex” is a pretty strange choice of animal, in English or in Swedish, so I wondered: was the picture an ibex in the original Swedish? It turns out it was not! “Ibex” in Swedish is stenbock. In the original Swedish, the picture is an igelkott, a hedgehog.

Well, in the translation I had as a kid, by Florence Lamborn, it was an ibex. But a different English translation (by Tiina Nunnaly) makes it an iguana, and another that I found, by Edna Hurup, contains the following elaborate invention:

[The teacher] therefore brought out a picture of a pretty little green island surrounded by blue water.

My philosophy of translation is opposed to this sort of thing. I will take all sorts of liberties, and I might make up an island if I have to, but having done so I would not describe it in detail as Ms. Hurup did so shamelessly. In the original the hedgehog is not described:

Därför tog hon fram en liten vacker plansch föreställande en igelkott.

(“Therefore, she took out a small, beautiful poster depicting a hedgehog.”)


Today I was thinking about Pippi, and I recalled that one of her goals in attending school was to learn “pluttification”:

“Hey, everybody,” hollered Pippi, swinging her big hat. "Am I in time for pluttification?”

In English “pluttification” is obviously Pippi's misunderstanding of “multiplication”:

“All kinds of things,” said the officer. “Lots of useful things, like the multiplication tables, for instance.”

“I've been fine for nine years without any pluttification tables,” said Pippi…

What was pluttification in Swedish?

It turns out, it wasn't any different. The Swedish for “multiplication tables” is multiplikationstabellen.

”Hejsvejs”, hojtade Pippi och svängde sin stora hatt. ”Kommer jag lagom till pluttifikationen?”

Pippi's Name

Long ago I wondered about Pippi's full name, which in the Lamborn version I read was:

Pippilotta Delicatessa Windowshade Mackrelmint Ephraim's Daughter Longstocking

The original Swedish was:

Pippilotta Viktualia Rullgardina Krusmynta Efraimsdotter Långstrump

and the English was a fairly close translation. Viktualier is “victuals”, and I think turning it into “Delicatessa” is clever. (Viktualia is actually a real Swedish name, although quite rare.) Rullgardina is exactly “windowshade”. (Literally “roll-curtain”.) Krusmynta is a nonsense compound of krus (see below) and mynta (mint). I thought that krus was “mackerel” but I can't find anyone to agree with me; everyone says that the Swedish for “mackerel” is makrill, as in most European languages.

The Nunnaly translation has:

Pippilotta Comestibles Windowshade Curlymint Ephraimsdaughter Longstocking

“Commestibles” is terrible, but “Curlymint” is just fine, because krusig is indeed “curly”.

The Hurup translation says:

Pippilotta Provisionia Gaberdina Dandeliona Ephraimsdaughter Longstocking

I don't like “Provisionia”, but it can be defended as a more literal translation than “Delicatessa”. I can't imagine why Hurup decided to replace “Windowshade Curlymint” with “Gaberdina Dandeliona”.

English Wikipedia has a whole section about this if you are not tired of it yet.


I recall that in the version I read, Captain Ephraim was "formerly the Terror of the Seas, and now a cannibal king", and that the original Swedish version of “cannibal king” was negerkung, “king of the negroes”. Mathilda Haraldsson's undergraduate thesis describes this as a “quite strong expression”, but adds that in the 1940s neger was considered inoffensive. (Recall that in the United States at the time, “negro” was the polite term.) It does appear that some people today consider negerkung offensive. And in any case it was never accurate; the people in question are not Africans, but Polynesians. In the Swedish version I looked at just now, the word has been changed to söderhavskung, “King of the South Seas”.

To me the most offensive part of all this is Lamborn's description of Ephraim's subjects as “cannibals” . As far as I can tell, the original Swedish says nothing about cannibalism, and this is a disgusting and completely unnecessary invention. Nunnaly makes it just “king of the natives” but Hurup inexplicably retains “Cannibal King”.

Norwegian Wikipedia has an article about Lindgren's use of negerkung, but Swedish Wikipedia does not!

[ Addendum: I just noticed that my discussion of the cannibal thing omits the word “racist”. This was an oversight. The cannibal thing is racist. ]

[ Addendum 20210303: Justin Pearson, Anders Nielsen, and Adam Sjøgren have each informed me that krusmynta is not a nonsense compound as I said. It is a standard term for spearmint. Also, is not Swedish Wikipedia. The language code for Swedish is sv. Wikipedia SE is Northern Sami Wikipedia. ]

[ Addendum 20230507: The official Astrid Lindgren web site says “Did you know? Pippi's full name is: Pippilotta Victoriaria Tea-cosy Appleminta Ephraim’s-daughter Longstocking”. I have nothing good to say about any of this. ]

[ Addendum 20230509: Wikipedia informs me that, in English, spearmint is sometimes called “mackerel mint", so it is clear now that Lamborn's translation of krusmynta is quite literal. I applaud her choice of the alliterative and rhythmic “mackerelmint” instead of the mundane “spearmint”. ]

[Other articles in category /lang] permanent link

Mon, 15 Feb 2021

Mystery twitter language

Today someone tweeted about an earlier blog article of mine, saying

10° bir kvadratda ən böyük şəhərləri görə biləcəyiniz bir xəritə olan bir sayt.

I looked at that and frowned, and said “What language is that? … is it Azerbaijani?” And it is Azerbaijani! Last time I encountered Azerbaijani I did not recognize it. So I not only learned something last April, I remembered it the following February when it came up again. Yay me!

[Other articles in category /lang] permanent link

Mon, 08 Feb 2021

Down in the dumps

I was reading The Life and Prankes of Long Meg of Westminster (1655), which opens with the story of how Long Meg first came to London with a posse of three or four girlfriends. After long travel they came within sight of London, “which joyed their hearts greatly.” But as they got closer, Meg's friends became less cheerful, and she said to them:

What Lasses in a dumpe, and we so nigh London?

If someone had asked me to guess when “in a dump” or “in the dumps” had been coined, I think I would have guessed sometime in the early 20th century. Nope! The Big Dictionary has cites back to 1535, which is when Long Meg takes place. It also cites a 1785 dictionary for “down in the dumps” specifically. The phrase is not connected with the dump where you dump a load of trash, which is of much later coinage.

It transpires that the lasses are in a dumpe because they realize that time has come to pay the carrier who has helped transport them to London, and believe he is likely to try to cheat them and take everything they have. Meg says she will reason sweetly with the carrier, and if that doesn't work, she will beat the crap out of him.

The carrier does try to take everything they have, but becomes much more helpful after Meg has beaten him with a cudgel.

Here it is if you would like to read it yourself.

[Other articles in category /lang] permanent link

Sun, 03 Jan 2021

Snow White in German

Tonight I was thinking of

Mirror, mirror, on the wall
Who is the fairest of them all?

I remembered that the original was in German and wondered whether it had always rhymed. It turns out that it had:

Spieglein, Spieglein an der Wand,
Wer ist die Schönste im ganzen Land?

The English is a pretty literal translation.

When the wunderbare Spiegel gives the Queen the bad news, it says:

Frau Königin, Ihr seid die Schönste hier,
Aber Schneewittchen ist tausendmal schöner als Ihr.

(“Queen, you are the fairest one here, but Little Snow White is a thousand times as fair as you.”)

When the dwarfs see Snow White in one of their beds, they cry

Ei, du mein Gott!

which is German for “zOMG”.

Later the Queen returns to the mirror, expecting a better answer, but she gets this:

Frau Königin, Ihr seid die Schönste hier,
Aber Schneewittchen über den Bergen
Bei den sieben Zwergen
Ist noch tausendmal schöner als Ihr.

(“Queen, you are the fairest here, but Little Snow White up on the mountain with the seven dwarfs is still a thousand times as fair as you.”)

I like the way this poem here interpolates the earlier version, turning the A-A rhyme into A-B-B-A. The English version I have has “in the glen / little men” in place of “über den Bergen / sieben Zwergen”. The original is much better, but I am not sure English has any good rhymes for “dwarfs”. Except “wharfs”, but putting the dwarfs by the wharfs is much worse than putting them in the glen.

[ Thanks to Gaal Yahas for correcting my translation of noch and to Mario Lang for correcting my German grammar. ]

[ Addendum 20200115: Was the mirror magical? ]

[Other articles in category /lang] permanent link

Sat, 26 Dec 2020


Screenshot of a tweet.
It says “Keys for me: kibbe, cheese pie, spinach pie, stuff grape
leaves (no meat), olives, cheeses, soujuk (spicy lamb sausage),
basterma (err, spicy beed prosciutto), hummous, baba g., taramasalata,
immam bayadi”

This tweet from Raffi Melkonian describes the appetizer plate at his house on Christmas. One item jumped out at me:

basterma (err, spicy beef prosciutto)

I wondered what that was like, and then I realized I do have some idea, because I recognized the word. Basterma is not an originally Armenian word, it's a Turkish loanword, I think canonically spelled pastırma. And from Turkish it made a long journey through Romanian and Yiddish to arrive in English as… pastrami

For which “spicy beef prosciutto” isn't a bad description at all.

[Other articles in category /lang/etym] permanent link

Sun, 13 Sep 2020

Weasel words in headlines

The front page of today has this headline:

Screenshot of part
of web page.  The main headline is “‘So Skeptical’: As Election Nears, Iowa Senator Under Pressure For
COVID-19 Remarks”.  There is a longer subheadline undernearth, which I
discussed below.

It contains this annoying phrase:

The race for Joni Ernst's seat could help determine control of the Senate.

Someone has really committed to hedging.

I would have said that the race would certainly help determine control of the Senate, or that it could determine control of the Senate. The statement as written makes an extremely weak claim.

The article itself doesn't include this phrase. This is why reporters hate headline-writers.


[Other articles in category /lang] permanent link

Fri, 11 Sep 2020

Historical diffusion of words for “eggplant”

In reply to my recent article about the history of words for “eggplant”, a reader, Lydia, sent me this incredible map they had made that depicts the history and the diffusion of the terms:

A map of the world, with arrows depicting the sequential adoption
of different terms for eggplant, as the words mutated from language to
language.  For details, see the previous post.  The map is an
oval-shaped projection.  The ocean parts of the
map are a dark eggplant-purple color, and a eggplant stem has been
added at the eastern edge, in the Pacific Ocean.

Lydia kindly gave me permission to share their map with you. You can see the early Dravidian term vaḻutanaṅṅa in India, and then the arrows show it travelling westward across Persia and, Arabia, from there to East Africa and Europe, and from there to the rest of the world, eventually making its way back to India as brinjal before setting out again on yet more voyages.

Thank you very much, Lydia! And Happy Diada Nacional de Catalunya, everyone!

[Other articles in category /lang/etym] permanent link

Fri, 28 Aug 2020

Zucchinis and Eggplants

This morning Katara asked me why we call these vegetables “zucchini” and “eggplant” but the British call them “courgette” and “aubergine”.

I have only partial answers, and the more I look, the more complicated they get.


The zucchini is a kind of squash, which means that in Europe it is a post-Columbian import from the Americas.

“Squash” itself is from Narragansett, and is not related to the verb “to squash”. So I speculate that what happened here was:

  • American colonists had some name for the zucchini, perhaps derived from an Narragansett or another Algonquian language, or perhaps just “green squash” or “little gourd” or something like that. A squash is not exactly a gourd, but it's not exactly not a gourd either, and the Europeans seem to have accepted it as a gourd (see below).

  • When the vegetable arrived in France, the French named it courgette, which means “little gourd”. (Courge = “gourd”.) Then the Brits borrowed “courgette” from the French.

  • Sometime much later, the Americans changed the name to “zucchini”, which also means “little gourd”, this time in Italian. (Zucca = “gourd”.)

The Big Dictionary has citations for “zucchini” only back to 1929, and “courgette” to 1931. What was this vegetable called before that? Why did the Americans start calling it “zucchini” instead of whatever they called it before, and why “zucchini” and not “courgette”? If it was brought in by Italian immigrants, one might expect to the word to have appeared earlier; the mass immigration of Italians into the U.S. was over by 1920.

Following up on this thought, I found a mention of it in Cuniberti, J. Lovejoy., Herndon, J. B. (1918). Practical Italian recipes for American kitchens, p. 18: “Zucchini are a kind of small squash for sale in groceries and markets of the Italian neighborhoods of our large cities.” Note that Cuniberti explains what a zucchini is, rather than saying something like “the zucchini is sometimes known as a green summer squash” or whatever, which suggests that she thinks it will not already be familiar to the readers. It looks as though the story is: Colonial Europeans in North America stopped eating the zucchini at some point, and forgot about it, until it was re-introduced in the early 20th century by Italian immigrants.

When did the French start calling it courgette? When did the Italians start calling it zucchini? Is the Italian term a calque of the French, or vice versa? Or neither? And since courge (and gourd) are evidently descended from Latin cucurbita, where did the Italians get zucca?

So many mysteries.


Here I was able to get better answers. Unlike squash, the eggplant is native to Eurasia and has been cultivated in western Asia for thousands of years.

The puzzling name “eggplant” is because the fruit, in some varieties, is round, white, and egg-sized.

closeup of
an eggplant with several of its  round, white, egg-sized  fruits that
do indeed look just like eggs

The term “eggplant” was then adopted for other varieties of the same plant where the fruit is entirely un-egglike.

“Eggplant” in English goes back only to 1767. What was it called before that? Here the OED was more help. It gives this quotation, from 1785:

When this [sc. its fruit] is white, it has the name of Egg-Plant.

I inferred that the preceding text described it under a better-known name, so, thanks to the Wonders of the Internet, I looked up the original source:

Melongena or Mad Apple is also of this genus [solanum]; it is cultivated as a curiosity for the largeness and shape of its fruit; and when this is white, it has the name of Egg Plant; and indeed it then perfectly resembles a hen's egg in size, shape, and colour.

(Jean-Jacques Rosseau, Letters on the Elements of Botany, tr. Thos. Martyn 1785. Page 202. (Wikipedia))

The most common term I've found that was used before “egg-plant” itself is “mad apple”. The OED has cites from the late 1500s that also refer to it as a “rage apple”, which is a calque of French pomme de rage. I don't know how long it was called that in French. I also found “Malum Insanam” in the 1736 Lexicon technicum of John Harris, entry “Bacciferous Plants”.

Melongena was used as a scientific genus name around 1700 and later adopted by Linnaeus in 1753. I can't find any sign that it was used in English colloquial, non-scientific writing. Its etymology is a whirlwind trip across the globe. Here's what the OED says about it:

  • The neo-Latin scientific term is from medieval Latin melongena

  • Latin melongena is from medieval Greek μελιντζάνα (/melintzána/), a variant of Byzantine Greek ματιζάνιον (/matizánion/) probably inspired by the common Greek prefix μελανο- (/melano-/) “dark-colored”. (Akin to “melanin” for example.)

  • Greek ματιζάνιον is from Arabic bāḏinjān (بَاذِنْجَان). (The -ιον suffix is a diminutive.)

  • Arabic bāḏinjān is from Persian bādingān (بادنگان)

  • Persian bādingān is from Sanskrit and Pali vātiṅgaṇa (भण्टाकी)

  • Sanskrit vātiṅgaṇa is from Dravidian (for example, Malayalam is vaḻutana (വഴുതന); the OED says “compare… Tamil vaṟutuṇai”, which I could not verify.)


Okay, now how do we get to “aubergine”? The list above includes Arabic bāḏinjān, and this, like many Arabic words was borrowed into Spanish, as berengena or alberingena. (The “al-” prefix is Arabic for “the” and is attached to many such borrowings, for example “alcohol” and “alcove”.)

From alberingena it's a short step to French aubergine. The OED entry for aubergine doesn't mention this. It claims that aubergine is from “Spanish alberchigo, alverchiga, ‘an apricocke’”. I think it's clear that the OED blew it here, and I think this must be the first time I've ever been confident enough to say that. Even the OED itself supports me on this: the note at the entry for brinjal says: “cognate with the Spanish alberengena is the French aubergine”. Okay then. (Brinjal, of course, is a contraction of berengena, via Portuguese bringella.)

Sanskrit vātiṅgaṇa is also the ultimate source of modern Hindi baingan, as in baingan bharta.

(Wasn't there a classical Latin word for eggplant? If so, what was it? Didn't the Romans eat eggplant? How do you conquer the world without any eggplants?)

[ Addendum: My search for antedatings of “zucchini” turned up some surprises. For example, I found what seemed to be many mentions in an 1896 history of Sicily. These turned out not to be about zucchini at all, but rather the computer's pathetic attempts at recognizing the word Σικελίαν. ]

[ Addendum 20200831: Another surprise: Google Books and Hathi Trust report that “zucchini” appears in the 1905 Collier Modern Eclectic Dictionary of the English Langauge, but it's an incredible OCR failure for the word “acclamation”. ]

[ Addendum 20200911: A reader, Lydia, sent me a beautiful map showing the evolution of the many words for ‘eggplant’. Check it out. ]

[ Addendum 20231021: The Japanese kabocha squash (カボチャ) is probably so-called because it was brought by the Portuguese from Camboja, Cambodia. ]

[ Addendum 20231127: A while back I looked into the question of whether the Romans had eggplants, and it seems that consensus was that they did not! Incredible. How much longer their empire would have lasted if they had been able to draw in the power of the eggplant? This probably goes some way to explaining why the Byzantine Empire lasted so much longer than the Western Empire. ]

[Other articles in category /lang/etym] permanent link

Tue, 09 Jun 2020

The two-bit huckster in medieval Italy

The eighth story on the seventh day of the Decameron concerns a Monna Sismonda, a young gentlewoman who is married to a merchant. She contrives to cheat on him, and then when her husband Arriguccio catches her, she manages to deflect the blame through a cunning series of lies. Arriguccio summons Sismonda's mother and brothers to witness her misbehavior, but when Sismonda seems to refute his claims, they heap abuse on him. Sismonda's mother rants about merchants with noble pretensions who marry above their station. My English translation (by G.H. McWilliam, 1972) included this striking phrase:

‘Have you heard how your poor sister is treated by this precious brother-in-law of yours? He’s a tuppenny-ha’penny pedlar, that's what he is!’

“Tuppeny-ha’penny” seemed rather odd in the context of medieval Florentines. It put me in mind of Douglas Hofstadter's complaint about an English translation of Crime and Punishment that rendered “S[toliarny] Pereulok” as “Carpenter’s Lane”:

So now we might imagine ourselves in London, … and in the midst of a situation invented by Dickens… . Is that what we want?

Intrigued by McWilliam's choice, I went to look at the other translation I had handy, John Payne's of 1886, as adapted by Cormac Ó Cuilleanáin in 2004:

‘Have you heard how your fine brother-in-law here, this two-bit huckster, is treating your sister?’

This seemed even more jarring, because Payne was English and Ó Cuilleanáin is Irish, but “two-bit” is 100% American. I wondered what the original had said.

Brown University has the Italian text online, so I didn't even have to go into the house to find out the answer:

‘Avete voi udito come il buono vostro cognato tratta la sirocchia vostra, mercatantuolo di quattro denari che egli è?’

In the coinage of the time, the denier or denarius was the penny, equal in value (at least notionally) to !!\frac1{240}!! of a pound (lira) of silver. It is the reason that pre-decimal British currency wrote fourpence as “4d.”. I think ‘-uolo’ is a diminutive suffix, so that Sismonda's mother is calling Arriguccio a fourpenny merchantling.

McWilliam’s and Ó Cuilleanáin’s translations are looking pretty good! I judged them too hastily.

While writing this up I was bothered by something else. I decided it was impossible that John Payne, in England in 1886, had ever written the words “two-bit huckster”. So I hunted up the original Payne translation from which Ó Cuilleanáin had adapted his version. I was only half right:

‘Have you heard how your fine brother-in-law here entreateth your sister? Four-farthing huckster that he is!’

“Four-farthing” is a quite literal translation of the original Italian, a farthing being an old-style English coin worth one-fourth of a penny. I was surprised to see “huckster”, which I would have guessed was 19th-century American slang. But my guess was completely wrong: “Huckster” is Middle English, going back at least to the 14th century.

In the Payne edition, there's a footnote attached to “four-farthing” that explains:

Or, in modern parlance, ‘twopenny-halfpenny.’

which is what McWilliam had. I don't know if the footnote is Payne's or belongs to the 1925 editor.

The Internet Archive's copy of the Payne translation was published in 1925, with naughty illustrations by Clara Tice. Wikipedia says “According to herself and the New York Times, in 1908 Tice was the first woman in Greenwich Village to bob her hair.”

[ Addendum 20210331: It took me until now to realize that -uolo is probably akin to the -ole suffix one finds in French words like casserole and profiterole, and derived from the Latin diminutive suffix -ulus that one finds in calculus and annulus. ]

[Other articles in category /lang] permanent link

Mon, 08 Jun 2020

More about Middle English and related issues

Quite a few people wrote me delightful letters about my recent article about how to read Middle English.


  • Paul Bolle pointed out that in my map, I had put the “Zeeland” label in Belgium. Here's the corrected map:

    A map of
a small portion of Europe, with London at the west, a squiggly
purple line proceeding eastward along the River Thames to the sea,
stopping off in “Forland” on the eastern coast of Britain near
Margate, and preparing to make a short run straight east across the
North Sea to Middelburg in the Netherlands.

    I was so glad I had done the map in SVG! Moving the label was trivial.

  • I had said:

    The printing press was introduced in the late 15th century, and at that point, because most books were published in or around London, the Midlands dialect used there became the standard, and the other dialects started to disappear.

    But Derek Cotter pointed out the obvious fact that London is not in the Midlands; it is in the south. Whoooops. M. Cotter elaborates:

    You rightly say modern English comes largely from the Midlands dialect, but London isn't in the Midlands, as your map shows; it's in the South. And the South dialects were among the losers in the standardisation of English, as your Caxton story shows: we now say Northern "eggs", not Southern "eyren". William Tyndale from Gloucestershire, Shakespeare from Warwickshire, and Dr Johnson from Staffordshire were influential in the development of modern English, along with hundreds of aristocrats, thousands of prosperous middle class, and millions of migrating workers.

  • I had been puzzled about schuleth, saying:

    “Schuleth” goes with ‘ye’ so it ought to be ‘schulest’. I don't know what's up with that.

    Derek Cotter explained my mistake: the -st suffix is only for singular thou, but ye here is plural. For comparison, consider the analogous -t in “Thou shalt not kill”. I knew this, and felt a little silly that I did not remember it.

Regarding Old English / Anglo-Saxon

Regarding Dutch

  • brian d foy pointed me to this video of a person trying to buy a cow from a Frisian farmer, by speaking in Old English. Friesland is up the coast from Zeeland, and approximately the original home of the Anglo-Saxon language. The attempt was successful! And the person is Eddie Izzard, who pops up in the oddest places.

  • I had mentioned a couple of common Middle English words that are no longer in use, and M. Bolle informed me that several are current in Modern Dutch:

    • Middle English eke (“almost”) is spelled ook and pronounced /oke/ in Dutch.

    • Wyf (“woman”) persists in Dutch as wijf, pronounced like Modern English “wife”. In Dutch this term is insulting, approximately “bitch”. (German cognates are weib (“woman”) and weibliche (“female”).)

    • Eyren (“eggs”). In Dutch this is eieren. (In German, one egg is ei and several is eier.) We aren't sure what the -en suffix is doing there but I speculated that it's the same plural suffix you still see only in “oxen”. (And, as Tony Finch pointed out to me, in “brethren” and “children”.) M. Bolle informs me that it is still common in Dutch.

Regarding German

  • My original article was about schuleþ, an old form of “shall, should”. Aristotle Pagaltzis informed me that in Modern German the word is spelled schulden, but the /d/ is very reduced, “merely hinted at in the transition between syllables”.

    One trick I didn't mention in the article was that if a Middle English word doesn't seem to make sense as English, try reading it as German instead and see if that works better. I didn't bring it up because it didn't seem as helpful as the other tricks, partly because it doesn't come up that often, and mainly because you actually have to know something. I didn't want to be saying “look how easy it is to read Middle English, you just have to know German”.

  • Tobias Boege and I had a long discussion about the intermutations of ‘ȝ’, ‘y’, ‘g’, and ‘gh’ in English and German. M. Boege tells me:

    I would just like to mention, although I suppose unrelated to the development in England, that in the Berlin/Brandenburg region close to where I live, the dialect often turns "g" into "y" sounds, for example "gestern" into "yestern".

    This somewhat spreads into Saxony-Anhalt, too. While first letter "g"s turn into "y"/"j", internal ones tend to become a soft "ch". The local pronunciation of my hometown Magdeburg is close to "Mach-tte-burch".

    and also brought to my attention this amusing remark about the pronounciation of ‘G’ in Magdeburg:

    Man sagt, die Magdeburger sprechen das G auf fünf verschiedene Arten, aber G ist nicht dabei!

    (“It is said, that the Magdeburgers pronounce the ‘G’ in five different ways, but none of them is /g/!”)

    The Wikipedia article provides more details, so check it out if you read German.

    It occurs to me now that the ‘G’ in Dutch is pronounced in many cases not at all as /g/, but as /ɣ/. We don't really have this sound in English, but if we did we might write it as ‘gh’, so it is yet another example of this intermutation. Dutch words with this ‘g’ include gouda and the first ‘G’ in Van Gogh.

  • Aristotle Pagaltzis pointed out that the singular / plural thou / ye distinction persists in Modern German. The German second person singular du is cognate with the Middle English singular thou, but the German plural is ihr.

Final note

The previous article about weirdos during the Depression hit #1 on Hacker News and was viewed 60,000 times. But I consider the Middle English article much more successful, because I very much prefer receiving interesting and thoughtful messages from six Gentle Readers to any amount of attention from Hacker News. Thanks to everyone who wrote, and also to everyone who read without writing.

[Other articles in category /lang] permanent link

Fri, 05 Jun 2020

You can learn to read Middle English

In a recent article I quoted this bit of Middle English:

Ȝelde ȝe to alle men ȝoure dettes: to hym þat ȝe schuleþ trybut, trybut.

and I said:

As often with Middle English, this is easier than it looks at first. In fact this one is so much easier than it looks that it might become my go-to example. The only strange word is schuleþ itself…

Yup! If you can read English, you can learn to read Middle English. It looks like a foreign language, but it's not. Not entirely foreign, anyway. There are tricks you can pick up. The tricks get you maybe 90 or 95% of the way there, at least for later texts, say after 1350 or so.

Disclaimer: I have never studied Middle English. This is just stuff I've picked up on my own. Any factual claims in this article might be 100% wrong. Nevertheless I have pretty good success reading Middle English, and this is how I do it.

Some quick historical notes

It helps to understand why Middle English is the way it is.

English started out as German. Old English, also called Anglo-Saxon, really is a foreign language, and requires serious study. I don't think an anglophone can learn to read it with mere tricks.

Over the centuries Old English diverged from German. In 1066 the Normans invaded England and the English language got a thick layer of French applied on top. Middle English is that mashup of English and French. It's still German underneath, but a lot of the spelling and vocabulary is Frenchified. This is good, because a lot of that Frenchification is still in Modern English, so it will be familiar.

For a long time each little bit of England had its own little dialect. The printing press was introduced in the late 15th century, and at that point, because most books were published in or around London, the Midlands dialect used there became the standard, and the other dialects started to disappear.

[ Addendum 20200606: The part about Midlands dialect is right. The part about London is wrong. London is not in the Midlands. ]

With the introduction of printing, the spelling, which had been fluid and do-as-you-please, became frozen. Unfortunately, during the 15th century, the Midlands dialect had been undergoing a change in pronunciation now called the Great Vowel Shift and many words froze with spelling and pronunciations that didn't match. This is why English vowel spellings are such a mess. For example, why are “meat” and “meet” spelled differently but pronounced the same? Why are “read” (present tense) and “read” (past tense) pronounced differently but spelled the same? In Old English, it made more sense. Modern English is a snapshot of the moment in the middle of a move when half your stuff is sitting in boxes on the sidewalk.

By the end of the 17th century things had settled down to the spelling mess that is Modern English.

The letters are a little funny

Depending on when it was written and by whom, you might see some of these obsolete letters:

  • Ȝ — This letter is called yogh. It's usually a ‘y’ sound, but if the word it's in doesn't make sense with a ‘y’ try pretending that it's a ‘g’ or ‘gh’ instead and see if the meaning becomes clearer. (It was originally more like a “gh-” sound. German words like gestern and garden change to yesterday and yard when they turn into English. This is also why we have words like ‘night’ that are still spelled with a ‘gh’ but is now pronounced with a ‘y’.)

  • þ — This is a thorn. It represents the sound we now write as th.

  • ð — This is an edh. This is usually also a th, but it might be a d. Originally þ and ð represented different sounds (“thin” and “this” respectively) but in Middle English they're kinda interchangeable. The uppercase version looks like Đ.

Some familiar letters behave a little differently:

  • u, v — Letters ‘u’ and ‘v’ are sometimes interchangeable. If there's a ‘u’ in a funny place, try reading it as a ‘v’ instead and see if it makes more sense. For example, what's the exotic-looking "haue”? When you know the trick, you see it's just the totally ordinary word “have”, wearing a funny hat.

  • w — When w is used as a vowel, Middle English just uses a ‘u’. For example, the word for “law” is often spelled “laue”.

  • y — Where Middle English uses ‘y’, we often use ‘i’. Also sometimes vice-versa.

The quotation I discussed in the earlier article looks like this:

Ȝelde ȝe to alle men ȝoure dettes: to hym þat ȝe schuleþ trybut, trybut.

Daunting, right? But it's not as bad as it looks. Let's get rid of the yoghs and thorns:

Yelde ye to alle men youre dettes: to hym that ye schuleth trybut, trybut.

The spelling is a little funny

Here's the big secret of reading Middle English: it sounds better than it looks. If you're not sure what a word is, try reading it aloud. For example, what's “alle men”? Oh, it's just “all men”, that was easy. What's “youre dettes”? It turns out it's “your debts”. That's not much of a disguise! It would be a stretch to call this “translation”.

Yelde ye to all men your debts: to him that ye schuleth trybut, trybut.

“Yelde” and “trybut” are a little trickier. As languages change, vowels nearly always change faster than consonants. Vowels in Middle English can be rather different from their modern counterparts; consonants less so. So if you can't figure out a word, try mashing on the vowels a little. For example, “much” is usually spelled “moche”.

With a little squinting you might be able to turn “trybut” into “tribute”, which is what it is. The first “tribute” is a noun, the second a verb. The construction is analogous to “if you have a drink, drink!”

I had to look up “yelde”, but after I had I felt a little silly, because it's “yield”.

Yield ye to all men your debts: to him that ye schuleth tribute, tribute.

We'll deal with “schuleth” a little later.

The word order is pretty much the same

That's because the basic grammar of English is still mostly the same as German. One thing English now does differently from German is that we no longer put the main verb at the end of the sentence. If a Middle English sentence has a verb hanging at the end, it's probably the main verb. Just interpret it as if you had heard it from Yoda.

The words are a little bit old-fashioned

… but many of them are old-fashioned in a way you might be familiar with. For example, you probably know what “ye” means: it's “you”, like in “hear ye, hear ye!” or “o ye of little faith!”.

Verbs in second person singular end in ‘-st’; in third person singular, ‘-th’. So for example:

  • I read
  • Thou readst
  • He readeth
  • I drink
  • Thou drinkst
  • She drinketh

In particular, the forms of “do” are: I do, thou dost, he doth.

Some words that were common in Middle English are just gone. You'll probably need to consult a dictionary at some point. The Oxford English Dictionary is great if you have a subscription. The University of Michigan has a dictionary of Middle English that you can use for free.

Here are a couple of common words that come to mind:

  • eke — “also”
  • wyf — “woman”

Verbs change form to indicate tense

In German (and the proto-language from which German descended), verb tense is indicated by a change in the vowel. Sometimes this persists in modern English. For example, it's why we have “drink, drank, drunk” and “sleep, slept”. In Modern German this is more common than in Modern English, and in Middle English it's also more common than it is now.

Past tense usually gets an ‘-ed’ on the end, like in Modern English.

The last mystery word here is “schuleth”:

Yield ye to all men your debts: to him that ye schuleth tribute, tribute.

This is the hard word here.

The first thing to know is that “sch-” is always pronounced “sh-” as it still is in German, never with a hard sound like “school” or “schedule”.

What's “schuleth” then? Maybe something do to with schools? It turns out not. This is a form of “shall, should” but in this context it has its old meaning, now lost, of “owe”. If I hadn't run across this while researching the history of the word “should”, I wouldn't have known what it was, and would have had to look it up.

But notice that it does follow a typical Middle English pattern: the consonants ‘sh-’ and ‘-l-’ stayed the same, while the vowels changed. In the modern word “should” we have a version of “schulen” with the past tense indicated by ‘-d’ just like usual.

“Schuleth” goes with ‘ye’ so it ought to be ‘schulest’. I don't know what's up with that.

[ Addendum 20200608: “ye” is plural, and ‘-st’ only goes on singular verbs. ]

Prose example

Let's try Wycliffe's Bible, which was written around 1380ish. This is Matthew 6:1:

Takith hede, that ye do not youre riytwisnesse bifor men, to be seyn of hem, ellis ye schulen haue no meede at youre fadir that is in heuenes.

Most of this reads right off:

Take heed, that you do not your riytwisnesse before men, to be seen of them, else you shall have no meede at your father that is in heaven.

“Take heed” is a bit archaic but still good English; it means “Be careful”.

Reading “riytwisnesse” aloud we can guess that it is actually “righteousness”. (Remember that that ‘y’ started out as a ‘gh’.)

“Schulen“ we've already seen; here it just means “shall”.

I had to look up “meede”, which seems to have disappeared since 1380. It meant “reward”, and that's exactly how the NIV translates it:

Be careful not to practice your righteousness in front of others to be seen by them. If you do, you will have no reward from your Father in heaven.

That was fun, let's do another:

Therfore whanne thou doist almes, nyle thou trumpe tofore thee, as ypocritis doon in synagogis and stretis, that thei be worschipid of men; sotheli Y seie to you, they han resseyued her meede.

The same tricks work for most of this. “Whanne” is “when”. We still have the word “almes”, now spelled “alms”: it's the handout you give to beggars. The “sch” in “worschipid” is pronounced like ‘sh’ so it's “worshipped”.

“Resseyued” looks hard, but if you remember to try reading the ‘u’ as a ‘v’ and the ‘y’ as an ‘i’, you get “resseived” which is just one letter off of “received”. “Meede” we just learned. So this is:

Therefore when you do alms, nyle thou trumpe before you, as hypocrites do in synagogues and streets, that they be worshipped by men; sotheli I say to you, they have received their reward.

Now we have the general meaning and some of the other words become clearer. What's “trumpe”? It's “trumpeting”. When you give to the needy, don't you trumpet before you, as the hypocrites do. So even though I don't know what “nyle” is exactly, the context makes it clear that it's something like “do not”. Negative words often began with ‘n’ just as they do now (no, nor, not, never, neither, nothing, etc.). Looking it up, I find that it's more usually spelled “nill”. This word is no longer used; it means the opposite of “will”. (It still appears in the phrase “willy-nilly”, which means “whether you want to or not”.)

“Sothely” means “truly”. “Soth” or “sooth” is an archaic word for truth, like in “soothsayer”, a truth-speaker.

Here's the NIV translation:

So when you give to the needy, do not announce it with trumpets, as the hypocrites do in the synagogues and on the streets, to be honored by others. Truly I tell you, they have received their reward in full.

Poetic example

Let's try something a little harder, a random sentence from The Canterbury Tales, written around 1390. Wish me luck!

We olde men, I drede, so fare we:
Til we be roten, kan we nat be rype;
We hoppen alwey whil that the world wol pype.

The main difficulty is that it's poetic language, which might be a bit obscure even in Modern English. But first let's fix the spelling of the obvious parts:

We old men, I dread, so fare we:
Til we be rotten, can we not be ripe?
We hoppen alwey whil that the world will pipe.

The University of Michigan dictionary can be a bit tricky to use. For example, if you look up “meede” it won't find it; it's listed under “mede”. If you don't find the word you want as a headword, try doing full-text search.

Anyway, hoppen is in there. It can mean “hopping”, but in this poetic context it means dancing.

We old men, I dread, so fare we:
Til we be rotten, can we not be ripe?
We dance always while the world will pipe.

“Pipe” is a verb here, it means (even now) to play the pipes.

You try!

William Caxton is thought to have been the first person to print and sell books in England. This anecdote of his is one of my favorites. He wrote it the late 1490s, at the very tail end of Middle English:

In my dayes happened that certayn marchauntes were in a shippe in Tamyse, for to haue sayled ouer the see into zelande, and for lacke of wynde thei taryed atte Forlond, and wente to lande for to refreshe them; And one of theym named Sheffelde, a mercer, cam in-to an hows and axed for mete; and specyally he axyed after eggys; and the goode wyf answerde, that she coude speke no frenshe, And the marchaunt was angry, for he also coude speke no frenshe, but wolde haue hadde ‘egges’ and she understode hym not. And theene at laste another sayd that he wolde haue ‘eyren’ then the good wyf sayd that she vnderstod hym wel.

A “mercer” is a merchant, and “taryed“ is now spelled “tarried”, which is now uncommon and means to stay somewhere temporarily.

I think the only other part of this that doesn't succumb to the tricks in this article is the place names:

A map of
the route described in the paragraph, with London at the west, a
squiggly purple line proceeding eastward along the River Thames to the
sea, then stopping off in “Forland” on the eastern coast of Britain
near Margate, and preparing to make a short run straight east across
the North Sea to Middelburg in the Netherlands.

Caxton is bemoaning the difficulties of translating into “English” in 1490, at a time when English was still a collection of local dialects. He ends the anecdote by asking:

Loo, what sholde a man in thyse dayes now wryte, ‘egges’ or ‘eyren’?

Thanks to Caxton and those that followed him, we can answer: definitely “egges”.

[ Addenda 20200608: More about Middle English. ]

[ Addendum 20211027: An extended example of “half your stuff is sitting on the sidewalk” ]

[ Addendum 20211028: More about “eke” ]

[Other articles in category /lang] permanent link

Sat, 30 May 2020

Missing moods in English

Rob Hoelz mentioned that (one of?) the Nenets languages has different verb moods for concepts that in English are both indicated by “should” and “ought”:

"this is the state of the world as I assume it" vs "this is the state of the world as it currently isn't, but it would be ideal"

Examples of the former being:

  • That pie should be ready to come out of the oven
  • If you leave now, you should be able to catch the 8:15 to Casablanca

and of the latter:

  • People should be kinder to one another
  • They should manufacture these bags with stronger handles

I have often wished that English were clearer about distinguishing these. For example, someone will say to me

Using git splurch should fix that

and it is not always clear whether they are advising me how to fix the problem, or lamenting that it won't.

A similar issue applies to the phrase “ought to”. As far as I can tell the two phrases are completely interchangeable, and both share the same ambiguities. I want to suggest that everyone start using “should” for the deontic version (“you should go now”) and “ought to” for the predictive verion (“you ought to be able to see the lighthouse from here”) and never vice versa, but that's obviously a lost cause.

I think the original meaning of both forms is more deontic. Both words originally meant monetary debts or obligations. With ought this is obvious, at least once it's pointed out, because it's so similar in spelling and pronunciation to owed. (Compare pass, passed, past with owe, owed, ought.)

For shall, the Big Dictionary has several citations, none later than 1425 CE. One is a Middle English version of Romans 13:7:

Ȝelde ȝe to alle men ȝoure dettes: to hym þat ȝe schuleþ trybut, trybut.

As often with Middle English, this is easier than it looks at first. In fact this one is so much easier than it looks that it might become my go-to example. The only strange word is schuleþ itself. Here's my almost word-for-word translation:

Yield ye to all men your debts: to him that ye oweth tribute, pay tribute.

The NIV translates it like this:

Give to everyone what you owe them: If you owe taxes, pay taxes

Anyway, this is a digression. I wanted to talk about different kinds of should. The Big Dictionary distinguishes two types but mixes them together in its listing, saying:

In statements of duty, obligation, or propriety. Also, in statements of expectation, likelihood, prediction, etc.

The first of these seems to correspond to what I was calling the deontic form (“people should be kinder”) and the second (“you should be able to catch that train”.) But their quotations reveal several other shades of meaning that don't seem exactly like either of these:

Some men should have been women

This is not any of duty, obligation, propriety, expectation, likelihood, or prediction. But it is exactly M. Hoelz’ “state of the world as it currently isn't”.


I should have gotten out while I had the chance

Again, this isn't (necessarily) duty, obligation, or propriety. It's just a wish contrary to fact.

The OED does give several other related shades of “should” which are not always easy to distinguish. For example, its definition 18b says

ought according to appearances to be, presumably is

and gives as an example

That should be Barbados..unless my reckoning is far out.

Compare “We should be able to see Barbados from here”.

Its 18c is “you should have seen that fight!” which again is of the wish-contrary-to-fact type; they even gloss it as “I wish you could have…”.

Another distinction I would like to have in English is between “should” used for mere suggestion in contrast with the deontic use. For example

It's sunny out, we should go swimming!

(suggestion) versus

You should finish your homework before you play ball.


Say this distinction was marked in English. Then your mom might say to you in the morning

You should¹ wash the dishes now, before they get crusty

and later, when you still haven't washed the dishes:

You should² wash the dishes before you go out

meaning that you'll be in trouble if you fail in your dish washing duties.

When my kids were small I started to notice how poorly this sort of thing was communicated by many parents. They would regularly say things like

  • You should be quiet now
  • I need you to be quiet now
  • You need to be quiet now
  • You want to be quiet now

(the second one in particular) when what they really meant, it seemed to me, was “I want you to be quiet now”.

[ I didn't mean to end the article there, but after accidentally publishing it I decided it was as good a place as any to stop. ]

[Other articles in category /lang] permanent link

Thu, 30 Apr 2020

Geeky boasting about dictionaries

Yesterday Katara and I were talking about words for ‘song’. Where did ‘song’ come from? Obviously from German, because sing, song, sang, sung is maybe the perfect example of ablaut in Germanic languages. (In fact, I looked it up in Wikipedia just now and that's the example they actually used in the lede.)

But the German word I'm familiar with is Lied. So what happened there? Do they still have something like Song? I checked the Oxford English Dictionary but it was typically unhelpful. “It just says it's from Old German Sang, meaning ‘song’. To find out what happened, we'd need to look in the Oxford German Dictionary.”

Katara considered. “Is that really a thing?”

“I think so, except it's written in German, and obviously not published by Oxford.”

“What's it called?”

I paused and frowned, then said “Deutsches Wörterbuch.”

“Did you just happen to know that?”

“Well, I might be totally wrong, but yeah.” But I looked. Yeah, it's called Deutsches Wörterbuch:

The Deutsches Wörterbuch … is the largest and most comprehensive dictionary of the German language in existence. … The dictionary's historical linguistics approach … makes it to German what the Oxford English Dictionary is to English.

So, yes, I just happened to know that. Yay me!

Deutsches Wörterbuch was begun by Wilhelm and Jakob Grimm (yes, those Brothers Grimm) although the project was much too big to be finished in their lifetimes. Wilhelm did the letter ‘D’. Jakob lived longer, and was able to finish ‘A’, ‘B’, ‘C’, and ‘E’. Wikipedia mentions the detail that he died “while working on the entry for ‘Frucht’ (fruit)”.

Wikipedia says “the work … proceeded very slowly”:

Hermann Wunderlich, Hildebrand's successor, only finished Gestüme to Gezwang after 20 years of work …

(This isn't as ridiculous as it seems; German has a lot of words that begin with ‘ge-’.)

The project came to an end in 2016, after 178 years of effort. The revision of the Grimms’ original work on A–F, planned since the 1950s, is complete, and there are no current plans to revise the other letters.

[Other articles in category /lang] permanent link

Wed, 22 Apr 2020

Mystery spam language

This morning I got spam with this subject:

Subject: yaxşı xəbər

Now what language is that? The ‘şı’ looks Turkish, but I don't think Turkish has a letter ‘ə’. It took me a little while to find out the answer.

It's Azerbaijani. Azerbaijani has an Arabic script and a Latin script; this is the Latin script. Azerbaijani is very similar to Turkish and I suppose they use the ‘ş’ and ‘ı’ for the same things. I speculated that the ‘x’ was analogous to Turkish ‘ğ’, but it appears not; Azerbaijani also has ‘ğ’ and in former times they used ‘ƣ’ for this.

Bonus trivia: The official Unicode name of ‘ƣ’ is LATIN SMALL LETTER OI. Unicode Technical Note #27 says:

These should have been called letter GHA. They are neither pronounced 'oi' nor based on the letters 'o' and 'i'.

[ Addendum 20210215: I was pleased to discover today that I have not yet forgotten what Azeri looks like. ]

[ Addendum 20230731: Another mystery language sample. ]

[Other articles in category /lang] permanent link

Hidden emeralds

Dave Turner pointed me to the 1939 Russian-language retelling of The Wizard of Oz, titled The Wizard of the Emerald City. In Russian the original title was Волшебник Изумрудного Города. It's fun to try to figure these things out. Often Russian words are borrowed from English or are at least related to things I know but this one was tricky. I didn't recognize any of the words. But from the word order I'd expect that Волшебник was the wizard. -ого is a possessive ending so maybe Изумрудного is “of emeralds”? But Изумрудного didn't look anything like emeralds… until it did.

Изумрудного is pronounced (approximately) “izumrudnogo”. But “emerald” used to have an ‘s’ in it, “esmerald”. (That's where we get the name “Esmeralda”.) So the “izumrud” is not that far off from “esmerad” and there they are!

[Other articles in category /lang] permanent link

Fri, 17 Apr 2020

Earlier dumpster fires

In my previous article I claimed

the oldest known metaphorical use of “dumpster fire” is in reference to the movie Shrek the Third.

However, this is mistaken. Eric Harley has brought to my attention that the phrase was used as early as 2003 to describe The Texas Chainsaw Massacre. According to this Salt Lake Tribune article:

One early use found by Oxford Dictionaries' Jeff Sherwood was a 2003 movie review by the Arizona Republic's Bill Muller that referred to that year's remake of "The Texas Chainsaw Massacre" as "the cinematic equivalent of a dumpster fire — stinky but insignificant."

If Sherwood is affiliated with Oxford Dictionaries, I wonder why this citation hasn't gotten into the Big Dictionary. The Tribune also pointed me to Claire Fallon's 2016 discussion of the phrase.

Thank you, M. Harley.

[Other articles in category /lang] permanent link

Thu, 16 Apr 2020

Dumpster fires

Today I learned that the oldest known metaphorical use of “dumpster fire” (to mean “a chaotic or disastrously mishandled situation”) is in reference to the movie Shrek the Third.

The OED's earliest citation is from a 2008 Usenet post, oddly in I looked in Google Book search for an earlier one, but everything I found was about literal dumpster fires.

I missed the movie, and now that I know it was the original Dumpster Fire, I feel lucky.

[ Addendum 20200417: More about this. ]

[Other articles in category /lang] permanent link

Mon, 06 Apr 2020

Anglo-Saxon and Hawai‘ian Wikipedias

Yesterday browsing the list of Wikipedias I learned there is an Anglo-Saxon Wikipedia. This seems really strange to me for several reasons: Who is writing it? And why?

And there is a vocabulary problem. Not just because Anglo-Saxon is dead, and one wouldn't expect it to have any words for anything not invented in the last 900 years or so. But also, there are very few extant Anglo-Saxon manuscripts, so we don't have a lot of vocabulary, even for things that had been invented beore the language died.

Helene Hanff said:

I have these guilts about never having read Chaucer but I was talked out of learning Early Anglo-Saxon / Middle English by a friend who had to take it for her Ph.D. They told her to write an essay in Early Anglo-Saxon on any-subject-of-her-own-choosing. “Which is all very well,” she said bitterly, “but the only essay subject you can find enough Early Anglo-Saxon words for is ‘How to Slaughter a Thousand Men in a Mead Hall’.”

I don't read Anglo-Saxon but if you want to investigate, you might look at the Anglo-Saxon article about the Maybach Exelero (a hēahfremmende sportƿægn), Barack Obama, or taekwondo. I am pre-committing to not getting sucked into this, but sportƿægn is evidently intended to mean “sportscar” (the ƿ is an obsolete letter called wynn and is approximately a W, so that ƿægn is “wagon”) and I think that fremmende is “foreign” and hēah is something like "high" or "very". But I'm really not sure.

Anyway Wikipedia reports that the Anglo-Saxon Wikipedia has 3,197 articles (although most are very short) and around 30 active users. In contrast, the Hawai‘ian Wikipedia has 3,919 articles and only around 14 active users, and that is a language that people actually speak.

[Other articles in category /lang] permanent link

Caricatures of Nazis and the number four in Russian

[ Warning: this article is kinda all over the place. ]

I was looking at this awesome poster of D. Moor (Д. Моор), one of Russia's most famous political poster artists:

A Soviet propaganda poster, black,
with the foreground in yellowish-beige and a border of the same
color.  It depicts caricatures of the faces of Himmler, Göring,
Hitler, and Goebbels, labeled on the left with their names in
Russian.  Each name begins with the Russian letter Г, which is shaped
like an upside-down letter L.  Further description is below.

(original source at Artchive.RU)

This is interesting for a couple of reasons. First, in Russian, “Himmler”, “Göring”, “Hitler”, and “Goebbels” all begin with the same letter, ‘Г’, which is homologous to ‘G’. (Similarly, Harry Potter in Russian is Га́рри, ‘Garri’.)

I also love the pictures, and especially Goebbels. These four men were so ugly, each in his own distinctively loathsome way. The artist has done such a marvelous job of depicting them, highlighting their various hideousness. It's exaggerated, and yet not unfair, these are really good likenesses! It's as if D. Moor had drawn a map of all the ways in which these men were ugly.

My all-time favorite depiction of Goebbels is this one, by Boris Yefimov (Бори́с Ефи́мов):

A poster in black, blue, yellow, and muddy green, depicting
Goebbels as a hideous mashup with
Mickey Mouse. His tail divides into four at the end and is shaped like
a swastika.  His yellow-clived hands are balled into fists and spittle
is flying from his mouth. The poster is captioned (in English) at the top: “WHAT
IS AN ‘ARYAN’?  He is HANDSOME” and at the bottom “AS GOEBBELS”.

For comparison, here's the actual Goebbels:

Actual archival photograph of Goebbels, in right profile, just
like Mickey Mouse Goebbels in the previous picture, but from the chest
up.  His mouth is
closed and he is wearing a wool suit, white shirt with collar, and a
wide necktie.

Looking at pictures of Goebbels, I had often thought “That is one ugly guy,” but never been able to put my finger on what specifically was wrong with his face. But since seeing the Yefimov picture, I have never been able to look at a picture of Goebbels without thinking of a rat. D. Moor has also drawn Goebbels as a tiny rat, scurrying around the baseboards of his poster.

Anyway, that was not what I had planned to write about. The right-hand side of D. Moor's poster imagines the initial ‘Г’ of the four Nazis’ names as the four bent arms of the swastika. The captions underneath mean “first Г”, “second Г” and so on.

[ Addendum: Darrin Edwards explains the meaning here that had escaped me:

One of the Russian words for shit is "govno" (говно). A euphemism for this is to just use the initial g; so "something na g" is roughly equivalent to saying "a crappy something". So the title "vse na g" (all on g) is literally "they all start with g" but pretty blatantly means "they're all crap" or "what a bunch of crap". I believe the trick of constructing the swastika out of four g's is meant to extend this association from the four men to the entire movement…

Thank you, M. Edwards! ]

Looking at the fourth one, четвертое /chetvyertoye/, I had a sudden brainwave. “Aha,” I thought, “I bet this is akin to Greek “tetra”, and the /t/ turned into /ch/ in Russian.”

Well, now that I'm writing it down it doesn't seem that exciting. I now remember that all the other Russian number words are clearly derived from PIE just as Greek, Latin, and German are:

English German Latin Greek Russian
one ein unum εἷς (eis) оди́н (odeen)
two zwei duo δύο (dyo) два (dva)
three drei trēs τρεῖς (treis) три (tri)
four vier quattuor τέτταρες (tettares) четы́ре (chyetirye)
five fünf quinque πέντε (pente) пять (pyat’)

In Latin that /t/ turned into a /k/ and we get /quadra/ instead of /tetra/. The Russian Ч /ch/ is more like a /t/ than it is like a /k/.

The change from /t/ to /f/ in English and /v/ in German is a bit weird. (The Big Dictionary says it “presents anomalies of which the explanation is still disputed”.) The change from the /p/ of ‘pente’ to the /f/ of ‘five’ is much more typical. (Consider Latin ‘pater’, ‘piscum’, ‘ped’ and the corresponding English ‘father’, ‘fish’, ‘foot’.) This is called Grimm's Law, yeah, after that Grimm.

The change from /q/ in quinque to /p/ in pente is also not unusual. (The ancestral form in PIE is believed to have been more like the /q/.) There's a classification of Celtic lanugages into P-Celtic and Q-Celtic that's similar, exemplified by the change from the Irish patronymic prefix Mac- into the Welsh patronymic map or ap.

I could probably write a whole article comparing the numbers from one to ten in these languages. (And Sanskrit. Wouldn't want to leave out Sanskrit.) The line for ‘two’ would be a great place to begin because all those words are basically the same, with only minor and typical variations in the spelling and pronunciation. Maybe someday.

[Other articles in category /lang/etym] permanent link

Tue, 07 Jan 2020

Social classes identified by letters

Looking up the letter E in the Big Dictionary, I learned that British sociologists were dividing social classes into lettered strata long before Aldous Huxley did it in Brave New World (1932). The OED quoted F. G. D’Aeth, “Present Tendencies of Class Differentiation”, The Sociological Review, vol 3 no 4, October, 1910:

The present class structure is based upon different standards of life…

A. The Loafer
B. Low-skilled labour
C. Artizan
D. Smaller Shopkeeper and clerk
E. Smaller Business Class
F. Professional and Administrative Class
G. The Rich

The OED doesn't quote further, but D’Aeth goes on to explain:

A. represents the refuse of a race; C. is a solid, independent and valuable class in society. … E. possesses the elements of refinement; provincialisms in speech are avoided, its sons are selected as clerks, etc., in good class businesses, e.g., banking, insurance.

Notice that in D’Aeth's classification, the later letters are higher classes. According to the OED this was typical; they also quote a similar classification from 1887 in which A was the lowest class. But the OED labels this sort of classification, with A at the bottom, as “obsolete”.

In Brave New World, you will recall, it is the in the other direction, with the Alphas (administrators and specialists), at the top, and the Epsilons (menial workers with artificially-induced fetal alcohol syndrome) at the bottom.

The OED's later quotations, from 1950–2014, all follow Huxley in putting class A at the top and E at the bottom. They also follow Huxley in having only five classes instead of seven or eight. (One has six classes, but two of them are C1 and C2.)

I wonder how much influence Brave New World had on this sort of classification. Was anyone before Huxley dividing British society into five lettered classes with A at the top?

[ By the way, I have been informed that this paper, which I have linked above, is “Copyright © 2020 by The Sociological Review Publication Limited. All rights are reserved.” This is a bald lie. Sociological Review Publication Limited should be ashamed of themselves. ]

[Other articles in category /lang] permanent link

Thu, 12 Dec 2019


Many ‘bene-’ words do have ‘male-’ opposites. For example, the opposite of a benefactor is a malefactor, the opposite of a benediction is a malediction, and the opposite of benevolence is malevolence. But strangely there is no ‘malefit’ that is opposite to ‘benefit’.

Or so I wrote, and then I thought I had better look it up.

The Big Dictionary has six examples, one as recent as 1989 and one as early as 1755:

I took it into my head to try for a benefit, and to that end printed some bills… but… instead of five and twenty pounds, I had barely four…. The morning after my malefit, I was obliged to strip my friend of the ownly decent gown she had, and pledged it to pay the players.

(Charlotte Charke, A narrative of the life of Mrs. Charlotte Charke (youngest daughter of Colley Cibber, Esq.), 1755.)

(I think the “benefit” here is short for “benefit performance”, an abbreviation we still use today.)

Mrs. Charke seems to be engaging in intentional wordplay. All but one of the other citations similarly suggest intentional wordplay; for example:

Malefactors used to commit malefactions. Why could they not still be said to do so, rather than disbenefits, or, perhaps, stretching a point, commit malefits?

(P. Howard, Word in Your Ear, 1983.)

The one exception is from no less a person than J.R.R. Tolkien:

Some very potent fiction is specially composed to be inspected by others and to deceive, to pass as record; but it is made for the malefit of Men.

(Around 1973, Quoted in C. Tolkien, History of Middle-earth: Sauron Defeated, 1992.)

Incidentally, J.R.R. is quoted 362 times in the Big Dictionary.

[Other articles in category /lang/etym] permanent link

Tue, 26 Nov 2019


A chalupa is a fried tortilla that has been filled with meat, shredded cheese, or whatever. But it is also the name of the mayor of Prague from 2002–2011.

Tortilla  Tomáš

The boat-shaped food item is named after a kind of boat called a chalupa; I think the name is akin to English sloop. But in Czech a chalupa is neither a boat nor a comestible, but a cottage.

[ Other people whose names are accidentally boats ]

[ Addendum 20191201: I should probably mention that the two words are not pronounced the same; in Spanish, the “ch” is like in English “church”, and in Czech it is pronounced like in English “challah” or “loch”. To get the Spanish pronunciation in Czech you need to write “čalupa”, and this is indeed the way they spell the name of the fried-tortilla dish in Czech. ]

[ Addendum 20220115: Other people whose names are accidentally foods ]

[Other articles in category /lang/etym] permanent link

Wed, 30 Oct 2019

Russian names in English and English names in Russian

One day I was surprised to find that Michael Jordan's name in Russian is “Майкл” (‘mai-kl’), and not “Михаи́л” (‘Mikhail’, the Russian translation of Michael.) Which is just what I should have expected; we don't refer to Mikhail Gorbachev or Baryshnikov as “Michael”, and it would be just as odd, in the other direction, if the Russians referred to the famous basketball player “Mikhail” Jordan.

When I was taking high school Russian we were assigned Russian versions of our names and I was disappointed to receive “Марк” (“Mark”) rather than anything more interesting. My friend Jeremy was stiffed in a different way. Apparently there is no direct Russian analog of “Jeremy” so the teacher opted for “Юрий” (Yuri). Yuri is not in any way a correct translation of Jeremy; it is the Russian version of “George”. Looking into it now, I wish she had thought to use “Иереми́я” (Jeremiah), or perhaps “Иерони́м” (Jerome).

1688 engraved portrait of Kimg
James I of England, cpationed “IACOBUS I · Rex. Angliæ”. James is
shown from the chest up, wearing a stiff lace ruff and a dark,
close-fitting jacket of some sort. His beard (chin only) is cut in a
square and the ends of hus mustache turn up.  His hairline has receded
and left a little tuft in the middle of the top of his head. His eyes
are wide and bright, and his lips slightly upturned, giving him an
appearance of private amusement.
King James I of England

It's funny how sometimes these names can be so easy to translate and sometimes so difficult. Mark is Mark, Aleksandr is Alexander, Viktor is Victor, Ivan is John, Yuri is George, Yakov is Jacob (or maybe James), Fyedor is Theodore, nothing is William, and Igor is nothing.

Italian Maria is obviously English “Mary” but how do you translate Mario? English has no male version of “Mary”.

(Side note: it is so bizarre that James and Jacob are somehow the same name, that when you turn Iacobus / Jacques / Iago (Latin / French / Spanish) into English it somehow turns into James. Another: What knucklehead decided to translate Frère Jacques as Brother John?)

[ Addendum: My previous article discussed the Korean translation of 邓小平, the name of Chinese leader Deng Xiaoping. Brian Lee points out that the usual Korean translation of Chinese小 (“small”) is 소 (pronounced, roughly, as /shoo/), but, just as in my Michael-Jordan examples above, the Koreans have chosen to translate the name so as to preserve the foreign pronunciation, 샤오 (/shya-oh/). Thanks! ]

[ Addendum: Dmitry Ivanov points out that there is a second Russian version of George, less common but closer to the English version: Георгий (“Georgy”). He also drew my attention to another Russian version of Jeremy, Ерёма (“Yerema”). This led me to discover that Russian Wikipedia has an entire page about Jeremy-related names, and mentions at least the following:

  • Еремей
  • Ереми́й
  • Ерене́й
  • Ерёма
  • Иереме́й
  • Иереми́й
  • Иереми́я
  • Ириме́й
  • Ярёма

Clearly, my high school Russian teacher blew it. ]

[Other articles in category /lang] permanent link

Tue, 29 Oct 2019

Vowels in Korean and Mandarin

Something I've been wondering about for a while: there's this vowel in Mandarin which is usually written as ‘e’, for example in Deng (Xiaoping, 邓小平) or in feng shui (風水). But it's not pronounced like the ‘e’ in English “bed” or “pen”. It seems to my untrained ear to be more like the Korean vowel ‘ㅓ’, which is sort of between English “bought” and “but”. So I had wanted for a while to look up how Deng's name was spelled in Korean to see if they used ‘ㅓ’ or some other vowel. Partial success. Sure enough, Deng is spelled with ‘ㅓ’ in Korean: 덩(샤오핑).

“Feng shui” is spelled differently in Korean, with a different vowel: 풍수. But that's not too surprising, since the term “feng shui” presumably entered the Korean language centuries ago, and not only was the Chinese pronunciation probably different then, the Korean pronunciation would have changed over time after the adoption. In contrast, Deng's name presumably wasn't translated into Korean until sometime in the 20th century.

I was surprised that “Xiaoping” turns into three syllables in Korean. But Korean doesn't have that /aʊ/ dipthong, so that's the best it can do. This reminds me now of how amused I was by Corn Flakes boxes in Korea: in Korean, “Flake” is a four-syllable word. (플레이크).

[Other articles in category /lang] permanent link

Mon, 28 Oct 2019

A solution in search of a problem

I don't remember right now what inspired this, but I got to thinking last week, what if I were to start writing the English letter ‘C’ in two forms, to distinguish its two pronunçiations? Speçifically, when ‘C’ gets the soft /s/ sound, we'll write it with a çedilla, and when it gets the hard /k/ sound we'll write it as usual.

Many improvements have been proposed to English spelling, and why not? Almost any change would be an improvement. But most orthographic innovations produçe barbaric or bizarre spellings. For example, “enuff” is still just wrong and may remain so for a long time. “Thru” and “donut” have been in common use long enough that not everyone thinks they look entirely bizarre, and I think only the Brits still object to “catalog” in plaçe of “catalogue”. But my ‘ç’ suggestion seems to me to be less violent. All the words are still spelled the same way. Nobody would have to deal with the shock of new spellings like “sirkular” or “klearanse”. I think the difficulty of adjusting to “çircular” and “clearançe” seems quite low.

On the other hand, the benefit also seems quite low. There aren't that many C’s to begin with. And who does this help, exactly? Foreigners who might otherwise have trouble deçiding how to pronounçe a particular ‘C’? Are there any people who actually have trouble reading “circle” and would be helped if it were spelled çircle”? And if there are, isn't c-vs-ç the least of their problems?

(Also, as Katara points out, ‘C’ is nearly superfluous in English as it is. You can almost always replaçe it with ‘S’ or ‘K’, accordingly. Although she did point out a counterexample: spelling “mace” as “mase” could be misleading. My proposal of “maçe”, though, is quite clear.)

I wonder, though, if this doesn't point the way toward a more general intervention that might be more generally helpful. The “ough” cluster gets a bad rap, but the real problems in English orthography are mostly in the totally inconsistent vowel spellings. Some diacritical marks might be a big help. For example, consider “bread” and “bead”. What if the close vowel in “bead” were indicated by spelling it “bēad”? Then it becomes easy to distinguish between “rēad” (/ɹid/, present tense) and “read” (/ɹɛd/, past tense), similarly “lēad” and “lead”. Native Anglophones will quickly learn to ignore the diacritical marks. A similar tactic might even help with the notorious “ough”. I don't really know what to do about words like “precious” or “ocean”, though. We can't leave them as they were, because that would unambiguously indicate the wrong pronunçiation “prekious”, “okean”. But to spell them “preçious” or “oçean” would be misleading. “Prećious”, maybe?

(I suppose someone wants to suggest “preşious” and “oşean”, but this is exactly what I'm trying to avoid. If you're going to do that you might as well go whole hog and use “preshus” and “oshun”.)

If you follow this path too far (and in the wrong direction) you end up with Unifon. I think this is a better direction and could end in a better plaçe. Maybe not better enough to be worth doing, though.

Peaçe out.

[Other articles in category /lang] permanent link

Fri, 25 Oct 2019

Gringos and gringas

Wikipedia's article on the etymology of gringo is quite good, well-cited, and I did not detect any fishy smells. I had previously tried to look up gringo in the Big Dictionary, but it only informed me that it was from Mexican Spanish, which is not really helpful. (I know that's because their jurisdiction stops at the English border, and they aren't responsible for anything outside, but really, OED folks? Nothing else?)

Anyway Wikipedia helped me out. I had gotten onto this gringos thing because yesterday I learned about gringas, which are white flour tortillas. I immediately wondered: are they called gringas because (like gringos) they're made of white paste? Or is it because they're eaten by gringos, who don't care for corn tortillas? The answer seems to be: both explanations are current, but nobody knows if either is correct.

On the way to gringo I spent a while reading about yanqui, which Latin Americans use to refer to northerners.

So do people in the USA for that matter. Southerners will angrily deny being “yanqui”. They reserve that term to mean anyone from the north, such as myself. But folks like me from the Mid-Atlantic states also deny being Yankees and will tell you that it only means people from New England. Many New Englanders will disclaim being truly Yankee and say that to meet true Yankees you need to go to Maine or maybe New Hampshire. And I suppose people in Maine use it to mean one particular old Yankee farmer who lives up near the Canadaian border.

Anyway, I wonder: in Latin America, does “yanqui” always mean specifically USA-ians, or would it also include Canadians? Would a typical Mexican or Guatemalan person refer casually to Canadians as yanquis? Or, if they were drinking beer with a Canadian, and the Canadian refered to themselves as yanqui, would they correct them? (“You're not a yanqui, you're Canadian! Not the same thing at all!”)

If Mexicans do consider Canadians to be a species of yanqui, what do they make of the Québécois? Also yanqui? Or do Francophones get a pass? (What about the Cajuns for that matter?)

[Other articles in category /lang/etym] permanent link

Wed, 28 Aug 2019

Opposites again

In a (still unpublished) discussion a while back, of the complexities of the idea of “opposites”, I said:

"Opposite" extends to all sorts of situations in which logic doesn't apply. Red is the opposite of green, but I'm not sure that it makes sense to ask for the logical negation of green. I suppose you can go with "not green", which is certainly quite different from "red".

A related example: Red is the opposite of green.

What's the opposite of “not green”? Is it “not red”? I think it isn't. The opposite of “not green” is “green”.

[Other articles in category /lang] permanent link

Mon, 20 May 2019

Alphabetical order in Korean

Alphabetical order in Korean has an interesting twist I haven't seen in any other language.

(Perhaps I should mention up front that Korean does not denote words with individual symbols the way Chinese does. It has a 24-letter alphabet, invented in the 15th century.)

Consider the Korean word “문어”, which means “octopus”. This is made up of five letters ㅁㅜㄴㅇㅓ. The ㅁㅜㄴ are respectively equivalent to English ‘m’, ‘oo‘ (as in ‘moon‘), and ‘n’. The ㅇis silent, just like ‘k’ in “knit”. The ㅓis a vowel we don't have in English, partway between “saw” and “bud”. Confusingly, it is usually rendered in Latin script as ‘eo’. (It is the first vowel in “Seoul”, for example.) So “문어” is transliterated to Latin script as “muneo”, or “munǒ”, and approximately pronounced “moon-aw”.

But as you see, it's not written as “ㅁㅜㄴㅇㅓ” but as “문어”. The letters are grouped into syllables of two or three letters each. (Or, more rarely, four or even five.)

Now consider the word “무해” (“harmless”) This word is made of the four letters ㅁㅜㅎㅐ. The first two, as before, are ‘m’, ‘oo’. The ㅎ is ‘h’ and the ‘ㅐ’ is a vowel that is something like the vowel in “air”, usually rendered in Latin script as ‘ae’. So it is written “muhae” and pronounced something like “moo-heh”.

ㅎis the last letter of the alphabet. Because ㅎfollows ㄴ, you might think that 무해 would follow 문어. But it does not. In Korean, alphabetization is also done at the syllable level. The syllable 무 comes before 문, because it is a proper prefix, so 무해 comes before 문어. If the syllable break in 문어 were different, causing it to be spelled 무너, it would indeed come before 무해. But it isn't, so it doesn't. (“무너” does not seem to be an actual word, but it appears as a consitutent in words like 무너지다 (“collapse”) and 무너뜨리다 (“demolish”) which do come before 무해 in the dictionary.)

As far as I know, there is nothing in Korean analogous to the English alphabet song.

Or to alphabet soup! Koreans love soup! And they love the alphabet, so why no hangeul-tang? There is a hundred dollar bill lying on the sidewalk here, waiting to be picked up.

[ Previously, but just barely related: Medieval Chinese typesetting technique. ]

[Other articles in category /lang] permanent link

Thu, 02 May 2019

Mathematical jargon failures

A while back I wrote an article about confusing and misleading technical jargon, drawing special attention to botanists’ indefensible misuse of the word “berry” and then to the word “henge”, which archaeologists use to describe a class of Stonehenge-like structures of which Stonehenge itself is not a member.

I included a discussion of mathematical jargon and generally gave it a good grade, saying:

Nobody hearing the term “cobordism” … will think for an instant that they have any idea what it means … they will be perfectly correct.

But conversely:

The non-mathematician's idea of “line”, “ball”, and “cube” is not in any way inconsistent with what the mathematician has in mind …

Today I find myself wondering if I gave mathematics too much credit. Some mathematical jargon is pretty bad. Often brought up as an example are the topological notions of “open” and “closed” sets. It sounds as if they should be exclusive and exhaustive — surely a set that is open is not closed, and vice versa? — but no, there are sets that are neither open nor closed and other sets that are both. Really the problem here is entirely with “open”. The use of “closed” is completely in line with other mathematical uses of “closed” and “closure”. A “closed” object is one that is a fixed point of a closure operator. Topological closure is an example of a closure operator, and topologically closed sets are its fixed points.

(Last month someone asked on Stack Exchange if there was a connection between topological closure and binary operation closure and I was astounded to see a consensus in the comments that there was no relation between them. But given a binary operation !!\oplus!!, we can define an associated closure operator !!\text{cl}_\oplus!! as follows: !!\text{cl}_\oplus(S)!! is the smallest set !!\bar S!! that contains !!S!! and for which !!x,y\in\bar S!! implies !!x\oplus y\in \bar S!!. Then the binary operation !!\oplus!! is said to be “closed on the set !!S!!” precisely if !!S!! is closed with respect to !!\text{cl}_\oplus!!; that is if !!\text{cl}_\oplus(S) = S!!. But I digress.)

Another example of poor nomenclature is “even” and “odd” functions. This is another case where it sounds like the terms ought to form a partition, as they do in the integers, but that is wrong; most functions are neither even nor odd, and there is one function that is both. I think what happened here is that first an “even” polynomial was defined to be a polynomial whose terms all have even exponents (such as !!x^4 - 10x^2 + 1!!) and similarly an “odd” polynomial. This already wasn't great, because most polynomials are neither even nor odd. But it was not too terrible. And at least the meaning is simple and easy to remember. (Also you might like the product of an even and an odd polynomial to be even, as it is for even and odd integers, but it isn't, it's always odd. As far as even-and-oddness is concerned the multiplication of the polynomials is analogous to addition of integers, and to get anything like multiplication you have to compose the polynomials instead.)

And once that step had been taken it was natural to extend the idea from polynomials to functions generally: odd polynomials have the property that !!p(-x) = -p(x)!!, so let's say that an odd function is one with that property. If an odd function is analytic, you can expand it as a Taylor series and the series will have only odd-degree terms even though it isn't a polynomial.

There were two parts to that journey, and each one made some sense by itself, but by the time we got to the end it wasn't so easy to see where we started from. Unfortunate.

I tried a web search for bad mathematics terminology and the top hit was this old blog article by my old friend Walt. (Not you, Walt, another Walt.) Walt suggests that

the worst terminology in all of mathematics may be that of !!G_\delta!! and !!F_\sigma!! sets…

I can certainly get behind that nomination. I have always hated those terms. Not only does it partake of the dubious open-closed terminology I complained of earlier (you'll see why in a moment), but all four letters are abbreviations for words in other languages, and not the same language. A !!G_\delta!! set is one that is a countable intersection of open sets. The !!G!! is short for Gebiet, which is German for an open neighborhood, and the !!\delta!! is for durchschnitt, which is German for set intersection. And on the other side of the Ruhr Valley, an !!F_\sigma!! set, which is a countable union of closed sets, is from French fermé (“closed”) and !!\sigma!! for somme (set union). And the terms themselves are completely opaque if you don't keep track of the ingredients of this unwholesome German-French-Greek stew.

This put me in mind of a similarly obscure pair that I always mix up, the type I and type II errors. One if them is when you fail to ignore something insignificant, and the other is when you fail to notice something significant, but I don't remember which is which and I doubt I ever will.

But the one I was thinking about today that kicked all this off is, I think, worse than any of these. It's really shameful, worthy to rank with cucumbers being berries and with Stonhenge not being a henge.

These are all examples of elliptic curves:

These are not:

That's right, ellipses are not elliptic curves, and elliptic curves are not elliptical. I don't know who was responsible for this idiocy, but if I ever meet them I'm going to kick them in the ass.

[ Addendum 20200510: Several people have earnestly explained to me how this terminological disaster came about. Please be assured that I am well aware of the history here. The situation is similar to the one that gave us “even” and “odd” functions: a long chain of steps each of which made some sense individually, but whose concatenation ended in a completely different place. This MathOverflow post has a good summary. ]

[ Addendum 20200510: Mark Badros has solved the “Type I / II” problem for me. They point out that in the story of the Boy Who Cried Wolf, there are two episodes. In the first episode, the boy and the villagers commit a Type I error by reacting to the presence of a wolf when there is none. In the second episode, they commit a Type II error by failing to react to the actual wolf. Thank you! ]

[Other articles in category /lang] permanent link

Fri, 26 Apr 2019


What is the shed in “watershed”? Is it a garden shed? No.

I guessed that it meant a piece of land that sheds water into some stream or river. Wrong!

The Big Dictionary says that this shed is:

The parting made in the hair by combing along the top of the head.

This meaning of “shed” fell out of use after the end of the 17th century.

[Other articles in category /lang/etym] permanent link

Sat, 30 Mar 2019


Katara just read me the story she wrote in Latin, which concerns two men who chase after a corax. “What kind of animal is corax?” I asked.

“It's a raven.”

“Awesome,” I said. “I bet it's onomatopoeic.”

So I looked into it, and yup! It's from Greek κόραξ. Liddell and Scott's Greek-English lexicon says (p. 832):

The Root is to be found in the onomatop. words κράζω, κρώζω, croak, etc.

κράζω (krazo) and κρώζω (krozo) mean “to croak”. “Croak” itself is also onomatopoeic. And it hadn't occurred to me before that English “crow” is also onomatopoeic. Looking into it further, Wikipedia also tells me that the rook is also named from the sound it makes.

(J.R.R. Tolkien was certainly aware of all of this. In The Hobbit has a giant raven named Roäc, the son of Carc.)

Liddell and Scott continues:

The same Root often appears in the sense of curved, cf. κορ-ώνη … Karin cur-vus, etc.

κορώνίς (koronis) means “curved”, and in particular a “corona” or crown. Curvus of course means curved, and is akin to Latin corvus, which again means a crow.

The raven's beak does not look so curved to me, but the Greeks must have found it striking.

[Other articles in category /lang/etym] permanent link

Wed, 12 Sep 2018

Language fluency in speech and print

Long ago I worked among the graduate students at the University of Pennsylvania department of Computer and Information Sciences. Among other things, I did system and software support for them, and being about the same age and with many common interests, I socialized with them also.

There was one Chinese-Malaysian graduate student who I thought of as having poor English. But one day, reading one of his emailed support requests, I was struck by how clear and well-composed it was. I suddenly realized I had been wrong. His English was excellent. It was his pronunciation that was not so good. When speaking to him in person, this was all I had perceived. In email, his accent vanished and he spoke English like a well-educated native. When I next met him in person I paid more careful attention and I realized that, indeed, I had not seen past the surface: he spoke the way he wrote, but his accent had blinded me to his excellent grammar and diction.

Once I picked up on this, I started to notice better. There were many examples of the same phenomenon, and also the opposite phenomenon, where someone spoke poorly but I hadn't noticed because their pronunciation was good. But then they would send email and the veil would be lifted. This was even true of native speakers, who can get away with all sorts of mistakes because their pronunciation is so perfect. (I don't mean perfect in the sense of pronouncing things the way the book says you should; I mean in the sense of pronouncing things the way a native speaker does.) I didn't notice this unless I was making an effort to look for it.

I'm not sure I have anything else to say about this, except that it seems to me that when learning a foreign language, one ought to consider whether one will be using it primarily for speech or primarily for writing, and optimize one's study time accordingly. For speech, concentrate on good pronunciation; for writing, focus on grammar and diction.

Hmm, put that way it seems obvious. Also, the sky is sometimes blue.

[Other articles in category /lang] permanent link

Sun, 29 Apr 2018

Lipogrammatic math posts

In August 2011, on a particular famous discussion forum (brought up on this blog again and again) an individual A, notorious for such acts, posts a quasi-philosophical inquiry, incurring unpopularity, antagonism, and many bad marks, although also a surprising quantity of rational discussion, including a thoughtful solution or two.

Many months forward, a distinct party B puts up a substantial bounty on this inquiry, saying:

I would like a complete answer to this question which does not use the letter "e" at any point.

(My apology for any anguish you may go through at this point in my story on account of this quotation and its obvious and blatant faults. My wrongdoing was involuntary, but I had no way to avoid it and still maintain full accuracy.)

By and by, a valiant third individual constructs a brilliant disquisition satisfying this surprising condition and thus obtains B's award.

Now, this month, in our group's accompanying policy board, a fourth collaborator, a guy (or gal, for all I know) I shall call D, and who I think may lack a minimal inclination for fun, finds fault with A's original post and particularly with C's bounty, and complains as follows:

Should we discourage bounties that encourage “clever” but unclear answers?

(Again, I must ask you for absolution. This is a word-for-word quotation.)

A thorough dismissal of OP's complaint, from a fifth author, adds a fully satisfactory finish to our affair.

[Other articles in category /lang] permanent link

Wed, 04 Apr 2018

Genealogy of the Saudi royal family

[ Note: None of this is a joke, nothing here is intended humorously, and certainly none of it should be taken as mockery or disparagement. The naming conventions of Saudi royalty are not for me to judge or criticize, and if they cause problems for me, the problems are my own. It is, however, a serious lament. ]

The following innocuous claim appears in Wikipedia's article on Abdullah bin Abdul-Rahman:

He was the seventh son of the Emir of the Second Saudi State, Abdul Rahman bin Faisal.

Yesterday I tried to verify this claim and I was not able to do it.

Somewhere there must be a complete and authoritative pedigree of the entire Saudi royal family, but I could not find it online, perhaps because it is very big. There is a Saudi royal family official web site, and when I found that it does have a page about the family tree, I rejoiced, thinking my search was over. But the tree only lists the descendants of King Abdulaziz Ibn Saud, founder of the modern Saudi state. Abdullah was his half-brother and does not appear there.

Well, no problem, just Google the name, right? Ha!

Problem 1: These princes all have at least twenty kids each. No, seriously. The Wikipedia article on Ibn Saud himself lists twenty-one wives and then gives up, ending with an exhausted “Possibly other wives”. There is a separate article on his descendants that lists 72 children of various sexes, and the following section on grandchildren begins:

Due to the Islamic traditions of polygyny and easy divorce (on the male side), King Abdul Aziz [Ibn Saud] has approximately a thousand grandchildren.

Problem 2: They reuse many of the names. Because of course they do; if wife #12 wants to name her first son the same as the sixth son of wife #2, why not? They don't live in the same house. So among the children of Ibn Saud there are two Abdullahs (“servant of God”), two Badrs (“full moon”), two Fahds (“leopard”), two each of Majid (“majestic”), Mishari (I dunno), Talal (dunno), and Turki (“handsome”). There are three sons named Khalid (“eternal”). There is a Sa'ad and a Saad, which I think are the exact same name (“success”) as spelled by two different Wikipedia editors.

And then they reuse the names intergenerationally. Among Ibn Saud's numerous patrilineal grandsons there are at least six more Fahds, the sons respectively of Mohammed, Badr (the second one), Sultan, Turki (also the second one), Muqrin, and Salman. Abdulaziz Ibn Saud has a grandson also named Abdulaziz, whose name is therefore Abdulaziz bin Talal bin Abdulaziz Al Saud. (The “bin” means “son of”; the feminine form is “bint”.) It appears that the House of Saud does not name sons after their fathers, for which I am grateful.

Ibn Saud's father was Abdul Rahman (this is the Abdul Rahman of Abdullah bin Abdul-Rahman, who is the subject of this article. Remember him?) One of Ibn Saud's sons is also Abdul Rahman, I think probably the first one to be born after the death of his grandfather, and at least two of his patrilineal grandsons are also.

Problem 3: Romanization of Arabic names is done very inconsistently. I mentioned “Saad” and “Sa'ad” before. I find the name Abdul Rahman spelled variously “Abdul Rahman”, “Abdulrahman”, “Abdul-Rahman”, and “Abd al-Rahman”. This makes text searches difficult and unreliable. (The name, by the way, means "Servant of the gracious one”, referring to God.)

Problem 4: None of these people has a surname. Instead they are all patronymics. Ibn Saud has six grandsons named Fahd; how do you tell them apart? No problem, their fathers all have different names, so they are Fahd bin Mohammed, Fahd bin Badr, Fahd bin Sultan, Fahd bin Turki, Fahd bin Muqrin, and Fahd bin Salman. But again this confuses text searches terribly.

You can search for “Abdullah bin Abdul-Rahman” but many of the results will be about his descendants Fahd bin Abdullah bin Abdul Rahman, Fahd bin Khalid bin Abdullah bin Abdul Rahman, Fahd bin Muhammad bin Abdullah bin Abdul Rahman, Abdullah bin Bandar bin Abdullah bin Abdul Rahman, Faisal bin Abdullah bin Abdul Rahman, Faisal bin Abdul Rahman bin Abdullah bin Abdul Rahman, etc.

In combination with the reuse of the same few names, the result is even more confusing. There is Bandar bin Khalid, and Khalid bin Bandar; Fahad bin Khalid and Khalid bin Fahd.

There is Mohammed al Saud (Mohammed of (the house of) Saud) and Mohammed bin Saud (Mohammed the son of Saud).

There are grandsons named Saad bin Faisal, Faisal bin Bandar, Bandar bin Sultan, Sultan bin Fahd, Fahd bin Turki, Turki bin Talal, Talal bin Mansour, Mansour bin Mutaib, Mutaib bin Abdullah, and Abdullah bin Saad. I swear I am not making this up.

Perhaps Abdullah was the seventh son of Abdul Rahman.

Perhaps not.

I surrender.

[Other articles in category /lang] permanent link

Thu, 22 Mar 2018

Does Skaði choose the husband with the best butt?

(Warning: I do not know anything about Old Norse, so everything I say about it should be understood as ill-informed speculation. I welcome corrections.)

In one of my favorite episodes from Norse mythology, the Æsir owe a payment to the Jötunn Skaði in compensation for killing her father. But they know she is very wealthy, and offer her an alternative compensation: one of their men in marriage.

Skaði wants to marry Baldr, because he is extremely handsome. But Baldr is already married. Odin proposes a compromise: the Æsir will line up behind a short curtain, and Skaði will choose her husband. She will marry whomever she picks; if she can pick out Baldr by his legs, she can have him. Skaði agrees, assuming that the beautiful Baldr will have the best legs.

(She chooses wrong. Njörðr has the best legs.)

Thinking on this as an adult, I said to myself “Aha, this is like that horn full of milk that was actually mead. I bet this was also cleaned up in the version I read, and that in the original material, Skaði was actually choosing the husband with the best butt.”

I went to check, and I was wrong. The sources say she was looking only at their feet.

I was going to just quote this:

she should choose for herself a husband from among the Æsir and choose by the feet only, seeing no more of him.

(Brodeur, 1916.)

But then I got worried. This is of course not the original source but an English translation; what if it is inaccurate?

Well, there was nothing else to do but ask Snorri about it. He says:

En æsir buðu henni sætt ok yfirbætr ok it fyrsta, at hon skal kjósa sér mann af ásum ok kjósa at fótum ok sjá ekki fleira af.

(Sætt is recompense or settlement; yfirbætr similarly. (Bætr is a cure, as in “I was sick, but I got better”.) The first (fyrsta) part of the settlement is that she “shall choose a man for herself” (skal kjósa sér mann) but choose by the feet (kjósa at fótum) seeing nothing else (sjá ekki fleira af).)

The crucial word here is fótum, which certainly looks like “foot”. (It is the dative form of fótr.) Could it possibly mean the buttocks? I don't think so. It's hard to be 100% certain, because it could be a euphemism — anything could be a euphemism for the buttocks if you paused before saying it and raised one eyebrow. (Did the Norse bards ever do this?) Also the Norse seem to have divided up the leg differently than we do. Many of the words seem to match, which is sometimes helpful but also can be misleading, because many don't. For example, I think leggr, despite its appearance, means just the shank. And I think fótum may not be just the foot itself, but some part of the leg that includes the foot.

But I'm pretty sure fótum is not the butt, at least not canonically. To do this right I would look at all the other instances of fótr to see what I could glean from the usage, but I have other work to do today. So anyway, Skaði probably was looking at their feet, and not at their butts. Oh well.

However! the other part of Skaði's settlement is that the Æsir must make her laugh. In the version I first read, Loki achieves this by tying his beard to a goat's. Nope!

Þá gerði Loki þat, at hann batt um skegg geitar nökkurrar ok öðrum enda um hreðjar sér, ok létu þau ýmsi eftir ok skrækði hvárt tveggja hátt.

Skegg geitar nökkurar is indeed some goat's beard. But hann batt … ok öðrum enda um hreðjar sér is “he tied … the other end to his own scrotum”.

Useful resources:

[Other articles in category /lang] permanent link

Mon, 19 Mar 2018

English's -en suffix

In English we can sometimes turn an adjective into a verb by suffixing “-en”. For example:

black → blacken
red → redden
white → whiten
wide → widen

But not

blue → bluen*
green → greenen*
yellow → yellowen*
long → longen*

(Note that I am only looking at -en verbs that are adjective-derived present tenses. This post is not concerned with the many -en verbs that are past participles, such as “smitten” (past participle of “smite”), “spoken” (“speak”), “molten” (“melt”), “sodden” (“seethe”), etc.)

I asked some linguist about this once and they were sure it was purely morphological, something like: black, red, and white end in stop consonants, and blue, green, and yellow don't.

Well, let's see:

Stop Blacken
Open (?)
Fricative Coarsen
Nasal   Cleanen
Vowel   Angrien
Glide   Betteren

There are some fine points:

  • “Biggen” used to exist but has fallen out of use
  • Perhaps I should have ommitted “strengthen” and “hasten”, which are derived from nouns, not from adjectives
  • I'm not sure whether “closen”, “hotten” and “wetten” are good or bad so I left them off
  • “moisten” and “soften” might belong with the stops instead of the fricatives
  • etc.

but clearly the morphological explanation wins. I'm convinced.

[ Addendum: Wiktionary discusses this suffix, distinguishing it from the etymologically distinct participial “-en”, and says “it is not currently very productive in forming new words, being mostly restricted to monosyllabic bases which end in an obstruent”. ]

[Other articles in category /lang] permanent link

Sat, 06 Jan 2018

The horn of milk

When I was a kid I had a book of “Myths and Legends of the Ages”, by Marion N. French. One of the myths was the story of Thor's ill-fated visit to Utgard. The jötunns of Utgard challenge Thor and Loki to various contests and defeat them all through a combination of talent and guile. In one of these contests, Thor is given a drinking horn and told that even the wimpiest of the jötunns is able to empty it of its contents in three drinks. (The jötunns are lying. The pointy end of the horn has been invisibly connected to the ocean.)

The book specified that the horn was full of milk, and as a sweet and innocent kiddie I did not question this. Decades later it hit me suddenly: no way was the horn filled with milk. When the mighty jötunns of Utgard are sitting around in their hall, they do not hold contests to see who can drink the most milk. Obviously, the horn was full of mead.

The next sentence I wrote in the draft version of this article was:

   In the canonical source material (poetic edda maybe?) the horn is full
   of *mead*. Of course it is.

In my drafts, I often write this sort of bald statement of fact, intending to go back later and check it, and perhaps produce a citation. As the quotation above betrays, I was absolutely certain that when I hunted down the original source it would contradict Ms. French and say mead. But I have now hunted down the canonical source material (in the Prose Edda, it turns out, not the Poetic one) and as far as I can tell it does not say mead!

Here is an extract of an 1880 translation by Rasmus Björn Anderson, provided by WikiSource:

He went into the hall, called his cup-bearer, and requested him to take the sconce-horn that his thanes were wont to drink from. The cup-bearer immediately brought forward the horn and handed it to Thor. Said Utgard-Loke: From this horn it is thought to be well drunk if it is emptied in one draught, some men empty it in two draughts, but there is no drinker so wretched that he cannot exhaust it in three.

For comparison, here is the 1916 translation of Arthur Gilchrist Brodeur, provided by

He went into the hall and called his serving-boy, and bade him bring the sconce-horn which the henchmen were wont to drink off. Straightway the serving-lad came forward with the horn and put it into Thor's hand. Then said Útgarda-Loki: 'It is held that this horn is well drained if it is drunk off in one drink, but some drink it off in two; but no one is so poor a man at drinking that it fails to drain off in three.'

In both cases the following text details Thor's unsuccessful attempts to drain the horn, and Utgard-Loki's patronizing mockery of him after. But neither one mentions at any point what was in the horn.

I thought it would be fun to take a look at the original Old Norse to see if the translators had elided this detail, and if it would look interesting. It was fun and it did look interesting. Here it is, courtesy of Heimskringla.NO:

Útgarða-Loki segir, at þat má vel vera, ok gengr inn í höllina ok kallar skutilsvein sinn, biðr, at hann taki vítishorn þat, er hirðmenn eru vanir at drekka af. Því næst kemr fram skutilsveinn með horninu ok fær Þór í hönd. Þá mælti Útgarða-Loki: "Af horni þessu þykkir þá vel drukkit, ef í einum drykk gengr af, en sumir menn drekka af í tveim drykkjum, en engi er svá lítill drykkjumaðr, at eigi gangi af í þrimr."

This was written in Old Norse around 1220, and I was astounded at how much of it is recognizable, at least when you already know what it is going to say. However, the following examples are all ill-informed speculation, and at least one of my confident claims is likely to be wrong. I hope that some of my Gentle Readers are Icelanders and can correct my more ridiculous errors.

“Höllina” is the hall. “Kallar” is to call in. The horn appears three times, as ‘horninu’, ‘horni’, and in ‘vítishorn’, which is a compound that specifies what kind of horn it is. “Þór í hönd” is “in Thor's hand”. (The ‘Þ’ is pronounced like the /th/ of “Thor”.) “Drekka”, “drukkit”, “drykk”, “drykkjum”, and “drykkjumaðr” are about drinking or draughts; “vel drukkit” is “well-drunk”. You can see the one-two-three in there as “einum-tveim-þrimr”. (Remember that the “þ” is a /th/.) One can almost see English in:

sumir menn drekka af í tveim drykkjum

which says “some men drink it in two drinks”. And “lítill drykkjumaðr” is a little-drinking-person, which I translated above as “wimp”.

It might be tempting to guess that “með horninu” is a mead-horn, but I'm pretty sure it is not; mead is “mjað” or “mjöð”. I'm not sure, but I think “með” here is just “with”, akin to modern German “mit”, so that:

næst kemr fram skutilsveinn með horninu

is something like “next, the skutilsveinn came with the horn”. (The skutilsveinn is something we don't have in English; compare trying to translate “designated hitter” into Old Norse.)

For a laugh, I tried putting this into Google Translate, and I was impressed with the results. It makes a heroic effort, and produces something that does capture some of the sense of the passage. It identifies the language as Icelandic, which while not correct, isn't entirely incorrect either. (The author, Snorri Sturluson, was in fact Icelandic.) Google somehow mistakes the horn for a corner, and it completely fails to get the obsolete term “hirðmenn” (roughly, “henchmen”), mistaking it for herdsmen. The skutilsveinn is one of the hirðmenn.

Anyway there is no mead here, and none in the rest of the story, which details Thor's unsuccessful attempts to drink the ocean. Nor is there any milk, which would be “mjólk”.

So where does that leave us? The jötunns challenge Thor to a drinking contest, and bring him a horn, and even though it was obviously mead, the story does not say what was in the horn.

Because why would they bother to say what was in the horn? It was obviously mead. When the boys crack open a cold one, you do not have to specify what it was that was cold, and nobody should suppose that it was a cold bottle of milk.

I imagine Marion N. French sitting by the fire, listening while Snorri tells the story of Thor and the enchanted drinking horn of Utgard:

“Utgarða-Loki called his skutilsveinn, and requested him to bring the penalty-horn that his hirðmen were wont to drink from…”

“Excuse me! Excuse me, Mr. Sturluson! Just what were they wont to drink from it?”

“Eh, what's that?”

”What beverage was in the horn?”

“Why, mead, of course. What did you think it was, milk?”

(Merriment ensues, liberally seasoned with patronizing mockery.)

(In preparing this article, I found it helpful to consult Zoëga's Concise Dictionary of Old Icelandic of 1910.)

[ Addendum 2018-01-17: Holy cow, I was so wrong. It was so obviously not mead. I was so, so wrong. Amazingly, unbelievably wrong. ]

[ Addendum 2018-03-22: A followup in which I investigate what organs Skaði looked at when choosing her husband, and what two things Loki tied together to make her laugh. ]

[Other articles in category /lang] permanent link

Fri, 05 Jan 2018

Hebrew John Doe

Last month I wrote about the Turkish analog of “Joe Blow”. I got email from Gaal Yahas, who said

I bet you'll get plenty of replies on your last post about translating "John Doe" to different languages.

Sadly no. But M. Yahas did tell me in detail about the Hebrew version, and I did a little additional research.

The Hebrew version of “Joe Blow” / “John Doe” is unequivocally “Ploni Almoni” (”פלוני אלמוני“, I think). This usage goes back at least to the Book of Ruth, approximately 2500 years ago. Ruth's husband has died without leaving an heir, and custom demands that a close relative of her father-in-law should marry her, to keep the property in the family. Boaz takes on this duty, but first meets with another man, who is a closer relative than he:

Then went Boaz up to the gate, and sat him down there: and, behold, the kinsman of whom Boaz spake came by; unto whom he said, Ho, such a one! turn aside, sit down here. And he turned aside, and sat down.

(Ruth 4:1, KJV)

This other relative declines to marry Ruth. He is not named, and is referred to in the Hebrew version as Ploni Almoni, translated here as “such a one”. This article in The Jewish Chronicle discusses the possible etymology of these words, glossing “ploni” as akin to “covered” or “hidden” and “almoni” as akin to “silenced” or “muted”.

Ploni Almoni also appears in the book of Samuel, probably even older than Ruth:

David answered Ahimelek the priest, “The king sent me on a mission and said to me, 'No one is to know anything about the mission I am sending you on.' As for my men, I have told them to meet me at a certain place.”

(1 Samuel 21:2, NIV)

The mission is secret, so David does not reveal the meeting place to Ahimelek. Instead, he refers to it as Ploni Almoni. There is a similar usage at 2 Kings 6:8.

Apparently the use of “Ploni” in Hebrew to mean “some guy” continues through the Talmud and up to the present day. M. Yahas also alerted me to two small but storied streets in Tel Aviv. According to this article from Haaretz:

A wealthy American businessman was buying up chunks of real estate in Tel Aviv. He purchased the two alleyways with the intention of naming them after himself and his wife, even going so far as to put up temporary shingles with the streets’ new names. But he had christened the streets without official permission from the city council.

The mayor was so incensed by the businessman’s chutzpah that he decided to temporarily name the alleyways Simta Almonit and Simta Plonit.

And so they remain, 95 years later.

(M. Yahas explains that “Simta” means “alley” and is feminine, so that Ploni and Almoni take the feminine ‘-it’ ending to agree with it.)

Wikipedia has not one but many articles on this topic and related ones:

My own tiny contribution in this area: my in-laws live in a rather distant and undeveloped neighborhood on the periphery of Seoul, and I once referred to it as 아무데도동 (/amudedo-dong/), approximately “nowhereville”. This is not standard in Korean, but I believe the meaning is clear.

[ Addendum 20230423: Every time I reread this article, I am startled by Haaretz's use of the word “christened” in this context. ]

[Other articles in category /lang] permanent link

Mon, 18 Dec 2017

Turkish John Doe

A few weeks ago I was writing something about Turkey, and I needed a generic Turkish name, analogous to “John Doe”. I was going to use “Osman Yılmaz”, which I think would have been a decent choice, but I decided it would be more fun to ask a Turkish co-worker what the correct choice would be. I asked Kıvanç Yazan, who kindly allowed himself to be nerdsniped and gave me a great deal of information. In the rest of this article, anything about Turkish that is correct should be credited to him, while any mistakes are surely my own.

M. Yazan informs me that one common choice is “Ali Veli”. Here's a link he gave me to Ekşisözlük, which is the Turkish analog of Urban Dictionary, explaining (in Turkish) the connotations of “John Doe”. The page also mentions “John Smith”, which in turn links to a page about a footballer named Ali Öztürk—in fact two footballers. ([1] [2]) which is along the same lines as my “Osman Yılmaz” suggestion.

But M. Yazan told me about a much closer match for “John Doe”. It is:

sarı çizmeli Mehmet Ağa

which translates as “Mehmet Agha with yellow boots”. (‘Sarı’ = ‘yellow’; ‘çizmeli’ = ‘booted’.)

This oddly specific phrase really seems to be what I was looking for. M. Yazan provided several links:

  • Ekşisözlük again
  • The official dictionary of the Turkish government

    Unfortunately I can't find any way to link to the specific entry, but the definition it provides is “kim olduğu, nerede oturduğu bilinmeyen kimse” which means approximately “someone whose identity/place is unknown”.

  • A paper on “Personal Names in Sayings and Idioms”.

    This is in Turkish, but M. Yazan has translated the relevant part as follows:

    At the time when yellow boots were in fashion, a guy from İzmir put "Mehmet Aga" in his account book. When time came to pay the debt , he sent his servant and asked him to find "Mehmet Aga with yellow boots". The helper did find a Mehmet Aga, but it was not the one they were looking for. Then guy gets angry at his servant, to which his helper responded, “Sir, this is a big city, there are lots of people with yellow boots, and lots of people named Mehmet! You should write it in your book one more time!”

Another source I found was this online Turkish-English dictionary which glosses it as “Joe Schmoe”.

Finding online mentions of sarı çizmeli Mehmet Ağa is a little bit tricky, because he is also the title of a song by the very famous Turkish musician Barış Manço, and the references to this song swamp all the other results. This video features Manço's boots and although we cannot see for sure (the recording is in grayscale) I presume that the boots are yellow.

Thanks again, Kıvanç!

[ Addendum: The Turkish word for “in style” is “moda”. I guessed it was a French loanword. Kıvanç tells me I was close: it is from Italian. ]

[ Addendum 20171219: Wikipedia has an impressive list of placeholder names by language that includes Mehmet Ağa. ]

[ Addendum 20180105: The Hebrew version of Mehmet Ağa is at least 2600 years old! ]

[Other articles in category /lang] permanent link

Mon, 31 Jul 2017

Sabotaged by Polish orthography

This weekend my family was doing a bookstore event related to Fantastic Beasts and Where to Find Them. One of the movie's characters, Jacob Kowalski, dreams of becoming a baker, and arrives to a bank appointment with a suitcase full of Polish confections, including pączki, a sort of Polish jelly donut. My wife wanted to serve these at the event.

The little tail on the ą in pączki is a diacritical mark called an ogonek, which is Polish for “little tail”. If I understand correctly, this nasalizes the sound of the a so that it is more like /an/, and furthermore in modern Polish the value of this particular letter has changed so that pączki is pronounced something like “pawnch-kee”. (Polish “cz” is approximately like English “ch”.)

I was delegated to travel to Philadelphia's Polish neighborhood to obtain the pączki. This turned out to be more difficult than I expected. The first address I visited was simply wrong. When I did find the bakery I was looking for, it was sold out of pączki. The bakery across the street was closed, so I started walking down Allegheny Avenue looking for the next bakery.

Before I got there, though, I passed a storefront with a sign listing its goods and services in blue capital letters. One of the items was PACZKI. Properly, of course, this should be PĄCZKI but Poles often omit the ogonek, especially when buying blue letter decals in Philadelphia, where large blue ogoneks are often unavailable. But when I went in to ask I immediately realized that I had probably made a mistake. The store seemed to sell toiletries, paper goods, and souvenirs, with no baked goods in sight.

I asked anyway: “Your sign outside says you sell PĄCZKI?”

“No,” replied the storekeeper. “Pach-kee.”

I thought she was correcting my pronunciation. “But I thought the ogonek made it ‘pawnch-kee’?”

“No, not pawnch-kee. Pach-kee. For sending, to Poland.” She pointed at a box.

I had misunderstood the sign. It did not say PĄCZKI, but PACZKI, which I have since learned means “boxes”.

The storekeeper directed me to the deli across the street, where I was able to buy the pączki. I also bought some interesting-looking cold roast pork loin and asked what it was called. A customer told me it was “po-lend-witsa”, and from this I was able to pick out the price label on the deli case, which said “POLEDWICA”.

After my embarrassment about the boxes I was concerned that I didn't understand ogoneks as well as I thought I did. I pointed to the ‘E’. “Shouldn't there be an ogonek on the ‘E’ here?”

“Yes,” he said, and shrugged. They had left it off, just as I had (incorrectly) thought had happened on the PACZKI sign.

I think the only way to win this one would have been to understand enough of the items in blue capital letters to guess from context that it really was PACZKI and not PĄCZKI.

[ Addendum 20170803: A thirty-year-old mystery has been cleared up! When I was a teenager the news was full of the struggles of the Polish workers’ union Solidarity and its charismatic leader, Lech Walesa, later president of Poland. But his name was always pronounced ‘walensa’. Why? Last night I suddenly understood the mysterious ‘n’: the name was actually ‘Walęsa’! ]

[ (Well, not quite. That does explain the mystery ‘n’. But on looking it up, I find that the name is actually ‘Wałęsa’. The ‘W’ is more like English ‘v’ than like English ‘w’, and the ‘ł’ is apparently very much like English ‘w’. So the correct pronunciation of ‘Wałęsa’ is more like ‘va-wen-sa’ than ‘wa-len-sa’. Perhaps the people who pronounced the ę but not the W or the ł were just being pretentious.) ]

[ Addendum 20170803: Maciej Cegłowski says that “paczki” is more like “packages” than like “boxes”; Google translate suggests “parcels”. He would also like me to remind you that “paczki” and “pączki” are plural, the singulars being “paczka” and “pączek”, respectively. Alicja Raszkowska she loves my use of “ogoneks” (the English plural) in place of the Polish “ogonki”. ]

[ Addendum: Maciej also says “For Polish speakers, your post is like watching someone dive from a high platform onto a cactus.” ]

[ Addendum 20210710: Today I was looking at a list of common Polish surnames, and one was Dąbrowski. Trying to pronounce this out loud, I suddenly understood where the American name “Dombrowski” comes from. As with pączki (pronounced like “pawnch-kee”), Dąbrowski is pronounced something like “dawm-brovski”, with the nasalization of the /a/ sounding to an Anglophone more like an /m/ than an /n/ because of the following labial consonant. So “Dombrowski” is a pretty good representation English represenation of this name. ]

[Other articles in category /lang] permanent link

Thu, 11 May 2017

Zomg lots more anagram stuff

I'm almost done with anagrams. For now, anyway. I think. This article is to mop up the last few leftover anagram-related matters so that I can put the subject to rest.

(Earlier articles: [1] [2] [3] [•] )

Code is available

Almost all the code I wrote for this project is available on Github.

The documentation is not too terrible, I think.

Anagram lists are available

I have also placed my scored anagram lists on my web site. Currently available are:

  • Original file from the 1990s. This contains 23,521 anagram pairs, the results of my original scoring algorithm on a hand-built dictionary that includes the Unix spellcheck dictionary (/usr/dict/words), the Webster's Second International Dictionary word list, and some lexicons copied from a contemporaneous release of WordNet. This file has been in the same place on my web site since 1997 and is certainly older than that.

  • New file from February. Unfortunately I forget what went into this file. Certainly everything in the previous file, and whatever else I had lying around, probably including the Moby Word Lists. It contains 38,333 anagram pairs.

  • Very big listing of Wikipedia article titles. (11 MB compressed) I acquired the current list of article titles from the English Wikipedia; there are around 13,000,000 of these. I scored these along with the other lexicons I had on hand. The results include 1,657,150 anagram pairs. See below for more discussion of this.

!‌!Con talk

On Saturday I gave a talk about the anagram-scoring work at !‌!Con in New York. The talk was not my best work, since I really needed 15 minutes to do a good job and I was unwilling to cut it short enough. (I did go overtime, which I deeply regret.) At least nobody came up to me afterward and complained.

Talk materials are on my web site and I will link other talk-related stuff from there when it becomes available. The video will be available around the end of May, and the text transcript probably before that.

[ Addendum 20170518: The video is available thanks to Confreaks. ]

Both algorithms are exponential

The day after the talk an attendee asked me a very good question: why did I say that one algorithm for scoring algorithms was better than the other, when they are both exponential? (Sorry, I don't remember who you were—if you would like credit please drop me a note.)

The two algorithms are:

  • A brute-force search to construct all possible mappings from word A to word B, and then calculate the minimum score over all mappings (more details)

  • The two words are converted into a graph; we find the maximum independent set in the graph, and the size of the MIS gives the score (more details)

The answer to this excellent question begins with: just because two problems are both hard doesn't mean they are equally hard. In this case, the MIS algorithm is better for several reasons:

  1. The number of possible mappings from A to B depends on the number of repeated letters in each word. For words of length n, in the worst case this is something like !! n! !!. This quantity is superexponential; it eventually exceeds !! c^n !! for all constants !!c!!. The naïve algorithm for MIS is only exponential, having !!c=2!!.

  2. The problem size for the mapping algorithm depends on the number of repeated letters in the words. The problem size for the MIS algorithm depends on the number of shared adjacent letter pairs in the two words. This is almost always much smaller.

  3. There appears to be no way to score all the mappings without constructing the mappings and scoring them. In contrast, MIS is well-studied and if you don't like the obvious !!2^n!! algorithm you can do something cleverer that takes only !!1.22^n!!.

  4. Branch-and-bound techniques are much more effective for the MIS problem, and in this particular case we know something about the graph structure, which can be exploited to make them even more effective. For example, when calculating the score for

    chromophotolithograph photochromolithograph

    my MIS implementation notices the matching trailing olithograph parts right away, and can then prune out any part of the MIS search that cannot produce a mapping with fewer than 11 chunks. Doing this in the mapping-generating algorithm is much more troublesome.

Stuff that didn't go into the talk

On Wednesday I tried out the talk on Katara and learned that it was around 75% too long. I had violated my own #1 content rule: “Do not begin with a long introduction”. My draft talk started with a tour of all my favorite anagrams, with illustrations. Included were:

  • “Please” and “asleep” and “elapse”.

  • “Spectrum” and “crumpets” ; my wife noticed this while we were at a figure-skating event at the Philadelphia Spectrum, depicted above.

  • “English” and “shingle” ; I came up with this looking at a teabag while at breakfast with my wife's parents. This prompted my mother-in-law to remark that it must be hard to always be thinking about such things—but then she admitted that when she sees long numerals she always checks them for divisibility by 9.

  • “Soupmaster” and “mousetraps”. The picture here is not perfect. I wanted a picture of the Soupmaster restaurant that was at the Liberty Place food court in Philadelphia, but I couldn't find one.

  • I also wanted to show the back end of a Honda Integra and a picture of granite, but I couldn't find a good picture of either one before I deleted them from the talk. (My wife also gets credit for noticing this one.) [ Addendum 20170515: On the road yesterday I was reminded of another one my wife noticed: “Pontiac” / “caption”. ]

Slide #1 defines what anagrams actually are, with an example of “soapstone” / “teaspoons”. I had originally thought I might pander to the left-wing sensibilities of the !‌!Con crowd by using the example “Donald Trump” / “Lord Dampnut” and even made the illustration. I eventually rejected this for a couple of reasons. First, it was misleading because I only intended to discuss single-word anagrams. Second, !‌!Con is supposed to be fun and who wants to hear about Donald Trump?

But the illustration might be useful for someone else, so here it is. Share and enjoy.

After I rejected this I spent some time putting together an alternative, depicting “I am Lord Voldemort” / “Tom Marvolo Riddle”. I am glad I went with the soapstone teaspoons instead.

People Magazine

Clearly one important ingredient in finding good anagrams is that they should have good semantics. I did not make much of an effort in this direction. But it did occur to me that if I found a list of names of well-known people I might get something amusing out of it. For example, it is well known that “Britney Spears” is an anagram of “Presbyterians” which may not be meaningful but at least provides something to mull over.

I had some trouble finding a list of names of well-known people, probably because I do not know where to look, but I did eventually find a list of a few hundred on the People Magazine web site so I threw it into the mix and was amply rewarded:

Cheryl Burke Huckleberry

I thought Cheryl Burke was sufficiently famous, sufficiently recently, that most people might have heard of her. (Even I know who she is!) But I gave a version of the !‌!Con talk to the Philadelphia Perl Mongers the following Monday and I was the only one in the room who knew. (That version of the talk took around 75 minutes, but we took a lot of time to stroll around and look at the scenery, much of which is in this article.)

I had a struggle finding the right Cheryl Burke picture for the !‌!Con talk. The usual image searches turned up lots of glamour and fashion pictures and swimsuit pictures. I wanted a picture of her actually dancing and for some reason this was not easy to find. The few I found showed her from the back, or were motion blurred. I was glad when I found the one above.


A few days before the !‌!Con talk my original anagram-scoring article hit #1 on Hacker News. Hacker News user Pxtl suggested using the Wikipedia article title list as an input lexicon. The article title list is available for download from the Wikimedia Foundation so you don't have to scrape the pages as Pxtl suggested. There are around 13 million titles and I found all the anagrams and scored them; this took around 25 minutes with my current code.

The results were not exactly disappointing, but neither did they deliver anything as awesomely successful as “cinematographer” / “megachiropteran”. The top scorer by far was “ACEEEFFGHHIILLMMNNOORRSSSTUV”, which is the pseudonym of 17th-century German writer Hans Jakob Christoffel von Grimmelshausen. Obviously, Grimmelshausen constructed his pseudonym by sorting the letters of his name into alphabetical order.

(Robert Hooke famously used the same scheme to claim priority for discovery of his spring law without actually revealing it. He published the statement as “ceiiinosssttuv” and then was able to claim, two years later, that this was an anagram of the actual law, which was “ut tensio, sic vis”. (“As the extension, so the force.”) An attendee of my Monday talk wondered if there is some other Latin phrase that Hooke could have claimed to have intended. Perhaps someone else can take the baton from me on this project.)

Anyway, the next few top scorers demonstrate several different problems:

    21 Abcdefghijklmnopqrstuvwxyz / Qwertyuiopasdfghjklzxcvbnm
    21 Abcdefghijklmnopqrstuvwxyz / Qwertzuiopasdfghjklyxcvbnm
    21 Ashland County Courthouse / Odontorhynchus aculeatus
    21 Daniel Francois Malherbe / Mindenhall Air Force Base

    20 Christine Amongin Aporu / Ethnic groups in Romania
    20 Message force multiplier / Petroleum fiscal regimes

    19 Cholesterol lowering agent / North West Regional College
    19 Louise de Maisonblanche / Schoenobius damienella
    19 Scorpaenodes littoralis / Steroidal spirolactones

The “Qwerty” ones are intrinsically uninteresting and anyway we could have predicted ahead of time that they would be there. And the others are just sort of flat. “Odontorhynchus aculeatus” has the usual problems. One can imagine that there could be some delicious irony in “Daniel Francois Malherbe” / “Mindenhall Air Force Base” but as far as I can tell there isn't any and neither was Louise de Maisonblanche killed by an S. damienella. (It's a moth. Mme de Maisonblanche was actually killed by Variola which is not an anagram of anything interesting.)

Wikipedia article titles include many trivial variations. For example, many people will misspell “Winona Ryder” as “Wynona Rider”, so Wikipedia has pages for both, with the real article at the correct spelling and the incorrect one redirecting to it. The anagram detector cheerfully picks these up although they do not get high scores. Similarly:

  • there are a lot of articles about weasels that have alternate titles about “weasles”
  • there are a lot of articles about the United States or the United Kingdom that have alternate titles about the “Untied States” or the “Untied Kingdom”
  • Articles about the “Center for” something or other with redirects to (or from) the “Centre for” the same thing.
  • There is an article about “Major professional sports leagues in Canada and the United States” with a redirect from “Major professional sports leagues in the United States and Canada”.
  • You get the idea.

The anagram scorer often had quite a bit of trouble with items like these because they are long and full of repeated letter pairs. The older algorithm would have done even worse. If you're still wondering about the difference between two exponential algorithms, some of these would make good example cases to consider.

As I mentioned above you can download the Wikipedia anagrams from my web site and check for yourself. My favorite item so far is:

    18 Atlantis Casino Resort Spa / Carter assassination plot


Some words appear with surprising frequency and I don't know why. As I mentioned above one of the top scorers was “Ethnic groups in Romania” and for some reason Romania appears in the anagram list over and over again:

    20 Christine Amongin Aporu / Ethnic groups in Romania
    17 List of Romanian actors / Social transformation
    15 Imperial Coronation  / Romanian riot police
    14 Rakhine Mountains / Romanians in the UK
    14 Mindanao rasbora / Romanians abroad
    13 Romanian poets / ramosopinnate
    13 Aleuron carinatum / Aromanian culture
    11 Resita Montana / Romanian state
    11 Monte Schiara / The Romaniacs
    11 Monetarianism / Romanian Times
    11 Marion Barnes / Romanian Serb
    11 Maarsen railway station / Romanian State Railways
    11 Eilema androconia / Nicolae de Romania
    11 Ana Maria Norbis / Arabs in Romania

    ( 170 more )

Also I had never thought of this before, but Romania appears in this unexpected context:

    09 Alicia Morton / Clitoromania
    09 Carinito Malo / Clitoromania

(Alicia Morton played Annie in the 1999 film. Carinito Malo is actually Cariñito Malo. I've already discussed the nonequivalence of “n” and “ñ” so I won't beat that horse again.)

Well, this is something I can investigate. For each string of letters, we have here the number of Wikipedia article titles in which the string appears (middle column), the number of anagram pairs in which the string appears (left column; anagrams with score less than 6 are not counted) and the quotient of the two (right column).

            romania               110  4106  2.7%
            serbia                109  4400  2.5%
            croatia                68  3882  1.8%
            belarus                24  1810  1.3%

            ireland               140 11426  1.2%
            andorra                 7   607  1.2%
            austria                60  5427  1.1%
            russia                137 15944  0.9%

            macedonia              28  3167  0.9%
            france                111 14785  0.8%
            spain                  64  8880  0.7%
            slovenia               18  2833  0.6%

            wales                  47  9438  0.5%
            portugal               17  3737  0.5%
            italy                  21  4353  0.5%
            denmark                19  3698  0.5%

            ukraine                12  2793  0.4%
            england                37  8719  0.4%
            sweden                 11  4233  0.3%
            scotland               16  4945  0.3%

            poland                 22  6400  0.3%
            montenegro              4  1446  0.3%
            germany                16  5733  0.3%
            finland                 6  2234  0.3%

            albania                10  3268  0.3%
            slovakia                3  1549  0.2%
            norway                  9  3619  0.2%
            greece                 10  8307  0.1%

            belgium                 3  2414  0.1%
            switzerland             0  5439  0.0%
            netherlands             1  3522  0.0%
            czechia                 0    75  0.0%

As we see, Romania and Serbia are substantially ahead of the others. I suspect that it is a combination of some lexical property (the interesting part) and the relatively low coverage of those countries in English Wikipedia. That is, I think if we were to identify the lexical component, we might well find that russia has more of it, but scores lower than romania because Russia is much more important. My apologies if I accidentally omitted your favorite European country.

[ Oh, crap, I just realized I left out Bosnia. ]


Another one of the better high scorers turns out to be the delightful:

   16 Lesbian intercourse / Sunrise Celebration

“Lesbian”, like “Romania”, seems to turn up over and over; the next few are:

    11 Lesbian erotica / Oreste Bilancia
    11 Pitane albicollis / Political lesbian
    12 Balearic islands / Radical lesbians
    12 Blaise reaction / Lesbian erotica

    (43 more)

Wikipedia says:

The Blaise reaction is an organic reaction that forms a β-ketoester from the reaction of zinc metal with a α-bromoester and a nitrile.

A hundred points to anyone who can make a genuinely funny joke out of this.

Oreste Bilancia is an Italian silent-film star, and Pitane albicollis is another moth. I did not know there were so many anagrammatic moths. Christian Bale is an anagram of Birthana cleis, yet another moth.

[ Addendum 20220227: Sean Carney has applied my method to the headwords from Urban Dictionary and says “even though it doesn’t score quite as well, in my mind, the clear winner is genitals be achin / cheating lesbian”. ]

I ran the same sort of analysis on lesbian as on romania, except that since it wasn't clear what to compare it to, I picked a bunch of random words.

    nosehair                 3     3 100.0%
    margarine                4    16  25.0%
    penis                   95   573  16.6%
    weasel                  11   271   4.1%
    phallus                  5   128   3.9%
    lesbian                 26   863   3.0%
    center                 340 23969   1.4%
    flowers                 14  1038   1.3%
    trumpet                  6   487   1.2%
    potato                  10   941   1.1%
    octopus                  4   445   0.9%
    coffee                  12  1531   0.8%

It seems that lesbian appears with unusually high but not remarkably high frequency. The unusual part is its participation in so many anagrams with very high scores. The outstanding item here is penis. (The top two being rare outliers.) But penis still wins even if I throw away anagrams with scores less than 10 (instead of less than 6):

    margarine               1    16   6.2%
    penis                  13   573   2.3%
    lesbian                 8   863   0.9%
    trumpet                 2   487   0.4%
    flowers                 4  1038   0.4%
    center                 69 23969   0.3%
    potato                  2   941   0.2%
    octopus                 1   445   0.2%
    coffee                  1  1531   0.1%
    weasel                  0   271   0.0%
    phallus                 0   128   0.0%
    nosehair                0     3   0.0%

Since I'm sure you are wondering, here are the anagrams of margarine and nosehair:

    07 Nosehair / Rehsonia
    08 Aso Shrine / Nosehairs
    09 Nosehairs / hoariness

    04 Margaret Hines / The Margarines
    07 Magerrain / margarine
    07 Ramiengar / margarine
    08 Rae Ingram / margarine
    11 Erika Armstrong / Stork margarine

I think “Margaret Hines” / “The Margarines” should score more than 4, and that this exposes a defect in my method.

Acrididae graphs 

Here is the graph constructed by the MIS algorithm for the pair “acrididae” / “cidaridae”, which I discussed in an earlier article and also mentioned in my talk.

Each maximum independent set in this graph corresponds to a minimum-chunk mapping between “acrididae” and “cidaridae”. In the earlier article, I claimed:

This one has two maximum independent sets

which is wrong; it has three, yielding three different mappings with five chunks:

My daughter Katara points out that the graphs above resemble grasshoppers. My Gentle Readers will no doubt recall that acrididae is the family of grasshoppers, comprising around 10,000 species. I wanted to find an anagram “grasshopper” / “?????? graph”. There are many anagrams of “eoprs” and “eoprss” but I was not able to find anything good. The best I could do was “spore graphs”.

Thank you, Gentle Readers, for taking this journey with me. I hope nobody walks up to me in the next year to complain that my blog does not feature enough anagram-related material.

[ Addendum 20230423: A discussion on LanguageHat of the original article includes the interesting Russian pair австралопитек / ватерполистка. австралопитек is an Australopithecus. ватерполистка is a female water polo player. ]

[Other articles in category /lang] permanent link

Thu, 23 Feb 2017

Miscellaneous notes on anagram scoring

My article on finding the best anagram in English was well-received, and I got a number of interesting comments about it.

  • A couple of people pointed out that this does nothing to address the issue of multiple-word anagrams. For example it will not discover “I, rearrangement servant / Internet anagram server” True, that is a different problem entirely.

  • Markian Gooley informed me that “megachiropteran / cinematographer” has been long known to Scrabble players, and Ben Zimmer pointed out that A. Ross Eckler, unimpressed by “cholecystoduodenostomy / duodenocholecystostomy”, proposed a method almost identical to mine for scoring anagrams in an article in Word Ways in 1976. M. Eckler also mentioned that the “remarkable” “megachiropteran / cinematographer” had been published in 1927 and that “enumeration / mountaineer” (which I also selected as a good example) appeared in the Saturday Evening Post in 1879!

  • The Hacker News comments were unusually pleasant and interesting. Several people asked “why didn't you just use the Levenshtein distance”? I don't remember that it ever occured to me, but if it had I would have rejected it right away as being obviously the wrong thing. Remember that my original chunking idea was motivated by the observation that “cholecystoduodenostomy / duodenocholecystostomy” was long but of low quality. Levenshtein distance measures how far every letter has to travel to get to its new place and it seems clear that this would give “cholecystoduodenostomy / duodenocholecystostomy” a high score because most of the letters move a long way.

    Hacker News user tyingq tried it anyway, and reported that it produced a poor outcome. The top-scoring pair by Levenshtein distance is “anatomicophysiologic physiologicoanatomic”, which under the chunking method gets a score of 3. Repeat offender “cholecystoduodenostomy / duodenocholecystostomy” only drops to fourth place.

    A better idea seems to be Levenshtein score per unit of length, suggested by user cooler_ranch.

  • A couple of people complained about my “notaries / senorita” example, rightly observing that “senorita” is properly spelled “señorita”. This bothered me also while I was writing the article. I eventually decided although “notaries” and “señorita” are certainly not anagrams in Spanish (even supposing that “notaries” was a Spanish word, which it isn't) that the spelling of “senorita” without the tilde is a correct alternative in English. (Although I found out later that both the Big Dictionary and American Heritage seem to require the tilde.)

    Hacker News user ggambetta observed that while ‘é’ and ‘e’, and ‘ó’ and ‘o’ feel interchangeable in Spanish, ‘ñ’ and ‘n’ do not. I think this is right. The ‘é’ is an ‘e’, but with a mark on it to show you where the stress is in the word. An ‘ñ’ is not like this. It was originally an abbreviation for ‘nn’, introduced in the 18th century. So I thought it might make sense to allow ‘ñ’ to be exchanged for ‘nn’, at least in some cases.

    (An analogous situation in German, which may be more familiar, is that it might be reasonable to treat ‘ö’ and ‘ü’ as if they were ‘oe’ and ‘ue’. Also note that in former times, “w” and “uu” were considered interchangeable in English anagrams.)

    Unfortunately my Spanish dictionary is small (7,000 words) and of poor quality and I did not find any anagrams of “señorita”. I wish I had something better for you. Also, “señorita” is not one of the cases where it is appropriate to replace “ñ” with “nn”, since it was never spelled “sennorita”.

    I wonder why sometimes this sort of complaint seems to me like useless nitpicking, and other times it seems like a serious problem worthy of serious consideration. I will try to think about this.

  • Mike Morton, who goes by the anagrammatic nickname of “Mr. Machine Tool”, referred me to his Higgledy-piggledy about megachiropteran / cinematographer, which is worth reading.

  • Regarding the maximum independent set algorithm I described yesterday, Shreevatsa R. suggested that it might be conceptually simpler to find the maximum clique in the complement graph. I'm not sure this helps, because the complement graph has a lot more edges than the original. Below right is the complement graph for “acrididae / cidaridae”. I don't think I can pick out the 4-cliques in that graph any more than the independent sets in the graph on the lower-left, and this is an unusually favorable example case for the clique version, because the original graph has an unusually large number of edges.

    But perhaps the cliques might be easier to see if you know what to look for: in the right-hand diagram the four nodes on the left are one clique, and the four on the right are the other, whereas in the left-hand diagram the two independent sets are all mixed together.

  • An earlier version of the original article mentioned the putative 11-pointer “endometritria / intermediator”. The word “endometritria” seemed pretty strange, and I did look into it before I published the article, but not carefully enough. When Philip Cohen wrote to me to question it, I investigated more carefully, and discovered that it had been an error in an early WordNet release, corrected (to “endometria”) in version 1.6. I didn't remember that I had used WordNet's word lists, but I am not surprised to discover that I did.

    A rare printing of Webster's 2¾th American International Lexican includes the word “endometritriostomoscopiotomous” but I suspect that it may be a misprint.

  • Philippe Bruhat wrote to inform me of Alain Chevrier’s book notes / sténo, a collection of thematically related anagrams in French. The full text is available online.

  • Alexandre Muñiz, who has a really delightful blog, and who makes and sells attractive and clever puzzles of his own invention. pointed out that soapstone teaspoons are available. The perfect gift for the anagram-lover in your life! They are not even expensive.

  • Thanks also to Clinton Weir, Simon Tatham, Jon Reeves, Wei-Hwa Huang, and Philip Cohen for their emails about this.

[ Addendum 20170507: Slides from my !!Con 2017 talk are now available. ]

[ Addendum 20170511: A large amount of miscellaneous related material ]

[Other articles in category /lang] permanent link

Tue, 21 Feb 2017

Moore's law beats a better algorithm

Yesterday I wrote about the project I did in the early 1990s to find the best anagrams. The idea is to give pair of anagram words a score, which is the number of chunks into which you have to divide one word in order to rearrange the chunks to form the other word. This was motivated by the observation that while “cholecysto-duodeno-stomy” and “duodeno-cholecysto-stomy” are very long words that are anagrams of one another, they are not interesting because they require so few chunks that the anagram is obvious. A shorter but much more interesting example is “aspired / diapers”, where the letters get all mixed up.

I wrote:

One could do this with a clever algorithm, if one were available. There is a clever algorithm, based on finding maximum independent sets in a certain graph. I did not find this algorithm at the time; nor did I try. Instead, I used a brute-force search.

I wrote about the brute-force search yesterday. Today I am going to discuss the clever algorithm. (The paper is Avraham Goldstein, Petr Kolman, Jie Zheng “Minimum Common String Partition Problem: Hardness and Approximations”, The Electronic Journal of Combinatorics, 12 (2005).)

The plan is to convert a pair of anagrams into a graph that expresses the constraints on how the letters can move around when one turns into the other. Shown below is the graph for comparing acrididae (grasshoppers) with cidaridae (sea urchins):

The “2,4” node at the top means that the letters ri at position 2 in acrididae match the letters ri at position 4 in cidaridae; the “3,1” node is for the match between the first id and the first id. The two nodes are connected by an edge to show that the two matchings are incompatible: if you map the ri to the ri, you cannot also map the first id to the first id; instead you have to map the first id to the second one, represented by the node “3,5”, which is not connected to “2,4”. A maximum independent set in this graph is a maximum selection of compatible matchings in the words, which corresponds to a division into the minimum number of chunks.

Usually the graph is much less complicated than this. For simple cases it is empty and the maximum independent set is trivial. This one has two maximum independent sets, one (3,1; 5,5; 6,6; 7,7) corresponding to the obvious minimum splitting:

and the other (2,4; 3,5; 5,1; 6,2) to this other equally-good splitting:

[ Addendum 20170511: It actually has three maximum independent sets. ]

In an earlier draft of yesterday's post, I wrote:

I should probably do this over again, because my listing seems to be incomplete. For example, it omits “spectrum / crumpets” which would have scored 5, because the Webster's Second list contains crumpet but not crumpets.

I was going to leave it at that, but then I did do it over again, and this time around I implemented the “good” algorithm. It was not that hard. The code is on GitHub if you would like to see it.

To solve the maximum independent set instances, I used a guided brute-force search. Maximum independent set is NP-complete, and so the best known algorithm for it runs in exponential time. But the instances in which we are interested here are small enough that this doesn't matter. The example graph above has 8 nodes, so one needs to check at most 256 possible sets to see which is the maximum independent set.

I collated together all the dictionaries I had handy. (I didn't know yet about SCOWL.) These totaled 275,954 words, which is somewhat more than Webster's Second by itself. One of the new dictionaries did contain crumpets so the result does include “spectrum / crumpets”.

The old scored anagram list that I made in the 1990s contained 23,521 pairs. The new one contains 38,333. Unfortunately most of the new stuff is of poor quality, as one would expect. Most of the new words that were missing from my dictionary the first time around are obscure. Perhaps some people would enjoy discovering that that “basiparachromatin” and “Marsipobranchiata” are anagrams, but I find it of very limited appeal.

But the new stuff is not all junk. It includes:

10 antiparticles paternalistic
10 nectarines transience
10 obscurantist subtractions

11 colonialists oscillations
11 derailments streamlined

which I think are pretty good.

I wasn't sure how long the old program had taken to run back in the early nineties, but I was sure it had been at least a couple of hours. The new program processes the 275,954 inputs in about 3.5 seconds. I wished I knew how much of this was due to Moore's law and how much to the improved algorithm, but as I said, the old code was long lost.

But then just as I was finishing up the article, I found the old brute-force code that I thought I had lost! I ran it on the same input, and instead of 3.5 seconds it took just over 4 seconds. So almost all of the gain since the 1990s was from Moore's law, and hardly any was from the “improved” algorithm.

I had written in the earlier article:

In 2016 [ the brute force algorithm ] would probably still [ run ] quicker than implementing the maximum independent set algorithm.

which turned out to be completely true, since implementing the maximum independent set algorithm took me a couple of hours. (Although most of that was building out a graph library because I didn't want to look for one on CPAN.)

But hey, at least the new program is only twice as much code!

38333 anagrams, scored

[ Addendum: The program had a minor bug: it would disregard capitalization when deciding if two words were anagrams, but then compute the scores with capitals and lowercase letters distinct. So for example Chaenolobus was considered an anagram of unchoosable, but then the Ch in Chaenolobus would not be matched to the ch in unchoosable, resulting in a score of 11 instead of 10. I have corrected the program and the output. Thanks to Philip Cohen for pointing this out. ]

[ Addendum 20170223: More about this ]

[ Addendum 20170507: Slides from my !!Con 2017 talk are now available. ]

[ Addendum 20170511: A large amount of miscellaneous related material ]

[Other articles in category /lang] permanent link

I found the best anagram in English

I planned to publish this last week sometime but then I wrote a line of code with three errors and that took over the blog.

A few years ago I mentioned in passing that in the 1990s I had constructed a listing of all the anagrams in Webster's Second International dictionary. (The Webster's headword list was available online.)

This was easy to do, even at the time, when the word list itself, at 2.5 megabytes, was a file of significant size. Perl and its cousins were not yet common; in those days I used Awk. But the task is not very different in any reasonable language:

  # Process word list
  while (my $word = <>) {
    chomp $word;
    my $sorted = join "", sort split //, $word;  # normal form
    push @{$anagrams{$sorted}}, $word;

  for my $words (values %anagrams) {
      print "@$words\n" if @$words > 1;

The key technique is to reduce each word to a normal form so that two words have the same normal form if and only if they are anagrams of one another. In this case we do this by sorting the letters into alphabetical order, so that both megalodon and moonglade become adeglmnoo.

Then we insert the words into a (hash | associative array | dictionary), keyed by their normal forms, and two or more words are anagrams if they fall into the same hash bucket. (There is some discussion of this technique in Higher-Order Perl pages 218–219 and elsewhere.)

(The thing you do not want to do is to compute every permutation of the letters of each word, looking for permutations that appear in the word list. That is akin to sorting a list by computing every permutation of the list and looking for the one that is sorted. I wouldn't have mentioned this, but someone on StackExchange actually asked this question.)

Anyway, I digress. This article is about how I was unhappy with the results of the simple procedure above. From the Webster's Second list, which contains about 234,000 words, it finds about 14,000 anagram sets (some with more than two words), consisting of 46,351 pairs of anagrams. The list starts with

aal ala

and ends with

zolotink zolotnik

which exemplify the problems with this simple approach: many of the 46,351 anagrams are obvious, uninteresting or even trivial. There must be good ones in the list, but how to find them?

I looked in the list to find the longest anagrams, but they were also disappointing:

cholecystoduodenostomy duodenocholecystostomy

(Webster's Second contains a large amount of scientific and medical jargon. A cholecystoduodenostomy is a surgical operation to create a channel between the gall bladder (cholecysto-) and the duodenum (duodeno-). A duodenocholecystostomy is the same thing.)

This example made clear at least one of the problems with boring anagrams: it's not that they are too short, it's that they are too simple. Cholecystoduodenostomy and duodenocholecystostomy are 22 letters long, but the anagrammatic relation between them is obvious: chop cholecystoduodenostomy into three parts:

cholecysto duodeno stomy

and rearrange the first two:

duodeno cholecysto stomy

and there you have it.

This gave me the idea to score a pair of anagrams according to how many chunks one had to be cut into in order to rearrange it to make the other one. On this plan, the “cholecystoduodenostomy / duodenocholecystostomy” pair would score 3, just barely above the minimum possible score of 2. Something even a tiny bit more interesting, say “abler / blare” would score higher, in this case 4. Even if this strategy didn't lead me directly to the most interesting anagrams, it would be a big step in the right direction, allowing me to eliminate the least interesting.

This rule would judge both “aal / ala” and “zolotink / zolotnik” as being uninteresting (scores 2 and 4 respectively), which is a good outcome. Note that some other boring-anagram problems can be seen as special cases of this one. For example, short anagrams never need to be cut into many parts: no four-letter anagrams can score higher than 4. The trivial anagramming of a word to itself always scores 1, and nontrivial anagrams always score more than this.

So what we need to do is: for each anagram pair, say acrididae (grasshoppers) and cidaridae (sea urchins), find the smallest number of chunks into which we can chop acrididae so that the chunks can be rearranged into cidaridae.

One could do this with a clever algorithm, if one were available. There is a clever algorithm, based on finding maximum independent sets in a certain graph. (More about this tomorrow.) I did not find this algorithm at the time; nor did I try. Instead, I used a brute-force search. Or rather, I used a very small amount of cleverness to reduce the search space, and then used brute-force search to search the reduced space.

Let's consider a example, scoring the anagram “abscise / scabies”. You do not have to consider every possible permutation of abscise. Rather, there are only two possible mappings from the letters of abscise to the letters of scabies. You know that the C must map to the C, the A must map to the A, and so forth. The only question is whether the first S of abscise maps to the first or to the second S of scabies. The first mapping gives us:

and the second gives us

because the S and the C no longer go to adjoining positions. So the minimum number of chunks is 5, and this anagram pair gets a score of 5.

To fully analyze cholecystoduodenostomy by this method required considering 7680 mappings. (120 ways to map the five O's × 2 ways to map the two C's × 2 ways to map the two D's, etc.) In the 1990s this took a while, but not prohibitively long, and it worked well enough that I did not bother to try to find a better algorithm. In 2016 it would probably still run quicker than implementing the maximum independent set algorithm. Unfortunately I have lost the code that I wrote then so I can't compare.

Assigning scores in this way produced a scored anagram list which began

2 aal ala

and ended

4 zolotink zolotnik

and somewhere in the middle was

3 cholecystoduodenostomy duodenocholecystostomy

all poor scores. But sorted by score, there were treasures at the end, and the clear winner was

14 cinematographer megachiropteran

I declare this the single best anagram in English. It is 15 letters long, and the only letters that stay together are the E and the R. “Cinematographer” is as familiar as a 15-letter word can be, and “megachiropteran” means a giant bat. GIANT BAT! DEATH FROM ABOVE!!!

And there is no serious competition. There was another 14-pointer, but both its words are Webster's Second jargon that nobody knows:

14 rotundifoliate titanofluoride

There are no score 13 pairs, and the score 12 pairs are all obscure. So this is the winner, and a deserving winner it is.

I think there is something in the list to make everyone happy. If you are the type of person who enjoys anagrams, the list rewards casual browsing. A few examples:

7 admirer married
7 admires sidearm

8 negativism timesaving
8 peripatetic precipitate
8 scepters respects
8 shortened threnodes
8 soapstone teaspoons

9 earringed grenadier
9 excitation intoxicate
9 integrals triangles
9 ivoriness revisions
9 masculine calumnies

10 coprophagist topographics
10 chuprassie haruspices
10 citronella interlocal

11 clitoridean directional
11 dispensable piebaldness

“Clitoridean / directional” has been one of my favorites for years. But my favorite of all, although it scores only 6, is

6 yttrious touristy

I think I might love it just because the word yttrious is so delightful. (What a debt we owe to Ytterby, Sweden!)

I also rather like

5 notaries senorita

which shows that even some of the low-scorers can be worth looking at. Clearly my chunk score is not the end of the story, because “notaries / senorita” should score better than “abets / baste” (which is boring) or “Acephali / Phacelia” (whatever those are), also 5-pointers. The length of the words should be worth something, and the familiarity of the words should be worth even more.

Here are the results:

38333 anagrams, scored

In former times there was a restaurant in Philadelphia named “Soupmaster”. My best unassisted anagram discovery was noticing that this is an anagram of “mousetraps”.

[ Addendum 20170222: There is a followup article comparing the two algorithms I wrote for computing scores. ]

[ Addendum 20170222: An earlier version of this article mentioned the putative 11-pointer “endometritria / intermediator”. The word “endometritria” seemed pretty strange, and I did look into it before I published the article, but not carefully enough. When Philip Cohen wrote to me to question it, I investigated more carefully, and discovered that it had been an error in an early WordNet release, corrected (to “endometria”) in version 1.6. I didn't remember that I had used WordNet's word lists, but I am not surprised to discover that I did. ]

[ Addendum 20170223: More about this ]

[ Addendum 20170507: Slides from my !!Con 2017 talk are now available. ]

[ Addendum 20170511: A large amount of miscellaneous related material ]

[Other articles in category /lang] permanent link

Mon, 30 Jan 2017

Digit symbols in the Parshvanatha magic square

In last month's article about the magic square at the Parshvanatha temple, shown at right, I said:

It has come to my attention that the digit symbols in the magic square are not too different from the current forms of the digit symbols in the Gujarati script. The temple is not very close to Gujarat or to the area in which Gujarati is common, so I guess that the digit symbols in Indian languages have evolved in the past thousand years, with the Gujarati versions remaining closest to the ancient forms, or else perhaps Gujarati was spoken more widely a thousand years ago. I would be interested to hear about this from someone who knows.

Shreevatsa R. replied in detail, and his reply was so excellent that, finding no way to improve it by adding or taking away, I begged his permission to republish it without change, which he generously granted.

Am sending this email to say:

  1. Why it shouldn't be surprising if the temple had Gujarati numerals
  2. Why the numerals aren't Gujarati numerals :-)

The Parshvanatha temple is located in the current state of Madhya Pradesh. Here is the location of the temple within a map of the state:

And here you can see that the above state of Madhya Pradesh (14 in the image below) is adjacent to the state of Gujarat (7):

The states of India are (sort of) organized along linguistic lines, and neighbouring states often have overlap or similarities in their languages. So a priori it shouldn't be too surprising if the language is that of a neighbouring state.

But, as you rightly say, the location of the Parshvanatha temple is actually quite far from the state (7) where Gujarat is spoken; it's closer to 27 in the above map (state named Uttar Pradesh).

Well, the Parshvanatha temple is believed to have been built "during the reign of the Chandela king Dhanga", and the Chandela kings were feudatories (though just beginning to assert sovereignty at the time) of the Gurjara-Pratihara kings, and "Gurjara" is where the name of the language of "Gujarati" comes from. So it's possible that they used the "official" language of the reigning kings, as with colonies. In fact the green area of the Gurjara-Pratihara kings in this map covers the location of the Parshvanatha temple:

But actually this is not a very convincing argument, because the link between Gurjara-Pratiharas and modern Gujarati is not too strong (at least I couldn't find it in a few minutes on Wikipedia :P)

So moving on...

Are the numerals really similar to Gujarati numerals? These are the numbers 1 to 16 from your blog post, ordered according to the usual order:

These are the numerals in a few current Indic scripts (as linked from your blog post):

Look at the first two rows above. Perhaps because of my familiarity with Devanagari, I cannot really see any big difference between the Devanagari and Gujarati symbols except for the 9: the differences are as minor as variation between fonts. (To see how much the symbols can change because of font variation, one can go to Google Fonts' Devanagari page and Google Fonts' Gujarati page and click on one of the sample texts and enter "० १ २ ३ ४ ५ ६ ७ ८ ९" and "૦ ૧ ૨ ૩ ૪ ૫ ૬ ૭ ૮ ૯" respectively, then "Apply to all fonts". Some fonts are bad, though.)

(In fact, even the Gurmukhi and Tibetan are somewhat recognizable, for someone who can read Devanagari.)

So if we decide that the Parshvanatha temple's symbols are actually closer not to modern Gujarati but to modern Devanagari (e.g. the "3" has a tail in the temple symbols which is present in Devanagari but missing in Gujarati), then the mystery disappears: Devanagari is still the script used in the state of Madhya Pradesh (and Uttar Pradesh, etc: it's the script used for Hindi, Marathi, Nepali, Sanskrit, and many other languages).

Finally, for the complete answer, we can turn to history.

The Parshvanatha temple was built during 950 to 970 CE. Languages: Modern Gujarati dates from 1800, Middle Gujarati from ~1500 to 1800, Old Gujarati from ~1100 to 1500. So the temple is older than the earliest language called "Gujarati". (Similarly, modern Hindi is even more recent.) Turning to scripts instead: see under Brahmic scripts.

So at the time the temple was built, neither Gujarati script nor Devanagari proper existed. The article on the Gujarati script traces its origin to the Devanagari script, which itself is a descendant of Nagari script.

At right are the symbols from the Nagari script, which I think are closer in many respects to the temple symbols.

So overall, if we trace the numerals in (a subset of) the family tree of scripts:

Brahmi > Gupta > Nagari > Devanagari > Gujarati

we'll find that the symbols of the temple are somewhere between the "Nagari" and "Devanagari" forms. (Most of the temple digits are the same as in the "Nagari" example above, except for the 5 which is closer to the Devanagari form.)

BTW, your post was about the numerals, but from being able to read modern Devanagari, I can also read some of the words above the square: the first line ends with ".. putra śrī devasarmma" (...पुत्र श्री देव‍सर्म्म) (Devasharma, son of...), and these words have the top bar which is missing in Gujarati script.

[Other articles in category /lang] permanent link

Sun, 20 Mar 2016

Technical jargon failure modes

Technical jargon is its own thing, intended for easy communication between trained practitioners of some art, but not necessarily between anyone else.

Jargon can be somewhat transparent, like the chemical jargon term “alcohol”. “Alcohol” refers to a large class of related chemical compounds, of which the simplest examples are methyl alcohol (traditionally called “wood alcohol”) and ethyl alcohol (the kind that you get in your martini). The extension of “alcohol” to the larger class is suggestive and helpful. Someone who doesn't understand the chemical jargon usage of “alcohol” can pick it up by analogy, and even if they don't they will probably have something like the right idea. A similar example is “aldehyde”. An outsider who hears this for the first time might reasonably ask “does that have something to do with formaldehyde?” and the reasonable answer is “yes indeed, formaldehyde is the simplest example of an aldehyde compound.” Again the common term is adapted to refer to the members of a larger but related class.

An opposite sort of adaptation is found in the term “bug”. The common term is extremely broad, encompassing all sorts of terrestrial arthropods, including mosquitoes, ladybugs, flies, dragonflies, spiders, and even isopods (“pillbugs”) and centipedes and so forth. It should be clear that this category is too large and heterogeneous to be scientifically useful, and the technical use of “bug” is much more restricted. But it does include many creatures commonly referred to as bugs, such as bed bugs, waterbugs, various plant bugs, and many other flat-bodied crawling insects.

Mathematics jargon often wanders in different directions. Some mathematical terms are completely opaque. Nobody hearing the term “cobordism” or “simplicial complex” or “locally compact manifold” for the first time will think for an instant that they have any idea what it means, and this is perfect, because they will be perfectly correct. Other mathematical terms are paradoxically so transparent seeming that they reveal their opacity by being obviously too good to be true. If you hear a mathematician mention a “field” it will take no more than a moment to realize that it can have nothing to do with fields of grain or track-and-field sports. (A field is a collection of things that are number-like, in the sense of having addition, subtraction, multiplication, and division that behave pretty much the way one would expect those operations to behave.) And some mathematical jargon is fairly transparent. The non-mathematician's idea of “line”, “ball”, and “cube” is not in any way inconsistent with what the mathematician has in mind, although the full technical meaning of those terms is pregnant with ramifications and connotations that are invisible to non-mathematicians.

But mathematical jargon sometimes goes to some bad places. The term “group” is so generic that it could mean anything, and outsiders often imagine that it means something like what mathematicians call a “set”. (It actually means a family of objects that behave like the family of symmetries of some other object.)

This last is not too terrible, as jargon failures go. There is a worse kind of jargon failure I would like to contrast with “bug”. There the problem, if there is a problem, is that entomologists use the common term “bug” much more restrictively than one expects. An entomologist will well-actually you to explain that a millipede is not actually a bug, but we are used to technicians using technical terms in more restrictive ways than we expect. At least you can feel fairly confident that if you ask for examples of bugs (“true bugs”, in the jargon) that they will all be what you will consider bugs, and the entomologist will not proceed to rattle off a list that includes bats, lobsters, potatoes, or the Trans-Siberian Railroad. This is an acceptable state of affairs.

Unacceptable, however, is the botanical use of the term “berry”:

It is one thing to adopt a jargon term that is completely orthogonal to common usage, as with “fruit”, where the technical term simply has no relation at all to the common meaning. That is bad enough. But to adopt the term “berry” for a class of fruits that excludes nearly everything that is commonly called a ”berry” is an offense against common sense.

This has been on my mind a long time, but I am writing about it now because I think I have found, at last, an even more offensive example.

  • Stonehenge is so-called because it is a place of hanging stones: “henge” is cognate with “hang”.

  • In 1932 archaeologists adapted the name “Stonehenge” to create the word “henge” as a generic term for a family of ancient monuments that are similar to Stonehenge.

  • Therefore, if there were only one thing in the whole world that ought to be an example of a henge, it should be Stonehenge.

  • However, Stonehenge is not, itself, a henge.

  • Stonehenge is not a henge.


Stonehenge is not a henge. … Technically, [henges] are earthwork enclosures in which a ditch was dug to make a bank, which was thrown up on the outside edge of the ditch.

— Michael Pitts, Hengeworld, pp. 26–28.

“Henge” may just be the most ineptly coined item of technical jargon in history.

[ Addendum 20161103: Zimbabwe's Great Dyke is not actually a dyke. ]

[ Addendum 20190502: I found a mathematical example that is approximately as bad as the worst examples on this page. ]

[Other articles in category /lang] permanent link

Fri, 25 Apr 2014

My brush with Oulipo

Last night I gave a talk for the New York Perl Mongers, and got to see a number of people that I like but don't often see. Among these was Michael Fischer, who told me of a story about myself that I had completely forgotten, but I think will be of general interest.

The front end of the story is this: Michael first met me at some conference, shortly after the publication of Higher-Order Perl, and people were coming up to me and presenting me with copies of the book to sign. In many cases these were people who had helped me edit the book, or who had reported printing errors; for some of those people I would find the error in the text that they had reported, circle it, and write a thank-you note on the same page. Michael did not have a copy of my book, but for some reason he had with him a copy of Oulipo Compendium, and he presented this to me to sign instead.

Oulipo is a society of writers, founded in 1960, who pursue “constrained writing”. Perhaps the best-known example is the lipogrammatic novel La Disparition, written in 1969 by Oulipo member Georges Perec, entirely without the use of the letter e. Another possibly well-known example is the Exercises in Style of Raymond Queneau, which retells the same vapid anecdote in 99 different styles. The book that Michael put in front of me to sign is a compendium of anecdotes, examples of Oulipan work, and other Oulipalia.

What Michael did not realize, however, was that the gods of fate were handing me an opportunity. He says that I glared at him for a moment, then flipped through the pages, found the place in the book where I was mentioned, circled it, and signed that.

The other half of that story is how I happened to be mentioned in Oulipo Compendium.

Back in the early 1990s I did a few text processing projects which would be trivial now, but which were unusual at the time, in a small way. For example, I constructed a concordance of the King James Bible, listing, for each word, the number of every verse in which it appeared. This was a significant effort at the time; the Bible was sufficiently large (around five megabytes) that I normally kept the files compressed to save space. This project was surprisingly popular, and I received frequent email from strangers asking for copies of the concordance.

Another project, less popular but still interesting, was an anagram dictionary. The word list from Webster's Second International dictionary was available, and it was an easy matter to locate all the anagrams in it, and compile a file. Unlike the Bible concordance, which I considered inferior to simply running grep, I still have the anagram dictionary. It begins:

aal ala
aam ama
Aarhus (See `arusha')
Aaronic (See `Nicarao')
Aaronite aeration
Aaru aura

And ends:

zoosporic sporozoic
zootype ozotype
zyga gazy
zygal glazy

The cross-references are to save space. When two words are anagrams of one another, both are listed in both places. But when three or more words are anagrams, the words are listed in one place, with cross-references in the other places, so for example:

Ateles teasel stelae saltee sealet
saltee (See `Ateles')
sealet (See `Ateles')
stelae (See `Ateles')
teasel (See `Ateles')

saves 52 characters over the unabbreviated version. Even with this optimization, the complete anagram dictionary was around 750 kilobytes, a significant amount of space in 1991. A few years later I generated an improved version, which dispensed with the abbreviation, by that time unnecessary, and which attempted, sucessfully I thought, to score the anagrams according to interestingness. But I digress.

One day in August of 1994, I received a query about the anagram dictionary, including a question about whether it could be used in a certain way. I replied in detail, explaining what I had done, how it could be used, and what could be done instead, and the result was a reply from Harry Mathews, another well-known member of the Oulipo, of which I had not heard before. Mr. Mathews, correctly recognizing that I would be interested, explained what he was really after:

A poetic procedure created by the late Georges Perec falls into the latter category. According to this procedure, only the 11 commonest letters in the language can be used, and all have to be used before any of them can be used again. A poem therefore consists of a series of 11 multi-word anagrams of, in French, the letters e s a r t i n u l o c (a c e i l n o r s t). Perec discovered only one one-word anagram for the letter-group, "ulcerations", which was adopted as a generic name for the procedure.

Mathews wanted, not exactly an anagram dictionary, but a list of words acceptable for the English version of "ulcerations". They should contain only the letters a d e h i l n o r s t, at most once each. In particular, he wanted a word containing precisely these eleven letters, to use as the translation of "ulcerations".

Producing the requisite list was much easier then producing the anagram dictionary iself, so I quickly did it and sent it back; it looked like this:

a A a
d D d
e E e
h H h
i I i
l L l
n N n
o O o
r R r
s S s
t T t
ad ad da
ae ae ea
ah Ah ah ha
lost lost lots slot
nors sorn
nort torn tron
nost snot
orst sort
adehl heald
adehn henad
adehr derah
adehs Hades deash sadhe shade
deilnorst nostriled
ehilnorst nosethirl
adehilnort threnodial
adehilnrst disenthral
aehilnorst hortensial

The leftmost column is the alphabetical list of letters. This is so that if you find yourself needing to use the letters 'a d e h s' at some point in your poem, you can jump to that part of the list and immediately locate the words containing exactly those letters. (It provides somewhat less help for discovering the shorter words that contain only some of those letters, but there is a limit to how much can be done with static files.)

As can be seen at the end of the list, there were three words that each used ten of the eleven required letters: “hortensial”, “threnodial”, “disenthral”, but none with all eleven. However, Mathews replied:

You have found the solution to my immediate problem: "threnodial" may only have 10 letters, but the 11th letter is "s". So, as an adjectival noun, "threnodials" becomes the one and only generic name for English "Ulcerations". It is not only less harsh a word than the French one but a sorrowfully appropriate one, since the form is naturally associated with Georges Perec, who died 12 years ago at 46 to the lasting consternation of us all.

(A threnody is a hymn of mourning.)

A few years later, the Oulipo Compendium appeared, edited by Mathews, and the article on Threnodials mentions my assistance. And so it was that when Michael Fischer handed me a copy, I was able to open it up to the place where I was mentioned.

[ Addendum 20140428: Thanks to Philippe Bruhat for some corrections: neither Perec nor Mathews was a founding member of Oulipo. ]

[ Addendum 20170205: To my consternation, Harry Mathews died on Janury 25. There was nobody like him, and the world is a smaller and poorer place. ]

[ Addendum 20170909: I should have mentioned that my appearance in Oulipo Compendium was brought to my attention by Robin Houston. Thank you M. Houston! ]

[Other articles in category /lang] permanent link

Tue, 03 Jan 2012

Eta-reduction in Haskell and English
The other day Katara and I were putting together a model, and she asked what a certain small green part was for. I said "It's a thing for connecting a thing to another thing."

Katara objected that this was a completely unhelpful explanation, but I disagreed. I would have agreed that it was an excessively verbose explanation, but she didn't argue that point.

Later, it occurred to me that Haskell has a syntax for eliding unnecessary variables in cases like this. In Haskell, one can abbreviate the expression

        λx → λy → x + y
to just (+). (Perl users may find it helpful to know that the Perl equivalent of the expression above is sub { my ($x) = @_; return sub { my ($y) = @_; return $x + $y } }.) This is an example of a general transformation called η-reduction. In general, for any function f, λxf x is a function that takes an argument x and returns f x. But that's exactly what f does. So we can replace the longer version with the shorter version, and that's η-reduction, or we can go the other way, which is η-expansion.

Anyway, once I thought of this it occurred to me that, just like the longer expression could be reduced to (+), my original explanation that the small green part was "a thing for connecting a thing to another thing" could be η-reduced to "a connector".

Perhaps if I had said that in the first place Katara would not have complained.

Happy new year, all readers.

[Other articles in category /lang] permanent link

Wed, 20 May 2009

No flimping
Advance disclaimer: I am not a linguist, have never studied linguistics, and am sure to get some of the details wrong in this article. Caveat lector.

There is a standard example in linguistics that is attached to the word "flimp". The idea it labels is that certain grammatical operations are restricted in the way they behave, and cannot reach deeply into grammatical structures and rearrange them.

For instance, you can ask "What did you use to see the girl on the hill in the blue dress?" and I can reply "I used a telescope to see the girl on the hill in the blue dress". Here "the girl on the hill in the blue dress" is operating as a single component, which could, in principle, be arbitrarily long. ("The girl on the hill that was fought over in the war between the two countries that have been at war since the time your mother saw that monkey climb the steeple of the church...") This component can be extracted whole from one sentence and made the object of a new sentence, or the subject of some other sentence.

But certain other structures are not transportable. For example, in "Bill left all his money to Fred and someone", one can reach down as far as "Fred and someone" and ask "What did Bill leave to Fred and someone?" but one cannot reach all the way down to "someone" and ask "Who did Bill leave all his money to Fred and"?

Under certain linguistic theories of syntax, analogous constraints rule out the existence of certain words. "Flimped" is the hypothetical nonexistent word which, under these theories, cannot exist. To flimp is to kiss a girl who is allergic to. For example, to flimp coconuts is to kiss a girl who is allergic to coconuts. (The grammatical failure in the last sentence but one illustrates the syntactic problem that supposedly rules out the word "flimped".

I am not making this up; for more details (from someone who, unlike me, may know what he is talking about) See Word meaning and Montague grammar by David Dowty, p. 236. Dowty cites the earlier sources, from 1969–1973 who proposed this theory in the first place. The "flimped" example above is exactly the same as Dowty's, and I believe it is the standard one.

Dowty provides a similar, but different example: there is not, and under this theory there cannot be, a verb "to thork" which means "to lend your uncle and", so that "John thorked Harry ten dollars" would mean "John lent his uncle and Harry ten dollars".

I had these examples knocking around in my head for many years. I used to work for the University of Pennsylvania Computer and Information Sciences department, and from my frequent contacts with various cognitive-science types I acquired a lot of odds and ends of linguistic and computational folklore. Michael Niv told me this one sometime around 1992.

The "flimp" thing rattled around my head, surfacing every few months or so, until last week, when I thought of a counterexample: Wank.

The verb "to wank to" means "to rub one's genitals while considering", and so seems to provide a countexample to the theory that says that verbs of this type are illegal in English.

When I went to investigate, I found that the theory had pretty much been refuted anyway. The Dowty book (published 1979) produced another example: "to cuckold" is "to have sexual intercourse with the woman who is married to".

Some Reddit person recently complained that one of my blog posts had no point. Eat this, Reddit person.

[Other articles in category /lang] permanent link

Fri, 08 May 2009

Most annoying phrase known to man?
I have been wasting time, those precious minutes of my life that will never return, by eliminating the odious phrase "known to man" from Wikipedia articles. It is satisfying, in much the same way as doing the crossword puzzle, or popping bubble wrap.

In the past I have gone on search-and-destroy missions against certain specific phrases, for example "It should be noted that...", which can nearly always be replaced with "" with no loss of meaning. But "known to man" is more fun.

One pleasant property of this phrase is that one can sidestep the issue of whether "man" is gender-neutral. People on both sides of this argument can still agree that "known to man" is best replaced with "known". For example:

  • The only albino gorilla known to man...
  • The most reactive and electronegative substance known to man...
  • Copper and iron were known to man well before the copper age and iron age...
In examples like these, "to man" is superfluous, and one can delete it with no regret.

As a pleonasm and a cliché, "known to man" is a signpost to prose that has been written by someone who was not thinking about what they were saying, and so one often finds it amid other prose that is pleonastic and clichéd. For example:

Diamond ... is one of the hardest naturally occurring material known (another harder substance known today is the man-made substance aggregated diamond nanorods which is still not the hardest substance known to man).
Which I trimmed to say:

Diamond ... is one of the hardest naturally-occurring materials known. (Some artificial substances, such as aggregated diamond nanorods, are harder.)
Many people ridicule Strunk and White's fatuous advice to "omit needless words"—if you knew which words were needless, you wouldn't need the advice—but all editors know that beginning writers will use ten words where five will do. The passage above is a good example.

Can "known to man" always be improved by replacement with "known"? I might have said so yesterday, but I mentioned the issue to Yaakov Sloman, who pointed out that the original use was meant to suggest a contrast not with female knowledge but with divine knowledge, an important point that completely escaped my atheist self. In light of this observation, it was easy to come up with a counterexample: "His acts descended to a depth of evil previously unknown to man" partakes of the theological connotations very nicely, I think, and so loses some of its force if it is truncated to "... previously unknown". I suppose that many similar examples appear in the work of H. P. Lovecraft.

It would be nice if some of the Wikipedia examples were of this type, but so far I haven't found any. The only cases so far that I haven't changed are all direct quotations, including several from the introductory narration of The Twilight Zone, which asserts that "There is a fifth dimension beyond that which is known to man...". I like when things turn out better than I expected, but this wasn't one of those times. Instead, there was one example that was even worse than I expected. Bad writing it may be, but the wrongness of "known to man" is at least arguable in most cases. (An argument I don't want to make today, although if I did, I might suggest that "titanium dioxide is the best whitening agent known to man" be rewritten as "titanium dioxide is the best whitening agent known to persons of both sexes with at least nine and a half inches of savage, throbbing cockmeat.") But one of the examples I corrected was risibly inept, in an unusual way:

Wonder Woman's Amazon training also gave her limited telepathy, profound scientific knowledge, and the ability to speak every language known to man.
I have difficulty imagining that the training imparted to Diana, crown princess of the exclusively female population of Paradise Island, would be limited to languages known to man.

Earle Martin drew my attention to the Wikipedia article on "The hardest metal known to man". I did not dare to change this.

[ Addendum 20090515: There is a followup article. ]

[Other articles in category /lang] permanent link

Sun, 15 Feb 2009

Stupid crap, presented by Plato
Yesterday I posted:

"She is not 'your' girlfriend," said this knucklehead. "She does not belong to you."
Through pure happenstance, I discovered last night that there is an account of this same bit of equivocation in Plato's Euthydemus. In this dialogue, Socrates tells of a sophist named Dionysodorus, who is so clever that he can refute any proposition, whether true or false. Here Dionysodorus demonstrates that Ctesippus's father is a dog:

You say that you have a dog.

Yes, a villain of a one, said Ctesippus.

And he has puppies?

Yes, and they are very like himself.

And the dog is the father of them?

Yes, he said, I certainly saw him and the mother of the puppies come together.

And is he not yours?

To be sure he is.

Then he is a father, and he is yours; ergo, he is your father, and the puppies are your brothers.

So my knuckleheaded interlocutor was not even being original.

I gratefully acknowledge the gift of Thomas Guest. Thank you very much!

[Other articles in category /lang] permanent link

Fri, 31 Oct 2008

A proposed correction to an inconsistency in English orthography
English contains exactly zero homophones of "zero", if one ignores the trivial homophone "zero", as is usually done.

English also contains exactly one homophone of "one", namely "won".

English does indeed contain two homophones of "two": "too" and "to".

However, the expected homophones of "three" are missing. I propose to rectify this inconsistency. This is sure to make English orthography more consistent and therefore easier for beginners to learn.

I suggest the following:

I also suggest the founding of a well-funded institute with the following mission:

  1. Determine the meanings of these three new homophones
  2. Conduct a public education campaign to establish them in common use
  3. Lobby politicians to promote these new words by legislation, educational standards, public funding, or whatever other means are appropriate
  4. Investigate the obvious sequel issues: "four" has only "for" and "fore" as homophones; what should be done about this?
Obviously, the director of this institute should be a thoughtful, far-seeing individual who will not allow his good judgement to be clouded by the generous salary. I refer, of course, to myself.

Happy Halloween. All Hail Discordia.

[ Addendum 20081106: Some readers inexplicably had nothing better to do than to respond to this ridiculous article. ]

[Other articles in category /lang] permanent link

Wed, 14 May 2008

More artificial Finnish
Several Finns wrote to me to explain in some detail what was wrong with the artificial Finnish in yesterday's article. As I surmised, the words "ssän" and "kkeen" are lexically illegal in Finnish. There were a number of similar problems. For example, my sample output included the non-word "t". I don't know how this could have happened, since the input probably didn't include anything like that, and the Markov process I used to generate it shouldn't have done so. But the code is lost, so I suppose I'll never know.

Of the various comments I received, perhaps the most interesting was from Ilmari Vacklin. ("Vacklin", huh? If my program had generated "Vacklin", the Finns would have been all over the error.) M. Vacklin pointed out that a number of words in my sample output violated the Finnish rules of vowel harmony.

(M. Vacklin also suggested that my article must have been inspired by this comic, but it wasn't. I venture to guess that the Internet is full of places that point out that you can manufacture pseudo-Finnish by stringing together a lot of k's and a's and t's; it's not that hard to figure out. Maybe this would be a good place to mention the word "saippuakauppias", the Finnish term for a soap-dealer, which was in the Guinness Book of World Records as the longest commonly-used palindromic word in any language.)

Anyway, back to vowel harmony. Vowel harmony is a phenomenon found in certain languages, including Finnish. These languages class vowels into two antithetical groups. Vowels from one group never appear in the same word as vowels from the other group. When one has a prefix or a suffix that normally has a group A vowel, and one wants to join it to a word with group B vowels, the vowel in the suffix changes to match. This happens a lot in Finnish, which has a zillion suffixes. In many languages, including Finnish, there is also a third group of vowels which are "neutral" and can be mixed with either group A or with group B.

Modern Korean does not have vowel harmony, mostly, but Middle Korean did have it, up until the early 16th century. The Korean alphabet was invented around 1443, and the notation for the vowels reflected the vowel harmony:

[ Addendum 20080517: The following paragraph about vowel harmony contains significant errors of fact. I got the groups wrong. ]

The first four vowels in this illustration, with the vertical lines, were incompatible with the second four vowels, the ones with the horizontal lines. The last two vowels were neutral, as was another one, not shown here, which was written as a single dot and which has since fallen out of use. Incidentally, vowel harmony is an unusual feature of languages, and its presence in Korean has led some people to suggest that it might be distantly related to Turkish.

The vowel harmony thing is interesting in this context for the following reason. My pseudo-Finnish was generated by a Markov process: each letter was selected at random so as to make the overall frequency of the output match that of real Finnish. Similarly, the overall frequency of two- and three-letter sequences in pseudo-Finnish should match that in real Finnish. Is this enough to generate plausible (although nonsensical) Finnish text? For English, we might say maybe. But for Finnish the answer is no, because this process does not respect the vowel harmony rules. The Markov process doesn't remember, by the time it gets to the end of a long word, whether it is generating a word in vowel category A or B, and so it doesn't know which vowels it whould be generating. It will inevitably generate words with mixed vowels, which is forbidden. This problem does not come up in the generation of pseudo-English.

None of that was what I was planning to write about, however. What I wanted to do was to present samples of pseudo-Finnish generated with various tunings of the Markov process.

The basic model is this: you choose a number N, say 2, and then you look at some input text. For each different sequence of N characters, you count how many times that sequence is followed by "a", how many times it is followed by "b", and so on.

Then you start generating text at random. You pick a sequence of N characters arbitrarily to start, and then you generate the next character according to the probabilities that you calculated. Then you look at the last N characters (the last N-1 from before, plus the new one) and repeat. You keep doing that until you get tired.

For example, suppose we have N=2. Then we have a big table whose keys are 2-character strings like "ab", and then associated with each such string, a table that looks something like this:
r 54.52
a 15.89
i 10.41
o 7.95
l 4.11
e 3.01
u 1.10
space 0.82
: 0.55
t 0.55
, 0.27
. 0.27
b 0.27
s 0.27
So in the input to this process, "ab" was followed by "r" more than 54% of the time, by "a" about 16% of the time, and so on. And when generating the output, every time our process happens to generate "ab", it will follow by generating an "r" 54.52% of the time, an "a" 15.89% of the time, and so on.

Whether to count capital letters as the same as lowercase, and what to do about punctuation and spaces and so forth, are up to the designer.

Here, as examples, are some samples of pseudo-English, generated with various N. The input text was the book of Genesis, which is not entirely typical. In each case, I deleted the initial N characters and the final partial word, cleaned up the capitalization by hand, and appended a final period.

Lt per f idd et oblcs hs hae:uso ar w aaolt y tndh rl ohn n synenihbdrha,spegn.
Cachand t wim, heheethas anevem blsant ims, andofan, ieahrn anthaye s, lso iveeti alll t tand, w.
Ged hich callochbarthe of th to tre said nothem, and rin ing of brom. My and he behou spend the.
Sack one eved of and refor ther of the hand he will there that in the ful, when it up unto rangers.
It should be clear that the quality improves as one increases the N parameter. The N=3 sample has mostly real words, and the few nonsense ones it contains ("eved", "ful") are completely plausible English. N=2, on the other hand, is mostly nonsense, although it's mostly plausible nonsense. Even "callochbarthe" is almost plausible. (The unfortunate "chb" in the middle is just bad luck. It occurs because Genesis 36 mentions Baalhanan the son of Achbor.) The N=1 sample is recognizably bogus; no English word looks like "ieahrn", and the triple "l" in "alll" is nearly impossible. (I did once write to Jesse Sheidlower, an editor of the Big Dictionary, to ask his advice about whether "ballless" should be hyphenated.)

I have prepared samples of pseudo-Finnish of various qualities. The input here was a bunch of text I copied out of Finnish Wikipedia. (Where else? If you need Finnish text in 1988, you get it from the Usenet group; if you need Finnish text in 2008, you get it from Finnish Wikipedia.) I did a little bit of manual cleanup, as with the English, but not too much.

Vtnnstäklun so so rl sieesjo.Aiijesjeäyuiotiannorin traäl.N vpojanti jonn oteaanlskmt enhksaiaaiiv oenlulniavas. Rottlatutsenynöisu iikännam e lavantkektann eaagla admikkosulssmpnrtinrkudilsorirumlshsmoti,anlosa anuioessydshln.Atierisllsjnlu e.Itatlosyhi vnko ättr otneän akho smalloailäi jiaat kajvtaopnasneilstio tntin einteaonaiimotn:r apoya oruasnainttotne wknaiossäelaäinoev aobrs,vteorlokynv. Aevsrikhanä tp s s oälnlke rvmi il ynae nara ign ssm lkimttbhineaatismäi tst lli ahaltineshne kr keöunv ah s itenh s .Ia pa elstpnanmnuiksriil anaalnttt mr ti.Ooa ka eee eiiei,tnees äusee a nanhetv.Iopkijeatatits,i l eklbiik suössmap tioaotaktdiir rkeaviohiesotkeagarihv nnadvö jlape öt kaeakmjkhykoto tnt iunnuyknnelu rutliie.Leva eiriaösnaj,rk oyumtsle,iioa,aspa aeiaä wsuinn eta y tvati klssviutkuaktmlpnheomi.T akapskushhnuksnhnnheaaaaussitseminmpnamäiaä pät.Kaaaabl unnionuhnpa iaes,outka.Cväinvkshvrnlteeoea rmi re suodmpr autlysa tnliaanäass. Srs rnvrtsita kmidusvjn tii.
Ava pän svun kerekent lsita batävomenasttenerga kovosuujalules rma punntäni rtraliksainoi van eukällä. Enäkukänesinntampalä ttan kolpäsäkyönsllvitivenestakkesenelussivaliite kuuksä kttteni einsuekeita kuterissalietäkilpöikalit ojatäjä pinsin atollukole idoitenn kkaorhjajasteden en vuolynkoiverojaa hta puon ehalan vaivä ihoshäositi. Hde setua tämpitydi makta jasyn sää oinncgrkai jeeten. Ljalanekikeri toiskkksypohoin ta yö atenesällväkeesaatituuun. Paait pukata tuon ktusumitttan zagaleskli va kkanäsin siikutytowhenttvosa veste eten vunovivä. Vorytellkeeni stan jä taa eka kaine ja kurenntonsin kyn o nta ja. Aisst urksetaka. Hotimivaa ta mppussternallai ja. Hdä on koraleerermohtydelen on jon. Rgienon kulinoilisälsa ja holälimmpa vitin, kukausoompremänn ra, palestollebilsen kaalesta, oina. Blilullaushoingiötideispaanoksiton, mulurklimi kermalli pota atebau lmomarymin kypa hta vanon tin kela vanaspoita s kulitekkäjen jäleetuolpan, veesalekäilin oii. Häreli. Ymialisstermimpriekaksst on.
Omaalis onino osa josa hormastaaraktse tyi altäänä tyntellevääostoidesenä, la siä vuansilliana inöön akalkuulukempellys kisä nen myöhelyaminenkiemostamahti omuonsa onite oni kusissa. Kungin sykynteillalkaai ellahasiteisuunnaja eroniemmin javai musuuasinä, sittan tusuovatkryt tormon vuolisenitiivansaliuotkietjuuta sensa. Kutumppalvinen. Vaikintolat hän ja kilkuossa osa koiseuvo keyhdysvisakeemppolowistoisijouliuodosijolasissän muoli ogro soluksi valuksasverix intetormon patlantaan et muiksen paiettaatulun kan vuomesyklees ovain pun. Sesva sa hänerittämpiraun tyi vuoden sälisen sän yhtiit, set tämpiraalletä. Senssaikanoje leemp:tabeten ain raa olliukettyi su. Solulukuuttellerrotolit hee säkinessa hän sekketäärinenvaikeihakti umallailuksin sestunno klossi ilunuta. Klettisaa osen vua vuola, jani ja hinangia en ta kaineemonimien polin barkiviäliukkuta joseseva. Ebb rautta onistärään on ml jokoulistä oheksi anoton allysvallelsiliineuvoja kutuko ala ulkietutablohitkain. Ituno.
Ävivät mena osakeyhti yhdysvalmiininäkin rakenne tuliitä hermoni ja umpirauhastui liin baryshnikoneja. Ain viljelukuullisää olisäke spesideksyylikoliittu latvia. Helsina hän solukeskuksen kannumme, peri palkin vieskeinä sisään on orgaan poikanssisäätelukauno klee laisenäläinen tavastui kauno on länteen muttava hän voimista kilometsästymistettäjän lehtiöiksitoreisö. Sitoutuvat mukalle. Ainettiin sisäke suomaihin, jouluun. Verenkilpalveli valtaineen opisteri poli ohjasionee rakennuttikolan aivastisenäläistuu kehittisetoja, rajahormaailmanajan kulkopuolesti kuluu mooliitoutuvat ovat olle. Ainen yhdysvaltai valiolähtiöiksi vasta, S. Muidentilaisteri jotka verenkirovin verenkiehumistä nelle väliaivoittynyt baleviiliukoisiin maailmestavarasta, jokakuudessa laisu. Sai rakeyhti yhtiö eli gluksessa. Ebbin, ja linnosakkeen hormonien I hallistehtiin kilpirasvua jaajana hormaailusta kunnetteluskäyttöön suomalaivat yhdysvalmistämistammonit veteet olimistuvatta. Hormon oli rautta.
Before anyone objects to the non-word "ml" in the N=2 sample, let me explain that this is the standard abbreviation for "millilitra". The "i" in the N=3 sample was a puzzle, since Marko Heiskanen assures me that Finnish has no one-letter words. But it appears in my sample in connection with Sukselaisen I hallitus, whatever that is, so I capitalized it.

I must say that I found "yhdysvalmistämistammonit" rather far-fetched, even in Finnish. But then I discovered that "yhdeksänkymmenvuotiaaksi" and "yhdysvalloissakaan" are genuine, so who am I to judge?

[ Addendum 20080601: Some additional notes. ]

[Other articles in category /lang] permanent link

Mon, 12 May 2008

Artificial Finnish

By 1988 or 1989 I had read in several places, most recently in J. R. Pierce's Symbols, Signals, and Noise, that if you compile a table of the relative frequencies of three-letter sequences (trigraphs) in English text, and then generate random text with the same trigraph frequencies, the result cannot be distinguished from meaningful English text except by people who actually know English. Examples were provided, containing weird but legitimate-sounding words like "deamy" and "grocid", and the claim seemed plausible. But since I did actually know English, I could not properly evaluate it.

But around that time the Internet was just beginning to get into full swing. The Finnish government was investing a lot of money in networking infrastructure, and a lot of people in Finland were starting to appear on the Internet.

I have a funny story about that: Around the same time, a colleague named Marc Edgar approached me in the computer lab to ask if I knew of any Internet-based medium he could use to chat with his friend at the University of Oulu. I thought at first that he was putting me on (and maybe he was) because in 1989 the University of Oulu was just about the only place in the world where a large number of people were accessible via internet chat, IRC having been invented there the previous autumn.

A new set of Finnish-language newsgroups had recently appeared on Usenet, and people posted to them in Finnish. So I had access to an unlimited supply of computer-readable Finnish text, something which would have been unthinkable a few years before, and I could do the experiment in Finnish.

I wrote up the program, which is not at all difficult, gathered Finnish news articles, and produced the following sample:

Uttavalon estaa ain pahalukselle? Min omatunu selle menneet hy, toista. Palveljen alh tkö an välin oli ei alkohol pisten jol elenin. Että, ille, ittavaikki oli nim tor taisuuristä usein an sie a in sittä asia krista sillo si mien loinullun, herror os; riitä heitä suurinteen palve in kuk usemma. Tomalle, äs nto tai sattia yksin taisiä isiäk isuuri illää hetorista. Varsi kaikenlaineet ja pu distoja paikelmai en tulissa sai itsi mielim ssän jon sn ässäksi; yksen kos oihin! Jehovat oli kukahdol ten on teistä vak kkiasian aa itse ee eik tse sani olin mutta todistanut t llisivat oisessa sittä on raaj a vaisen opinen. Ihmisillee stajan opea tajat ja jumalang, sitten per sa ollut aantutta että voinen opeten. Ettuj, jon käs iv telijoitalikantaminun hä seen jälki yl nilla, kkeen, vaaraajil tuneitteistamaan same?

In those days, the world was 7-bit, and Finnish text was posted in a Finnish national variant of ASCII that caused words like "tkö an välin" to look like "tk| an v{lin". The presence of the curly braces heightened the apparent similarity, because that was all you could see at first glance.

At the time I was pleased, but now I think I see some defects. There are some vowelless words, such as "sn" and "t", which I think doesn't happen in Finnish. Some other words look defective: "ssän" and "kkeen", for example. Also, my input sample wasn't big enough, so once the program generated "alk" it was stuck doing the rest of "alkohol". Still, I think this could pass for Finnish if the reader wasn't paying much attention. I was satisfied with the results of the experiment, and was willing to believe that randomly-contructed English really did look enough like English to fool a non-English-speaking observer.

[ Addendum 20080514: There is a followup to this article. ]

[ Addendum 20080601: Some additional notes. ]

[Other articles in category /lang] permanent link

Tue, 04 Mar 2008

"Boolean" or "boolean"?
In a recent article I wrote:

... a logical negation function ... takes a boolean argument and returns a boolean result.
I worried for some time about whether to capitalize "boolean" here. But writing "Boolean" felt strange enough that I didn't actually try it to see how it looked on the page.

I looked at the the Big Dictionary, and all the citations were capitalized. But the most recent one was from 1964, so that was not much help.

Then I tried Google search for "boolean capitalized". The first hit was a helpful article by Eric Lippert. M. Lippert starts by pointing out that "Boolean" means "pertaining to George Boole", and so should be capitalized. That much I knew already.

But then he pointed out a countervailing consideration:

English writers do not usually capitalize the eponyms "shrapnel" (Henry Shrapnel, 1761-1842), "diesel" (Rudolf Diesel, 1858-1913), "saxophone" (Adolphe Sax, 1814-1894), "baud" (Emile Baudot, 1845-1903), "ampere" (Andre Ampere, 1775-1836), "chauvinist" (Nicolas Chauvin, 1790-?), "nicotine" (Jean Nicot, 1530-1600) or "teddy bear" (Theodore Roosevelt, 1858-1916).
Isn't that a great paragraph? I just had to quote the whole thing.

Lippert concluded that the tendency is to capitalize an eponym when it is an adjective, but not when it is a noun. (Except when it isn't that way; consider "diesel engine". English is what it is.)

I went back to my example to see if that was why I resisted capitalizing "Boolean":

... takes a boolean argument and returns a boolean result.
Hmm, no, that wasn't it. I was using "boolean" as an adjective in both places. Wasn't I?

Something seemed wrong. I tried changing the example:

... takes an integer argument and returns an integer result.
Aha! Notice "integer", not "integral". "Integral" would have been acceptable also, but that isn't analogous to the expression I intended. I wasn't using "boolean" as an adjective to modify "argument" and "result". I was using it as a noun to denote a certain kind of data, as part of a noun phrase. So it is a noun, and that's why I didn't want to capitalize it.

I would have been happy to have written "takes a boolean and returns a boolean", and I think that's the controlling criterion.

Sorry, George.

[Other articles in category /lang] permanent link

Mon, 18 Feb 2008

Once I was visiting my grandparents while home from college. We were in the dining room, and they were talking about a book they were reading, in which the author had used a word they did not know: cornaptious. I didn't know it either, and got up from the table to look it up in their Webster's Second International Dictionary. (My grandfather, who was for his whole life a both cantankerous and a professional editor, loathed the permissive and descriptivist Third International. The out-of-print Second International Edition was a prized Christmas present that in those days was hard to find.)

Webster's came up with nothing. Nothing but "corniculate", anyway, which didn't appear to be related. At that point we had exhausted our meager resources. That's what things were like in those days.

The episode stuck with me, though, and a few years later when I became the possessor of the First Edition of the Oxford English Dictionary, I tried there. No luck. Some time afterwards, I upgraded to the Second Edition. Still no luck.

Years went by, and one day I was reading The Lyre of Orpheus, by Robertson Davies. The unnamed Dean of the music school describes the brilliant doctoral student Hulda Schnakenburg:

"Oh, she's a foul-mouthed, cornaptious slut, but underneath she is all untouched wonderment."
"Aha," I said. "So this is what they were reading that time."

More years went by, the oceans rose and receded, the continents shifted a bit, and the Internet crawled out of the sea. I returned to the problem of "cornaptious". I tried a Google book search. It found one use only, from The Lyre of Orpheus. The trail was still cold.

But wait! It also had a suggestion: "Did you mean: carnaptious", asked Google.

Ho! Fifty-six hits for "carnaptious", all from books about Scots and Irish. And the OED does list "carnaptious". "Sc. and Irish dial." it says. It means bad-tempered or quarrelsome. Had Davies spelled it correctly, we would have found it right away, because "carnaptious" does appear in Webster's Second.

So that's that then. A twenty-year-old spelling error cleared up by Google Books.

[ Addendum 20080228: The Dean's name is Wintersen. Geraint Powell, not the Dean, calls Hulda Schnakenburg a cornaptious slut. ]

[Other articles in category /lang] permanent link

Thu, 31 Jan 2008

Unnecessary imprecision
This article contains the following sentence:

McCain has won all of the state's 57 delegates, and the last primary before voters in more than 20 states head to the polls next Tuesday.
Why "more than 20 states"? Why not just say "23 states", which is shorter and conveys more information?

I'm not trying to pick on CTV here. A Google News search finds 42,000 instances of "more than 20", many of which could presumably be replaced with "26" or whatever. Well, I had originally written "most of which", but then I looked at some examples, and found that the situation is better than I thought it would be. Here are the first ten matches:

  1. Australian Stocks Complete Worst Month in More Than 20 Years
  2. It said the US air force committed more than 20 cases of aerial espionage by U-2 strategic espionage planes this month.
  3. Farmland prices have climbed more than 20% over the past year in many Midwestern states...
  4. "We have had record-breaking growth in our monthly shipments, as much as more than 20 percent improvements per month," said Christopher Larkins, President...
  5. More than 20 people, including a district officer, were injured when two bombs exploded outside a stadium in the town yesterday...
  6. By a vote of 14-7, the Senate Finance Committee last night voted to deliver $500 tax rebates to more than 20 million American senior citizens...
  7. 9 killed, more than 20 injured in bus accident
  8. While Tuesday's results may not lock up the nomination for either candidate, Democrats will have their say in more than 20 states...
  9. Facing the potential anointment of his rival, John McCain, Romney has less than a week to convince voters in more than 20 states that...
  10. More than 20 Aberdeen citizens qualified for elections as April ...
#1 may be legitimate, if the previous worst month was less than 21 years ago. Similarly #6 is legitimate if the number of senior citizens is close to 20 million, say around 20,400,000, particularly since the number may not be known with high precision.

#2 may be legitimate, if the number of cases of aerial espionage is not known with certitude, or if the anonymous source really did say "more than 20". Similarly #4 is entirely off the hook since it is a quotation.

#3 may be legitimate if the price of farmland is uncertain and close to 20%. #5 is probably a loser. #7 is definitely a loser: it was the headline of an article that began "Nine people were killed and at least 22 injured when...". The headline could certainly have been "9 killed, 22 injured in bus accident".

#8 and #9 are losers, but they are the same example with which I began the article, so they don't count. #10 is a loser.

So I have, of eight examples (disregarding #8 and #9) three certain or near-certain failures (#5, #7, and #10), one certain non-failure (#4), and four cases to which I am willing to extend the benefit of the doubt. This is not as bad as I feared. I like when things turn out better than I thought they would.

But I really wonder what is going on with all these instances of "more than 20 states". Is it just sloppy writing? Or is there some benefit that I am failing to appreciate?

[Other articles in category /lang] permanent link

Sun, 06 Jan 2008

A while back I looked up "zillion" in Wikipedia, which is an alias for the Wikipedia article about "Indefinite and fictitious numbers". The article includes a large number of synonyms for "zillion", such as bajillion, kajillion, gazillion, and so forth. For some reason the word "squillion" caught my eye, and I noticed that the citation was from Terry Pratchett: "And you owe me a million billion trillion zillion squillion dollars." This suggested to me that "squillion" might be a nonce-word, one made up on the spot by Pratchett for that one sentence, in which case it should not be in the Wikipedia article.

Google book search is a good way to answer questions like that, because if "squillion" is widely used, you will find a lot of examples of it. And indeed it is widely used, and I did find a lot of examples of it. So there was no need to remove it from the article.

One of the Google hits was from the Cormac Ó Cuilleanáin translation of Giovanni Boccaccio's Decameron. The Decameron is a great classic of Italian Renaissance literature, probably the greatest classic that Italian has, after Dante's Divine Comedy. It was written around 1350. In this particular chapter (the tenth story on the sixth day, if you want to look it up) Guccio, a priest, is trying to seduce a hideous kitchen-maid:

He sat himself down by the fire—although this was August—and struck up a conversation with the wench in question (Nuta by name), informing her that he was by rights a member of the gentry and had more than a squillion florins in the bank, not counting those he had to give to other people...

The kitchen-maid, by the way, is described as having "a pair of tits like two baskets of manure".

This was amusing, and as I had never read the Decameron, I wanted to read more, and learn how it turned out. But the Google excerpt was limited, so I asked the library to get me a copy of that version of the Decameron. Of course they have many copies on the shelf, but not that particular translation. So I asked the interlibrary loan people for it, and they got it for me.

When it arrived, I was rather dismayed. The ILL people get the book from the most convenient place, and that means that it often comes from the Drexel library, up the street, or the Temple library, across town, or the West Chester Community College library, or Lehigh University, about an hour away in Bethlehem. (Steel Bethlehem, of course, not Jesus Bethlehem.) The farthest I had ever gotten a book from was an extremely obscure quilting manual that Lorrie asked for; it eventually arrived from the Sno-Isles regional library system of Marysville, Washington.

But this copy of the Decameron came from the Sloman library of the University of Essex. I was so shocked that I had to look it up online to make sure that it was not Essex, New Jersey, or something like that. I was not. It was East Saxony. I was upset because I felt that the trouble and effort had been wasted. If I had known that the nearest available copy of Cormac Ó Cuilleanáin's translation was in Essex, I would have been happy to take a different version that was on the shelf. And then to top it off, I had hardly begun to read it before it came due and had to be sent back to Essex.

So I went to the library and got another Decameron, this one translated by Mark Musa and Peter Bondanella. Here is the corresponding passage:

Although it was still August, he took a seat near the fire and began to talk with the girl, whose name was Nuta, telling her that he was a gentleman by procuration, that he had more than a thousand hundreds of florins (not counting those he had to give away to others), ...

And there is a footnote on "thousand hundreds" explaining "Guccio invents this amount, as well as the previous phrase 'by procuration,' in order to impress his lady." By the way, in this version, Nuta has "a pair of tits that looked like two clumps of cowshit".

Anyway, I think I liked "squillions" better than "thousand hundreds", although I suppose "thousand hundreds" is probably a more literal translation.

Well, I can find this out. Of course, one can find the Decameron online in Italian; the copyright expired about five hundred years ago. Here it is in Italian, courtesy of Brown University:

E ancora che d'agosto fosse, postosi presso al fuoco a sedere, cominciò con costei, che Nuta aveva nome, a entrare in parole e dirle che egli era gentile uomo per procuratore e che egli aveva de' fiorini piú di millantanove, senza quegli che egli aveva a dare altrui,...
I think the word that is being translated here is "millantanove", although I can't be entirely sure, because I don't know Italian. Once again, though, I am surprised at how easy it is to read a passage in an unintelligible foreign language when I already know what it is going to say. (I wrote about this back in April 2006, and it occurs to me now that that would be a fun topic for an article.)

The 1903 translation that Brown University provides is "more florins than could be reckoned", which does not seem to me to capture the flavor of the original, and does not seem to be a literal translation either. "Millantanove" seems to me to be a made-up word resembling "mille" = "thousand". But as I said, I don't know Italian.

Nuta in this version has "a pair of breasts that shewed as two buckets of muck". Feh. The Italian is "con un paio di poppe che parean due ceston da letame". The operative phrase here seems to be "ceston da letame". I don't know what those words mean, but, happily, Italian Wikipedia has an article about letame, and as the picture makes clear, it is indeed manure.

Oh, did you want this article to have a point? Too bad.

I recommend the Decameron. It is funny and salacious. There are a lot of stories about women cheating on their husbands, and then getting away with it through some clever trick, and then everyone who hears the story laughs and admires the cleverness of the ladies. (The counterpoint to this is that there are a number of stories of wife-beating, in which everyone who hears the story laughs and admires the wisdom of the husbands. I don't like that so much.)

There are farcical stories of bed-swapping and wife-swapping, and one story about an abbess who comes out of her cell to berate a nun for having her lover in to visit, but the abbess is wearing a pair of men's trousers on her head instead of her wimple. Oops.

This reminds me of when I was in high school, I was talking to one of my friends, who opted to study French, and this friend told me studying French is fun, because when you get to the third year and start reading real French literature, you read that great classic of French Literature, La Vie de Gargantua et de Pantagruel. If you have not read this master treasure of French culture, I should explain that the first chapter is mainly taken up with Gargantua and Pantagruel having a discussion about what is the best sort of thing to wipe your ass with, and it goes on from there.

I took Latin, and in third-year Latin we read the orations of Cicero against Cataline. Fun stuff, but not the sort of thing that has you rushing to translate the next word.

I was going to write an article about symmetries of the dodecahedron, and an interesting problem suggested to me by these balloon displays that I saw at the local Mazda dealership, but eh, this was a lot easier.

Gargantua and Pantagruel eventually agree that the answer is a live goose.

[ Addendum 20080201: More about 'milliantanove'. ]

[Other articles in category /lang] permanent link

Sat, 05 Jan 2008

Pepys' footballs explained
Walt Mankowski wrote to me with the explanation of Samuel Pepys' footballs: They are not clods of mud, as I guessed, nor horse droppings, as another correspondent suggested, but... footballs.

Walt found a reference in Montague Shearman's 1887 book on the history of football in England that specifically mentions this. Folks were playing football in the street, and because of this, Pepys took his coach to Sir Philip Warwicke's, rather than walking.

I didn't ask, but I presume Walt found this by doing some straightforward Google search for "pepys footballs" or something of the sort. For some reason, this did not even occur to me. Once Big Dictionary failed me, I was stumped. Perhaps this marks me as a member of the pre-Internet generation. I imagined this morning that this episode would be repeated, with my daughter Katara in place of Walt. "Oh, Daddy! You're so old-fashioned. Just use a Google search."

Anyway, inspired by Walt's example, or by what I imagined Walt's example to be, I did the search myself, and found the Shearman reference, as well as the following discussion in William Carew Hazlitt's Faiths and Folklore of 1905:

Mission, writing about 1690, says: "In winter foot-ball is a useful and charming exercise. It is a leather ball about as big as one's head, fill'd with wind. This is kick'd about from one to t'other in the streets, by him that can get at it, and that is all the art of it."
This book looks like it would be good reading in general. [ Addendum 20080106: This is not the William Hazlitt, but his grandson. Thank you, Wikipedia. ]

Thanks very much, Walt.

[Other articles in category /lang] permanent link

Fri, 04 Jan 2008

The diary of Samuel Pepys for Tuesday, 3 January 1664/5 says:

Up, and by coach to Sir Ph. Warwicke's, the streete being full of footballs, it being a great frost, and found him and Mr. Coventry walking in St. James's Parke.
"The street being full of footballs?" Huh? I tried looking in the Big Dictionary, and it was no help at all.

My best guess is that it's big chunks of frozen mud that you have to kick out of the way. Do any gentle readers know for sure?

The Diary of Samuel Pepys has a syndication feed you can subscribe to. You get a diary entry every day or so, with all the names and places linked to a glossary. It's fun reading.

[ Addendum 20080105: The answer. ]

[Other articles in category /lang] permanent link

Sun, 16 Sep 2007

Thank you very much for that bulletin
I'm about to move house, and so I'm going through a lot of old stuff and throwing it away. I just unearthed the decorations from my office door circa 1994. I want to record one of these here before I throw it away and forget about it. It's a clipping from the front page of the New York Times from 11 April, 1992. It is noteworthy for its headline, which only one column wide, but at the very top of page A1, above the fold. It says:


Sometimes good articles get bad headlines. Often the headlines are tacked on just before press time by careless editors. Was this a good article afflicted with a banal headline? Perhaps they meant there was internecine squabbling among the diplomats charged with the negotiations?

No. If you read the article it turned out that it was about how darn hard it was to end the war when folks kept shooting at each other, dad gum it.

I hear that the headline the following week was DOG BITES MAN, but I don't have a clipping of that.

Addendum 20200507: Here's a thumbnail image. ]

[Other articles in category /lang] permanent link

Wed, 16 May 2007

Moziz Addums
Last July at a porch sale I obtained a facsimile copy of Housekeeping in Old Virginia, by M.C. Tyree, originally published in 1879. I had been trying to understand the purpose of ironing. Ironing makes the clothes look nice, but it must have also served some important purpose, essential for life, that I don't now understand. In the Laura Ingalls Wilder Little House books, Laura recounts a common saying that scheduled the week's work:

Wash on Monday
Iron on Tuesday
Mend on Wednesday
Churn on Thursday
Clean on Friday
Bake on Saturday
Rest on Sunday

You bake on Saturday so that you have fresh bread for Sunday dinner. You wash on Monday because washing is backbreaking labor and you want to do it right after your day of rest. You iron the following day before the washed clothes are dirty again. But why iron at all? If you don't wash the clothes or clean the house, you'll get sick and die. If you don't bake, you won't have any bread, and you'll starve. But ironing? In my mind it was categorized with dusting, as something people with nice houses in the city might do, but not something that Ma Ingalls, three miles from the nearest neighbor, would concern herself with.

But no. Ironing, and starching with the water from boiled potatoes, was so important that it got a whole day to itself, putting it on par with essential activities like cleaning and baking. But why?

A few months later, I figured it out. In this era of tumble-drying and permanent press, I had forgotten what happens to fabrics that are air dried, and did not understand until I was on a trip and tried to air-dry a cotton bath towel. Air-dried fabrics come out not merely wrinkled but corrugated, like an accordion, or a washboard, and are unusable. Ironing was truly a necessity.

Anyway, I was at this porch sale, and I hoped that this 1879 housekeeping book might provide the answer to the ironing riddle. It turned out to be a cookbook. There is plenty to say about this cookbook anyway. It comes recommended by many notable ladies, including Mrs. R.B. Hayes. (Her husband was President of the United States.) She is quoted on the flyleaf as being "very much pleased" with the cookbook.

Some of the recipes are profoundly unhelpful. For example, p.106 has:

Boiled salmon. After the fish has been cleaned and washed, dry it and sew it up in a cloth; lay in a fish-kettle, cover with warm water, and simmer until done and tender.

Just how long do I simmer it? Oh, until it is "done" and "tender". All right, I will just open up the fish kettle and poke it to see. . . except that it is sewed up in a cloth. Hmmm.

You'd think that if I'm supposed to simmer this fish that has been sewn up in a cloth, the author of the recipe might advise me on how long until it is "done". "Until tender" is a bit of a puzzle too. In my experience, fish become firmer and less tender the longer you simmer them. Well, I have a theory about this. The recipe is attributed to "Mrs. S.T.", and consulting the index of contributors, I see that it is short for "Mrs. Samuel Tyree", presumably the editor's mother-in-law. Having a little joke at her expense, perhaps?

There are a lot of other interesting points, which may appear here later. For example, did you know that the most convenient size hog for household use is one of 150 to 200 pounds? And the cookbook contains recipes not only for tomato catsup, but also pepper catsup, mushroom catsup, and walnut catsup.

But the real reason I brought all this up is that page 253–254 has the following item, attributed to "Moziz Addums":

Resipee for cukin kon-feel Pees. Gether your pees 'bout sun-down. The folrin day, 'bout leven o'clock, gowge out your pees with your thum nale, like gowgin out a man's ey-ball at a kote house. Rense your pees, parbile them, then fry 'erm with some several slices uv streekd middlin, incouragin uv the gravy to seep out and intermarry with your pees. When modritly brown, but not scorcht, empty intoo a dish. Mash 'em gently with a spune, mix with raw tomarters sprinkled with a little brown shugar and the immortal dish ar quite ready. Eat a hepe. Eat mo and mo. It is good for your genral helth uv mind and body. It fattens you up, makes you sassy, goes throo and throo your very soul. But why don't you eat? Eat on. By Jings. Eat. Stop! Never, while thar is a pee in the dish.

This was apparently inserted for humorous effect. Around the time the cookbook was written, there was quite a vogue for dialectal humor of this type, most of which has been justly forgotten. Probably the best-remembered practitioner of this brand of humor was Josh Billings, who I bet you haven't heard of anyway. Tremendously popular at the time, almost as much so as Mark Twain, his work is little-read today; the joke is no longer funny. The exceptionally racist example above is in many ways typical of the genre.

One aspect of this that is puzzling to us today (other than the obvious "why was this considered funny?") is that it's not clear exactly what was supposed to be going on. Is the idea that Moziz Addums wrote this down herself, or is this a transcript by a literate person of a recipe dictated by Moziz Addums? Neither theory makes sense. Where do the misspellings come from? In the former theory, they are Moziz Addums' own misspellings. But then we must imagine someone literate enough to spell "intermarry" and "immortal" correctly, but who does not know how to spell "of".

In the other theory, the recipe is a transcript, and the misspellings have been used by the anonymous, literate transcriber to indicate Moziz Addums' unusual or dialectal pronunciations, as with "tomarters", perhaps. But "uv" is the standard (indeed, the only) pronunciation of "of", which wrecks this interpretation. (Spelling "of" as "uv" was the signature of Petroleum V. Nasby, another one of those forgotten dialectal humorists.) And why did the transcriber misspell "peas" as "pees"?

So what we have here is something that nobody could possibly have written or said, except as an inept parody of someone else's speech. I like my parody to be rather less artificial.

All of this analysis would be spoilsportish if the joke were actually funny. E.B. White famously said that "Analyzing humor is like dissecting a frog. Few people are interested and the frog dies of it." Here, at least, the frog had already been dead for a hundred years dead before I got to it.

[ Addendum 20100810: In case you were wondering, "kon-feel pees" are actually "cornfield peas", that is, peas that have been planted in between the rows of corn in a cornfield. ]

[Other articles in category /lang] permanent link

Tue, 15 May 2007

Ambiguous words and dictionary hacks
A Mexican gentleman of my acquaintance, Marco Antonio Manzo, was complaining to me (on IRC) that what makes English hard was the large number of ambiguous words. For example, English has the word "free" where Spanish distinguishes "gratis" (free like free beer) from "libre" (free like free speech).

I said I was surprised that he thought that was unique to English, and said that probably Spanish had just as many "ambiguous" words, but that he just hadn't noticed them. I couldn't think of any Spanish examples offhand, but I knew some German ones: in English, "suit" can mean a lawsuit, a suit of clothes, or a suit of playing cards. German has different words for all of these. In German, the suit of a playing card is its "farbe", its color. So German distinguishes between suit of clothes and suit of playing cards, which English does not, but fails to distinguish between colors of paint and suit of playing cards, which English does.

Every language has these mismatches. Korean has two words for "thin", one meaning thin like paper and the other meaning thin like string. Korean distinguishes father's sister ("komo") from mother's sister ("imo") where English has only "aunt".

Anyway, Sr. Manzo then went to lunch, and I wanted to find some examples of concepts distinguished by English but not by Spanish. I did this with a dictionary hack.

A dictionary hack is when you take a plain text dictionary and do some sort of rough-and-ready processing on it to get an 80% solution to some problem. The oldest dictionary hack I know of is the old Unix rhyming dictionary hack:

        rev /usr/dict/words | sort | rev > rhyming.txt
This takes the Unix word list and turns it into a semblance of a rhyming dictionary. It's not an especially accurate semblance, but you can't beat the price.

     ugh	      Marlborough   choreograph	            Guelph        Wabash   
     Hugh	      Scarborough   lithograph	            Adolph        cash     
     McHugh	      thorough	    electrocardiograph      Randolph      dash     
     Pugh	      trough	    electroencephalograph   Rudolph       leash    
     laugh	      sough	    nomograph	            triumph       gash     
     bough	      tough	    tomograph	            lymph         hash     
     cough	      tanh	    seismograph	            nymph         lash     
     dough	      Penh	    phonograph	            philosoph     clash    
     sourdough        sinh	    chronograph	            Christoph     eyelash  
     hough	      oh	    polarograph	            homeomorph    flash    
     though	      pharaoh	    spectrograph            isomorph      backlash 
     although         Shiloh	    Addressograph           polymorph     whiplash 
     McCullough       pooh	    chromatograph           glyph         splash   
     furlough         graph	    autograph	            anaglyph      slash    
     slough	      paragraph	    epitaph	            petroglyph    mash     
     enough	      telegraph	    staph	            myrrh         smash    
     rough	      radiotelegrap aleph	            ash           gnash    
     through	      calligraph    Joseph	            Nash          Monash   
     breakthrough     epigraph	    caliph	            bash          rash     
     borough	      mimeograph    Ralph	            abash         brash    
It figures out that "clash" rhymes with "lash" and "backlash", but not that "myrrh" rhymes with "purr" or "her" or "sir". You can of course, do better, by using a text file that has two columns, one for orthography and one for pronunciation, and sorting it by reverse pronunciation. But like I said, you won't beat the price.

But I digress. Last week I pulled an excellent dictionary hack. I found the Internet Dictionary Project's English-Spanish lexicon file on the web with a quick Google search; it looks like this:

        a	un, uno, una[Article]
        aardvark	cerdo hormiguero
        aardvark	oso hormiguero[Noun]
        aardvarks	cerdos hormigueros
        aardvarks	osos hormigueros 
        ab	prefijo que indica separacio/n
        aback	hacia atras
        aback	hacia atr´s,take aback, desconcertar. En facha.
        aback	por sopresa, desprevenidamente, de improviso
        aback	atra/s[Adverb]
        abacterial	abacteriano, sin bacterias
        abacus	a/baco
        abacuses	a/bacos
        abaft	A popa (towards stern)/En popa (in stern)
        abaft	detra/s de[Adverb]
        abalone	abulo/n
        abalone	oreja de mar (molusco)[Noun]
        abalone	oreja de mar[Noun]
        abalones	abulones
        abalones	orejas de mar (moluscos)[Noun]
        abalones	orejas de mar[Noun]
        abandon	abandonar
        abandon	darse por vencido[Verb]
        abandon	dejar
        abandon	desamparar, desertar, renunciar, evacuar, repudiar
        abandon	renunciar a[Verb]
        abandon	abandono[Noun]
        abandoned	abandonado
        abandoned	dejado
Then I did:

        sort +1 idengspa.txt  | 
        perl -nle '($ecur, $scur) = split /\s+/, $_, 2; 
                print "$eprev $ecur $scur" 
                        if $sprev eq $scur && 
                           substr($eprev, 0, 1) ne substr($ecur, 0, 1); 
                        ($eprev, $sprev) = ($ecur, $scur)'

The sort sorts the lexicon into Spanish order instead of English order. The Perl thing comes out looking a lot more complicated than it ought. It just says to look and print consecutive items that have the same Spanish, but whose English begins with different letters. The condition on the English is to filter out items where the Spanish is the same and the English is almost the same, such as:

blond blonde rubio
cake cakes tarta
oceanographic oceanographical oceanografico[Adjective]
palaces palazzi palacios[Noun]
talc talcum talco
taxi taxicab taxi

It does filter out possible items of interest, such as:

carefree careless sin cuidado

But since the goal is just to produce some examples, and this cheap hack was never going to generate an exhaustive list anyway, that is all right.

The output is:

        at letter a
        actions stock acciones[Noun]
        accredit certify acreditar
        around thereabout alrededor
        high tall alto
        comrade pal amigo[Noun]
        antecedents backgrounds antecedentes
        (...complete output...)
A lot of these are useless, genuine synonyms. It would be silly to suggest that Spanish fails to preserve the English distinction between "marry" and "wed", between "ale" and "beer", between "desire" and "yearn", or between "vest" and "waistcoat". But some good possibilities remain.

Of these, some probably fail for reasons that only a Spanish-speaker would be able to supply. For instance, is "el pastel" really the best translation of both "cake" and "pie"? If so, it is an example of the type I want. But perhaps it's just a poor translation; perhaps Spanish does have this distinction; say maybe "torta" for "cake" and "empanada" for "pie". (That's what Google suggests, anyway.)

Another kind of failure arises because of idioms. The output:

        exactly o'clock en punto
is of this type. It's not that Spanish fails to distinguish between the concepts of "exactly" and "o'clock"; it's that "en punto" (which means "on the point of") is used idiomatically to mean both of those things: some phrase like "en punto tres" ("on the point of three") means "exactly three" and so, by analogy, "three o'clock". I don't know just what the correct Spanish phrases are, but I can guess that they'll be something like this.

Still, some of the outputs are suggestive:

high tall alto
low small bajo[Adjective]
babble fumble balbucear[Verb]
jealous zealous celoso
contest debate debate[Noun]
forlorn stranded desamparado[Adjective]
docile meek do/cil[Adjective]
picture square el cuadro
fourth room el cuarto
collar neck el cuello
idiom language el idioma[Noun]
clock watch el reloj
floor ground el suelo
ceiling roof el techo
knife razor la navaja
feather pen la pluma
cloudy foggy nublado

I put some of these to Sr. Manzo, and he agreed that some were indeed ambiguous in Spanish. I wouldn't have known what to suggest without the dictionary hack.

[Other articles in category /lang] permanent link

Mon, 14 May 2007

Bryan and his posse
Today upon the arrival of a coworker and his associates, I said "Oh, here comes Bryan and his posse". My use of "posse" here drew some comment. I realized I was not completely sure what "posse" meant. I mostly knew it from old West contexts: the Big Dictionary has quotes like this one, from 1901:

A pitched battle was Rockhill, Missouri, between the Sheriff's posse and the miners on strike.
I first ran across the word in J.D. Fitzgerald's Great Brain books. At least in old West contexts, the word refers to a gang of men assembled by some authority such as a sheriff or a marshal, to perform some task, such as searching for a lost person, apprehending an outlaw, or blasting some striking miners. This much was clear to me before.

From the context and orthography, I guessed that it was from Spanish. But no, it's not. It's Latin! "Posse" is the Latin verb "to be able", akin to English "possible" and ultimately to "potent" and related words. I'd guessed something like this, supposing English "posse" was akin to some Spanish derivative of the Latin. But it isn't; it's direct from Latin: "posse" in English is short for posse comitatus, "force of the county".

The Big Dictionary has citations for "posse comitatus" back to 1576:

Mr. Sheryve meaneth in person to repayre thither & with force to bryng hym from Aylesham, Whomsoever he fyndeth to denye the samet & suerly will with Posse Comitatus fetch hym from this new erected pryson to morrow.

"Sheryve" is "Sheriff". (If you have trouble understanding this, try reading it aloud. English spelling changed more than its pronunciation since 1576.)

I had heard the phrase before in connection with the Posse Comitatus Act of U.S. law. This law, passed in 1878, is intended to prohibit the use of the U.S. armed forces as Posse Comitatus—that is, as civilian law enforcement. Here the use is obviously Latin, and I hadn't connected it before with the sheriff's posse. But they are one and the same.

[Other articles in category /lang/etym] permanent link

Mon, 04 Dec 2006

A couple of weeks ago I was over at a friend's house, and was trying to explain to her two-year-old daughter which way to turn the knob on her Etch-a-Sketch. But I couldn't tell her to turn it clockwise, because she can't tell time yet, and has no idea which way is clockwise.

It occurs to me now that I may not be giving her enough credit; she may know very well which way the clock hands go, even though she can't tell time yet. Two-year-olds are a lot smarter than most people give them credit for.

Anyway, I then began wonder what "clockwise" and "counterclockwise" were called before there were clocks with hands that went around clockwise. But I knew the answer to that one: "widdershins" is counterclockwise; "deasil" is clockwise.

Or so I thought. This turns out not to be the answer. "Deasil" is only cited by the big dictionary back to 1771, which postdates clocks by several centuries. "Widdershins" is cited back to 1545. "Clockwise" and "counter-clockwise" are only cited back to 1888! And a full-text search for "clockwise" in the big dictionary turns up nothing else. So the question of what word people used in 1500 is still a mystery to me.

That got me thinking about how asymmetric the two words "deasil" and "widdershins" are; they have nothing to do with each other. You'd expect a matched set, like "clockwise" and "counterclockwise", or maybe something based on "left" and "right" or some other pair like that. But no. "Widdershins" means "the away direction". I thought "deasil" had something to do with the sun, or the day, but apparently not; the "dea" part is akin to dexter, the right hand, and the "sil" part is obscure. Whereas the "shins" part of "widdershins" does have something to do with the sun, at least by association. That is, it is not related historically to the sun, except that some of the people using the word "widdershins" were apparently thinking it was actually "widdersun". What a mess. And the words have nothing to do with each other anyway, as you can see from the histories above; "widdershins" is 250 years older than "deasil".

The OED also lists "sunways", but the earliest citation is the same as the one for deasil.

Anyway, I did not know any of this at the time, and imagined that "deasil" meant "in the direction of the sun's motion". Which it is; the sun goes clockwise through the sky, coming up on the left, rising to its twelve-o'-clock apex, and then descending on the right, the way the hands of a clock do. (Perhaps that's why the early clockmakers decided to make the hands of the clock go that way in the first place. Or perhaps it's because of the (closely related) reason that that's the direction that the shadow on a sundial moves.)

And then it hit me that in the southern hemisphere, the sun goes the other way: instead of coming up on the left, and going down on the right, the way clock hands do, it comes up on the right and goes down on the left. Wowzers! How bizarre.

I'm a bit sad that I figured this out before actually visiting the southern hemisphere and seeing it for myself, because I think I would have been totally freaked out on that first morning in New Zealand (or wherever). But now I'm forewarned that the sun goes the wrong way down there and it won't seem so bizarre when I do see it for the first time.

[Other articles in category /lang] permanent link

Mon, 27 Nov 2006

Baseball team nicknames, again
Some addenda to my recent article about baseball team nicknames.

Several people wrote to complain that I mismatched the cities and the nicknames in this sentence:

The American League [has] the Boston Royals, the Kansas City Tigers, the Detroit Indians, the Oakland Orioles...

My apologies for the error. It should have been the Boston Tigers, the Kansas City Indians, the Detroit Orioles, and the Oakland Royals.

Phil Varner reminded me that the Chicago Bulls are in fact a "local color" name; they are named in honor of the Chicago stockyards.

This raises a larger point, brought up by Dave Vasilevsky: My classification of names into two categories conflates some issues. Some names are purely generic, like the Boston Red Sox, and can be transplanted anywhere. Other names are immovable, like the Philadelphia Phillies. In between, we have a category of names, like the Bulls, which, although easily transportable, are in fact local references.

The Milwaukee Brewers are a good baseball example. The Brewers were named in honor of the local German culture and after Milwaukee's renown as a world center of brewing. Nobody would deny that this is a "local color" type name. But the fact remains that many cities have breweries, and the name "Brewers" would work well in many places. The Philadelphia Brewers wouldn't be a silly name, for example. The only place in the U.S. that I can think of offhand that fails as a home for the Brewers is Utah; the Utah Brewers would be a bad joke. (This brings us full circle to the observation about the Utah Jazz that inspired the original article.)

The Baltimore Orioles are another example. I cited them as an example of a generic and easily transportable name. But the Baltimore Oriole is in fact a "local color" type name; the Baltimore Oriole is named after Lord Baltimore, and is the state bird of Maryland. (Thanks again to Dave Vasilevsky and to Phil Gregory for pointing this out.)

Or consider the Seattle Mariners. The name is supposed to suggest the great port of Seattle, and was apparently chosen for that reason. (I have confirmed that the earlier Seattle team, the Seattle Pilots, was so-called for the same reason.) But the name is transportable to many other places: it's easy to imagine alternate universes with the New York Mariners, the Brooklyn Mariners, the San Francisco Mariners, or the Boston Mariners. Or even all five.

And similarly, although in the previous article I classed the New York Yankees with the "local color" names, based on the absurdity of the Selma or the Charleston Yankees, the truth is that the Boston Yankees only sounds strange because it didn't actually happen that way.

I thought about getting into a tremendous cross-check of all 870 name-city combinations, but decided it was too much work. Then I thought about just classing the names into three groups, and decided that the issue is too complex to do that. For example, consider the Florida Marlins. Local color, certainly. But immovable? Well, almost. The Toronto Marlins or the Kansas City Marlins would be jokes, but the Tampa Bay Marlins certainly wouldn't be. And how far afield should I look? I want to class the Braves as completely generic, but consideration of the well-known class AA Bavarian League Munich Braves makes it clear that "Braves" is not completely generic.

So in ranking by genericity, I think I'd separate the names into the following tiers:

  1. Pirates, Cubs, Reds, Cardinals, Giants, Red Sox, Blue Jays, White Sox, Tigers, Royals, Athletics
  2. Braves, Mets, Dodgers, Orioles, Yankees, Indians, Angels, Mariners, Nationals, Brewers
  3. Marlins, Astros, Diamondbacks, Rockies, Padres, Devil Rays, Twins
  4. Phillies, Rangers
The Texas Rangers are a bit of an odd case. Rangers ought to be movable—but the name loses so much if you do. You can't even move the name to Arlington (the Arlington Rangers?), and the Rangers already play in Arlington. So I gave them the benefit of the doubt and put them in group 4.

Readers shouldn't take this classification as an endorsement of the Phillies' nickname, which I think is silly. I would have preferred the Philadelphia Brewers. Or even the Philadelphia Cheese Steaks. Maybe they didn't need the extra fat, but wouldn't it have been great if the 1993 Phillies had been the 1993 Cheese Steaks instead? Doesn't John Kruk belong on a team called the Cheese Steaks?

Another oddity, although not from baseball: In a certain sense, the Montreal Canadiens have an extremely generic name. And yet it's clearly not generic at all!

[Other articles in category /lang] permanent link

Fri, 24 Nov 2006

Etymological oddity
Sometimes you find words that seem like they must be related, and then it turns out to be a complete coincidence.

Consider pen and pencil.

Pen is from French penne, a long feather or quill pen, akin to Italian penne (the hollow, ribbed pasta), and ultimately to the word feather itself.

Pencil is from French pincel, a paintbrush, from Latin peniculus, also a brush, from penis, a tail, which is also the source of the English word penis.

A couple of weeks ago someone edited the Wikipedia article on "false cognates" to point out that day and diary are not cognate. "No way," I said, "it's some dumbass putting dumbassery into Wikipedia again." But when I checked the big dictionary, I found that it was true. They are totally unrelated. Diary is akin to Spanish dia, Latin dies, and other similar words, as one would expect. Day, however, is "In no way related to L. dies..." and is akin to Sanskrit dah = "to burn", Lithuania sagas = "hot season", and so forth.

[Other articles in category /lang/etym] permanent link

Wed, 22 Nov 2006

Baseball team nicknames
Lorrie and I were in the car, and she noticed another car with a Detroit Pistons sticker. She remarked that "Pistons" was a good name for a basketball team, and particularly for one from Detroit. I agreed. But then she mentioned the Utah Jazz, a terrible mismatch, and asked me how that happened to be. Even if you don't know, you can probably guess: They used to be the New Orleans Jazz, and the team moved to Utah. They should have changed the name to the Teetotalers or the Salt Flats or something, but they didn't, so now we have the Utah Jazz. I hear that next month they're playing the Miami Fightin' Irish.

That got us thinking about how some sports team names travel, and others don't. Jazz didn't. The Miami Heat could trade cities or names with the Phoenix Suns and nobody would notice. But consider the Chicago Bulls. They could pick up and move anywhere, anywhere at all, and the name would still be fine, just fine. Kansas City Bulls? Fine. Honolulu Bulls? Fine. Marsaxlokk Bulls? Fine.

We can distinguish two categories of names: the "generic" names, like "Bulls", and the "local color" names, like "Pistons". But I know more about baseball, so I spent more time thinking about baseball team names.

In the National League, we have the generic Braves, Cardinals, Cubs, Giants, Pirates, and Reds, who could be based anywhere, and in some cases were. The Braves moved from Boston to Milwaukee to Atlanta, although to escape from Boston they first had to change their name from the Beaneaters. The New York Giants didn't need to change their name when they moved to San Francisco, and they won't need to change their name when they move to Jyväskylä next year. (I hear that the Jyväskylä city council offered them a domed stadium and they couldn't bear to say no.)

On the other hand, the Florida Marlins, Arizona Diamondbacks, and Colorado Rockies are clearly named after features of local importance. If the Marlins were to move to Wyoming, or the Rockies to Nebraska, they would have to change their names, or turn into bad jokes. Then again, the Jazz didn't change their name when they moved to Utah.

The New York Mets are actually the "Metropolitans", so that has at least an attempt at a local connection. The Washington Nationals ditto, although the old name of the Washington Senators was better. At least in that one way. Who could root for a team called the Washington Senators? (From what I gather, not many people could.)

The Nationals replaced the hapless Montreal Expos, whose name wasn't very good, but was locally related: they were named for the 1967 Montreal World's Fair. Advice: If you're naming a baseball team, don't choose an event that will close after a year, and especially don't choose one that has already closed.

The Houston Astros, and their Astrodome filled with Astroturf, are named to recall the NASA manned space center, which opened there in 1961. The Philadelphia club is called the Phillies, which is not very clever, but is completely immovable. Boston Phillies, anyone? Pittsburgh Phillies? New York Phillies? No? I didn't think so.

I don't know why the San Diego Padres are named that, but there is plenty of Spanish religious history in the San Diego area, so I am confident in putting them in the "local color" column. Milwaukee is indeed full of Brewers; there are a lot of Germans up there, brewing up lager. (Are they back in the National League again? They seem to switch leagues every thirty years.)

That leaves just the Los Angeles Dodgers, who are a bit of an odd case. The team, as you know, was originally the Brooklyn Dodgers. The "Dodgers" nickname, as you probably didn't know, is short for "Trolley Dodgers". The Los Angeles Trolley Dodgers is almost as bad a joke as the Nebraska Rockies. Fortunately, the "Trolley" part was lost a long time ago, and we can now imagine that the team is the Los Angeles Traffic Dodgers. So much for the National League; we have six generic names out of 16, counting the Traffic Dodgers in the "local color" group, and ignoring the defunct Expos.

The American League does not do so well. They have the Boston Royals, the Kansas City Tigers, the Detroit Indians, the Oakland Orioles, and three teams that are named after sox: the Red, the White, and the Athletics.

Then there are the Blue Jays. They were originally owned by Labatt, a Canadian brewer of beer, and were so-named to remind visitors to the park of their flagship brand, Labatt's Blue. I might have a harder time deciding which group to put them in, if it weren't for the (1944-1945) Philadelphia Blue Jays. If the name is generic enough to be transplanted from Toronto to Philadelphia, it is generic. I have no idea what name the Toronto club could choose if they wanted to avail themselves of the "local color" option rather than the "generic" option; it's tempting to make a cruel joke and suggest that the name most evocative of Toronto would be the Toronto Generics. But no, that's unfair. They could always call their baseball club the Toronto Hockey Fans.

Anyway, moving on, we have the New York Yankees, which is not the least generic possible name, but clearly qualifies as "local color" once you pause to think about the Charleston Yankees, the Shreveport Yankees, and the Selma Yankees. The Tampa Bay Devil Rays are clearly "local color". The Minnesota Twins play in the Twin Cities of Minneapolis and St. Paul. The California, Anaheim, or Los Angeles Angels, whatever they're called this week, are evidently named for the city of Los Angeles. I would ridicule the Los Angeles Angels for having a redundant name, but as an adherent of the Philadelphia Phillies, I am living in a glass house.

The Texas Rangers are named for the famous Texas Rangers. I don't know exactly why the Seattle club is named Mariners; I wouldn't have considered Seattle to be an unusually maritime city, but their previous team was the Seattle Pilots, so the folks in Seattle must think of themselves so, and I'm willing to go along with it.

The tally for the American League is therefore eight generic, six local color. The total for Major League Baseball as a whole is 14 generic names out of 30.

This is a lot better than the Japanese Baseball League, which has a bunch of teams with names like the Lions, Tigers, Dragons, Giants, and Fighters. They make up for this somewhat in the names of the teams' corporate sponsors, so, for example, the Nippon Ham Fighters. They are sponsored by Nippon Ham, which does not make it any less funny. And the Yakult Swallows, which, if you interpret it as a noun phrase, sounds just a little bit like a gay porn flick set in Uzbekistan.

Incidentally, my favorite team name is the Wilmington Blue Rocks. The Blue Rocks' mascot is, alas, not a rock but a moose. Sometimes I dream of a team from Lansing, Michigan, called the Lansing Boils, but I know it will remain an unfulfilled fantasy.

[ Warning for non-Americans: Almost, but not quite everything in this article is the truth. Marsaxlokk does not actually have a Major League baseball club yet; however, they do have a class-A affiliate in the Mediterranean league, called the Marsaxlokk Moghzaskops. Also, the Giants are not scheduled to move to Jyväskylä until after the 2008 season. ]

[ Addendum 20061127: There is a followup article to this one. ]

[ Addendum 20230425: I can't believe it took me this long to realize it, but the Los Angeles Lakers is just as strange a mismatch as the Utah Jazz. There are no lakes near Los Angeles. That name itself tells you what happened: the team was originally located in Minneapolis. ]

[Other articles in category /lang] permanent link

Sat, 07 Oct 2006

Bone names
Names of bones are usually Latin. They come in two types. One type is descriptive. The auditory ossicles (that's Latin for "little bones for hearing") are named in English the hammer, anvil, and stirrup, and their formal, Latin names are the malleus ("hammer"), incus ("anvil"), and stapes ("stirrup")

The fibula is the small bone in the lower leg; it's named for the Latin fibula, which is a kind of Roman safety pin. The other leg bone, the tibia, is much bigger; that's the frame of the pin, and the fibula makes the thin sharp part.

The kneecap is the patella, which is a "little pan". The big, flat parietal bone in the skull is from paries, which is a wall or partition. The clavicle, or collarbone, is a little key.

"Pelvis" is Latin for "basin". The pelvis is made of four bones: the sacrum, the coccyx, and the left and right os innominata. Sacrum is short for os sacrum, "the sacred bone", but I don't know why it was called that. Coccyx is a cuckoo bird, because it looks like a cuckoo's beak. Os innominatum means "nameless bone": they gave up on the name because it doesn't look like anything. (See illustration to right.)

On the other hand, some names are not descriptive: they're just the Latin words for the part of the body that they are. For example, the thighbone is called the femur, which is Latin for "thigh". The big lower arm bone is the ulna, Latin for "elbow". The upper arm bone is the humerus, which is Latin for "shoulder". (Actually, Latin is umerus, but classical words beginning in "u" often acquire an initial "h" when they come into English.) The leg bone corresponding to the ulna is the tibia, which is Latin for "tibia". It also means "flute", but I think the flute meaning is secondary—they made flutes out of hollowed-out tibias.

Some of the nondescriptive names are descriptive in Latin, but not in English. The vertebra in English are so called after Latin vertebra, which means the vertebra. But the Latin word is ultimately from the verb vertere, which means to turn. (Like in "avert" ("turn away") and "revert" ("turn back").) The jawbone, or "mandible", is so-called after mandibula, which means "mandible". But the Latin word is ultimately from mandere, which means to chew.

The cranium is Greek, not Latin; kranion (or κρανιον, I suppose) is Greek for "skull". Sternum, the breastbone, is Greek for "chest"; carpus, the wrist, is Greek for "wrist"; tarsus, the ankle, is Greek for "instep". The zygomatic bone of the face is yoke-shaped; ζυγος ("zugos") is Greek for "yoke".

The hyoid bone is the only bone that is not attached to any other bone. (It's located in the throat, and supports the base of the tongue.) It's called the "hyoid" bone because it's shaped like the letter "U". This used to puzzle me, but the way to understand this is to think of it as the "U-oid" bone, which makes sense, and then to remember two things. First, that classical words beginning in "u" often acquire an initial "h" when they come into English, as "humerus". And second, classical Greek "u" always turns into "y" in Latin. You can see this if you look at the shape of the Greek letter capital upsilon, which looks like this: Υ. Greek αβυσσος ("abussos" = "without a bottom") becomes English "abyss"; Greek ανωνυμος ("anonumos") becomes English "anonymous"; Greek υπος ("hupos"; there's supposed to be a diacritical mark on the υ indicating the "h-" sound, but I don't know how to type it) becomes "hypo-" in words like "hypothermia" and "hypodermic". So "U-oid" becomes "hy-oid".

(Other parts of the body named for letters of the alphabet are the sigmoid ("S-shaped") flexure of the colon and the deltoid ("Δ-shaped") muscle in the arm. The optic chiasm is the place in the head where the optic nerves cross; "chiasm" is Greek for a crossing-place, and is so-called after the Greek letter Χ.)

The German word for "auditory ossicles" is Gehörknöchelchen. Gehör is "for hearing". Knöchen is "bones"; Knöchelchen is "little bones". So the German word, like the Latin phrase "auditory ossicles", means "little bones for hearing".

[Other articles in category /lang/etym] permanent link

Sun, 09 Jul 2006

Phrasal verbs
My mom teaches English to visiting foreign students, and last time I met her she was talling me about phrasal verbs. A phrasal verb is a verb that incorporates a preposition. Examples include "speed up", "try out", "come across", "go off", "turn down". The prepositional part is uninflected, so "turns down", "turned down", "turning down", not *"turn downs", *"turn downed", *"turn downing". My mom says she uses a book that has a list of all of them; there are several hundred. She was complaining specifically about "go off", which has an unusually peculiar meaning: when the alarm clock goes off in the morning, it actually goes on.

This reminded me that "slow up" and "slow down" are synonymous. And there is "speed up", but no "speed down". And you cannot understand "stand down" by analogy with "stand up", "sit up", and "sit down". And you also cannot understand "nose job" by analogy with "hand job". But I digress.

One of the things about the phrasal verbs that gives the foreign students so much trouble is that the verbs don't all obey the same rules. For example, some are separable and some not. Consider "turned down". I can turn down the thermostat, but I can also turn the thermostat down. And I can try out my new game, and I can also try my new game out. And I can stand up my blind date, and I can stand my blind date up. But while I can come across a fountain in the park, I can't *come a fountain across in the park. And while I can go off to Chicago, I can't *go to Chicago off. There's no way to know which of these work and which not, except just by memorizing which are allowed and which not.

And sometimes the separable ones can't be unseparated. I can give back the map, and I can give the map back, and I can give it back, but I can't *give back it. I can hold up the line, and I can hold the line up, and I can hold us up, but I can't *hold up us. I don't know what the rule is exactly, and I don't want to go to the library again to get the Cambridge Grammar, because last time I did that I dropped it on my toe.

I hadn't realized any of this until I read this article about them, but when I did, I had a sudden flash of insight. I had not realized before what was going on when someone set up us the bomb. "Set up" is separable: I can set up the bomb, or set the bomb up, or someone can set us up. But "us", as noted above, is not deseperable, so you cannot have *set up us. But I think I understand the mistake better now than I did before; it seems less like a complete freak and more like a member of a common type of error.

[Other articles in category /lang] permanent link

Wed, 05 Apr 2006

TeX and the long S
It just occurs to me, reading today's article, that the final sentence is one of the strangest I've written in quite a while. It says:

stock TeX does not have any way to make a long medial s.

This is a strange thing to say because TeX was principally designed as a mathematical typesetting system, and one of the most common of all mathematical notations is the integral sign:

$$\int_a^b f'(x) dx = f(b) - f(a)$$

And the integral sign $$\int$$ is nothing more than an old-style long s; the 's' is for 'sum'.

Strange or not, the substance of my remark is correct, since standard TeX's fonts do not provide a long s in a size suitable for use in running text in place of a regular s.

[Other articles in category /lang] permanent link

On baroque long S
Jokes about the long medial 's' are easy to make. Stan Freberg's album Stan Freberg Presents: The United States of America, Volume I: The Early Years has a scene in which John Adams or Benjamin Franklin or one of those guys is reading Thomas Jefferson's draft of the Declaration of Independence: "'Life, liberty, and the purfuit of happinefs'? Tom, all your s's look like f's!"

A story by Frances Warfield, appropriately titled "Fpafm", gets probably as much juice out of the joke as there is to be got. I believe the copyright has expired, so here it is, in its entirety:


by Frances Warfield

I ordered ham and eggs, as I always do on the diner, and then, as I always do, looked around for pamphlets. There was one handy, "Echoes of Colonial Days," it was called, "being a little fouvenir iffued from time to time, for the benefit of the guefts of The Baltimore & Ohio Railroad Company as a reminder of the pleafant moments fpent..." Involuntarily, my lips began to move. I reached for a pencil. But the man across from me already had his pencil out. He had written:

"Oh, fay can you fee?"

I said, "Fing Fomething Fimple."

"Filly, ifn't it?" he said, and kept on writing.

I wrote: "Fing a Fong of Fixpence."

"Oh, ftop the fongs," he said, "Too eafy." He wrote: "The Courtfhip of Miles Fandifh," "I fee a fquirrel," "I undereftimate ftatefmanfhip," "My fifter feems fuperfenfitive," and, seeing that I did not appreciate the last one, which he evidently thought very fine, he wrote: "Forry to fee you fo ftupid."

I ate my lunch grouchily. How could I help it if he was in practice and I was not? He had probably taken this train before.

"Pafs the falt," I said.

"Pleafe pafs the falt," he triumphed.

I paid no attention. "Waiter!" I said. The waiter did not budge.

"You muft fpeak the language," said the man opposite me. "Fay, Fteward!"

The waiter jumped to attention. "Fir?" he said.

"Pleafe fill the faltcellar."

"The faltcellar fhall be replenifhed inftantly," replied the waiter, with a superior gleam in his eyes.

I smiled and my companion unbent a little. "Let's try for hard ones," he invited.

"Farcafm," he said.


"Fubfiftence," he scored.


"S's inside now," he ruled.

Perfuafive," I said instantly.




"Nonfenfe," I finished. "Fon of a fpeckled fea monfter."

"Ftep-fon of a poifonous fnake!" he cried.

"You don't fay fo!" I retorted.

"I do fay fo," he replied, getting up and leaving the diner.

"Fool!" I called after him, fniffiling.

Well, fo much for that.

Reading Baroque scientific papers, you see a lot of long-medial-s. Opening to a random page of the Philosophical Experiments and Observations of Robert Hooke, for example, we have:

The ſecond Experiment, was made, to ſhew a Way, how to find the true and comparative Expanſion of any metal, when melted, and ſo to compare it both with the Expanſion of the ſame metal, when ſolid, and likewiſe with the Expanſion of any other, either fluid or ſolid Body.

As I read more of this sort of thing, I went through several phases. At first it I just found it confusing. Then later I started to get good at reading the words with f's instead of s's and it became funny. ("Fhew! Folid! Hee hee!") Then it stopped being funny, although I still noticed it and found it quaint and charming. Also a constant reminder of how learned and scholarly I am, to be reading this old stuff. (Yes, I really do think this way. Pathetic, isn't it? And you are an enabler of this pathetic behavior.) Then eventually I didn't notice it any more, except in a few startling cases, such as when Dr. Hooke wrote on the tendency of ice to incorporate air bubbles while freezing, and said " the ſame time it may not be ſaid to ſuck it in".

What hasn't happened, however: it hasn't become completely transparent. The long s really does look a lot like an f, so much so that I can find it confusing when the context doesn't help me out. The fact that these books are always facsimiles and that the originals were printed on coarse paper and the ink has smudged, does not make it any easier to tell when one is looking at an s and when at an f. So far, the most difficult instance I have encountered involved a reference to "the Learned Dr. Voſſius". Or was it Voffius? Or Vofſius? Or was it Voſfius? Well, I found out later it was indeed Vossius; this is Dr. Gerhard Johann Voss (1577-1649), Latinized to "Vossius". But I was only able to be sure because I encountered the name somewhere else with the short s's.

This typographic detail raises a question of scholarly ethics that I don't know how to answer. In an earlier article, I needed to show how 17th-century writers referred to dates early in the year, which in common nomenclature occurred during one year, but which legally were part of the preceding year. Simply quoting one of these writers wasn't enough, because the date was disambiguated typographically, with the digit for the legal year directly above the digit for the conventional year. So I programmed TeX to demonstrate the typography:

 To this I W.D. shall add another Remark I find in the minutes of the {\it Royal Society\/}, {\it February\/} 20. $167^8_9$, {\it viz.\/}$\ldots$

But this raised another problem: to what degree should I reproduce the original typography? There is a scale here of which substitutions are more or less permissible:

  1. Most permissible is to replace the original 17th-century font with a modern one.

  2. Slightly less permissible would be to reduce the heavy 17th-century usage of italic face, in Royal Society for example, replacing it with roman typefaces.

  3. Slightly less permissible still would be to replace the 17th-century capitalization conventions with 20th-century conventions. For example, in C20 we would not capitalize "Remark".

  4. Then can I replace obsolete 17th-century contractions such as "consider'd" with 20th-century equivalents such as "considered"? If that is acceptable, then what about "'tis"? Can I replace "3dly" with "thirdly"?

  5. Can I replace obsolete Baroque spellings such as "plaister", "fatt", and "it self" with "plaster", "fat", and "itself"?

  6. Can I replace obsolete Baroquisms such as "strow'd" in "strow'd on Ice" with "strewn", or "stopple" with "stopper"?

  7. At the bottom of the list, I could just rewrite the whole thing in a modern style and pass it off as what Derham actually wrote.
It seems to me that replacing the long medial s's with short ones is toward the top of this scale. By doing this, I am not changing the spelling, because a long medial s is still an s; I am just replacing one s with another, and this is akin to changing the font. And anyway, my choice was forced, because stock TeX does not have any way to make a long medial s.

[ Addendum 20060405 ]

[Other articles in category /lang] permanent link

Sun, 12 Mar 2006

Naomi Wolf and Big Ethel
Aaron Swartz has done a text search of The Beauty Myth and concluded that Wolf never intended Big Ethel to serve as an example of intelligence, contrary to what I asserted in my previous article. M. Swartz says:

Judging from a search on Amazon, the only time Ethel is mentioned is in the context of noting that an attractive woman is often paired with an unattractive one: "... Veronica and Ethel in Riverdale; ... and so forth. Male culture seems happiest to imagine two women together when they are defined as being one winner and one loser in the beauty myth." (59f)

I still question the aptness of the example, since, again, the principal case in which two women are imagined together in Archie comics is not Veronica and Ethel, but Veronica and Betty, both of whom are portrayed as "winners". Betty and Veronica are major characters; Ethel is not. But the error isn't nearly as serious as the one I said Wolf had made.

The most serious error here is mine: I should have considered and discussed the possibility that my friend was misquoting Wolf. That I didn't do this was unfair to Wolf and entirely my fault. Since I haven't read the book myself, I should have realized what shaky ground I was on, and taken pains to point this out. And yet other possibilities are:

  • That my friend didn't misquote Wolf at all, and I misunderstood her at the time, or
  • that my friend correctly quoted Wolf and I understood her at the time, but my memory of the episode (which occurred around 1993) is faulty.
I took Vallely to task for poor research and for failing to pick up a dictionary to confirm some of his assertions. Had I taken my own advice, I would have checked to see what Wolf said before commenting on it. My disclaimer in the original article that I had not read the book relieves me of only part of the responsibility for this failure.

[Other articles in category /lang/etym] permanent link

On saying too much, or, bad things come in threes
Long ago, I had a conversation with a woman who had recently read Naomi Wolf's book The Beauty Myth. She was extolling the book, which I had not read, and mentioned that Wolf had an extensive discussion of the popular dichotomy between beauty and intelligence. She told me that Wolf had cited Archie comics as containing an example of this dichotomy, in the characters of Veronica and Big Ethel.

I had been nodding and agreeing up to that point. But at the mention of Big Ethel I was quite startled, and said that that spoiled the argument for me, and made me doubt the conclusion. I now had doubts about what had seemed so plausible a moment before.

Veronica is indeed one half of a contrasting pair in Archie comics. But Veronica and Big Ethel? No. Veronica is not complementary to Big Ethel. The counterpart of Veronica is Betty. The contrast is not between beauty and brains but between rich and poor, and between their derived properties, spoiled and sweet. A good point could be made about Veronica and Betty, but it was not the point that Wolf wanted to make; her citation of Veronica and Big Ethel as exemplifying the opposition of beauty and intelligence was just bizarre. Big Ethel, to my knowledge, has never been portrayed as unusually intelligent. She is characterized by homeliness and by her embarrassing and unrequited attraction to Jughead, not by intelligence.

Why would this make me doubt the conclusion of Wolf's argument? Because I had been fully ready to believe the conclusion, that our culture manufactures a division between attractiveness and intelligence for women, and makes them choose one or the other. I had imagined that it would be easy to produce examples demonstrating the point. But the example Wolf chose was completely inept. And, as I said at the time, "Naomi Wolf is very smart, and has studied this closely and thought about it for a long time. If that is the best example that she can come up with, then perhaps I'm wrong, and there really aren't as many examples as I thought there would be." Without the example, I would have agreed with the conclusion. With the example, intended to support the conclusion, I wasn't so sure.

Now, I come to the real point of this note. Paul Vallely has written an article for The Independent on "How Islamic inventors changed the world". He lists twenty of the most influential contributions of the Muslim world, including the discovery of coffee, inoculation, and the fountain pen. I am not so clear on the history of the technology here. Some of it I know is correct; some is plausible; some is extremely dubious. (The crank, not invented before 1206? Please.) But the whole article is spoiled for me, except as a topic of derision, because of three errors.

Item #1 concerns the discovery of the coffee bean. One might expect this to have been discovered in prehistoric times by local Ethiopians, long before the founding of Islam. But I'm in no position to argue with it, and I was ready to give Vallely the benefit of the doubt.

Item #2 on Vallely's list was more worrying. It says "Ibn al-Haitham....set up the first Camera Obscura (from the Arab word qamara for a dark or private room)." It may or may not be true that "qamara" is an "Arab word" (by which I suppose Vallely means an "Arabic word") for "chamber", but it is certainly true that this word, if it exists, is not the source of the English word "camera". I don't know from "qamara", but "camera obscura" is Latin for "dark chamber". "Camera" means "chamber" in Latin and has for thousands of years. The two words, in fact, are etymologically the same, which is why they have almost the same spelling. It is for this reason that the part of a legal hearing held in the judge's private chambers is said to be "in camera".

There might be an Arabic word "qamara", for all I know. If there is, it might be derived from the Latin. (The Latin word is not derived from Arabic, either; it is from Greek καμαρα, which refers to anything with an arched cover.) Two things are sure: The English word "camera" is not derived from Arabic, and Vallely did not bother to pick up a dictionary before he said that it was.

Anyone can make a mistake. But I started to get excited when I read item 3, which is about the game of chess. Vallely says "The word rook comes from the Persian rukh, which means chariot." This is true, sort of, but it is off in a subtle way. The rooks or castles of modern chess did start out as chariots. (Moving castles around never did make much sense.) And "rook" is indeed from Persian rukh. But rukh doesn't exactly mean a chariot. It means a chariot in the game of chess. The Persian word for a chariot outside of chess was different. (I don't remember what it was.) Saying that rukh is the Persian word for chariot is like saying that "rook" is the English word for castle.

I was only on item 3 and had already encountered one serious error of etymology and one other item which although it wasn't exactly an error, was peculiar. I considered that I wouldn't really have enough material for a blog post, unless Vallely made at least one more serious mistake. But there were still 17 of 20 items left. So I read on. Would Vallely escape?

No, or I would not have written this article. Item 17 says "The modern cheque comes from the Arabic saqq, a written vow to pay for goods when they were delivered...". But no. The correct etymology is fascinating and bizarre. "Cheque" is derived from Norman French "exchequer", which was roughly the equivalent of the treasury and internal revenue department in England starting around 1300. Why was the internal revenue department called the exchequer? Because it was named after the chessboard, which was also called "exchequer".

What do chessboards have to do with internal revenue? Ah, I am glad you wondered. Hindu-Arabic numerals had not yet become popular in Europe; numbers were still recorded using Roman numerals. It is extremely difficult to calculate efficiently with Roman numerals. How, then did the internal revenue department calculate taxes owed and amounts payable?

They used an abacus. But it wasn't an abacus like modern Chinese or Japanese abacuses, with beads strung on wires. A medieval European abacus was a table with a raised edge and a grid of squares ruled on it. The columns of squares represented ones, tens, hundreds, and so on. You would put metal counters, called jettons, on the squares to represent numbers. Three jettons on a "hundred" square represented three hundred; four jettons on the square to its right represented forty. Each row of squares recorded a separate numeral. To add two numerals together, just take the jettons from one row, move them to the other row, and then resolve the carrying appropriately: Ten jettons on a square can be removed and replaced with a single jetton on the square to the left.

The internal revenue department, the "exchequer", got its name from these counting-boards covered with ruled squares like chessboards.

(The word "exchequer" meaning a chessboard was derived directly from the name of the game: Old French eschecs, Medieval Latin scacci, and so on, all from shah, which means "king" in Persian. The word "checkered" is also closely related.)

So, in summary: the game is "chess", or eschek in French; the board is therefore exchequer, and since the counting-tables of the treasury department look like chessboards, the treasury department itself becomes known as the exchequer. The treasury department, like all treasury departments, issues notes promising to pay certain sums at certain times, and these notes are called "exchequer notes" or just "exchequers", later shortened (by the English) to "cheques" or (by Americans) to "checks". Arabic saqq, if there is such a word, does not come into it. Once again, it is clear that Vallely's research was shoddy.

While I was writing up this article, yet another serious error came to light. Item 11 says "The windmill was invented in 634 for a Persian caliph...". Now, I am not very knowledgeable about history, and my historical education is very poor. But that was so peculiar that it startled even me. 634 seemed to me much too early for any clever inventions to be attributed to Muslims. Then I looked it up, and so it was. Muhammad himself had only died in 632.

As for the Persian caliph Vallely mentions, he did not exist. The caliphs are the successors of Muhammad, so of course there was one in 634---the first one, in fact. Abu Bakr reigned from the death of the Prophet in 632 until his own death in 634; he was succeeded by `Umar. Neither was Persian. They were both Arabs, as you would expect of Muslim leaders in 634. There were no Persian caliphs in 634.

My own ignorance of Islam and its history is vast and deep, but at least I had a vague idea that 634 was extremely early. Vallely could have looked up the date of the founding of the caliphate as easily as I did. Why didn't he? Well, perhaps it was just a typo, and should have said 834 or 934. In that case it's just poor editing and inattention. But perhaps it was a genuine factual error, in which case Vallely was not only not paying attention, but is apparently even less familiar with Islamic history than I am, difficult as that is to achieve. In which case we have this article about the twenty greatest contributions of Islam written by a guy who literally does not know the first thing about Islam.

And so this article, which I hoped to enjoy, was spoiled by a series of errors. I am very sympathetic to the idea that the brilliant history of Islamic science and engineering has been neglected by European scholarship. One of my very first blog posts was about the Islamic use of algebra to solve complex probate problems. Just last week I was reading about al-Biruni's invention, around 1000 years ago, of an improved method for measuring the size of the earth, a topic that Vallely treats as item 18. But after reading Vallely's article, I worried a bit that the case might have been overstated. Perhaps the contributions of Muslims are not as large as I had thought?

Fortunately, there was an alternative: the conclusion is correct, and the inept support from the author speaks only to the author's ineptness, not to the validity of the conclusion. I did not have that alternative with Naomi Wolf, who is not inept. (Also, see this addendum.)

With only cursory attention, I found three major errors of fact in this one short article. How many more did I miss, I wonder? Did Abbas ibn Firnas really invent a working parachute, as Vallely says? Maybe it was someone else. Maybe there was no parachute. Maybe there was, but it didn't work. Maybe the whole thing is a propaganda invention by someone who wants to promote Islam, and has suckered Vallely into repeating fiction. Maybe all of these. Someone knows the truth, but it isn't me, and I can't trust Vallely.

Were the Turks vaccinating people eighty years before the Europeans, or did Vallely swallow a tall tale? I don't know, and I can't trust Vallely.

People sometimes joke "I am stupider for having read this," but I really believe this was the case here. The article is worse than useless, because it has polluted my brain with a lot of unreliable non-information. I will have to be careful not to think that quilted fabrics were first brought to Europe by the crusaders, who got them from the Muslims. My real fear is that the "fact" will remain in my brain for years, long after I have forgotten how unreliable Vallely is, and that I will bring it out again as real information, which it is not. True or not, it is too unreliable to be information.

The best I can hope for now is that I will forget everything Vallely says, and meet the true parts again somewhere else in the future. In the meantime, I am worse off for having read it.

[ Addendum 20200204: Thirteen years later, it occurred to me to wonder: Why does Arabic chess have chariots anyway? ]

[Other articles in category /lang/etym] permanent link

Thu, 02 Feb 2006

Petard corrections
Eric Cholet has written in to mention that he is familiar with the fried choux pastry that I mentioned yesterday, but under the name pets de nonne, not pets de soeurs, as I said. (Nonne, of course, is "nun". The word soeur is literally "sister", but in this context means "nun". ) I had cited On Food and Cooking as mentioning pets de soeurs, but it agrees with Eric, not with me.

It appears, though, that many people do use the name pets de soeurs to refer to these fritters, and some people also use it to refer to a kind of soda-raised cinnamon roll. Citations to various cookbooks are available through the usual searches.

Eric also points out that petard is the current word for a firecracker, and also now refers to a doobie. I was already aware of this because pictures of those things appeared when I did Google image seach for petard. Thank you, Eric.

[Other articles in category /lang/etym] permanent link

Tue, 31 Jan 2006

A petard is a Renaissance-era bomb, basically a big firecracker: a box or small barrel of gunpowder with a fuse attached. Those hissing black exploding spheres that you see in Daffy Duck cartoons are petards. Outside of cartoons, you are most likely to encounter the petard in the phrase "hoist with his own petard", which is from Hamlet. Rosencrantz and Guildenstern are being sent to England with the warrant for Hamlet's death; Hamlet alters the warrant to contain R&G's names instead of his own. "Hoist", of course, means "raised", and Hamlet is saying that it is amusing to see someone screw up his own petard and blow himself sky-high with it.

This morning I read in On Food in Cooking that there's a kind of fried choux pastry called pets de soeurs ("nuns' farts") because they're so light and delicate. That brought to mind Le Pétomane, the world-famous theatrical fartmaster. Then there was a link on reddit titled "Xmas Petard (cool gif video!)" which got me thinking about petards, and it occurred to me that "petard" was probably akin to pets, because it makes a bang like a fart. And hey, I was right; how delightful.

Another fart-related word is "partridge", so named because its call sounds like a fart.

[ Update 20260202: Some corrections ]

[Other articles in category /lang/etym] permanent link

Thu, 26 Jan 2006

"Farther" vs. "further"
People mostly use "farther" and "further" interchangeably. What's the difference?

I looked it up in the dictionary, and it turns out it's simple. "Farther" means "more far". "Further" means "more forward".

"Further" does often connote "farther", because something that is further out is usually farther away, and so in many cases the two are interchangeable. For example, "Hitherto shalt thou come, but no further" (Job 38:11.)

But now when I see people write things like China Steps Further Back From Democracy (The New York Times, 26 November 1995) or, even worse, Big Pension Plans Fall Further Behind (Washington Post, 7 June 2005) it freaks me out.

Google finds 3.2 million citations for "further back", and 9.5 million for "further behind", so common usage is strongly in favor of this. But a quick check of the OED does not reveal much historical confusion between these two. Of the citations there, I can only find one that rings my alarm bell. ("1821 J. BAILLIE Metr. Leg., Wallace lvi, In the further rear.")

[Other articles in category /lang] permanent link