The Universe of Discourse

Wed, 12 Sep 2018

Language fluency in speech and print

Long ago I worked among the graduate students at the University of Pennsylvania department of Computer and Information Sciences. Among other things, I did system and software support for them, and being about the same age and with many common interests, I socialized with them also.

There was one Chinese-Malaysian graduate student who I thought of as having poor English. But one day, reading one of his emailed support requests, I was struck by how clear and well-composed it was. I suddenly realized I had been wrong. His English was excellent. It was his pronunciation that was not so good. When speaking to him in person, this was all I had perceived. In email, his accent vanished and he spoke English like a well-educated native. When I next met him in person I paid more careful attention and I realized that, indeed, I had not seen past the surface: he spoke the way he wrote, but his accent had blinded me to his excellent grammar and diction.

Once I picked up on this, I started to notice better. There were many examples of the same phenomenon, and also the opposite phenomenon, where someone spoke poorly but I hadn't noticed because their pronunciation was good. But then they would send email and the veil would be lifted. This was even true of native speakers, who can get away with all sorts of mistakes because their pronunciation is so perfect. (I don't mean perfect in the sense of pronouncing things the way the book says you should; I mean in the sense of pronouncing things the way a native speaker does.) I didn't notice this unless I was making an effort to look for it.

I'm not sure I have anything else to say about this, except that it seems to me that when learning a foreign language, one ought to consider whether one will be using it primarily for speech or primarily for writing, and optimize one's study time accordingly. For speech, concentrate on good pronunciation; for writing, focus on grammar and diction.

Hmm, put that way it seems obvious. Also, the sky is sometimes blue.

[Other articles in category /lang] permanent link

Sun, 29 Apr 2018

Lipogrammatic math posts

In August 2011, on a particular famous discussion forum (brought up on this blog again and again) an individual A, notorious for such acts, posts a quasi-philosophical inquiry, incurring unpopularity, antagonism, and many bad marks, although also a surprising quantity of rational discussion, including a thoughtful solution or two.

Many months forward, a distinct party B puts up a substantial bounty on this inquiry, saying:

I would like a complete answer to this question which does not use the letter "e" at any point.

(My apology for any anguish you may go through at this point in my story on account of this quotation and its obvious and blatant faults. My wrongdoing was involuntary, but I had no way to avoid it and still maintain full accuracy.)

By and by, a valiant third individual constructs a brilliant disquisition satisfying this surprising condition and thus obtains B's award.

Now, this month, in our group's accompanying policy board, a fourth collaborator, a guy (or gal, for all I know) I shall call D, and who I think may lack a minimal inclination for fun, finds fault with A's original post and particularly with C's bounty, and complains as follows:

Should we discourage bounties that encourage “clever” but unclear answers?

(Again, I must ask you for absolution. This is a word-for-word quotation.)

A thorough dismissal of OP's complaint, from a fifth author, adds a fully satisfactory finish to our affair.

[Other articles in category /lang] permanent link

Wed, 04 Apr 2018

Genealogy of the Saudi royal family

[ Note: None of this is a joke, nothing here is intended humorously, and certainly none of it should be taken as mockery or disparagement. The naming conventions of Saudi royalty are not for me to judge or criticize, and if they cause problems for me, the problems are my own. It is, however, a serious lament. ]

The following innocuous claim appears in Wikipedia's article on Abdullah bin Abdul-Rahman:

He was the seventh son of the Emir of the Second Saudi State, Abdul Rahman bin Faisal.

Yesterday I tried to verify this claim and I was not able to do it.

Somewhere there must be a complete and authoritative pedigree of the entire Saudi royal family, but I could not find it online, perhaps because it is very big. There is a Saudi royal family official web site, and when I found that it does have a page about the family tree, I rejoiced, thinking my search was over. But the tree only lists the descendants of King Abdulaziz Ibn Saud, founder of the modern Saudi state. Abdullah was his half-brother and does not appear there.

Well, no problem, just Google the name, right? Ha!

Problem 1: These princes all have at least twenty kids each. No, seriously. The Wikipedia article on Ibn Saud himself lists twenty-one wives and then gives up, ending with an exhausted “Possibly other wives”. There is a separate article on his descendants that lists 72 children of various sexes, and the following section on grandchildren begins:

Due to the Islamic traditions of polygyny and easy divorce (on the male side), King Abdul Aziz [Ibn Saud] has approximately a thousand grandchildren.

Problem 2: They reuse many of the names. Because of course they do; if wife #12 wants to name her first son the same as the sixth son of wife #2, why not? They don't live in the same house. So among the children of Ibn Saud there are two Abdullahs (“servant of God”), two Badrs (“full moon”), two Fahds (“leopard”), two each of Majid (“majestic”), Mishari (I dunno), Talal (dunno), and Turki (“handsome”). There are three sons named Khalid (“eternal”). There is a Sa'ad and a Saad, which I think are the exact same name (“success”) as spelled by two different Wikipedia editors.

And then they reuse the names intergenerationally. Among Ibn Saud's numerous patrilineal grandsons there are at least six more Fahds, the sons respectively of Mohammed, Badr (the second one), Sultan, Turki (also the second one), Muqrin, and Salman. Abdulaziz Ibn Saud has a grandson also named Abdulaziz, whose name is therefore Abdulaziz bin Talal bin Abdulaziz Al Saud. (The “bin” means “son of”; the feminine form is “bint”.) It appears that the House of Saud does not name sons after their fathers, for which I am grateful.

Ibn Saud's father was Abdul Rahman (this is the Abdul Rahman of Abdullah bin Abdul-Rahman, who is the subject of this article. Remember him?) One of Ibn Saud's sons is also Abdul Rahman, I think probably the first one to be born after the death of his grandfather, and at least two of his patrilineal grandsons are also.

Problem 3: Romanization of Arabic names is done very inconsistently. I mentioned “Saad” and “Sa'ad” before. I find the name Abdul Rahman spelled variously “Abdul Rahman”, “Abdulrahman”, “Abdul-Rahman”, and “Abd al-Rahman”. This makes text searches difficult and unreliable. (The name, by the way, means "Servant of the gracious one”, referring to God.)

Problem 4: None of these people has a surname. Instead they are all patronymics. Ibn Saud has six grandsons named Fahd; how do you tell them apart? No problem, their fathers all have different names, so they are Fahd bin Mohammed, Fahd bin Badr, Fahd bin Sultan, Fahd bin Turki, Fahd bin Muqrin, and Fahd bin Salman. But again this confuses text searches terribly.

You can search for “Abdullah bin Abdul-Rahman” but many of the results will be about his descendants Fahd bin Abdullah bin Abdul Rahman, Fahd bin Khalid bin Abdullah bin Abdul Rahman, Fahd bin Muhammad bin Abdullah bin Abdul Rahman, Abdullah bin Bandar bin Abdullah bin Abdul Rahman, Faisal bin Abdullah bin Abdul Rahman, Faisal bin Abdul Rahman bin Abdullah bin Abdul Rahman, etc.

In combination with the reuse of the same few names, the result is even more confusing. There is Bandar bin Khalid, and Khalid bin Bandar; Fahad bin Khalid and Khalid bin Fahd.

There is Mohammed al Saud (Mohammed of (the house of) Saud) and Mohammed bin Saud (Mohammed the son of Saud).

There are grandsons named Saad bin Faisal, Faisal bin Bandar, Bandar bin Sultan, Sultan bin Fahd, Fahd bin Turki, Turki bin Talal, Talal bin Mansour, Mansour bin Mutaib, Mutaib bin Abdullah, and Abdullah bin Saad. I swear I am not making this up.

Perhaps Abdullah was the seventh son of Abdul Rahman.

Perhaps not.

I surrender.

[Other articles in category /lang] permanent link

Thu, 22 Mar 2018

Does Skaði choose the husband with the best butt?

(Warning: I do not know anything about Old Norse, so everything I say about it should be understood as ill-informed speculation. I welcome corrections.)

In one of my favorite episodes from Norse mythology, the Æsir owe a payment to the Jötunn Skaði in compensation for killing her father. But they know she is very wealthy, and offer her an alternative compensation: one of their men in marriage.

Skaði wants to marry Baldr, because he is extremely handsome. But Baldr is already married. Odin proposes a compromise: the Æsir will line up behind a short curtain, and Skaði will choose her husband. She will marry whomever she picks; if she can pick out Baldr by his legs, she can have him. Skaði agrees, assuming that the beautiful Baldr will have the best legs.

(She chooses wrong. Njörðr has the best legs.)

Thinking on this as an adult, I said to myself “Aha, this is like that horn full of milk that was actually mead. I bet this was also cleaned up in the version I read, and that in the original material, Skaði was actually choosing the husband with the best butt.”

I went to check, and I was wrong. The sources say she was looking only at their feet.

I was going to just quote this:

she should choose for herself a husband from among the Æsir and choose by the feet only, seeing no more of him.

But then I got worried. This is of course not the original source but an English translation; what if it is inaccurate?

Well, there was nothing else to do but ask Snorri about it. He says:

En æsir buðu henni sætt ok yfirbætr ok it fyrsta, at hon skal kjósa sér mann af ásum ok kjósa at fótum ok sjá ekki fleira af.

(Sætt is recompense or settlement; yfirbætr similarly. (Bætr is a cure, as in “I was sick, but I got better”.) The first (fyrsta) part of the settlement is that she “shall choose a man for herself” (skal kjósa sér mann) but choose by the feet (kjósa at fótum) seeing nothing else (sjá ekki fleira af).)

The crucial word here is fótum, which certainly looks like “foot”. (It is the dative form of fótr.) Could it possibly mean the buttocks? I don't think so. It's hard to be 100% certain, because it could be a euphemism — anything could be a euphemism for the buttocks if you paused before saying it and raised one eyebrow. (Did the Norse bards ever do this?) Also the Norse seem to have divided up the leg differently than we do. Many of the words seem to match, which is sometimes helpful but also can be misleading, because many don't. For example, I think leggr, despite its appearance, means just the shank. And I think fótum may not be just the foot itself, but some part of the leg that includes the foot.

But I'm pretty sure fótum is not the butt, at least not canonically. To do this right I would look at all the other instances of fótr to see what I could glean from the usage, but I have other work to do today. So anyway, Skaði probably was looking at their feet, and not at their butts. Oh well.

However! the other part of Skaði's settlement is that the Æsir must make her laugh. In the version I first read, Loki achieves this by tying his beard to a goat's. Nope!

Þá gerði Loki þat, at hann batt um skegg geitar nökkurrar ok öðrum enda um hreðjar sér, ok létu þau ýmsi eftir ok skrækði hvárt tveggja hátt.

Skegg geitar nökkurar is indeed some goat's beard. But hann batt … ok öðrum enda um hreðjar sér is “he tied … the other end to his own scrotum”.

Useful resources:

[Other articles in category /lang] permanent link

Tue, 20 Mar 2018

English's -en suffix

In English we can sometimes turn an adjective into a verb by suffixing “-en”. For example:

black → blacken
red → redden
white → whiten
wide → widen

But not

blue → bluen*
green → greenen*
yellow → yellowen*
long → longen*

(Note that I am only looking at -en verbs that are adjective-derived present tenses. This post is not concerned with the many -en verbs that are past participles, such as “smitten” (past participle of “smite”), “spoken” (“speak”), “molten” (“melt”), “sodden” (“seethe”), etc.)

I asked some linguist about this once and they were sure it was purely morphological, something like: black, red, and white end in stop consonants, and blue, green, and yellow don't.

Well, let's see:

Stop Blacken
Open (?)
Fricative Coarsen
Nasal   Cleanen
Vowel   Angrien
Glide   Betteren

There are some fine points:

  • “Biggen” used to exist but has fallen out of use
  • Perhaps I should have ommitted “strengthen” and “hasten”, which are derived from nouns, not from adjectives
  • I'm not sure whether “closen”, “hotten” and “wetten” are good or bad so I left them off
  • “moisten” and “soften” might belong with the stops instead of the fricatives
  • etc.

but clearly the morphological explanation wins. I'm convinced.

[ Addendum: Wiktionary discusses this suffix, distinguishing it from the etymologically distinct participial “-en”, and says “it is not currently very productive in forming new words, being mostly restricted to monosyllabic bases which end in an obstruent”. ]

[Other articles in category /lang] permanent link

Sun, 07 Jan 2018

The horn of milk

When I was a kid I had a book of “Myths and Legends of the Ages”, by Marion N. French. One of the myths was the story of Thor's ill-fated visit to Utgard. The jötunns of Utgard challenge Thor and Loki to various contests and defeat them all through a combination of talent and guile. In one of these contests, Thor is given a drinking horn and told that even the wimpiest of the jötunns is able to empty it of its contents in three drinks. (The jötunns are lying. The pointy end of the horn has been invisibly connected to the ocean.)

The book specified that the horn was full of milk, and as a sweet and innocent kiddie I did not question this. Decades later it hit me suddenly: no way was the horn filled with milk. When the mighty jötunns of Utgard are sitting around in their hall, they do not hold contests to see who can drink the most milk. Obviously, the horn was full of mead.

The next sentence I wrote in the draft version of this article was:

   In the canonical source material (poetic edda maybe?) the horn is full
   of *mead*. Of course it is.

In my drafts, I often write this sort of bald statement of fact, intending to go back later and check it, and perhaps produce a citation. As the quotation above betrays, I was absolutely certain that when I hunted down the original source it would contradict Ms. French and say mead. But I have now hunted down the canonical source material (in the Prose Edda, it turns out, not the Poetic one) and as far as I can tell it does not say mead!

Here is an extract of an 1880 translation by Rasmus Björn Anderson, provided by WikiSource:

He went into the hall, called his cup-bearer, and requested him to take the sconce-horn that his thanes were wont to drink from. The cup-bearer immediately brought forward the horn and handed it to Thor. Said Utgard-Loke: From this horn it is thought to be well drunk if it is emptied in one draught, some men empty it in two draughts, but there is no drinker so wretched that he cannot exhaust it in three.

For comparison, here is the 1916 translation of Arthur Gilchrist Brodeur, provided by

He went into the hall and called his serving-boy, and bade him bring the sconce-horn which the henchmen were wont to drink off. Straightway the serving-lad came forward with the horn and put it into Thor's hand. Then said Útgarda-Loki: 'It is held that this horn is well drained if it is drunk off in one drink, but some drink it off in two; but no one is so poor a man at drinking that it fails to drain off in three.'

In both cases the following text details Thor's unsuccessful attempts to drain the horn, and Utgard-Loki's patronizing mockery of him after. But neither one mentions at any point what was in the horn.

I thought it would be fun to take a look at the original Old Norse to see if the translators had elided this detail, and if it would look interesting. It was fun and it did look interesting. Here it is, courtesy of Heimskringla.NO:

Útgarða-Loki segir, at þat má vel vera, ok gengr inn í höllina ok kallar skutilsvein sinn, biðr, at hann taki vítishorn þat, er hirðmenn eru vanir at drekka af. Því næst kemr fram skutilsveinn með horninu ok fær Þór í hönd. Þá mælti Útgarða-Loki: "Af horni þessu þykkir þá vel drukkit, ef í einum drykk gengr af, en sumir menn drekka af í tveim drykkjum, en engi er svá lítill drykkjumaðr, at eigi gangi af í þrimr."

This was written in Old Norse around 1220, and I was astounded at how much of it is recognizable, at least when you already know what it is going to say. However, the following examples are all ill-informed speculation, and at least one of my confident claims is likely to be wrong. I hope that some of my Gentle Readers are Icelanders and can correct my more ridiculous errors.

“Höllina” is the hall. “Kallar” is to call in. The horn appears three times, as ‘horninu’, ‘horni’, and in ‘vítishorn’, which is a compound that specifies what kind of horn it is. “Þór í hönd” is “in Thor's hand”. (The ‘Þ’ is pronounced like the /th/ of “Thor”.) “Drekka”, “drukkit”, “drykk”, “drykkjum”, and “drykkjumaðr” are about drinking or draughts; “vel drukkit” is “well-drunk”. You can see the one-two-three in there as “einum-tveim-þrimr”. (Remember that the “þ” is a /th/.) One can almost see English in:

sumir menn drekka af í tveim drykkjum

which says “some men drink it in two drinks”. And “lítill drykkjumaðr” is a little-drinking-person, which I translated above as “wimp”.

It might be tempting to guess that “með horninu” is a mead-horn, but I'm pretty sure it is not; mead is “mjað” or “mjöð”. I'm not sure, but I think “með” here is just “with”, akin to modern German “mit”, so that:

næst kemr fram skutilsveinn með horninu

is something like “next, the skutilsveinn came with the horn”. (The skutilsveinn is something we don't have in English; compare trying to translate “designated hitter” into Old Norse.)

For a laugh, I tried putting this into Google Translate, and I was impressed with the results. It makes a heroic effort, and produces something that does capture some of the sense of the passage. It identifies the language as Icelandic, which while not correct, isn't entirely incorrect either. (The author, Snorri Sturluson, was in fact Icelandic.) Google somehow mistakes the horn for a corner, and it completely fails to get the obsolete term “hirðmenn” (roughly, “henchmen”), mistaking it for herdsmen. The skutilsveinn is one of the hirðmenn.

Anyway there is no mead here, and none in the rest of the story, which details Thor's unsuccessful attempts to drink the ocean. Nor is there any milk, which would be “mjólk”.

So where does that leave us? The jötunns challenge Thor to a drinking contest, and bring him a horn, and even though it was obviously mead, the story does not say what was in the horn.

Because why would they bother to say what was in the horn? It was obviously mead. When the boys crack open a cold one, you do not have to specify what it was that was cold, and nobody should suppose that it was a cold bottle of milk.

I imagine Marion N. French sitting by the fire, listening while Snorri tells the story of Thor and the enchanted drinking horn of Utgard:

“Utgarða-Loki called his skutilsveinn, and requested him to bring the penalty-horn that his hirðmen were wont to drink from…”

“Excuse me! Excuse me, Mr. Sturluson! Just what were they wont to drink from it?”

“Eh, what's that?”

”What beverage was in the horn?”

“Why, mead, of course. What did you think it was, milk?”

(Merriment ensues, liberally seasoned with patronizing mockery.)

(In preparing this article, I found it helpful to consult Zoëga's Concise Dictionary of Old Icelandic of 1910.)

[ Addendum 2018-01-17: Holy cow, I was so wrong. It was so obviously not mead. I was so, so wrong. Amazingly, unbelievably wrong. ]

[ Addendum 2018-03-22: A followup in which I investigate what organs Skaði looked at when choosing her husband, and what two things Loki tied together to make her laugh. ]

[Other articles in category /lang] permanent link

Fri, 05 Jan 2018

Hebrew John Doe

Last month I wrote about the Turkish analog of “Joe Blow”. I got email from Gaal Yahas, who said

I bet you'll get plenty of replies on your last post about translating "John Doe" to different languages.

Sadly no. But M. Yahas did tell me in detail about the Hebrew version, and I did a little additional research.

The Hebrew version of “Joe Blow” / “John Doe” is unequivocally “Ploni Almoni”. This usage goes back at least to the Book of Ruth, approximately 2500 years ago. Ruth's husband has died without leaving an heir, and custom demands that a close relative of her father-in-law should marry her, to keep the property in the family. Boaz takes on this duty, but first meets with another man, who is a closer relative than he:

Then went Boaz up to the gate, and sat him down there: and, behold, the kinsman of whom Boaz spake came by; unto whom he said, Ho, such a one! turn aside, sit down here. And he turned aside, and sat down.

(Ruth 4:1, KJV)

This other relative declines to marry Ruth. He is not named, and is referred to in the Hebrew version as Ploni Almoni, translated here as “such a one”. This article in The Jewish Chronicle discusses the possible etymology of these words, glossing “ploni” as akin to “covered” or “hidden” and “almoni” as akin to “silenced” or “muted”.

Ploni Almoni also appears in the book of Samuel, probably even older than Ruth:

David answered Ahimelek the priest, “The king sent me on a mission and said to me, 'No one is to know anything about the mission I am sending you on.' As for my men, I have told them to meet me at a certain place.”

(1 Samuel 21:2, NIV)

The mission is secret, so David does not reveal the meeting place to Ahimelek. Instead, he refers to it as Ploni Almoni. There is a similar usage at 2 Kings 6:8.

Apparently the use of “Ploni” in Hebrew to mean “some guy” continues through the Talmud and up to the present day. M. Yahas also alerted me to two small but storied streets in Tel Aviv. According to this article from Haaretz:

A wealthy American businessman was buying up chunks of real estate in Tel Aviv. He purchased the two alleyways with the intention of naming them after himself and his wife, even going so far as to put up temporary shingles with the streets’ new names. But he had christened the streets without official permission from the city council.

The mayor was so incensed by the businessman’s chutzpah that he decided to temporarily name the alleyways Simta Almonit and Simta Plonit.

And so they remain, 95 years later.

(M. Yahas explains that “Simta” means “alley” and is feminine, so that Ploni and Almoni take the feminine ‘-it’ ending to agree with it.)

Wikipedia has not one but many articles on this topic and related ones:

My own tiny contribution in this area: my in-laws live in a rather distant and undeveloped neighborhood on the periphery of Seoul, and I once referred to it as 아무데도동 (/amudedo-dong/), approximately “nowhereville”. This is not standard in Korean, but I believe the meaning is clear.

[Other articles in category /lang] permanent link

Mon, 18 Dec 2017

Turkish John Doe

A few weeks ago I was writing something about Turkey, and I needed a generic Turkish name, analogous to “John Doe”. I was going to use “Osman Yılmaz”, which I think would have been a decent choice, but I decided it would be more fun to ask a Turkish co-worker what the correct choice would be. I asked Kıvanç Yazan, who kindly allowed himself to be nerdsniped and gave me a great deal of information. In the rest of this article, anything about Turkish that is correct should be credited to him, while any mistakes are surely my own.

M. Yazan informs me that one common choice is “Ali Veli”. Here's a link he gave me to Ekşisözlük, which is the Turkish analog of Urban Dictionary, explaining (in Turkish) the connotations of “John Doe”. The page also mentions “John Smith”, which in turn links to a page about a footballer named Ali Öztürk—in fact two footballers. ([1] [2]) which is along the same lines as my “Osman Yılmaz” suggestion.

But M. Yazan told me about a much closer match for “John Doe”. It is:

sarı çizmeli Mehmet Ağa

which translates as “Mehmet Agha with yellow boots”. (‘Sarı’ = ‘yellow’; ‘çizmeli’ = ‘booted’.)

This oddly specific phrase really seems to be what I was looking for. M. Yazan provided several links:

  • Ekşisözlük again
  • The official dictionary of the Turkish government

    Unfortunately I can't find any way to link to the specific entry, but the definition it provides is “kim olduğu, nerede oturduğu bilinmeyen kimse” which means approximately “someone whose identity/place is unknown”.

  • A paper on “Personal Names in Sayings and Idioms”.

    This is in Turkish, but M. Yazan has translated the relevant part as follows:

    At the time when yellow boots were in fashion, a guy from İzmir put "Mehmet Aga" in his account book. When time came to pay the debt , he sent his servant and asked him to find "Mehmet Aga with yellow boots". The helper did find a Mehmet Aga, but it was not the one they were looking for. Then guy gets angry at his servant, to which his helper responded, “Sir, this is a big city, there are lots of people with yellow boots, and lots of people named Mehmet! You should write it in your book one more time!”

Another source I found was this online Turkish-English dictionary which glosses it as “Joe Schmoe”.

Finding online mentions of sarı çizmeli Mehmet Ağa is a little bit tricky, because he is also the title of a song by the very famous Turkish musician Barış Manço, and the references to this song swamp all the other results. This video features Manço's boots and although we cannot see for sure (the recording is in grayscale) I presume that the boots are yellow.

Thanks again, Kıvanç!

[ Addendum: The Turkish word for “in style” is “moda”. I guessed it was a French loanword. Kıvanç tells me I was close: it is from Italian. ]

[ Addendum 20171219: Wikipedia has an impressive list of placeholder names by language that includes Mehmet Ağa. ]

[ Addendum 20180105: The Hebrew version of Mehmet Ağa is at least 2600 years old! ]

[Other articles in category /lang] permanent link

Mon, 31 Jul 2017

Sabotaged by Polish orthography

This weekend my family was doing a bookstore event related to Fantastic Beasts and Where to Find Them. One of the movie's characters, Jacob Kowalski, dreams of becoming a baker, and arrives to a bank appointment with a suitcase full of Polish confections, including pączki, a sort of Polish jelly donut. My wife wanted to serve these at the event.

The little tail on the ą in pączki is a diacritical mark called an ogonek, which is Polish for “little tail”. If I understand correctly, this nasalizes the sound of the a so that it is more like /an/, and furthermore in modern Polish the value of this particular letter has changed so that pączki is pronounced something like “pawnch-kee”. (Polish “cz” is approximately like English “ch”.)

I was delegated to travel to Philadelphia's Polish neighborhood to obtain the pączki. This turned out to be more difficult than I expected. The first address I visited was simply wrong. When I did find the bakery I was looking for, it was sold out of pączki. The bakery across the street was closed, so I started walking down Allegheny Avenue looking for the next bakery.

Before I got there, though, I passed a storefront with a sign listing its goods and services in blue capital letters. One of the items was PACZKI. Properly, of course, this should be PĄCZKI but Poles often omit the ogonek, especially when buying blue letter decals in Philadelphia, where large blue ogoneks are often unavailable. But when I went in to ask I immediately realized that I had probably made a mistake. The store seemed to sell toiletries, paper goods, and souvenirs, with no baked goods in sight.

I asked anyway: “Your sign outside says you sell PĄCZKI?”

“No,” replied the storekeeper. “Pach-kee.”

I thought she was correcting my pronunciation. “But I thought the ogonek made it ‘pawnch-kee’?”

“No, not pawnch-kee. Pach-kee. For sending, to Poland.” She pointed at a box.

I had misunderstood the sign. It did not say PĄCZKI, but PACZKI, which I have since learned means “boxes”.

The storekeeper directed me to the deli across the street, where I was able to buy the pączki. I also bought some interesting-looking cold roast pork loin and asked what it was called. A customer told me it was “po-lend-witsa”, and from this I was able to pick out the price label on the deli case, which said “POLEDWICA”.

After my embarrassment about the boxes I was concerned that I didn't understand ogoneks as well as I thought I did. I pointed to the ‘E’. “Shouldn't there be an ogonek on the ‘E’ here?”

“Yes,” he said, and shrugged. They had left it off, just as I had (incorrectly) thought had happened on the PACZKI sign.

I think the only way to win this one would have been to understand enough of the items in blue capital letters to guess from context that it really was PACZKI and not PĄCZKI.

[ Addendum 20170803: A thirty-year-old mystery has been cleared up! When I was a teenager the news was full of the struggles of the Polish workers’ union Solidarity and its charismatic leader, Lech Walesa, later president of Poland. But his name was always pronounced ‘walensa’. Why? Last night I suddenly understood the mysterious ‘n’: the name was actually ‘Walęsa’! ]

[ (Well, not quite. That does explain the mystery ‘n’. But on looking it up, I find that the name is actually ‘Wałęsa’. The ‘W’ is more like English ‘v’ than like English ‘w’, and the ‘ł’ is apparently very much like English ‘w’. So the correct pronunciation of ‘Wałęsa’ is more like ‘va-wen-sa’ than ‘wa-len-sa’. Perhaps the people who pronounced the ę but not the W or the ł were just being pretentious.) ]

[ Addendum 20170803: Maciej Cegłowski says that “paczki” is more like “packages” than like “boxes”; Google translate suggests “parcels”. He would also like me to remind you that “paczki” and “pączki” are plural, the singulars being “paczka” and “pączek”, respectively. Alicja Raszkowska she loves my use of “ogoneks” (the English plural) in place of the Polish “ogonki”. ]

[Other articles in category /lang] permanent link

Thu, 11 May 2017

Zomg lots more anagram stuff

I'm almost done with anagrams. For now, anyway. I think. This article is to mop up the last few leftover anagram-related matters so that I can put the subject to rest.

(Earlier articles: [1] [2] [3] [•] )

Code is available

Almost all the code I wrote for this project is available on Github.

The documentation is not too terrible, I think.

Anagram lists are available

I have also placed my scored anagram lists on my web site. Currently available are:

  • Original file from the 1990s. This contains 23,521 anagram pairs, the results of my original scoring algorithm on a hand-built dictionary that includes the Unix spellcheck dictionary (/usr/dict/words), the Webster's Second International Dictionary word list, and some lexicons copied from a contemporaneous release of WordNet. This file has been in the same place on my web site since 1997 and is certainly older than that.

  • New file from February. Unfortunately I forget what went into this file. Certainly everything in the previous file, and whatever else I had lying around, probably including the Moby Word Lists. It contains 38,333 anagram pairs.

  • Very big listing of Wikipedia article titles. (11 MB compressed) I acquired the current list of article titles from the English Wikipedia; there are around 13,000,000 of these. I scored these along with the other lexicons I had on hand. The results include 1,657,150 anagram pairs. See below for more discussion of this.

!‌!Con talk

On Saturday I gave a talk about the anagram-scoring work at !‌!Con in New York. The talk was not my best work, since I really needed 15 minutes to do a good job and I was unwilling to cut it short enough. (I did go overtime, which I deeply regret.) At least nobody came up to me afterward and complained.

Talk materials are on my web site and I will link other talk-related stuff from there when it becomes available. The video will be available around the end of May, and the text transcript probably before that.

[ Addendum 20170518: The video is available thanks to Confreaks. ]

Both algorithms are exponential

The day after the talk an attendee asked me a very good question: why did I say that one algorithm for scoring algorithms was better than the other, when they are both exponential? (Sorry, I don't remember who you were—if you would like credit please drop me a note.)

The two algorithms are:

  • A brute-force search to construct all possible mappings from word A to word B, and then calculate the minimum score over all mappings (more details)

  • The two words are converted into a graph; we find the maximum independent set in the graph, and the size of the MIS gives the score (more details)

The answer to this excellent question begins with: just because two problems are both hard doesn't mean they are equally hard. In this case, the MIS algorithm is better for several reasons:

  1. The number of possible mappings from A to B depends on the number of repeated letters in each word. For words of length n, in the worst case this is something like !! n! !!. This quantity is superexponential; it eventually exceeds !! c^n !! for all constants !!c!!. The naïve algorithm for MIS is only exponential, having !!c=2!!.

  2. The problem size for the mapping algorithm depends on the number of repeated letters in the words. The problem size for the MIS algorithm depends on the number of shared adjacent letter pairs in the two words. This is almost always much smaller.

  3. There appears to be no way to score all the mappings without constructing the mappings and scoring them. In contrast, MIS is well-studied and if you don't like the obvious !!2^n!! algorithm you can do something cleverer that takes only !!1.22^n!!.

  4. Branch-and-bound techniques are much more effective for the MIS problem, and in this particular case we know something about the graph structure, which can be exploited to make them even more effective. For example, when calculating the score for

    chromophotolithograph photochromolithograph

    my MIS implementation notices the matching trailing olithograph parts right away, and can then prune out any part of the MIS search that cannot produce a mapping with fewer than 11 chunks. Doing this in the mapping-generating algorithm is much more troublesome.

Stuff that didn't go into the talk

On Wednesday I tried out the talk on Katara and learned that it was around 75% too long. I had violated my own #1 content rule: “Do not begin with a long introduction”. My draft talk started with a tour of all my favorite anagrams, with illustrations. Included were:

  • “Please” and “asleep” and “elapse”.

  • “Spectrum” and “crumpets” ; my wife noticed this while we were at a figure-skating event at the Philadelphia Spectrum, depicted above.

  • “English” and “shingle” ; I came up with this looking at a teabag while at breakfast with my wife's parents. This prompted my mother-in-law to remark that it must be hard to always be thinking about such things—but then she admitted that when she sees long numerals she always checks them for divisibility by 9.

  • “Soupmaster” and “mousetraps”. The picture here is not perfect. I wanted a picture of the Soupmaster restaurant that was at the Liberty Place food court in Philadelphia, but I couldn't find one.

  • I also wanted to show the back end of a Honda Integra and a picture of granite, but I couldn't find a good picture of either one before I deleted them from the talk. (My wife also gets credit for noticing this one.) [ Addendum 20170515: On the road yesterday I was reminded of another one my wife noticed: “Pontiac” / “caption”. ]

Slide #1 defines what anagrams actually are, with an example of “soapstone” / “teaspoons”. I had originally thought I might pander to the left-wing sensibilities of the !‌!Con crowd by using the example “Donald Trump” / “Lord Dampnut” and even made the illustration. I eventually rejected this for a couple of reasons. First, it was misleading because I only intended to discuss single-word anagrams. Second, !‌!Con is supposed to be fun and who wants to hear about Donald Trump?

But the illustration might be useful for someone else, so here it is. Share and enjoy.

After I rejected this I spent some time putting together an alternative, depicting “I am Lord Voldemort” / “Tom Marvolo Riddle”. I am glad I went with the soapstone teaspoons instead.

People Magazine

Clearly one important ingredient in finding good anagrams is that they should have good semantics. I did not make much of an effort in this direction. But it did occur to me that if I found a list of names of well-known people I might get something amusing out of it. For example, it is well known that “Britney Spears” is an anagram of “Presbyterians” which may not be meaningful but at least provides something to mull over.

I had some trouble finding a list of names of well-known people, probably because i do not know where to look, but I did eventually find a list of a few hundred on the People Magazine web site so I threw it into the mix and was amply rewarded:

Cheryl Burke Huckleberry

I thought Cheryl Burke was sufficiently famous, sufficiently recently, that most people might have heard of her. (Even I know who she is!) But I gave a version of the !‌!Con talk to the Philadelphia Perl Mongers the following Monday and I was the only one in the room who knew. (That version of the talk took around 75 minutes, but we took a lot of time to stroll around and look at the scenery, much of which is in this article.)

I had a struggle finding the right Cheryl Burke picture for the !‌!Con talk. The usual image searches turned up lots of glamour and fashion pictures and swimsuit pictures. I wanted a picture of her actually dancing and for some reason this was not easy to find. The few I found showed her from the back, or were motion blurred. I was glad when I found the one above.


A few days before the !‌!Con talk my original anagram-scoring article hit #1 on Hacker News. Hacker News user Pxtl suggested using the Wikipedia article title list as an input lexicon. The article title list is available for download from the Wikimedia Foundation so you don't have to scrape the pages as Pxtl suggested. There are around 13 million titles and I found all the anagrams and scored them; this took around 25 minutes with my current code.

The results were not exactly disappointing, but neither did they deliver anything as awesomely successful as “cinematographer” / “megachiropteran”. The top scorer by far was “ACEEEFFGHHIILLMMNNOORRSSSTUV”, which is the pseudonym of 17th-century German writer Hans Jakob Christoffel von Grimmelshausen. Obviously, Grimmelshausen constructed his pseudonym by sorting the letters of his name into alphabetical order.

(Robert Hooke famously used the same scheme to claim priority for discovery of his spring law without actually revealing it. He published the statement as “ceiiinosssttuv” and then was able to claim, two years later, that this was an anagram of the actual law, which was “ut tensio, sic vis”. (“As the extension, so the force.”) An attendee of my Monday talk wondered if there is some other Latin phrase that Hooke could have claimed to have intended. Perhaps someone else can take the baton from me on this project.)

Anyway, the next few top scorers demonstrate several different problems:

    21 Abcdefghijklmnopqrstuvwxyz / Qwertyuiopasdfghjklzxcvbnm
    21 Abcdefghijklmnopqrstuvwxyz / Qwertzuiopasdfghjklyxcvbnm
    21 Ashland County Courthouse / Odontorhynchus aculeatus
    21 Daniel Francois Malherbe / Mindenhall Air Force Base

    20 Christine Amongin Aporu / Ethnic groups in Romania
    20 Message force multiplier / Petroleum fiscal regimes

    19 Cholesterol lowering agent / North West Regional College
    19 Louise de Maisonblanche / Schoenobius damienella
    19 Scorpaenodes littoralis / Steroidal spirolactones

The “Qwerty” ones are intrinsically uninteresting and anyway we could have predicted ahead of time that they would be there. And the others are just sort of flat. “Odontorhynchus aculeatus” has the usual problems. One can imagine that there could be some delicious irony in “Daniel Francois Malherbe” / “Mindenhall Air Force Base” but as far as I can tell there isn't any and neither was Louise de Maisonblanche killed by an S. damienella. (It's a moth. Mme de Maisonblanche was actually killed by Variola which is not an anagram of anything interesting.)

Wikipedia article titles include many trivial variations. For example, many people will misspell “Winona Ryder” as “Wynona Rider”, so Wikipedia has pages for both, with the real article at the correct spelling and the incorrect one redirecting to it. The anagram detector cheerfully picks these up although they do not get high scores. Similarly:

  • there are a lot of articles about weasels that have alternate titles about “weasles”
  • there are a lot of articles about the United States or the United Kingdom that have alternate titles about the “Untied States” or the “Untied Kingdom”
  • Articles about the “Center for” something or other with redirects to (or from) the “Centre for” the same thing.
  • There is an article about “Major professional sports leagues in Canada and the United States” with a redirect from “Major professional sports leagues in the United States and Canada”.
  • You get the idea.

The anagram scorer often had quite a bit of trouble with items like these because they are long and full of repeated letter pairs. The older algorithm would have done even worse. If you're still wondering about the difference between two exponential algorithms, some of these would make good example cases to consider.

As I mentioned above you can download the Wikipedia anagrams from my web site and check for yourself. My favorite item so far is:

    18 Atlantis Casino Resort Spa / Carter assassination plot


Some words appear with surprising frequency and I don't know why. As I mentioned above one of the top scorers was “Ethnic groups in Romania” and for some reason Romania appears in the anagram list over and over again:

    20 Christine Amongin Aporu / Ethnic groups in Romania
    17 List of Romanian actors / Social transformation
    15 Imperial Coronation  / Romanian riot police
    14 Rakhine Mountains / Romanians in the UK
    14 Mindanao rasbora / Romanians abroad
    13 Romanian poets / ramosopinnate
    13 Aleuron carinatum / Aromanian culture
    11 Resita Montana / Romanian state
    11 Monte Schiara / The Romaniacs
    11 Monetarianism / Romanian Times
    11 Marion Barnes / Romanian Serb
    11 Maarsen railway station / Romanian State Railways
    11 Eilema androconia / Nicolae de Romania
    11 Ana Maria Norbis / Arabs in Romania

    ( 170 more )

Also I had never thought of this before, but Romania appears in this unexpected context:

    09 Alicia Morton / Clitoromania
    09 Carinito Malo / Clitoromania

(Alicia Morton played Annie in the 1999 film. Carinito Malo is actually Cariñito Malo. I've already discussed the nonequivalence of “n” and “ñ” so I won't beat that horse again.)

Well, this is something I can investigate. For each string of letters, we have here the number of Wikipedia article titles in which the string appears (middle column), the number of anagram pairs in which the string appears (left column; anagrams with score less than 6 are not counted) and the quotient of the two (right column).

            romania               110  4106  2.7%
            serbia                109  4400  2.5%
            croatia                68  3882  1.8%
            belarus                24  1810  1.3%

            ireland               140 11426  1.2%
            andorra                 7   607  1.2%
            austria                60  5427  1.1%
            russia                137 15944  0.9%

            macedonia              28  3167  0.9%
            france                111 14785  0.8%
            spain                  64  8880  0.7%
            slovenia               18  2833  0.6%

            wales                  47  9438  0.5%
            portugal               17  3737  0.5%
            italy                  21  4353  0.5%
            denmark                19  3698  0.5%

            ukraine                12  2793  0.4%
            england                37  8719  0.4%
            sweden                 11  4233  0.3%
            scotland               16  4945  0.3%

            poland                 22  6400  0.3%
            montenegro              4  1446  0.3%
            germany                16  5733  0.3%
            finland                 6  2234  0.3%

            albania                10  3268  0.3%
            slovakia                3  1549  0.2%
            norway                  9  3619  0.2%
            greece                 10  8307  0.1%

            belgium                 3  2414  0.1%
            switzerland             0  5439  0.0%
            netherlands             1  3522  0.0%
            czechia                 0    75  0.0%

As we see, Romania and Serbia are substantially ahead of the others. I suspect that it is a combination of some lexical property (the interesting part) and the relatively low coverage of those countries in English Wikipedia. That is, I think if we were to identify the lexical component, we might well find that russia has more of it, but scores lower than romania because Russia is much more important. My apologies if I accidentally omitted your favorite European country.

[ Oh, crap, I just realized I left out Bosnia. ]


Another one of the better high scorers turns out to be the delightful:

   16 Lesbian intercourse / Sunrise Celebration

“Lesbian”, like “Romania”, seems to turn up over and over; the next few are:

    11 Lesbian erotica / Oreste Bilancia
    11 Pitane albicollis / Political lesbian
    12 Balearic islands / Radical lesbians
    12 Blaise reaction / Lesbian erotica

    (43 more)

Wikipedia says:

The Blaise reaction is an organic reaction that forms a β-ketoester from the reaction of zinc metal with a α-bromoester and a nitrile.

A hundred points to anyone who can make a genuinely funny joke out of this.

Oreste Bilancia is an Italian silent-film star, and Pitane albicollis is another moth. I did not know there were so many anagrammatic moths. Christian Bale is an anagram of Birthana cleis, yet another moth.

I ran the same sort of analysis on lesbian as on romania, except that since it wasn't clear what to compare it to, I picked a bunch of random words.

    nosehair                 3     3 100.0%
    margarine                4    16  25.0%
    penis                   95   573  16.6%
    weasel                  11   271   4.1%
    phallus                  5   128   3.9%
    lesbian                 26   863   3.0%
    center                 340 23969   1.4%
    flowers                 14  1038   1.3%
    trumpet                  6   487   1.2%
    potato                  10   941   1.1%
    octopus                  4   445   0.9%
    coffee                  12  1531   0.8%

It seems that lesbian appears with unusually high but not remarkably high frequency. The unusual part is its participation in so many anagrams with very high scores. The outstanding item here is penis. (The top two being rare outliers.) But penis still wins even if I throw away anagrams with scores less than 10 (instead of less than 6):

    margarine               1    16   6.2%
    penis                  13   573   2.3%
    lesbian                 8   863   0.9%
    trumpet                 2   487   0.4%
    flowers                 4  1038   0.4%
    center                 69 23969   0.3%
    potato                  2   941   0.2%
    octopus                 1   445   0.2%
    coffee                  1  1531   0.1%
    weasel                  0   271   0.0%
    phallus                 0   128   0.0%
    nosehair                0     3   0.0%

Since I'm sure you are wondering, here are the anagrams of margarine and nosehair:

    07 Nosehair / Rehsonia
    08 Aso Shrine / Nosehairs
    09 Nosehairs / hoariness

    04 Margaret Hines / The Margarines
    07 Magerrain / margarine
    07 Ramiengar / margarine
    08 Rae Ingram / margarine
    11 Erika Armstrong / Stork margarine

I think “Margaret Hines” / “The Margarines” should score more than 4, and that this exposes a defect in my method.

Acrididae graphs 

Here is the graph constructed by the MIS algorithm for the pair “acrididae” / “cidaridae”, which I discussed in an earlier article and also mentioned in my talk.

Each maximum independent set in this graph corresponds to a minimum-chunk mapping between “acrididae” and “cidaridae”. In the earlier article, I claimed:

This one has two maximum independent sets

which is wrong; it has three, yielding three different mappings with five chunks:

My daughter Katara points out that the graphs above resemble grasshoppers. My Gentle Readers will no doubt recall that acrididae is the family of grasshoppers, comprising around 10,000 species. I wanted to find an anagram “grasshopper” / “?????? graph”. There are many anagrams of “eoprs” and “eoprss” but I was not able to find anything good. The best I could do was “spore graphs”.

Thank you, Gentle Readers, for taking this journey with me. I hope nobody walks up to me in the next year to complain that my blog does not feature enough anagram-related material.

[Other articles in category /lang] permanent link

Thu, 23 Feb 2017

Miscellaneous notes on anagram scoring

My article on finding the best anagram in English was well-received, and I got a number of interesting comments about it.

  • A couple of people pointed out that this does nothing to address the issue of multiple-word anagrams. For example it will not discover “I, rearrangement servant / Internet anagram server” True, that is a different problem entirely.

  • Markian Gooley informed me that “megachiropteran / cinematographer” has been long known to Scrabble players, and Ben Zimmer pointed out that A. Ross Eckler, unimpressed by “cholecystoduodenostomy / duodenocholecystostomy”, proposed a method almost identical to mine for scoring anagrams in an article in Word Ways in 1976. M. Eckler also mentioned that the “remarkable” “megachiropteran / cinematographer” had been published in 1927 and that “enumeration / mountaineer” (which I also selected as a good example) appeared in the Saturday Evening Post in 1879!

  • The Hacker News comments were unusually pleasant and interesting. Several people asked “why didn't you just use the Levenshtein distance”? I don't remember that it ever occured to me, but if it had I would have rejected it right away as being obviously the wrong thing. Remember that my original chunking idea was motivated by the observation that “cholecystoduodenostomy / duodenocholecystostomy” was long but of low quality. Levenshtein distance measures how far every letter has to travel to get to its new place and it seems clear that this would give “cholecystoduodenostomy / duodenocholecystostomy” a high score because most of the letters move a long way.

    Hacker News user tyingq tried it anyway, and reported that it produced a poor outcome. The top-scoring pair by Levenshtein distance is “anatomicophysiologic physiologicoanatomic”, which under the chunking method gets a score of 3. Repeat offender “cholecystoduodenostomy / duodenocholecystostomy” only drops to fourth place.

    A better idea seems to be Levenshtein score per unit of length, suggested by user cooler_ranch.

  • A couple of people complained about my “notaries / senorita” example, rightly observing that “senorita” is properly spelled “señorita”. This bothered me also while I was writing the article. I eventually decided although “notaries” and “señorita” are certainly not anagrams in Spanish (even supposing that “notaries” was a Spanish word, which it isn't) that the spelling of “senorita” without the tilde is a correct alternative in English. (Although I found out later that both the Big Dictionary and American Heritage seem to require the tilde.)

    Hacker News user ggambetta observed that while ‘é’ and ‘e’, and ‘ó’ and ‘o’ feel interchangeable in Spanish, ‘ñ’ and ‘n’ do not. I think this is right. The ‘é’ is an ‘e’, but with a mark on it to show you where the stress is in the word. An ‘ñ’ is not like this. It was originally an abbreviation for ‘nn’, introduced in the 18th century. So I thought it might make sense to allow ‘ñ’ to be exchanged for ‘nn’, at least in some cases.

    (An analogous situation in German, which may be more familiar, is that it might be reasonable to treat ‘ö’ and ‘ü’ as if they were ‘oe’ and ‘ue’. Also note that in former times, “w” and “uu” were considered interchangeable in English anagrams.)

    Unfortunately my Spanish dictionary is small (7,000 words) and of poor quality and I did not find any anagrams of “señorita”. I wish I had something better for you. Also, “señorita” is not one of the cases where it is appropriate to replace “ñ” with “nn”, since it was never spelled “sennorita”.

    I wonder why sometimes this sort of complaint seems to me like useless nitpicking, and other times it seems like a serious problem worthy of serious consideration. I will try to think about this.

  • Mike Morton, who goes by the anagrammatic nickname of “Mr. Machine Tool”, referred me to his Higgledy-piggledy about megachiropteran / cinematographer, which is worth reading.

  • Regarding the maximum independent set algorithm I described yesterday, Shreevatsa R. suggested that it might be conceptually simpler to find the maximum clique in the complement graph. I'm not sure this helps, because the complement graph has a lot more edges than the original. Below right is the complement graph for “acrididae / cidaridae”. I don't think I can pick out the 4-cliques in that graph any more than the independent sets in the graph on the lower-left, and this is an unusually favorable example case for the clique version, because the original graph has an unusually large number of edges.

    But perhaps the cliques might be easier to see if you know what to look for: in the right-hand diagram the four nodes on the left are one clique, and the four on the right are the other, whereas in the left-hand diagram the two independent sets are all mixed together.

  • An earlier version of the original article mentioned the putative 11-pointer “endometritria / intermediator”. The word “endometritria” seemed pretty strange, and I did look into it before I published the article, but not carefully enough. When Philip Cohen wrote to me to question it, I investigated more carefully, and discovered that it had been an error in an early WordNet release, corrected (to “endometria”) in version 1.6. I didn't remember that I had used WordNet's word lists, but I am not surprised to discover that I did.

    A rare printing of Webster's 2¾th American International Lexican includes the word “endometritriostomoscopiotomous” but I suspect that it may be a misprint.

  • Philippe Bruhat wrote to inform me of Alain Chevrier’s book notes / sténo, a collection of thematically related anagrams in French. The full text is available online.

  • Alexandre Muñiz, who has a really delightful blog, and who makes and sells attractive and clever puzzles of his own invention. pointed out that soapstone teaspoons are available. The perfect gift for the anagram-lover in your life! They are not even expensive.

  • Thanks also to Clinton Weir, Simon Tatham, Jon Reeves, Wei-Hwa Huang, and Philip Cohen for their emails about this.

[ Addendum 20170507: Slides from my !!Con 2017 talk are now available. ]

[ Addendum 20170511: A large amount of miscellaneous related material ]

[Other articles in category /lang] permanent link

Wed, 22 Feb 2017

Moore's law beats a better algorithm

Yesterday I wrote about the project I did in the early 1990s to find the best anagrams. The idea is to give pair of anagram words a score, which is the number of chunks into which you have to divide one word in order to rearrange the chunks to form the other word. This was motivated by the observation that while “cholecysto-duodeno-stomy” and “duodeno-cholecysto-stomy” are very long words that are anagrams of one another, they are not interesting because they require so few chunks that the anagram is obvious. A shorter but much more interesting example is “aspired / diapers”, where the letters get all mixed up.

I wrote:

One could do this with a clever algorithm, if one were available. There is a clever algorithm, based on finding maximum independent sets in a certain graph. I did not find this algorithm at the time; nor did I try. Instead, I used a brute-force search.

I wrote about the brute-force search yesterday. Today I am going to discuss the clever algorithm. (The paper is Avraham Goldstein, Petr Kolman, Jie Zheng “Minimum Common String Partition Problem: Hardness and Approximations”, The Electronic Journal of Combinatorics, 12 (2005).)

The plan is to convert a pair of anagrams into a graph that expresses the constraints on how the letters can move around when one turns into the other. Shown below is the graph for comparing acrididae (grasshoppers) with cidaridae (sea urchins):

The “2,4” node at the top means that the letters ri at position 2 in acrididae match the letters ri at position 4 in cidaridae; the “3,1” node is for the match between the first id and the first id. The two nodes are connected by an edge to show that the two matchings are incompatible: if you map the ri to the ri, you cannot also map the first id to the first id; instead you have to map the first id to the second one, represented by the node “3,5”, which is not connected to “2,4”. A maximum independent set in this graph is a maximum selection of compatible matchings in the words, which corresponds to a division into the minimum number of chunks.

Usually the graph is much less complicated than this. For simple cases it is empty and the maximum independent set is trivial. This one has two maximum independent sets, one (3,1; 5,5; 6,6; 7,7) corresponding to the obvious minimum splitting:

and the other (2,4; 3,5; 5,1; 6,2) to this other equally-good splitting:

[ Addendum 20170511: It actually has three maximum independent sets. ]

In an earlier draft of yesterday's post, I wrote:

I should probably do this over again, because my listing seems to be incomplete. For example, it omits “spectrum / crumpets” which would have scored 5, because the Webster's Second list contains crumpet but not crumpets.

I was going to leave it at that, but then I did do it over again, and this time around I implemented the “good” algorithm. It was not that hard. The code is on GitHub if you would like to see it.

To solve the maximum independent set instances, I used a guided brute-force search. Maximum independent set is NP-complete, and so the best known algorithm for it runs in exponential time. But the instances in which we are interested here are small enough that this doesn't matter. The example graph above has 8 nodes, so one needs to check at most 256 possible sets to see which is the maximum independent set.

I collated together all the dictionaries I had handy. (I didn't know yet about SCOWL.) These totaled 275,954 words, which is somewhat more than Webster's Second by itself. One of the new dictionaries did contain crumpets so the result does include “spectrum / crumpets”.

The old scored anagram list that I made in the 1990s contained 23,521 pairs. The new one contains 38,333. Unfortunately most of the new stuff is of poor quality, as one would expect. Most of the new words that were missing from my dictionary the first time around are obscure. Perhaps some people would enjoy discovering that that “basiparachromatin” and “Marsipobranchiata” are anagrams, but I find it of very limited appeal.

But the new stuff is not all junk. It includes:

10 antiparticles paternalistic
10 nectarines transience
10 obscurantist subtractions

11 colonialists oscillations
11 derailments streamlined

which I think are pretty good.

I wasn't sure how long the old program had taken to run back in the early nineties, but I was sure it had been at least a couple of hours. The new program processes the 275,954 inputs in about 3.5 seconds. I wished I knew how much of this was due to Moore's law and how much to the improved algorithm, but as I said, the old code was long lost.

But then just as I was finishing up the article, I found the old brute-force code that I thought I had lost! I ran it on the same input, and instead of 3.5 seconds it took just over 4 seconds. So almost all of the gain since the 1990s was from Moore's law, and hardly any was from the “improved” algorithm.

I had written in the earlier article:

In 2016 [ the brute force algorithm ] would probably still [ run ] quicker than implementing the maximum independent set algorithm.

which turned out to be completely true, since implementing the maximum independent set algorithm took me a couple of hours. (Although most of that was building out a graph library because I didn't want to look for one on CPAN.)

But hey, at least the new program is only twice as much code!

38333 anagrams, scored

[ Addendum: The program had a minor bug: it would disregard capitalization when deciding if two words were anagrams, but then compute the scores with capitals and lowercase letters distinct. So for example Chaenolobus was considered an anagram of unchoosable, but then the Ch in Chaenolobus would not be matched to the ch in unchoosable, resulting in a score of 11 instead of 10. I have corrected the program and the output. Thanks to Philip Cohen for pointing this out. ]

[ Addendum 20170223: More about this ]

[ Addendum 20170507: Slides from my !!Con 2017 talk are now available. ]

[ Addendum 20170511: A large amount of miscellaneous related material ]

[Other articles in category /lang] permanent link

Tue, 21 Feb 2017

I found the best anagram in English

I planned to publish this last week sometime but then I wrote a line of code with three errors and that took over the blog.

A few years ago I mentioned in passing that in the 1990s I had constructed a listing of all the anagrams in Webster's Second International dictionary. (The Webster's headword list was available online.)

This was easy to do, even at the time, when the word list itself, at 2.5 megabytes, was a file of significant size. Perl and its cousins were not yet common; in those days I used Awk. But the task is not very different in any reasonable language:

  # Process word list
  while (my $word = <>) {
    chomp $word;
    my $sorted = join "", sort split //, $word;  # normal form
    push @{$anagrams{$sorted}}, $word;

  for my $words (values %anagrams) {
      print "@$words\n" if @$words > 1;

The key technique is to reduce each word to a normal form so that two words have the same normal form if and only if they are anagrams of one another. In this case we do this by sorting the letters into alphabetical order, so that both megalodon and moonglade become adeglmnoo.

Then we insert the words into a (hash | associative array | dictionary), keyed by their normal forms, and two or more words are anagrams if they fall into the same hash bucket. (There is some discussion of this technique in Higher-Order Perl pages 218–219 and elsewhere.)

(The thing you do not want to do is to compute every permutation of the letters of each word, looking for permutations that appear in the word list. That is akin to sorting a list by computing every permutation of the list and looking for the one that is sorted. I wouldn't have mentioned this, but someone on StackExchange actually asked this question.)

Anyway, I digress. This article is about how I was unhappy with the results of the simple procedure above. From the Webster's Second list, which contains about 234,000 words, it finds about 14,000 anagram sets (some with more than two words), consisting of 46,351 pairs of anagrams. The list starts with

aal ala

and ends with

zolotink zolotnik

which exemplify the problems with this simple approach: many of the 46,351 anagrams are obvious, uninteresting or even trivial. There must be good ones in the list, but how to find them?

I looked in the list to find the longest anagrams, but they were also disappointing:

cholecystoduodenostomy duodenocholecystostomy

(Webster's Second contains a large amount of scientific and medical jargon. A cholecystoduodenostomy is a surgical operation to create a channel between the gall bladder (cholecysto-) and the duodenum (duodeno-). A duodenocholecystostomy is the same thing.)

This example made clear at least one of the problems with boring anagrams: it's not that they are too short, it's that they are too simple. Cholecystoduodenostomy and duodenocholecystostomy are 22 letters long, but the anagrammatic relation between them is obvious: chop cholecystoduodenostomy into three parts:

cholecysto duodeno stomy

and rearrange the first two:

duodeno cholecysto stomy

and there you have it.

This gave me the idea to score a pair of anagrams according to how many chunks one had to be cut into in order to rearrange it to make the other one. On this plan, the “cholecystoduodenostomy / duodenocholecystostomy” pair would score 3, just barely above the minimum possible score of 2. Something even a tiny bit more interesting, say “abler / blare” would score higher, in this case 4. Even if this strategy didn't lead me directly to the most interesting anagrams, it would be a big step in the right direction, allowing me to eliminate the least interesting.

This rule would judge both “aal / ala” and “zolotink / zolotnik” as being uninteresting (scores 2 and 4 respectively), which is a good outcome. Note that some other boring-anagram problems can be seen as special cases of this one. For example, short anagrams never need to be cut into many parts: no four-letter anagrams can score higher than 4. The trivial anagramming of a word to itself always scores 1, and nontrivial anagrams always score more than this.

So what we need to do is: for each anagram pair, say acrididae (grasshoppers) and cidaridae (sea urchins), find the smallest number of chunks into which we can chop acrididae so that the chunks can be rearranged into cidaridae.

One could do this with a clever algorithm, if one were available. There is a clever algorithm, based on finding maximum independent sets in a certain graph. (More about this tomorrow.) I did not find this algorithm at the time; nor did I try. Instead, I used a brute-force search. Or rather, I used a very small amount of cleverness to reduce the search space, and then used brute-force search to search the reduced space.

Let's consider a example, scoring the anagram “abscise / scabies”. You do not have to consider every possible permutation of abscise. Rather, there are only two possible mappings from the letters of abscise to the letters of scabies. You know that the C must map to the C, the A must map to the A, and so forth. The only question is whether the first S of abscise maps to the first or to the second S of scabies. The first mapping gives us:

and the second gives us

because the S and the C no longer go to adjoining positions. So the minimum number of chunks is 5, and this anagram pair gets a score of 5.

To fully analyze cholecystoduodenostomy by this method required considering 7680 mappings. (120 ways to map the five O's × 2 ways to map the two C's × 2 ways to map the two D's, etc.) In the 1990s this took a while, but not prohibitively long, and it worked well enough that I did not bother to try to find a better algorithm. In 2016 it would probably still run quicker than implementing the maximum independent set algorithm. Unfortunately I have lost the code that I wrote then so I can't compare.

Assigning scores in this way produced a scored anagram list which began

2 aal ala

and ended

4 zolotink zolotnik

and somewhere in the middle was

3 cholecystoduodenostomy duodenocholecystostomy

all poor scores. But sorted by score, there were treasures at the end, and the clear winner was

14 cinematographer megachiropteran

I declare this the single best anagram in English. It is 15 letters long, and the only letters that stay together are the E and the R. “Cinematographer” is as familiar as a 15-letter word can be, and “megachiropteran” means a giant bat. GIANT BAT! DEATH FROM ABOVE!!!

And there is no serious competition. There was another 14-pointer, but both its words are Webster's Second jargon that nobody knows:

14 rotundifoliate titanofluoride

There are no score 13 pairs, and the score 12 pairs are all obscure. So this is the winner, and a deserving winner it is.

I think there is something in the list to make everyone happy. If you are the type of person who enjoys anagrams, the list rewards casual browsing. A few examples:

7 admirer married
7 admires sidearm

8 negativism timesaving
8 peripatetic precipitate
8 scepters respects
8 shortened threnodes
8 soapstone teaspoons

9 earringed grenadier
9 excitation intoxicate
9 integrals triangles
9 ivoriness revisions
9 masculine calumnies

10 coprophagist topographics
10 chuprassie haruspices
10 citronella interlocal

11 clitoridean directional
11 dispensable piebaldness

“Clitoridean / directional” has been one of my favorites for years. But my favorite of all, although it scores only 6, is

6 yttrious touristy

I think I might love it just because the word yttrious is so delightful. (What a debt we owe to Ytterby, Sweden!)

I also rather like

5 notaries senorita

which shows that even some of the low-scorers can be worth looking at. Clearly my chunk score is not the end of the story, because “notaries / senorita” should score better than “abets / baste” (which is boring) or “Acephali / Phacelia” (whatever those are), also 5-pointers. The length of the words should be worth something, and the familiarity of the words should be worth even more.

Here are the results:

38333 anagrams, scored

In former times there was a restaurant in Philadelphia named “Soupmaster”. My best unassisted anagram discovery was noticing that this is an anagram of “mousetraps”.

[ Addendum 20170222: There is a followup article comparing the two algorithms I wrote for computing scores. ]

[ Addendum 20170222: An earlier version of this article mentioned the putative 11-pointer “endometritria / intermediator”. The word “endometritria” seemed pretty strange, and I did look into it before I published the article, but not carefully enough. When Philip Cohen wrote to me to question it, I investigated more carefully, and discovered that it had been an error in an early WordNet release, corrected (to “endometria”) in version 1.6. I didn't remember that I had used WordNet's word lists, but I am not surprised to discover that I did. ]

[ Addendum 20170223: More about this ]

[ Addendum 20170507: Slides from my !!Con 2017 talk are now available. ]

[ Addendum 20170511: A large amount of miscellaneous related material ]

[Other articles in category /lang] permanent link

Mon, 30 Jan 2017

Digit symbols in the Parshvanatha magic square

In last month's article about the magic square at the Parshvanatha temple, shown at right, I said:

It has come to my attention that the digit symbols in the magic square are not too different from the current forms of the digit symbols in the Gujarati script. The temple is not very close to Gujarat or to the area in which Gujarati is common, so I guess that the digit symbols in Indian languages have evolved in the past thousand years, with the Gujarati versions remaining closest to the ancient forms, or else perhaps Gujarati was spoken more widely a thousand years ago. I would be interested to hear about this from someone who knows.

Shreevatsa R. replied in detail, and his reply was so excellent that, finding no way to improve it by adding or taking away, I begged his permission to republish it without change, which he generously granted.

Am sending this email to say:

  1. Why it shouldn't be surprising if the temple had Gujarati numerals
  2. Why the numerals aren't Gujarati numerals :-)

The Parshvanatha temple is located in the current state of Madhya Pradesh. Here is the location of the temple within a map of the state:

And here you can see that the above state of Madhya Pradesh (14 in the image below) is adjacent to the state of Gujarat (7):

The states of India are (sort of) organized along linguistic lines, and neighbouring states often have overlap or similarities in their languages. So a priori it shouldn't be too surprising if the language is that of a neighbouring state.

But, as you rightly say, the location of the Parshvanatha temple is actually quite far from the state (7) where Gujarat is spoken; it's closer to 27 in the above map (state named Uttar Pradesh).

Well, the Parshvanatha temple is believed to have been built "during the reign of the Chandela king Dhanga", and the Chandela kings were feudatories (though just beginning to assert sovereignty at the time) of the Gurjara-Pratihara kings, and "Gurjara" is where the name of the language of "Gujarati" comes from. So it's possible that they used the "official" language of the reigning kings, as with colonies. In fact the green area of the Gurjara-Pratihara kings in this map covers the location of the Parshvanatha temple:

But actually this is not a very convincing argument, because the link between Gurjara-Pratiharas and modern Gujarati is not too strong (at least I couldn't find it in a few minutes on Wikipedia :P)

So moving on...

Are the numerals really similar to Gujarati numerals? These are the numbers 1 to 16 from your blog post, ordered according to the usual order:

These are the numerals in a few current Indic scripts (as linked from your blog post):

Look at the first two rows above. Perhaps because of my familiarity with Devanagari, I cannot really see any big difference between the Devanagari and Gujarati symbols except for the 9: the differences are as minor as variation between fonts. (To see how much the symbols can change because of font variation, one can go to Google Fonts' Devanagari page and Google Fonts' Gujarati page and click on one of the sample texts and enter "० १ २ ३ ४ ५ ६ ७ ८ ९" and "૦ ૧ ૨ ૩ ૪ ૫ ૬ ૭ ૮ ૯" respectively, then "Apply to all fonts". Some fonts are bad, though.)

(In fact, even the Gurmukhi and Tibetan are somewhat recognizable, for someone who can read Devanagari.)

So if we decide that the Parshvanatha temple's symbols are actually closer not to modern Gujarati but to modern Devanagari (e.g. the "3" has a tail in the temple symbols which is present in Devanagari but missing in Gujarati), then the mystery disappears: Devanagari is still the script used in the state of Madhya Pradesh (and Uttar Pradesh, etc: it's the script used for Hindi, Marathi, Nepali, Sanskrit, and many other languages).

Finally, for the complete answer, we can turn to history.

The Parshvanatha temple was built during 950 to 970 CE. Languages: Modern Gujarati dates from 1800, Middle Gujarati from ~1500 to 1800, Old Gujarati from ~1100 to 1500. So the temple is older than the earliest language called "Gujarati". (Similarly, modern Hindi is even more recent.) Turning to scripts instead: see under Brahmic scripts.

So at the time the temple was built, neither Gujarati script nor Devanagari proper existed. The article on the Gujarati script traces its origin to the Devanagari script, which itself is a descendant of Nagari script.

At right are the symbols from the Nagari script, which I think are closer in many respects to the temple symbols.

So overall, if we trace the numerals in (a subset of) the family tree of scripts:

Brahmi > Gupta > Nagari > Devanagari > Gujarati

we'll find that the symbols of the temple are somewhere between the "Nagari" and "Devanagari" forms. (Most of the temple digits are the same as in the "Nagari" example above, except for the 5 which is closer to the Devanagari form.)

BTW, your post was about the numerals, but from being able to read modern Devanagari, I can also read some of the words above the square: the first line ends with ".. putra śrī devasarmma" (...पुत्र श्री देव‍सर्म्म) (Devasharma, son of...), and these words have the top bar which is missing in Gujarati script.

[Other articles in category /lang] permanent link

Sun, 20 Mar 2016

Technical jargon failure modes

Technical jargon is its own thing, intended for easy communication between trained practitioners of some art, but not necessarily between anyone else.

Jargon can be somewhat transparent, like the chemical jargon term “alcohol”. “Alcohol” refers to a large class of related chemical compounds, of which the simplest examples are methyl alcohol (traditionally called “wood alcohol”) and ethyl alcohol (the kind that you get in your martini). The extension of “alcohol” to the larger class is suggestive and helpful. Someone who doesn't understand the chemical jargon usage of “alcohol” can pick it up by analogy, and even if they don't they will probably have something like the right idea. A similar example is “aldehyde”. An outsider who hears this for the first time might reasonably ask “does that have something to do with formaldehyde?” and the reasonable answer is “yes indeed, formaldehyde is the simplest example of an aldehyde compound.” Again the common term is adapted to refer to the members of a larger but related class.

An opposite sort of adaptation is found in the term “bug”. The common term is extremely broad, encompassing all sorts of terrestrial arthropods, including mosquitoes, ladybugs, flies, dragonflies, spiders, and even isopods (“pillbugs”) and centipedes and so forth. It should be clear that this category is too large and heterogeneous to be scientifically useful, and the technical use of “bug” is much more restricted. But it does include many creatures commonly referred to as bugs, such as bed bugs, waterbugs, various plant bugs, and many other flat-bodied crawling insects.

Mathematics jargon often wanders in different directions. Some mathematical terms are completely opaque. Nobody hearing the term “cobordism” or “simplicial complex” or “locally compact manifold” for the first time will think for an instant that they have any idea what it means, and this is perfect, because they will be perfectly correct. Other mathematical terms are paradoxically so transparent seeming that they reveal their opacity by being obviously too good to be true. If you hear a mathematician mention a “field” it will take no more than a moment to realize that it can have nothing to do with fields of grain or track-and-field sports. (A field is a collection of things that are number-like, in the sense of having addition, subtraction, multiplication, and division that behave pretty much the way one would expect those operations to behave.) And some mathematical jargon is fairly transparent. The non-mathematician's idea of “line”, “ball”, and “cube” is not in any way inconsistent with what the mathematician has in mind, although the full technical meaning of those terms is pregnant with ramifications and connotations that are invisible to non-mathematicians.

But mathematical jargon sometimes goes to some bad places. The term “group” is so generic that it could mean anything, and outsiders often imagine that it means something like what mathematicians call a “set”. (It actually means a family of objects that behave like the family of symmetries of some other object.)

This last is not too terrible, as jargon failures go. There is a worse kind of jargon failure I would like to contrast with “bug”. There the problem, if there is a problem, is that entomologists use the common term “bug” much more restrictively than one expects. An entomologist will well-actually you to explain that a millipede is not actually a bug, but we are used to technicians using technical terms in more restrictive ways than we expect. At least you can feel fairly confident that if you ask for examples of bugs (“true bugs”, in the jargon) that they will all be what you will consider bugs, and the entomologist will not proceed to rattle off a list that includes bats, lobsters, potatoes, or the Trans-Siberian Railroad. This is an acceptable state of affairs.

Unacceptable, however, is the botanical use of the term “berry”:

It is one thing to adopt a jargon term that is completely orthogonal to common usage, as with “fruit”, where the technical term simply has no relation at all to the common meaning. That is bad enough. But to adopt the term “berry” for a class of fruits that excludes nearly everything that is commonly called a ”berry” is an offense against common sense.

This has been on my mind a long time, but I am writing about it now because I think I have found, at last, an even more offensive example.

  • Stonehenge is so-called because it is a place of hanging stones: “henge” is cognate with “hang”.

  • In 1932 archaeologists adapted the name “Stonehenge” to create the word “henge” as a generic term for a family of ancient monuments that are similar to Stonehenge.

  • Therefore, if there were only one thing in the whole world that ought to be an example of a henge, it should be Stonehenge.

  • However, Stonehenge is not, itself, a henge.

  • Stonehenge is not a henge.


Stonehenge is not a henge. … Technically, [henges] are earthwork enclosures in which a ditch was dug to make a bank, which was thrown up on the outside edge of the ditch.

— Michael Pitts, Hengeworld, pp. 26–28.

“Henge” may just be the most ineptly coined item of technical jargon in history.

[ Addendum 20161103: Zimbabwe's Great Dyke is not actually a dyke. ]

[Other articles in category /lang] permanent link

Sat, 26 Apr 2014

My brush with Oulipo

Last night I gave a talk for the New York Perl Mongers, and got to see a number of people that I like but don't often see. Among these was Michael Fischer, who told me of a story about myself that I had completely forgotten, but I think will be of general interest.

The front end of the story is this: Michael first met me at some conference, shortly after the publication of Higher-Order Perl, and people were coming up to me and presenting me with copies of the book to sign. In many cases these were people who had helped me edit the book, or who had reported printing errors; for some of those people I would find the error in the text that they had reported, circle it, and write a thank-you note on the same page. Michael did not have a copy of my book, but for some reason he had with him a copy of Oulipo Compendium, and he presented this to me to sign instead.

Oulipo is a society of writers, founded in 1960, who pursue “constrained writing”. Perhaps the best-known example is the lipogrammatic novel La Disparition, written in 1969 by Oulipo member Georges Perec, entirely without the use of the letter e. Another possibly well-known example is the Exercises in Style of Raymond Queneau, which retells the same vapid anecdote in 99 different styles. The book that Michael put in front of me to sign is a compendium of anecdotes, examples of Oulipan work, and other Oulipalia.

What Michael did not realize, however, was that the gods of fate were handing me an opportunity. He says that I glared at him for a moment, then flipped through the pages, found the place in the book where I was mentioned, circled it, and signed that.

The other half of that story is how I happened to be mentioned in Oulipo Compendium.

Back in the early 1990s I did a few text processing projects which would be trivial now, but which were unusual at the time, in a small way. For example, I constructed a concordance of the King James Bible, listing, for each word, the number of every verse in which it appeared. This was a significant effort at the time; the Bible was sufficiently large (around five megabytes) that I normally kept the files compressed to save space. This project was surprisingly popular, and I received frequent email from strangers asking for copies of the concordance.

Another project, less popular but still interesting, was an anagram dictionary. The word list from Webster's Second International dictionary was available, and it was an easy matter to locate all the anagrams in it, and compile a file. Unlike the Bible concordance, which I considered inferior to simply running grep, I still have the anagram dictionary. It begins:

aal ala
aam ama
Aarhus (See `arusha')
Aaronic (See `Nicarao')
Aaronite aeration
Aaru aura

And ends:

zoosporic sporozoic
zootype ozotype
zyga gazy
zygal glazy

The cross-references are to save space. When two words are anagrams of one another, both are listed in both places. But when three or more words are anagrams, the words are listed in one place, with cross-references in the other places, so for example:

Ateles teasel stelae saltee sealet
saltee (See `Ateles')
sealet (See `Ateles')
stelae (See `Ateles')
teasel (See `Ateles')

saves 52 characters over the unabbreviated version. Even with this optimization, the complete anagram dictionary was around 750 kilobytes, a significant amount of space in 1991. A few years later I generated an improved version, which dispensed with the abbreviation, by that time unnecessary, and which attempted, sucessfully I thought, to score the anagrams according to interestingness. But I digress.

One day in August of 1994, I received a query about the anagram dictionary, including a question about whether it could be used in a certain way. I replied in detail, explaining what I had done, how it could be used, and what could be done instead, and the result was a reply from Harry Mathews, another well-known member of the Oulipo, of which I had not heard before. Mr. Mathews, correctly recognizing that I would be interested, explained what he was really after:

A poetic procedure created by the late Georges Perec falls into the latter category. According to this procedure, only the 11 commonest letters in the language can be used, and all have to be used before any of them can be used again. A poem therefore consists of a series of 11 multi-word anagrams of, in French, the letters e s a r t i n u l o c (a c e i l n o r s t). Perec discovered only one one-word anagram for the letter-group, "ulcerations", which was adopted as a generic name for the procedure.

Mathews wanted, not exactly an anagram dictionary, but a list of words acceptable for the English version of "ulcerations". They should contain only the letters a d e h i l n o r s t, at most once each. In particular, he wanted a word containing precisely these eleven letters, to use as the translation of "ulcerations".

Producing the requisite list was much easier then producing the anagram dictionary iself, so I quickly did it and sent it back; it looked like this:

a A a
d D d
e E e
h H h
i I i
l L l
n N n
o O o
r R r
s S s
t T t
ad ad da
ae ae ea
ah Ah ah ha
lost lost lots slot
nors sorn
nort torn tron
nost snot
orst sort
adehl heald
adehn henad
adehr derah
adehs Hades deash sadhe shade
deilnorst nostriled
ehilnorst nosethirl
adehilnort threnodial
adehilnrst disenthral
aehilnorst hortensial

The leftmost column is the alphabetical list of letters. This is so that if you find yourself needing to use the letters 'a d e h s' at some point in your poem, you can jump to that part of the list and immediately locate the words containing exactly those letters. (It provides somewhat less help for discovering the shorter words that contain only some of those letters, but there is a limit to how much can be done with static files.)

As can be seen at the end of the list, there were three words that each used ten of the eleven required letters: “hortensial”, “threnodial”, “disenthral”, but none with all eleven. However, Mathews replied:

You have found the solution to my immediate problem: "threnodial" may only have 10 letters, but the 11th letter is "s". So, as an adjectival noun, "threnodials" becomes the one and only generic name for English "Ulcerations". It is not only less harsh a word than the French one but a sorrowfully appropriate one, since the form is naturally associated with Georges Perec, who died 12 years ago at 46 to the lasting consternation of us all.

(A threnody is a hymn of mourning.)

A few years later, the Oulipo Compendium appeared, edited by Mathews, and the article on Threnodials mentions my assistance. And so it was that when Michael Fischer handed me a copy, I was able to open it up to the place where I was mentioned.

[ Addendum 20140428: Thanks to Philippe Bruhat for some corrections: neither Perec nor Mathews was a founding member of Oulipo. ]

[ Addendum 20170205: To my consternation, Harry Mathews died on Janury 25. There was nobody like him, and the world is a smaller and poorer place. ]

[ Addendum 20170909: I should have mentioned that my appearance in Oulipo Compendium was brought to my attention by Robin Houston. Thank you M. Houston! ]

[Other articles in category /lang] permanent link

Tue, 03 Jan 2012

Eta-reduction in Haskell and English
The other day Katara and I were putting together a model, and she asked what a certain small green part was for. I said "It's a thing for connecting a thing to another thing."

Katara objected that this was a completely unhelpful explanation, but I disagreed. I would have agreed that it was an excessively verbose explanation, but she didn't argue that point.

Later, it occurred to me that Haskell has a syntax for eliding unnecessary variables in cases like this. In Haskell, one can abbreviate the expression

        λx → λy → x + y
to just (+). (Perl users may find it helpful to know that the Perl equivalent of the expression above is sub { my ($x) = @_; return sub { my ($y) = @_; return $x + $y } }.) This is an example of a general transformation called η-reduction. In general, for any function f, λxf x is a function that takes an argument x and returns f x. But that's exactly what f does. So we can replace the longer version with the shorter version, and that's η-reduction, or we can go the other way, which is η-expansion.

Anyway, once I thought of this it occurred to me that, just like the longer expression could be reduced to (+), my original explanation that the small green part was "a thing for connecting a thing to another thing" could be η-reduced to "a connector".

Perhaps if I had said that in the first place Katara would not have complained.

Happy new year, all readers.

[Other articles in category /lang] permanent link

Wed, 20 May 2009

No flimping
Advance disclaimer: I am not a linguist, have never studied linguistics, and am sure to get some of the details wrong in this article. Caveat lector.

There is a standard example in linguistics that is attached to the word "flimp". The idea it labels is that certain grammatical operations are restricted in the way they behave, and cannot reach deeply into grammatical structures and rearrange them.

For instance, you can ask "What did you use to see the girl on the hill in the blue dress?" and I can reply "I used a telescope to see the girl on the hill in the blue dress". Here "the girl on the hill in the blue dress" is operating as a single component, which could, in principle, be arbitrarily long. ("The girl on the hill that was fought over in the war between the two countries that have been at war since the time your mother saw that monkey climb the steeple of the church...") This component can be extracted whole from one sentence and made the object of a new sentence, or the subject of some other sentence.

But certain other structures are not transportable. For example, in "Bill left all his money to Fred and someone", one can reach down as far as "Fred and someone" and ask "What did Bill leave to Fred and someone?" but one cannot reach all the way down to "someone" and ask "Who did Bill leave all his money to Fred and"?

Under certain linguistic theories of syntax, analogous constraints rule out the existence of certain words. "Flimped" is the hypothetical nonexistent word which, under these theories, cannot exist. To flimp is to kiss a girl who is allergic to. For example, to flimp coconuts is to kiss a girl who is allergic to coconuts. (The grammatical failure in the last sentence but one illustrates the syntactic problem that supposedly rules out the word "flimped".

I am not making this up; for more details (from someone who, unlike me, may know what he is talking about) See Word meaning and Montague grammar by David Dowty, p. 236. Dowty cites the earlier sources, from 1969–1973 who proposed this theory in the first place. The "flimped" example above is exactly the same as Dowty's, and I believe it is the standard one.

Dowty provides a similar, but different example: there is not, and under this theory there cannot be, a verb "to thork" which means "to lend your uncle and", so that "John thorked Harry ten dollars" would mean "John lent his uncle and Harry ten dollars".

I had these examples knocking around in my head for many years. I used to work for the University of Pennsylvania Computer and Information Sciences department, and from my frequent contacts with various cognitive-science types I acquired a lot of odds and ends of linguistic and computational folklore. Michael Niv told me this one sometime around 1992.

The "flimp" thing rattled around my head, surfacing every few months or so, until last week, when I thought of a counterexample: Wank.

The verb "to wank to" means "to rub one's genitals while considering", and so seems to provide a countexample to the theory that says that verbs of this type are illegal in English.

When I went to investigate, I found that the theory had pretty much been refuted anyway. The Dowty book (published 1979) produced another example: "to cuckold" is "to have sexual intercourse with the woman who is married to".

Some Reddit person recently complained that one of my blog posts had no point. Eat this, Reddit person.

[Other articles in category /lang] permanent link

Fri, 08 May 2009

Most annoying phrase known to man?
I have been wasting time, those precious minutes of my life that will never return, by eliminating the odious phrase "known to man" from Wikipedia articles. It is satisfying, in much the same way as doing the crossword puzzle, or popping bubble wrap.

In the past I have gone on search-and-destroy missions against certain specific phrases, for example "It should be noted that...", which can nearly always be replaced with "" with no loss of meaning. But "known to man" is more fun.

One pleasant property of this phrase is that one can sidestep the issue of whether "man" is gender-neutral. People on both sides of this argument can still agree that "known to man" is best replaced with "known". For example:

  • The only albino gorilla known to man...
  • The most reactive and electronegative substance known to man...
  • Copper and iron were known to man well before the copper age and iron age...
In examples like these, "to man" is superfluous, and one can delete it with no regret.

As a pleonasm and a cliché, "known to man" is a signpost to prose that has been written by someone who was not thinking about what they were saying, and so one often finds it amid other prose that is pleonastic and clichéd. For example:

Diamond ... is one of the hardest naturally occurring material known (another harder substance known today is the man-made substance aggregated diamond nanorods which is still not the hardest substance known to man).
Which I trimmed to say:

Diamond ... is one of the hardest naturally-occurring materials known. (Some artificial substances, such as aggregated diamond nanorods, are harder.)
Many people ridicule Strunk and White's fatuous advice to "omit needless words"—if you knew which words were needless, you wouldn't need the advice—but all editors know that beginning writers will use ten words where five will do. The passage above is a good example.

Can "known to man" always be improved by replacement with "known"? I might have said so yesterday, but I mentioned the issue to Yaakov Sloman, who pointed out that the original use was meant to suggest a contrast not with female knowledge but with divine knowledge, an important point that completely escaped my atheist self. In light of this observation, it was easy to come up with a counterexample: "His acts descended to a depth of evil previously unknown to man" partakes of the theological connotations very nicely, I think, and so loses some of its force if it is truncated to "... previously unknown". I suppose that many similar examples appear in the work of H. P. Lovecraft.

It would be nice if some of the Wikipedia examples were of this type, but so far I haven't found any. The only cases so far that I haven't changed are all direct quotations, including several from the introductory narration of The Twilight Zone, which asserts that "There is a fifth dimension beyond that which is known to man...". I like when things turn out better than I expected, but this wasn't one of those times. Instead, there was one example that was even worse than I expected. Bad writing it may be, but the wrongness of "known to man" is at least arguable in most cases. (An argument I don't want to make today, although if I did, I might suggest that "titanium dioxide is the best whitening agent known to man" be rewritten as "titanium dioxide is the best whitening agent known to persons of both sexes with at least nine and a half inches of savage, throbbing cockmeat.") But one of the examples I corrected was risibly inept, in an unusual way:

Wonder Woman's Amazon training also gave her limited telepathy, profound scientific knowledge, and the ability to speak every language known to man.
I have difficulty imagining that the training imparted to Diana, crown princess of the exclusively female population of Paradise Island, would be limited to languages known to man.

Earle Martin drew my attention to the Wikipedia article on "The hardest metal known to man". I did not dare to change this.

[ Addendum 20090515: There is a followup article. ]

[Other articles in category /lang] permanent link

Sun, 15 Feb 2009

Stupid crap, presented by Plato
Yesterday I posted:

"She is not 'your' girlfriend," said this knucklehead. "She does not belong to you."
Through pure happenstance, I discovered last night that there is an account of this same bit of equivocation in Plato's Euthydemus. In this dialogue, Socrates tells of a sophist named Dionysodorus, who is so clever that he can refute any proposition, whether true or false. Here Dionysodorus demonstrates that Ctesippus's father is a dog:

You say that you have a dog.

Yes, a villain of a one, said Ctesippus.

And he has puppies?

Yes, and they are very like himself.

And the dog is the father of them?

Yes, he said, I certainly saw him and the mother of the puppies come together.

And is he not yours?

To be sure he is.

Then he is a father, and he is yours; ergo, he is your father, and the puppies are your brothers.

So my knuckleheaded interlocutor was not even being original.

I gratefully acknowledge the gift of Thomas Guest. Thank you very much!

[Other articles in category /lang] permanent link

Fri, 31 Oct 2008

A proposed correction to an inconsistency in English orthography
English contains exactly zero homophones of "zero", if one ignores the trivial homophone "zero", as is usually done.

English also contains exactly one homophone of "one", namely "won".

English does indeed contain two homophones of "two": "too" and "to".

However, the expected homophones of "three" are missing. I propose to rectify this inconsistency. This is sure to make English orthography more consistent and therefore easier for beginners to learn.

I suggest the following:

I also suggest the founding of a well-funded institute with the following mission:

  1. Determine the meanings of these three new homophones
  2. Conduct a public education campaign to establish them in common use
  3. Lobby politicians to promote these new words by legislation, educational standards, public funding, or whatever other means are appropriate
  4. Investigate the obvious sequel issues: "four" has only "for" and "fore" as homophones; what should be done about this?
Obviously, the director of this institute should be a thoughtful, far-seeing individual who will not allow his good judgement to be clouded by the generous salary. I refer, of course, to myself.

Happy Halloween. All Hail Discordia.

[ Addendum 20081106: Some readers inexplicably had nothing better to do than to respond to this ridiculous article. ]

[Other articles in category /lang] permanent link

Wed, 14 May 2008

More artificial Finnish
Several Finns wrote to me to explain in some detail what was wrong with the artificial Finnish in yesterday's article. As I surmised, the words "ssän" and "kkeen" are lexically illegal in Finnish. There were a number of similar problems. For example, my sample output included the non-word "t". I don't know how this could have happened, since the input probably didn't include anything like that, and the Markov process I used to generate it shouldn't have done so. But the code is lost, so I suppose I'll never know.

Of the various comments I received, perhaps the most interesting was from Ilmari Vacklin. ("Vacklin", huh? If my program had generated "Vacklin", the Finns would have been all over the error.) M. Vacklin pointed out that a number of words in my sample output violated the Finnish rules of vowel harmony.

(M. Vacklin also suggested that my article must have been inspired by this comic, but it wasn't. I venture to guess that the Internet is full of places that point out that you can manufacture pseudo-Finnish by stringing together a lot of k's and a's and t's; it's not that hard to figure out. Maybe this would be a good place to mention the word "saippuakauppias", the Finnish term for a soap-dealer, which was in the Guinness Book of World Records as the longest commonly-used palindromic word in any language.)

Anyway, back to vowel harmony. Vowel harmony is a phenomenon found in certain languages, including Finnish. These languages class vowels into two antithetical groups. Vowels from one group never appear in the same word as vowels from the other group. When one has a prefix or a suffix that normally has a group A vowel, and one wants to join it to a word with group B vowels, the vowel in the suffix changes to match. This happens a lot in Finnish, which has a zillion suffixes. In many languages, including Finnish, there is also a third group of vowels which are "neutral" and can be mixed with either group A or with group B.

Modern Korean does not have vowel harmony, mostly, but Middle Korean did have it, up until the early 16th century. The Korean alphabet was invented around 1443, and the notation for the vowels reflected the vowel harmony:

[ Addendum 20080517: The following paragraph about vowel harmony contains significant errors of fact. I got the groups wrong. ]

The first four vowels in this illustration, with the vertical lines, were incompatible with the second four vowels, the ones with the horizontal lines. The last two vowels were neutral, as was another one, not shown here, which was written as a single dot and which has since fallen out of use. Incidentally, vowel harmony is an unusual feature of languages, and its presence in Korean has led some people to suggest that it might be distantly related to Turkish.

The vowel harmony thing is interesting in this context for the following reason. My pseudo-Finnish was generated by a Markov process: each letter was selected at random so as to make the overall frequency of the output match that of real Finnish. Similarly, the overall frequency of two- and three-letter sequences in pseudo-Finnish should match that in real Finnish. Is this enough to generate plausible (although nonsensical) Finnish text? For English, we might say maybe. But for Finnish the answer is no, because this process does not respect the vowel harmony rules. The Markov process doesn't remember, by the time it gets to the end of a long word, whether it is generating a word in vowel category A or B, and so it doesn't know which vowels it whould be generating. It will inevitably generate words with mixed vowels, which is forbidden. This problem does not come up in the generation of pseudo-English.

None of that was what I was planning to write about, however. What I wanted to do was to present samples of pseudo-Finnish generated with various tunings of the Markov process.

The basic model is this: you choose a number N, say 2, and then you look at some input text. For each different sequence of N characters, you count how many times that sequence is followed by "a", how many times it is followed by "b", and so on.

Then you start generating text at random. You pick a sequence of N characters arbitrarily to start, and then you generate the next character according to the probabilities that you calculated. Then you look at the last N characters (the last N-1 from before, plus the new one) and repeat. You keep doing that until you get tired.

For example, suppose we have N=2. Then we have a big table whose keys are 2-character strings like "ab", and then associated with each such string, a table that looks something like this:
r 54.52
a 15.89
i 10.41
o 7.95
l 4.11
e 3.01
u 1.10
space 0.82
: 0.55
t 0.55
, 0.27
. 0.27
b 0.27
s 0.27
So in the input to this process, "ab" was followed by "r" more than 54% of the time, by "a" about 16% of the time, and so on. And when generating the output, every time our process happens to generate "ab", it will follow by generating an "r" 54.52% of the time, an "a" 15.89% of the time, and so on.

Whether to count capital letters as the same as lowercase, and what to do about punctuation and spaces and so forth, are up to the designer.

Here, as examples, are some samples of pseudo-English, generated with various N. The input text was the book of Genesis, which is not entirely typical. In each case, I deleted the initial N characters and the final partial word, cleaned up the capitalization by hand, and appended a final period.

Lt per f idd et oblcs hs hae:uso ar w aaolt y tndh rl ohn n synenihbdrha,spegn.
Cachand t wim, heheethas anevem blsant ims, andofan, ieahrn anthaye s, lso iveeti alll t tand, w.
Ged hich callochbarthe of th to tre said nothem, and rin ing of brom. My and he behou spend the.
Sack one eved of and refor ther of the hand he will there that in the ful, when it up unto rangers.
It should be clear that the quality improves as one increases the N parameter. The N=3 sample has mostly real words, and the few nonsense ones it contains ("eved", "ful") are completely plausible English. N=2, on the other hand, is mostly nonsense, although it's mostly plausible nonsense. Even "callochbarthe" is almost plausible. (The unfortunate "chb" in the middle is just bad luck. It occurs because Genesis 36 mentions Baalhanan the son of Achbor.) The N=1 sample is recognizably bogus; no English word looks like "ieahrn", and the triple "l" in "alll" is nearly impossible. (I did once write to Jesse Sheidlower, an editor of the Big Dictionary, to ask his advice about whether "ballless" should be hyphenated.)

I have prepared samples of pseudo-Finnish of various qualities. The input here was a bunch of text I copied out of Finnish Wikipedia. (Where else? If you need Finnish text in 1988, you get it from the Usenet group; if you need Finnish text in 2008, you get it from Finnish Wikipedia.) I did a little bit of manual cleanup, as with the English, but not too much.

Vtnnstäklun so so rl sieesjo.Aiijesjeäyuiotiannorin traäl.N vpojanti jonn oteaanlskmt enhksaiaaiiv oenlulniavas. Rottlatutsenynöisu iikännam e lavantkektann eaagla admikkosulssmpnrtinrkudilsorirumlshsmoti,anlosa anuioessydshln.Atierisllsjnlu e.Itatlosyhi vnko ättr otneän akho smalloailäi jiaat kajvtaopnasneilstio tntin einteaonaiimotn:r apoya oruasnainttotne wknaiossäelaäinoev aobrs,vteorlokynv. Aevsrikhanä tp s s oälnlke rvmi il ynae nara ign ssm lkimttbhineaatismäi tst lli ahaltineshne kr keöunv ah s itenh s .Ia pa elstpnanmnuiksriil anaalnttt mr ti.Ooa ka eee eiiei,tnees äusee a nanhetv.Iopkijeatatits,i l eklbiik suössmap tioaotaktdiir rkeaviohiesotkeagarihv nnadvö jlape öt kaeakmjkhykoto tnt iunnuyknnelu rutliie.Leva eiriaösnaj,rk oyumtsle,iioa,aspa aeiaä wsuinn eta y tvati klssviutkuaktmlpnheomi.T akapskushhnuksnhnnheaaaaussitseminmpnamäiaä pät.Kaaaabl unnionuhnpa iaes,outka.Cväinvkshvrnlteeoea rmi re suodmpr autlysa tnliaanäass. Srs rnvrtsita kmidusvjn tii.
Ava pän svun kerekent lsita batävomenasttenerga kovosuujalules rma punntäni rtraliksainoi van eukällä. Enäkukänesinntampalä ttan kolpäsäkyönsllvitivenestakkesenelussivaliite kuuksä kttteni einsuekeita kuterissalietäkilpöikalit ojatäjä pinsin atollukole idoitenn kkaorhjajasteden en vuolynkoiverojaa hta puon ehalan vaivä ihoshäositi. Hde setua tämpitydi makta jasyn sää oinncgrkai jeeten. Ljalanekikeri toiskkksypohoin ta yö atenesällväkeesaatituuun. Paait pukata tuon ktusumitttan zagaleskli va kkanäsin siikutytowhenttvosa veste eten vunovivä. Vorytellkeeni stan jä taa eka kaine ja kurenntonsin kyn o nta ja. Aisst urksetaka. Hotimivaa ta mppussternallai ja. Hdä on koraleerermohtydelen on jon. Rgienon kulinoilisälsa ja holälimmpa vitin, kukausoompremänn ra, palestollebilsen kaalesta, oina. Blilullaushoingiötideispaanoksiton, mulurklimi kermalli pota atebau lmomarymin kypa hta vanon tin kela vanaspoita s kulitekkäjen jäleetuolpan, veesalekäilin oii. Häreli. Ymialisstermimpriekaksst on.
Omaalis onino osa josa hormastaaraktse tyi altäänä tyntellevääostoidesenä, la siä vuansilliana inöön akalkuulukempellys kisä nen myöhelyaminenkiemostamahti omuonsa onite oni kusissa. Kungin sykynteillalkaai ellahasiteisuunnaja eroniemmin javai musuuasinä, sittan tusuovatkryt tormon vuolisenitiivansaliuotkietjuuta sensa. Kutumppalvinen. Vaikintolat hän ja kilkuossa osa koiseuvo keyhdysvisakeemppolowistoisijouliuodosijolasissän muoli ogro soluksi valuksasverix intetormon patlantaan et muiksen paiettaatulun kan vuomesyklees ovain pun. Sesva sa hänerittämpiraun tyi vuoden sälisen sän yhtiit, set tämpiraalletä. Senssaikanoje leemp:tabeten ain raa olliukettyi su. Solulukuuttellerrotolit hee säkinessa hän sekketäärinenvaikeihakti umallailuksin sestunno klossi ilunuta. Klettisaa osen vua vuola, jani ja hinangia en ta kaineemonimien polin barkiviäliukkuta joseseva. Ebb rautta onistärään on ml jokoulistä oheksi anoton allysvallelsiliineuvoja kutuko ala ulkietutablohitkain. Ituno.
Ävivät mena osakeyhti yhdysvalmiininäkin rakenne tuliitä hermoni ja umpirauhastui liin baryshnikoneja. Ain viljelukuullisää olisäke spesideksyylikoliittu latvia. Helsina hän solukeskuksen kannumme, peri palkin vieskeinä sisään on orgaan poikanssisäätelukauno klee laisenäläinen tavastui kauno on länteen muttava hän voimista kilometsästymistettäjän lehtiöiksitoreisö. Sitoutuvat mukalle. Ainettiin sisäke suomaihin, jouluun. Verenkilpalveli valtaineen opisteri poli ohjasionee rakennuttikolan aivastisenäläistuu kehittisetoja, rajahormaailmanajan kulkopuolesti kuluu mooliitoutuvat ovat olle. Ainen yhdysvaltai valiolähtiöiksi vasta, S. Muidentilaisteri jotka verenkirovin verenkiehumistä nelle väliaivoittynyt baleviiliukoisiin maailmestavarasta, jokakuudessa laisu. Sai rakeyhti yhtiö eli gluksessa. Ebbin, ja linnosakkeen hormonien I hallistehtiin kilpirasvua jaajana hormaailusta kunnetteluskäyttöön suomalaivat yhdysvalmistämistammonit veteet olimistuvatta. Hormon oli rautta.
Before anyone objects to the non-word "ml" in the N=2 sample, let me explain that this is the standard abbreviation for "millilitra". The "i" in the N=3 sample was a puzzle, since Marko Heiskanen assures me that Finnish has no one-letter words. But it appears in my sample in connection with Sukselaisen I hallitus, whatever that is, so I capitalized it.

I must say that I found "yhdysvalmistämistammonit" rather far-fetched, even in Finnish. But then I discovered that "yhdeksänkymmenvuotiaaksi" and "yhdysvalloissakaan" are genuine, so who am I to judge?

[ Addendum 20080601: Some additional notes. ]

[Other articles in category /lang] permanent link

Mon, 12 May 2008

Artificial Finnish

By 1988 or 1989 I had read in several places, most recently in J. R. Pierce's Symbols, Signals, and Noise, that if you compile a table of the relative frequencies of three-letter sequences (trigraphs) in English text, and then generate random text with the same trigraph frequencies, the result cannot be distinguished from meaningful English text except by people who actually know English. Examples were provided, containing weird but legitimate-sounding words like "deamy" and "grocid", and the claim seemed plausible. But since I did actually know English, I could not properly evaluate it.

But around that time the Internet was just beginning to get into full swing. The Finnish government was investing a lot of money in networking infrastructure, and a lot of people in Finland were starting to appear on the Internet.

I have a funny story about that: Around the same time, a colleague named Marc Edgar approached me in the computer lab to ask if I knew of any Internet-based medium he could use to chat with his friend at the University of Oulu. I thought at first that he was putting me on (and maybe he was) because in 1989 the University of Oulu was just about the only place in the world where a large number of people were accessible via internet chat, IRC having been invented there the previous autumn.

A new set of Finnish-language newsgroups had recently appeared on Usenet, and people posted to them in Finnish. So I had access to an unlimited supply of computer-readable Finnish text, something which would have been unthinkable a few years before, and I could do the experiment in Finnish.

I wrote up the program, which is not at all difficult, gathered Finnish news articles, and produced the following sample:

Uttavalon estaa ain pahalukselle? Min omatunu selle menneet hy, toista. Palveljen alh tkö an välin oli ei alkohol pisten jol elenin. Että, ille, ittavaikki oli nim tor taisuuristä usein an sie a in sittä asia krista sillo si mien loinullun, herror os; riitä heitä suurinteen palve in kuk usemma. Tomalle, äs nto tai sattia yksin taisiä isiäk isuuri illää hetorista. Varsi kaikenlaineet ja pu distoja paikelmai en tulissa sai itsi mielim ssän jon sn ässäksi; yksen kos oihin! Jehovat oli kukahdol ten on teistä vak kkiasian aa itse ee eik tse sani olin mutta todistanut t llisivat oisessa sittä on raaj a vaisen opinen. Ihmisillee stajan opea tajat ja jumalang, sitten per sa ollut aantutta että voinen opeten. Ettuj, jon käs iv telijoitalikantaminun hä seen jälki yl nilla, kkeen, vaaraajil tuneitteistamaan same?

In those days, the world was 7-bit, and Finnish text was posted in a Finnish national variant of ASCII that caused words like "tkö an välin" to look like "tk| an v{lin". The presence of the curly braces heightened the apparent similarity, because that was all you could see at first glance.

At the time I was pleased, but now I think I see some defects. There are some vowelless words, such as "sn" and "t", which I think doesn't happen in Finnish. Some other words look defective: "ssän" and "kkeen", for example. Also, my input sample wasn't big enough, so once the program generated "alk" it was stuck doing the rest of "alkohol". Still, I think this could pass for Finnish if the reader wasn't paying much attention. I was satisfied with the results of the experiment, and was willing to believe that randomly-contructed English really did look enough like English to fool a non-English-speaking observer.

[ Addendum 20080514: There is a followup to this article. ]

[ Addendum 20080601: Some additional notes. ]

[Other articles in category /lang] permanent link

Tue, 04 Mar 2008

"Boolean" or "boolean"?
In a recent article I wrote:

... a logical negation function ... takes a boolean argument and returns a boolean result.
I worried for some time about whether to capitalize "boolean" here. But writing "Boolean" felt strange enough that I didn't actually try it to see how it looked on the page.

I looked at the the Big Dictionary, and all the citations were capitalized. But the most recent one was from 1964, so that was not much help.

Then I tried Google search for "boolean capitalized". The first hit was a helpful article by Eric Lippert. M. Lippert starts by pointing out that "Boolean" means "pertaining to George Boole", and so should be capitalized. That much I knew already.

But then he pointed out a countervailing consideration:

English writers do not usually capitalize the eponyms "shrapnel" (Henry Shrapnel, 1761-1842), "diesel" (Rudolf Diesel, 1858-1913), "saxophone" (Adolphe Sax, 1814-1894), "baud" (Emile Baudot, 1845-1903), "ampere" (Andre Ampere, 1775-1836), "chauvinist" (Nicolas Chauvin, 1790-?), "nicotine" (Jean Nicot, 1530-1600) or "teddy bear" (Theodore Roosevelt, 1858-1916).
Isn't that a great paragraph? I just had to quote the whole thing.

Lippert concluded that the tendency is to capitalize an eponym when it is an adjective, but not when it is a noun. (Except when it isn't that way; consider "diesel engine". English is what it is.)

I went back to my example to see if that was why I resisted capitalizing "Boolean":

... takes a boolean argument and returns a boolean result.
Hmm, no, that wasn't it. I was using "boolean" as an adjective in both places. Wasn't I?

Something seemed wrong. I tried changing the example:

... takes an integer argument and returns an integer result.
Aha! Notice "integer", not "integral". "Integral" would have been acceptable also, but that isn't analogous to the expression I intended. I wasn't using "boolean" as an adjective to modify "argument" and "result". I was using it as a noun to denote a certain kind of data, as part of a noun phrase. So it is a noun, and that's why I didn't want to capitalize it.

I would have been happy to have written "takes a boolean and returns a boolean", and I think that's the controlling criterion.

Sorry, George.

[Other articles in category /lang] permanent link

Mon, 18 Feb 2008

Once I was visiting my grandparents while home from college. We were in the dining room, and they were talking about a book they were reading, in which the author had used a word they did not know: cornaptious. I didn't know it either, and got up from the table to look it up in their Webster's Second International Dictionary. (My grandfather, who was for his whole life a both cantankerous and a professional editor, loathed the permissive and descriptivist Third International. The out-of-print Second International Edition was a prized Christmas present that in those days was hard to find.)

Webster's came up with nothing. Nothing but "corniculate", anyway, which didn't appear to be related. At that point we had exhausted our meager resources. That's what things were like in those days.

The episode stuck with me, though, and a few years later when I became the possessor of the First Edition of the Oxford English Dictionary, I tried there. No luck. Some time afterwards, I upgraded to the Second Edition. Still no luck.

Years went by, and one day I was reading The Lyre of Orpheus, by Robertson Davies. The unnamed Dean of the music school describes the brilliant doctoral student Hulda Schnakenburg:

"Oh, she's a foul-mouthed, cornaptious slut, but underneath she is all untouched wonderment."
"Aha," I said. "So this is what they were reading that time."

More years went by, the oceans rose and receded, the continents shifted a bit, and the Internet crawled out of the sea. I returned to the problem of "cornaptious". I tried a Google book search. It found one use only, from The Lyre of Orpheus. The trail was still cold.

But wait! It also had a suggestion: "Did you mean: carnaptious", asked Google.

Ho! Fifty-six hits for "carnaptious", all from books about Scots and Irish. And the OED does list "carnaptious". "Sc. and Irish dial." it says. It means bad-tempered or quarrelsome. Had Davies spelled it correctly, we would have found it right away, because "carnaptious" does appear in Webster's Second.

So that's that then. A twenty-year-old spelling error cleared up by Google Books.

[ Addendum 20080228: The Dean's name is Wintersen. Geraint Powell, not the Dean, calls Hulda Schnakenburg a cornaptious slut. ]

[Other articles in category /lang] permanent link

Thu, 31 Jan 2008

Unnecessary imprecision
This article contains the following sentence:

McCain has won all of the state's 57 delegates, and the last primary before voters in more than 20 states head to the polls next Tuesday.
Why "more than 20 states"? Why not just say "23 states", which is shorter and conveys more information?

I'm not trying to pick on CTV here. A Google News search finds 42,000 instances of "more than 20", many of which could presumably be replaced with "26" or whatever. Well, I had originally written "most of which", but then I looked at some examples, and found that the situation is better than I thought it would be. Here are the first ten matches:

  1. Australian Stocks Complete Worst Month in More Than 20 Years
  2. It said the US air force committed more than 20 cases of aerial espionage by U-2 strategic espionage planes this month.
  3. Farmland prices have climbed more than 20% over the past year in many Midwestern states...
  4. "We have had record-breaking growth in our monthly shipments, as much as more than 20 percent improvements per month," said Christopher Larkins, President...
  5. More than 20 people, including a district officer, were injured when two bombs exploded outside a stadium in the town yesterday...
  6. By a vote of 14-7, the Senate Finance Committee last night voted to deliver $500 tax rebates to more than 20 million American senior citizens...
  7. 9 killed, more than 20 injured in bus accident
  8. While Tuesday's results may not lock up the nomination for either candidate, Democrats will have their say in more than 20 states...
  9. Facing the potential anointment of his rival, John McCain, Romney has less than a week to convince voters in more than 20 states that...
  10. More than 20 Aberdeen citizens qualified for elections as April ...
#1 may be legitimate, if the previous worst month was less than 21 years ago. Similarly #6 is legitimate if the number of senior citizens is close to 20 million, say around 20,400,000, particularly since the number may not be known with high precision.

#2 may be legitimate, if the number of cases of aerial espionage is not known with certitude, or if the anonymous source really did say "more than 20". Similarly #4 is entirely off the hook since it is a quotation.

#3 may be legitimate if the price of farmland is uncertain and close to 20%. #5 is probably a loser. #7 is definitely a loser: it was the headline of an article that began "Nine people were killed and at least 22 injured when...". The headline could certainly have been "9 killed, 22 injured in bus accident".

#8 and #9 are losers, but they are the same example with which I began the article, so they don't count. #10 is a loser.

So I have, of eight examples (disregarding #8 and #9) three certain or near-certain failures (#5, #7, and #10), one certain non-failure (#4), and four cases to which I am willing to extend the benefit of the doubt. This is not as bad as I feared. I like when things turn out better than I thought they would.

But I really wonder what is going on with all these instances of "more than 20 states". Is it just sloppy writing? Or is there some benefit that I am failing to appreciate?

[Other articles in category /lang] permanent link

Sun, 06 Jan 2008

A while back I looked up "zillion" in Wikipedia, which is an alias for the Wikipedia article about "Indefinite and fictitious numbers". The article includes a large number of synonyms for "zillion", such as bajillion, kajillion, gazillion, and so forth. For some reason the word "squillion" caught my eye, and I noticed that the citation was from Terry Pratchett: "And you owe me a million billion trillion zillion squillion dollars." This suggested to me that "squillion" might be a nonce-word, one made up on the spot by Pratchett for that one sentence, in which case it should not be in the Wikipedia article.

Google book search is a good way to answer questions like that, because if "squillion" is widely used, you will find a lot of examples of it. And indeed it is widely used, and I did find a lot of examples of it. So there was no need to remove it from the article.

One of the Google hits was from the Cormac Ó Cuilleanáin translation of Giovanni Boccaccio's Decameron. The Decameron is a great classic of Italian Renaissance literature, probably the greatest classic that Italian has, after Dante's Divine Comedy. It was written around 1350. In this particular chapter (the tenth story on the sixth day, if you want to look it up) Guccio, a priest, is trying to seduce a hideous kitchen-maid:

He sat himself down by the fire—although this was August—and struck up a conversation with the wench in question (Nuta by name), informing her that he was by rights a member of the gentry and had more than a squillion florins in the bank, not counting those he had to give to other people...

The kitchen-maid, by the way, is described as having "a pair of tits like two baskets of manure".

This was amusing, and as I had never read the Decameron, I wanted to read more, and learn how it turned out. But the Google excerpt was limited, so I asked the library to get me a copy of that version of the Decameron. Of course they have many copies on the shelf, but not that particular translation. So I asked the interlibrary loan people for it, and they got it for me.

When it arrived, I was rather dismayed. The ILL people get the book from the most convenient place, and that means that it often comes from the Drexel library, up the street, or the Temple library, across town, or the West Chester Community College library, or Lehigh University, about an hour away in Bethlehem. (Steel Bethlehem, of course, not Jesus Bethlehem.) The farthest I had ever gotten a book from was an extremely obscure quilting manual that Lorrie asked for; it eventually arrived from the Sno-Isles regional library system of Marysville, Washington.

But this copy of the Decameron came from the Sloman library of the University of Essex. I was so shocked that I had to look it up online to make sure that it was not Essex, New Jersey, or something like that. I was not. It was East Saxony. I was upset because I felt that the trouble and effort had been wasted. If I had known that the nearest available copy of Cormac Ó Cuilleanáin's translation was in Essex, I would have been happy to take a different version that was on the shelf. And then to top it off, I had hardly begun to read it before it came due and had to be sent back to Essex.

So I went to the library and got another Decameron, this one translated by Mark Musa and Peter Bondanella. Here is the corresponding passage:

Although it was still August, he took a seat near the fire and began to talk with the girl, whose name was Nuta, telling her that he was a gentleman by procuration, that he had more than a thousand hundreds of florins (not counting those he had to give away to others), ...

And there is a footnote on "thousand hundreds" explaining "Guccio invents this amount, as well as the previous phrase 'by procuration,' in order to impress his lady." By the way, in this version, Nuta has "a pair of tits that looked like two clumps of cowshit".

Anyway, I think I liked "squillions" better than "thousand hundreds", although I suppose "thousand hundreds" is probably a more literal translation.

Well, I can find this out. Of course, one can find the Decameron online in Italian; the copyright expired about five hundred years ago. Here it is in Italian, courtesy of Brown University:

E ancora che d'agosto fosse, postosi presso al fuoco a sedere, cominciò con costei, che Nuta aveva nome, a entrare in parole e dirle che egli era gentile uomo per procuratore e che egli aveva de' fiorini piú di millantanove, senza quegli che egli aveva a dare altrui,...
I think the word that is being translated here is "millantanove", although I can't be entirely sure, because I don't know Italian. Once again, though, I am surprised at how easy it is to read a passage in an unintelligible foreign language when I already know what it is going to say. (I wrote about this back in April 2006, and it occurs to me now that that would be a fun topic for an article.)

The 1903 translation that Brown University provides is "more florins than could be reckoned", which does not seem to me to capture the flavor of the original, and does not seem to be a literal translation either. "Millantanove" seems to me to be a made-up word resembling "mille" = "thousand". But as I said, I don't know Italian.

Nuta in this version has "a pair of breasts that shewed as two buckets of muck". Feh. The Italian is "con un paio di poppe che parean due ceston da letame". The operative phrase here seems to be "ceston da letame". I don't know what those words mean, but, happily, Italian Wikipedia has an article about letame, and as the picture makes clear, it is indeed manure.

Oh, did you want this article to have a point? Too bad.

I recommend the Decameron. It is funny and salacious. There are a lot of stories about women cheating on their husbands, and then getting away with it through some clever trick, and then everyone who hears the story laughs and admires the cleverness of the ladies. (The counterpoint to this is that there are a number of stories of wife-beating, in which everyone who hears the story laughs and admires the wisdom of the husbands. I don't like that so much.)

There are farcical stories of bed-swapping and wife-swapping, and one story about an abbess who comes out of her cell to berate a nun for having her lover in to visit, but the abbess is wearing a pair of men's trousers on her head instead of her wimple. Oops.

This reminds me of when I was in high school, I was talking to one of my friends, who opted to study French, and this friend told me studying French is fun, because when you get to the third year and start reading real French literature, you read that great classic of French Literature, La Vie de Gargantua et de Pantagruel. If you have not read this master treasure of French culture, I should explain that the first chapter is mainly taken up with Gargantua and Pantagruel having a discussion about what is the best sort of thing to wipe your ass with, and it goes on from there.

I took Latin, and in third-year Latin we read the orations of Cicero against Cataline. Fun stuff, but not the sort of thing that has you rushing to translate the next word.

I was going to write an article about symmetries of the dodecahedron, and an interesting problem suggested to me by these balloon displays that I saw at the local Mazda dealership, but eh, this was a lot easier.

Gargantua and Pantagruel eventually agree that the answer is a live goose.

[ Addendum 20080201: More about 'milliantanove'. ]

[Other articles in category /lang] permanent link

Sat, 05 Jan 2008

Pepys' footballs explained
Walt Mankowski wrote to me with the explanation of Samuel Pepys' footballs: They are not clods of mud, as I guessed, nor horse droppings, as another correspondent suggested, but... footballs.

Walt found a reference in Montague Shearman's 1887 book on the history of football in England that specifically mentions this. Folks were playing football in the street, and because of this, Pepys took his coach to Sir Philip Warwicke's, rather than walking.

I didn't ask, but I presume Walt found this by doing some straightforward Google search for "pepys footballs" or something of the sort. For some reason, this did not even occur to me. Once Big Dictionary failed me, I was stumped. Perhaps this marks me as a member of the pre-Internet generation. I imagined this morning that this episode would be repeated, with my daughter Katara in place of Walt. "Oh, Daddy! You're so old-fashioned. Just use a Google search."

Anyway, inspired by Walt's example, or by what I imagined Walt's example to be, I did the search myself, and found the Shearman reference, as well as the following discussion in William Carew Hazlitt's Faiths and Folklore of 1905:

Mission, writing about 1690, says: "In winter foot-ball is a useful and charming exercise. It is a leather ball about as big as one's head, fill'd with wind. This is kick'd about from one to t'other in the streets, by him that can get at it, and that is all the art of it."
This book looks like it would be good reading in general. [ Addendum 20080106: This is not the William Hazlitt, but his grandson. Thank you, Wikipedia. ]

Thanks very much, Walt.

[Other articles in category /lang] permanent link

Fri, 04 Jan 2008

The diary of Samuel Pepys for Tuesday, 3 January 1664/5 says:

Up, and by coach to Sir Ph. Warwicke's, the streete being full of footballs, it being a great frost, and found him and Mr. Coventry walking in St. James's Parke.
"The street being full of footballs?" Huh? I tried looking in the Big Dictionary, and it was no help at all.

My best guess is that it's big chunks of frozen mud that you have to kick out of the way. Do any gentle readers know for sure?

The Diary of Samuel Pepys has a syndication feed you can subscribe to. You get a diary entry every day or so, with all the names and places linked to a glossary. It's fun reading.

[ Addendum 20080105: The answer. ]

[Other articles in category /lang] permanent link

Sun, 16 Sep 2007

Thank you very much for that bulletin
I'm about to move house, and so I'm going through a lot of old stuff and throwing it away. I just unearthed the decorations from my office door circa 1994. I want to record one of these here before I throw it away and forget about it. It's a clipping from the front page of the New York Times from 11 April, 1992. It is noteworthy for its headline, which only one column wide, but at the very top of page A1, above the fold. It says:


Sometimes good articles get bad headlines. Often the headlines are tacked on just before press time by careless editors. Was this a good article afflicted with a banal headline? Perhaps they meant there was internecine squabbling among the diplomats charged with the negotiations?

No. If you read the article it turned out that it was about how darn hard it was to end the war when folks kept shooting at each other, dad gum it.

I hear that the headline the following week was DOG BITES MAN, but I don't have a clipping of that.

[Other articles in category /lang] permanent link

Wed, 16 May 2007

Moziz Addums
Last July at a porch sale I obtained a facsimile copy of Housekeeping in Old Virginia, by M.C. Tyree, originally published in 1879. I had been trying to understand the purpose of ironing. Ironing makes the clothes look nice, but it must have also served some important purpose, essential for life, that I don't now understand. In the Laura Ingalls Wilder Little House books, Laura recounts a common saying that scheduled the week's work:

Wash on Monday
Iron on Tuesday
Mend on Wednesday
Churn on Thursday
Clean on Friday
Bake on Saturday
Rest on Sunday

You bake on Saturday so that you have fresh bread for Sunday dinner. You wash on Monday because washing is backbreaking labor and you want to do it right after your day of rest. You iron the following day before the washed clothes are dirty again. But why iron at all? If you don't wash the clothes or clean the house, you'll get sick and die. If you don't bake, you won't have any bread, and you'll starve. But ironing? In my mind it was categorized with dusting, as something people with nice houses in the city might do, but not something that Ma Ingalls, three miles from the nearest neighbor, would concern herself with.

But no. Ironing, and starching with the water from boiled potatoes, was so important that it got a whole day to itself, putting it on par with essential activities like cleaning and baking. But why?

A few months later, I figured it out. In this era of tumble-drying and permanent press, I had forgotten what happens to fabrics that are air dried, and did not understand until I was on a trip and tried to air-dry a cotton bath towel. Air-dried fabrics come out not merely wrinkled but corrugated, like an accordion, or a washboard, and are unusable. Ironing was truly a necessity.

Anyway, I was at this porch sale, and I hoped that this 1879 housekeeping book might provide the answer to the ironing riddle. It turned out to be a cookbook. There is plenty to say about this cookbook anyway. It comes recommended by many notable ladies, including Mrs. R.B. Hayes. (Her husband was President of the United States.) She is quoted on the flyleaf as being "very much pleased" with the cookbook.

Some of the recipes are profoundly unhelpful. For example, p.106 has:

Boiled salmon. After the fish has been cleaned and washed, dry it and sew it up in a cloth; lay in a fish-kettle, cover with warm water, and simmer until done and tender.

Just how long do I simmer it? Oh, until it is "done" and "tender". All right, I will just open up the fish kettle and poke it to see. . . except that it is sewed up in a cloth. Hmmm.

You'd think that if I'm supposed to simmer this fish that has been sewn up in a cloth, the author of the recipe might advise me on how long until it is "done". "Until tender" is a bit of a puzzle too. In my experience, fish become firmer and less tender the longer you simmer them. Well, I have a theory about this. The recipe is attributed to "Mrs. S.T.", and consulting the index of contributors, I see that it is short for "Mrs. Samuel Tyree", presumably the editor's mother-in-law. Having a little joke at her expense, perhaps?

There are a lot of other interesting points, which may appear here later. For example, did you know that the most convenient size hog for household use is one of 150 to 200 pounds? And the cookbook contains recipes not only for tomato catsup, but also pepper catsup, mushroom catsup, and walnut catsup.

But the real reason I brought all this up is that page 253–254 has the following item, attributed to "Moziz Addums":

Resipee for cukin kon-feel Pees. Gether your pees 'bout sun-down. The folrin day, 'bout leven o'clock, gowge out your pees with your thum nale, like gowgin out a man's ey-ball at a kote house. Rense your pees, parbile them, then fry 'erm with some several slices uv streekd middlin, incouragin uv the gravy to seep out and intermarry with your pees. When modritly brown, but not scorcht, empty intoo a dish. Mash 'em gently with a spune, mix with raw tomarters sprinkled with a little brown shugar and the immortal dish ar quite ready. Eat a hepe. Eat mo and mo. It is good for your genral helth uv mind and body. It fattens you up, makes you sassy, goes throo and throo your very soul. But why don't you eat? Eat on. By Jings. Eat. Stop! Never, while thar is a pee in the dish.

This was apparently inserted for humorous effect. Around the time the cookbook was written, there was quite a vogue for dialectal humor of this type, most of which has been justly forgotten. Probably the best-remembered practitioner of this brand of humor was Josh Billings, who I bet you haven't heard of anyway. Tremendously popular at the time, almost as much so as Mark Twain, his work is little-read today; the joke is no longer funny. The exceptionally racist example above is in many ways typical of the genre.

One aspect of this that is puzzling to us today (other than the obvious "why was this considered funny?") is that it's not clear exactly what was supposed to be going on. Is the idea that Moziz Addums wrote this down herself, or is this a transcript by a literate person of a recipe dictated by Moziz Addums? Neither theory makes sense. Where do the misspellings come from? In the former theory, they are Moziz Addums' own misspellings. But then we must imagine someone literate enough to spell "intermarry" and "immortal" correctly, but who does not know how to spell "of".

In the other theory, the recipe is a transcript, and the misspellings have been used by the anonymous, literate transcriber to indicate Moziz Addums' unusual or dialectal pronunciations, as with "tomarters", perhaps. But "uv" is the standard (indeed, the only) pronunciation of "of", which wrecks this interpretation. (Spelling "of" as "uv" was the signature of Petroleum V. Nasby, another one of those forgotten dialectal humorists.) And why did the transcriber misspell "peas" as "pees"?

So what we have here is something that nobody could possibly have written or said, except as an inept parody of someone else's speech. I like my parody to be rather less artificial.

All of this analysis would be spoilsportish if the joke were actually funny. E.B. White famously said that "Analyzing humor is like dissecting a frog. Few people are interested and the frog dies of it." Here, at least, the frog had already been dead for a hundred years dead before I got to it.

[ Addendum 20100810: In case you were wondering, "kon-feel pees" are actually "cornfield peas", that is, peas that have been planted in between the rows of corn in a cornfield. ]

[Other articles in category /lang] permanent link

Tue, 15 May 2007

Ambiguous words and dictionary hacks
A Mexican gentleman of my acquaintance, Marco Antonio Manzo, was complaining to me (on IRC) that what makes English hard was the large number of ambiguous words. For example, English has the word "free" where Spanish distinguishes "gratis" (free like free beer) from "libre" (free like free speech).

I said I was surprised that he thought that was unique to English, and said that probably Spanish had just as many "ambiguous" words, but that he just hadn't noticed them. I couldn't think of any Spanish examples offhand, but I knew some German ones: in English, "suit" can mean a lawsuit, a suit of clothes, or a suit of playing cards. German has different words for all of these. In German, the suit of a playing card is its "farbe", its color. So German distinguishes between suit of clothes and suit of playing cards, which English does not, but fails to distinguish between colors of paint and suit of playing cards, which English does.

Every language has these mismatches. Korean has two words for "thin", one meaning thin like paper and the other meaning thin like string. Korean distinguishes father's sister ("komo") from mother's sister ("imo") where English has only "aunt".

Anyway, Sr. Manzo then went to lunch, and I wanted to find some examples of concepts distinguished by English but not by Spanish. I did this with a dictionary hack.

A dictionary hack is when you take a plain text dictionary and do some sort of rough-and-ready processing on it to get an 80% solution to some problem. The oldest dictionary hack I know of is the old Unix rhyming dictionary hack:

        rev /usr/dict/words | sort | rev > rhyming.txt
This takes the Unix word list and turns it into a semblance of a rhyming dictionary. It's not an especially accurate semblance, but you can't beat the price.

     ugh	      Marlborough   choreograph	            Guelph        Wabash   
     Hugh	      Scarborough   lithograph	            Adolph        cash     
     McHugh	      thorough	    electrocardiograph      Randolph      dash     
     Pugh	      trough	    electroencephalograph   Rudolph       leash    
     laugh	      sough	    nomograph	            triumph       gash     
     bough	      tough	    tomograph	            lymph         hash     
     cough	      tanh	    seismograph	            nymph         lash     
     dough	      Penh	    phonograph	            philosoph     clash    
     sourdough        sinh	    chronograph	            Christoph     eyelash  
     hough	      oh	    polarograph	            homeomorph    flash    
     though	      pharaoh	    spectrograph            isomorph      backlash 
     although         Shiloh	    Addressograph           polymorph     whiplash 
     McCullough       pooh	    chromatograph           glyph         splash   
     furlough         graph	    autograph	            anaglyph      slash    
     slough	      paragraph	    epitaph	            petroglyph    mash     
     enough	      telegraph	    staph	            myrrh         smash    
     rough	      radiotelegrap aleph	            ash           gnash    
     through	      calligraph    Joseph	            Nash          Monash   
     breakthrough     epigraph	    caliph	            bash          rash     
     borough	      mimeograph    Ralph	            abash         brash    
It figures out that "clash" rhymes with "lash" and "backlash", but not that "myrrh" rhymes with "purr" or "her" or "sir". You can of course, do better, by using a text file that has two columns, one for orthography and one for pronunciation, and sorting it by reverse pronunciation. But like I said, you won't beat the price.

But I digress. Last week I pulled an excellent dictionary hack. I found the Internet Dictionary Project's English-Spanish lexicon file on the web with a quick Google search; it looks like this:

        a	un, uno, una[Article]
        aardvark	cerdo hormiguero
        aardvark	oso hormiguero[Noun]
        aardvarks	cerdos hormigueros
        aardvarks	osos hormigueros 
        ab	prefijo que indica separacio/n
        aback	hacia atras
        aback	hacia atr´s,take aback, desconcertar. En facha.
        aback	por sopresa, desprevenidamente, de improviso
        aback	atra/s[Adverb]
        abacterial	abacteriano, sin bacterias
        abacus	a/baco
        abacuses	a/bacos
        abaft	A popa (towards stern)/En popa (in stern)
        abaft	detra/s de[Adverb]
        abalone	abulo/n
        abalone	oreja de mar (molusco)[Noun]
        abalone	oreja de mar[Noun]
        abalones	abulones
        abalones	orejas de mar (moluscos)[Noun]
        abalones	orejas de mar[Noun]
        abandon	abandonar
        abandon	darse por vencido[Verb]
        abandon	dejar
        abandon	desamparar, desertar, renunciar, evacuar, repudiar
        abandon	renunciar a[Verb]
        abandon	abandono[Noun]
        abandoned	abandonado
        abandoned	dejado
Then I did:

        sort +1 idengspa.txt  | 
        perl -nle '($ecur, $scur) = split /\s+/, $_, 2; 
                print "$eprev $ecur $scur" 
                        if $sprev eq $scur && 
                           substr($eprev, 0, 1) ne substr($ecur, 0, 1); 
                        ($eprev, $sprev) = ($ecur, $scur)'

The sort sorts the lexicon into Spanish order instead of English order. The Perl thing comes out looking a lot more complicated than it ought. It just says to look and print consecutive items that have the same Spanish, but whose English begins with different letters. The condition on the English is to filter out items where the Spanish is the same and the English is almost the same, such as:

blond blonde rubio
cake cakes tarta
oceanographic oceanographical oceanografico[Adjective]
palaces palazzi palacios[Noun]
talc talcum talco
taxi taxicab taxi

It does filter out possible items of interest, such as:

carefree careless sin cuidado

But since the goal is just to produce some examples, and this cheap hack was never going to generate an exhaustive list anyway, that is all right.

The output is:

        at letter a
        actions stock acciones[Noun]
        accredit certify acreditar
        around thereabout alrededor
        high tall alto
        comrade pal amigo[Noun]
        antecedents backgrounds antecedentes
        (...complete output...)
A lot of these are useless, genuine synonyms. It would be silly to suggest that Spanish fails to preserve the English distinction between "marry" and "wed", between "ale" and "beer", between "desire" and "yearn", or between "vest" and "waistcoat". But some good possibilities remain.

Of these, some probably fail for reasons that only a Spanish-speaker would be able to supply. For instance, is "el pastel" really the best translation of both "cake" and "pie"? If so, it is an example of the type I want. But perhaps it's just a poor translation; perhaps Spanish does have this distinction; say maybe "torta" for "cake" and "empanada" for "pie". (That's what Google suggests, anyway.)

Another kind of failure arises because of idioms. The output:

        exactly o'clock en punto
is of this type. It's not that Spanish fails to distinguish between the concepts of "exactly" and "o'clock"; it's that "en punto" (which means "on the point of") is used idiomatically to mean both of those things: some phrase like "en punto tres" ("on the point of three") means "exactly three" and so, by analogy, "three o'clock". I don't know just what the correct Spanish phrases are, but I can guess that they'll be something like this.

Still, some of the outputs are suggestive:

high tall alto
low small bajo[Adjective]
babble fumble balbucear[Verb]
jealous zealous celoso
contest debate debate[Noun]
forlorn stranded desamparado[Adjective]
docile meek do/cil[Adjective]
picture square el cuadro
fourth room el cuarto
collar neck el cuello
idiom language el idioma[Noun]
clock watch el reloj
floor ground el suelo
ceiling roof el techo
knife razor la navaja
feather pen la pluma
cloudy foggy nublado

I put some of these to Sr. Manzo, and he agreed that some were indeed ambiguous in Spanish. I wouldn't have known what to suggest without the dictionary hack.

[Other articles in category /lang] permanent link

Mon, 14 May 2007

Bryan and his posse
Today upon the arrival of a coworker and his associates, I said "Oh, here comes Bryan and his posse". My use of "posse" here drew some comment. I realized I was not completely sure what "posse" meant. I mostly knew it from old West contexts: the Big Dictionary has quotes like this one, from 1901:

A pitched battle was Rockhill, Missouri, between the Sheriff's posse and the miners on strike.
I first ran across the word in J.D. Fitzgerald's Great Brain books. At least in old West contexts, the word refers to a gang of men assembled by some authority such as a sheriff or a marshal, to perform some task, such as searching for a lost person, apprehending an outlaw, or blasting some striking miners. This much was clear to me before.

From the context and orthography, I guessed that it was from Spanish. But no, it's not. It's Latin! "Posse" is the Latin verb "to be able", akin to English "possible" and ultimately to "potent" and related words. I'd guessed something like this, supposing English "posse" was akin to some Spanish derivative of the Latin. But it isn't; it's direct from Latin: "posse" in English is short for posse comitatus, "force of the county".

The Big Dictionary has citations for "posse comitatus" back to 1576:

Mr. Sheryve meaneth in person to repayre thither & with force to bryng hym from Aylesham, Whomsoever he fyndeth to denye the samet & suerly will with Posse Comitatus fetch hym from this new erected pryson to morrow.

"Sheryve" is "Sheriff". (If you have trouble understanding this, try reading it aloud. English spelling changed more than its pronunciation since 1576.)

I had heard the phrase before in connection with the Posse Comitatus Act of U.S. law. This law, passed in 1878, is intended to prohibit the use of the U.S. armed forces as Posse Comitatus—that is, as civilian law enforcement. Here the use is obviously Latin, and I hadn't connected it before with the sheriff's posse. But they are one and the same.

[Other articles in category /lang/etym] permanent link

Mon, 04 Dec 2006

A couple of weeks ago I was over at a friend's house, and was trying to explain to her two-year-old daughter which way to turn the knob on her Etch-a-Sketch. But I couldn't tell her to turn it clockwise, because she can't tell time yet, and has no idea which way is clockwise.

It occurs to me now that I may not be giving her enough credit; she may know very well which way the clock hands go, even though she can't tell time yet. Two-year-olds are a lot smarter than most people give them credit for.

Anyway, I then began wonder what "clockwise" and "counterclockwise" were called before there were clocks with hands that went around clockwise. But I knew the answer to that one: "widdershins" is counterclockwise; "deasil" is clockwise.

Or so I thought. This turns out not to be the answer. "Deasil" is only cited by the big dictionary back to 1771, which postdates clocks by several centuries. "Widdershins" is cited back to 1545. "Clockwise" and "counter-clockwise" are only cited back to 1888! And a full-text search for "clockwise" in the big dictionary turns up nothing else. So the question of what word people used in 1500 is still a mystery to me.

That got me thinking about how asymmetric the two words "deasil" and "widdershins" are; they have nothing to do with each other. You'd expect a matched set, like "clockwise" and "counterclockwise", or maybe something based on "left" and "right" or some other pair like that. But no. "Widdershins" means "the away direction". I thought "deasil" had something to do with the sun, or the day, but apparently not; the "dea" part is akin to dexter, the right hand, and the "sil" part is obscure. Whereas the "shins" part of "widdershins" does have something to do with the sun, at least by association. That is, it is not related historically to the sun, except that some of the people using the word "widdershins" were apparently thinking it was actually "widdersun". What a mess. And the words have nothing to do with each other anyway, as you can see from the histories above; "widdershins" is 250 years older than "deasil".

The OED also lists "sunways", but the earliest citation is the same as the one for deasil.

Anyway, I did not know any of this at the time, and imagined that "deasil" meant "in the direction of the sun's motion". Which it is; the sun goes clockwise through the sky, coming up on the left, rising to its twelve-o'-clock apex, and then descending on the right, the way the hands of a clock do. (Perhaps that's why the early clockmakers decided to make the hands of the clock go that way in the first place. Or perhaps it's because of the (closely related) reason that that's the direction that the shadow on a sundial moves.)

And then it hit me that in the southern hemisphere, the sun goes the other way: instead of coming up on the left, and going down on the right, the way clock hands do, it comes up on the right and goes down on the left. Wowzers! How bizarre.

I'm a bit sad that I figured this out before actually visiting the southern hemisphere and seeing it for myself, because I think I would have been totally freaked out on that first morning in New Zealand (or wherever). But now I'm forewarned that the sun goes the wrong way down there and it won't seem so bizarre when I do see it for the first time.

[Other articles in category /lang] permanent link

Mon, 27 Nov 2006

Baseball team nicknames, again
Some addenda to my recent article about baseball team nicknames.

Several people wrote to complain that I mismatched the cities and the nicknames in this sentence:

The American League [has] the Boston Royals, the Kansas City Tigers, the Detroit Indians, the Oakland Orioles...

My apologies for the error. It should have been the Boston Tigers, the Kansas City Indians, the Detroit Orioles, and the Oakland Royals.

Phil Varner reminded me that the Chicago Bulls are in fact a "local color" name; they are named in honor of the Chicago stockyards.

This raises a larger point, brought up by Dave Vasilevsky: My classification of names into two categories conflates some issues. Some names are purely generic, like the Boston Red Sox, and can be transplanted anywhere. Other names are immovable, like the Philadelphia Phillies. In between, we have a category of names, like the Bulls, which, although easily transportable, are in fact local references.

The Milwaukee Brewers are a good baseball example. The Brewers were named in honor of the local German culture and after Milwaukee's renown as a world center of brewing. Nobody would deny that this is a "local color" type name. But the fact remains that many cities have breweries, and the name "Brewers" would work well in many places. The Philadelphia Brewers wouldn't be a silly name, for example. The only place in the U.S. that I can think of offhand that fails as a home for the Brewers is Utah; the Utah Brewers would be a bad joke. (This brings us full circle to the observation about the Utah Jazz that inspired the original article.)

The Baltimore Orioles are another example. I cited them as an example of a generic and easily transportable name. But the Baltimore Oriole is in fact a "local color" type name; the Baltimore Oriole is named after Lord Baltimore, and is the state bird of Maryland. (Thanks again to Dave Vasilevsky and to Phil Gregory for pointing this out.)

Or consider the Seattle Mariners. The name is supposed to suggest the great port of Seattle, and was apparently chosen for that reason. (I have confirmed that the earlier Seattle team, the Seattle Pilots, was so-called for the same reason.) But the name is transportable to many other places: it's easy to imagine alternate universes with the New York Mariners, the Brooklyn Mariners, the San Francisco Mariners, or the Boston Mariners. Or even all five.

And similarly, although in the previous article I classed the New York Yankees with the "local color" names, based on the absurdity of the Selma or the Charleston Yankees, the truth is that the Boston Yankees only sounds strange because it didn't actually happen that way.

I thought about getting into a tremendous cross-check of all 870 name-city combinations, but decided it was too much work. Then I thought about just classing the names into three groups, and decided that the issue is too complex to do that. For example, consider the Florida Marlins. Local color, certainly. But immovable? Well, almost. The Toronto Marlins or the Kansas City Marlins would be jokes, but the Tampa Bay Marlins certainly wouldn't be. And how far afield should I look? I want to class the Braves as completely generic, but consideration of the well-known class AA Bavarian League Munich Braves makes it clear that "Braves" is not completely generic.

So in ranking by genericity, I think I'd separate the names into the following tiers:

  1. Pirates, Cubs, Reds, Cardinals, Giants, Red Sox, Blue Jays, White Sox, Tigers, Royals, Athletics
  2. Braves, Mets, Dodgers, Orioles, Yankees, Indians, Angels, Mariners, Nationals, Brewers
  3. Marlins, Astros, Diamondbacks, Rockies, Padres, Devil Rays, Twins
  4. Phillies, Rangers
The Texas Rangers are a bit of an odd case. Rangers ought to be movable—but the name loses so much if you do. You can't even move the name to Arlington (the Arlington Rangers?), and the Rangers already play in Arlington. So I gave them the benefit of the doubt and put them in group 4.

Readers shouldn't take this classification as an endorsement of the Phillies' nickname, which I think is silly. I would have preferred the Philadelphia Brewers. Or even the Philadelphia Cheese Steaks. Maybe they didn't need the extra fat, but wouldn't it have been great if the 1993 Phillies had been the 1993 Cheese Steaks instead? Doesn't John Kruk belong on a team called the Cheese Steaks?

Another oddity, although not from baseball: In a certain sense, the Montreal Canadiens have an extremely generic name. And yet it's clearly not generic at all!

[Other articles in category /lang] permanent link

Fri, 24 Nov 2006

Etymological oddity
Sometimes you find words that seem like they must be related, and then it turns out to be a complete coincidence.

Consider pen and pencil.

Pen is from French penne, a long feather or quill pen, akin to Italian penne (the hollow, ribbed pasta), and ultimately to the word feather itself.

Pencil is from French pincel, a paintbrush, from Latin peniculus, also a brush, from penis, a tail, which is also the source of the English word penis.

A couple of weeks ago someone edited the Wikipedia article on "false cognates" to point out that day and diary are not cognate. "No way," I said, "it's some dumbass putting dumbassery into Wikipedia again." But when I checked the big dictionary, I found that it was true. They are totally unrelated. Diary is akin to Spanish dia, Latin dies, and other similar words, as one would expect. Day, however, is "In no way related to L. dies..." and is akin to Sanskrit dah = "to burn", Lithuania sagas = "hot season", and so forth.

[Other articles in category /lang/etym] permanent link

Wed, 22 Nov 2006

Baseball team nicknames
Lorrie and I were in the car, and she noticed another car with a Detroit Pistons sticker. She remarked that "Pistons" was a good name for a basketball team, and particularly for one from Detroit. I agreed. But then she mentioned the Utah Jazz, a terrible mismatch, and asked me how that happened to be. Even if you don't know, you can probably guess: They used to be the New Orleans Jazz, and the team moved to Utah. They should have changed the name to the Teetotalers or the Salt Flats or something, but they didn't, so now we have the Utah Jazz. I hear that next month they're playing the Miami Fightin' Irish.

That got us thinking about how some sports team names travel, and others don't. Jazz didn't. The Miami Heat could trade cities or names with the Phoenix Suns and nobody would notice. But consider the Chicago Bulls. They could pick up and move anywhere, anywhere at all, and the name would still be fine, just fine. Kansas City Bulls? Fine. Honolulu Bulls? Fine. Marsaxlokk Bulls? Fine.

We can distinguish two categories of names: the "generic" names, like "Bulls", and the "local color" names, like "Pistons". But I know more about baseball, so I spent more time thinking about baseball team names.

In the National League, we have the generic Braves, Cardinals, Cubs, Giants, Pirates, and Reds, who could be based anywhere, and in some cases were. The Braves moved from Boston to Milwaukee to Atlanta, although to escape from Boston they first had to change their name from the Beaneaters. The New York Giants didn't need to change their name when they moved to San Francisco, and they won't need to change their name when they move to Jyväskylä next year. (I hear that the Jyväskylä city council offered them a domed stadium and they couldn't bear to say no.)

On the other hand, the Florida Marlins, Arizona Diamondbacks, and Colorado Rockies are clearly named after features of local importance. If the Marlins were to move to Wyoming, or the Rockies to Nebraska, they would have to change their names, or turn into bad jokes. Then again, the Jazz didn't change their name when they moved to Utah.

The New York Mets are actually the "Metropolitans", so that has at least an attempt at a local connection. The Washington Nationals ditto, although the old name of the Washington Senators was better. At least in that one way. Who could root for a team called the Washington Senators? (From what I gather, not many people could.)

The Nationals replaced the hapless Montreal Expos, whose name wasn't very good, but was locally related: they were named for the 1967 Montreal World's Fair. Advice: If you're naming a baseball team, don't choose an event that will close after a year, and especially don't choose one that has already closed.

The Houston Astros, and their Astrodome filled with Astroturf, are named to recall the NASA manned space center, which opened there in 1961. The Philadelphia club is called the Phillies, which is not very clever, but is completely immovable. Boston Phillies, anyone? Pittsburgh Phillies? New York Phillies? No? I didn't think so.

I don't know why the San Diego Padres are named that, but there is plenty of Spanish religious history in the San Diego area, so I am confident in putting them in the "local color" column. Milwaukee is indeed full of Brewers; there are a lot of Germans up there, brewing up lager. (Are they back in the National League again? They seem to switch leagues every thirty years.)

That leaves just the Los Angeles Dodgers, who are a bit of an odd case. The team, as you know, was originally the Brooklyn Dodgers. The "Dodgers" nickname, as you probably didn't know, is short for "Trolley Dodgers". The Los Angeles Trolley Dodgers is almost as bad a joke as the Nebraska Rockies. Fortunately, the "Trolley" part was lost a long time ago, and we can now imagine that the team is the Los Angeles Traffic Dodgers. So much for the National League; we have six generic names out of 16, counting the Traffic Dodgers in the "local color" group, and ignoring the defunct Expos.

The American League does not do so well. They have the Boston Royals, the Kansas City Tigers, the Detroit Indians, the Oakland Orioles, and three teams that are named after sox: the Red, the White, and the Athletics.

Then there are the Blue Jays. They were originally owned by Labatt, a Canadian brewer of beer, and were so-named to remind visitors to the park of their flagship brand, Labatt's Blue. I might have a harder time deciding which group to put them in, if it weren't for the (1944-1945) Philadelphia Blue Jays. If the name is generic enough to be transplanted from Toronto to Philadelphia, it is generic. I have no idea what name the Toronto club could choose if they wanted to avail themselves of the "local color" option rather than the "generic" option; it's tempting to make a cruel joke and suggest that the name most evocative of Toronto would be the Toronto Generics. But no, that's unfair. They could always call their baseball club the Toronto Hockey Fans.

Anyway, moving on, we have the New York Yankees, which is not the least generic possible name, but clearly qualifies as "local color" once you pause to think about the Charleston Yankees, the Shreveport Yankees, and the Selma Yankees. The Tampa Bay Devil Rays are clearly "local color". The Minnesota Twins play in the Twin Cities of Minneapolis and St. Paul. The California, Anaheim, or Los Angeles Angels, whatever they're called this week, are evidently named for the city of Los Angeles. I would ridicule the Los Angeles Angels for having a redundant name, but as an adherent of the Philadelphia Phillies, I am living in a glass house.

The Texas Rangers are named for the famous Texas Rangers. I don't know exactly why the Seattle club is named Mariners; I wouldn't have considered Seattle to be an unusually maritime city, but their previous team was the Seattle Pilots, so the folks in Seattle must think of themselves so, and I'm willing to go along with it.

The tally for the American League is therefore eight generic, six local color. The total for Major League Baseball as a whole is 14 generic names out of 30.

This is a lot better than the Japanese Baseball League, which has a bunch of teams with names like the Lions, Tigers, Dragons, Giants, and Fighters. They make up for this somewhat in the names of the teams' corporate sponsors, so, for example, the Nippon Ham Fighters. They are sponsored by Nippon Ham, which does not make it any less funny. And the Yakult Swallows, which, if you interpret it as a noun phrase, sounds just a little bit like a gay porn flick set in Uzbekistan.

Incidentally, my favorite team name is the Wilmington Blue Rocks. The Blue Rocks' mascot is, alas, not a rock but a moose. Sometimes I dream of a team from Lansing, Michigan, called the Lansing Boils, but I know it will remain an unfulfilled fantasy.

[ Warning for non-Americans: Almost, but not quite everything in this article is the truth. Marsaxlokk does not actually have a Major League baseball club yet; however, they do have a class-A affiliate in the Mediterranean league, called the Marsaxlokk Moghzaskops. Also, the Giants are not scheduled to move to Jyväskylä until after the 2008 season. ]

[ Addendum 20061127: There is a followup article to this one. ]

[Other articles in category /lang] permanent link

Sat, 07 Oct 2006

Bone names
Names of bones are usually Latin. They come in two types. One type is descriptive. The auditory ossicles (that's Latin for "little bones for hearing") are named in English the hammer, anvil, and stirrup, and their formal, Latin names are the malleus ("hammer"), incus ("anvil"), and stapes ("stirrup")

The fibula is the small bone in the lower leg; it's named for the Latin fibula, which is a kind of Roman safety pin. The other leg bone, the tibia, is much bigger; that's the frame of the pin, and the fibula makes the thin sharp part.

The kneecap is the patella, which is a "little pan". The big, flat parietal bone in the skull is from paries, which is a wall or partition. The clavicle, or collarbone, is a little key.

"Pelvis" is Latin for "basin". The pelvis is made of four bones: the sacrum, the coccyx, and the left and right os innominata. Sacrum is short for os sacrum, "the sacred bone", but I don't know why it was called that. Coccyx is a cuckoo bird, because it looks like a cuckoo's beak. Os innominatum means "nameless bone": they gave up on the name because it doesn't look like anything. (See illustration to right.)

On the other hand, some names are not descriptive: they're just the Latin words for the part of the body that they are. For example, the thighbone is called the femur, which is Latin for "thigh". The big lower arm bone is the ulna, Latin for "elbow". The upper arm bone is the humerus, which is Latin for "shoulder". (Actually, Latin is umerus, but classical words beginning in "u" often acquire an initial "h" when they come into English.) The leg bone corresponding to the ulna is the tibia, which is Latin for "tibia". It also means "flute", but I think the flute meaning is secondary—they made flutes out of hollowed-out tibias.

Some of the nondescriptive names are descriptive in Latin, but not in English. The vertebra in English are so called after Latin vertebra, which means the vertebra. But the Latin word is ultimately from the verb vertere, which means to turn. (Like in "avert" ("turn away") and "revert" ("turn back").) The jawbone, or "mandible", is so-called after mandibula, which means "mandible". But the Latin word is ultimately from mandere, which means to chew.

The cranium is Greek, not Latin; kranion (or κρανιον, I suppose) is Greek for "skull". Sternum, the breastbone, is Greek for "chest"; carpus, the wrist, is Greek for "wrist"; tarsus, the ankle, is Greek for "instep". The zygomatic bone of the face is yoke-shaped; ζυγος ("zugos") is Greek for "yoke".

The hyoid bone is the only bone that is not attached to any other bone. (It's located in the throat, and supports the base of the tongue.) It's called the "hyoid" bone because it's shaped like the letter "U". This used to puzzle me, but the way to understand this is to think of it as the "U-oid" bone, which makes sense, and then to remember two things. First, that classical words beginning in "u" often acquire an initial "h" when they come into English, as "humerus". And second, classical Greek "u" always turns into "y" in Latin. You can see this if you look at the shape of the Greek letter capital upsilon, which looks like this: Υ. Greek αβυσσος ("abussos" = "without a bottom") becomes English "abyss"; Greek ανωνυμος ("anonumos") becomes English "anonymous"; Greek υπος ("hupos"; there's supposed to be a diacritical mark on the υ indicating the "h-" sound, but I don't know how to type it) becomes "hypo-" in words like "hypothermia" and "hypodermic". So "U-oid" becomes "hy-oid".

(Other parts of the body named for letters of the alphabet are the sigmoid ("S-shaped") flexure of the colon and the deltoid ("Δ-shaped") muscle in the arm. The optic chiasm is the place in the head where the optic nerves cross; "chiasm" is Greek for a crossing-place, and is so-called after the Greek letter Χ.)

The German word for "auditory ossicles" is Gehörknöchelchen. Gehör is "for hearing". Knöchen is "bones"; Knöchelchen is "little bones". So the German word, like the Latin phrase "auditory ossicles", means "little bones for hearing".

[Other articles in category /lang/etym] permanent link

Mon, 10 Jul 2006

Phrasal verbs
My mom teaches English to visiting foreign students, and last time I met her she was talling me about phrasal verbs. A phrasal verb is a verb that incorporates a preposition. Examples include "speed up", "try out", "come across", "go off", "turn down". The prepositional part is uninflected, so "turns down", "turned down", "turning down", not *"turn downs", *"turn downed", *"turn downing". My mom says she uses a book that has a list of all of them; there are several hundred. She was complaining specifically about "go off", which has an unusually peculiar meaning: when the alarm clock goes off in the morning, it actually goes on.

This reminded me that "slow up" and "slow down" are synonymous. And there is "speed up", but no "speed down". And you cannot understand "stand down" by analogy with "stand up", "sit up", and "sit down". And you also cannot understand "nose job" by analogy with "hand job". But I digress.

One of the things about the phrasal verbs that gives the foreign students so much trouble is that the verbs don't all obey the same rules. For example, some are separable and some not. Consider "turned down". I can turn down the thermostat, but I can also turn the thermostat down. And I can try out my new game, and I can also try my new game out. And I can stand up my blind date, and I can stand my blind date up. But while I can come across a fountain in the park, I can't *come a fountain across in the park. And while I can go off to Chicago, I can't *go to Chicago off. There's no way to know which of these work and which not, except just by memorizing which are allowed and which not.

And sometimes the separable ones can't be unseparated. I can give back the map, and I can give the map back, and I can give it back, but I can't *give back it. I can hold up the line, and I can hold the line up, and I can hold us up, but I can't *hold up us. I don't know what the rule is exactly, and I don't want to go to the library again to get the Cambridge Grammar, because last time I did that I dropped it on my toe.

I hadn't realized any of this until I read this article about them, but when I did, I had a sudden flash of insight. I had not realized before what was going on when someone set up us the bomb. "Set up" is separable: I can set up the bomb, or set the bomb up, or someone can set us up. But "us", as noted above, is not deseperable, so you cannot have *set up us. But I think I understand the mistake better now than I did before; it seems less like a complete freak and more like a member of a common type of error.

[Other articles in category /lang] permanent link

Wed, 05 Apr 2006

TeX and the long S
It just occurs to me, reading today's article, that the final sentence is one of the strangest I've written in quite a while. It says:

stock TeX does not have any way to make a long medial s.

This is a strange thing to say because TeX was principally designed as a mathematical typesetting system, and one of the most common of all mathematical notations is the integral sign:

$$\int_a^b f'(x) dx = f(b) - f(a)$$

And the integral sign $$\int$$ is nothing more than an old-style long s; the 's' is for 'sum'.

Strange or not, the substance of my remark is correct, since standard TeX's fonts do not provide a long s in a size suitable for use in running text in place of a regular s.

[Other articles in category /lang] permanent link

On baroque long S
Jokes about the long medial 's' are easy to make. Stan Freberg's album Stan Freberg Presents: The United States of America, Volume I: The Early Years has a scene in which John Adams or Benjamin Franklin or one of those guys is reading Thomas Jefferson's draft of the Declaration of Independence: "'Life, liberty, and the purfuit of happinefs'? Tom, all your s's look like f's!"

A story by Frances Warfield, appropriately titled "Fpafm", gets probably as much juice out of the joke as there is to be got. I believe the copyright has expired, so here it is, in its entirety:


by Frances Warfield

I ordered ham and eggs, as I always do on the diner, and then, as I always do, looked around for pamphlets. There was one handy, "Echoes of Colonial Days," it was called, "being a little fouvenir iffued from time to time, for the benefit of the guefts of The Baltimore & Ohio Railroad Company as a reminder of the pleafant moments fpent..." Involuntarily, my lips began to move. I reached for a pencil. But the man across from me already had his pencil out. He had written:

"Oh, fay can you fee?"

I said, "Fing Fomething Fimple."

"Filly, ifn't it?" he said, and kept on writing.

I wrote: "Fing a Fong of Fixpence."

"Oh, ftop the fongs," he said, "Too eafy." He wrote: "The Courtfhip of Miles Fandifh," "I fee a fquirrel," "I undereftimate ftatefmanfhip," "My fifter feems fuperfenfitive," and, seeing that I did not appreciate the last one, which he evidently thought very fine, he wrote: "Forry to fee you fo ftupid."

I ate my lunch grouchily. How could I help it if he was in practice and I was not? He had probably taken this train before.

"Pafs the falt," I said.

"Pleafe pafs the falt," he triumphed.

I paid no attention. "Waiter!" I said. The waiter did not budge.

"You muft fpeak the language," said the man opposite me. "Fay, Fteward!"

The waiter jumped to attention. "Fir?" he said.

"Pleafe fill the faltcellar."

"The faltcellar fhall be replenifhed inftantly," replied the waiter, with a superior gleam in his eyes.

I smiled and my companion unbent a little. "Let's try for hard ones," he invited.

"Farcafm," he said.


"Fubfiftence," he scored.


"S's inside now," he ruled.

Perfuafive," I said instantly.




"Nonfenfe," I finished. "Fon of a fpeckled fea monfter."

"Ftep-fon of a poifonous fnake!" he cried.

"You don't fay fo!" I retorted.

"I do fay fo," he replied, getting up and leaving the diner.

"Fool!" I called after him, fniffiling.

Well, fo much for that.

Reading Baroque scientific papers, you see a lot of long-medial-s. Opening to a random page of the Philosophical Experiments and Observations of Robert Hooke, for example, we have:

The ſecond Experiment, was made, to ſhew a Way, how to find the true and comparative Expanſion of any metal, when melted, and ſo to compare it both with the Expanſion of the ſame metal, when ſolid, and likewiſe with the Expanſion of any other, either fluid or ſolid Body.

As I read more of this sort of thing, I went through several phases. At first it I just found it confusing. Then later I started to get good at reading the words with f's instead of s's and it became funny. ("Fhew! Folid! Hee hee!") Then it stopped being funny, although I still noticed it and found it quaint and charming. Also a constant reminder of how learned and scholarly I am, to be reading this old stuff. (Yes, I really do think this way. Pathetic, isn't it? And you are an enabler of this pathetic behavior.) Then eventually I didn't notice it any more, except in a few startling cases, such as when Dr. Hooke wrote on the tendency of ice to incorporate air bubbles while freezing, and said " the ſame time it may not be ſaid to ſuck it in".

What hasn't happened, however: it hasn't become completely transparent. The long s really does look a lot like an f, so much so that I can find it confusing when the context doesn't help me out. The fact that these books are always facsimiles and that the originals were printed on coarse paper and the ink has smudged, does not make it any easier to tell when one is looking at an s and when at an f. So far, the most difficult instance I have encountered involved a reference to "the Learned Dr. Voſſius". Or was it Voffius? Or Vofſius? Or was it Voſfius? Well, I found out later it was indeed Vossius; this is Dr. Gerhard Johann Voss (1577-1649), Latinized to "Vossius". But I was only able to be sure because I encountered the name somewhere else with the short s's.

This typographic detail raises a question of scholarly ethics that I don't know how to answer. In an earlier article, I needed to show how 17th-century writers referred to dates early in the year, which in common nomenclature occurred during one year, but which legally were part of the preceding year. Simply quoting one of these writers wasn't enough, because the date was disambiguated typographically, with the digit for the legal year directly above the digit for the conventional year. So I programmed TeX to demonstrate the typography:

 To this I W.D. shall add another Remark I find in the minutes of the {\it Royal Society\/}, {\it February\/} 20. $167^8_9$, {\it viz.\/}$\ldots$

But this raised another problem: to what degree should I reproduce the original typography? There is a scale here of which substitutions are more or less permissible:

  1. Most permissible is to replace the original 17th-century font with a modern one.

  2. Slightly less permissible would be to reduce the heavy 17th-century usage of italic face, in Royal Society for example, replacing it with roman typefaces.

  3. Slightly less permissible still would be to replace the 17th-century capitalization conventions with 20th-century conventions. For example, in C20 we would not capitalize "Remark".

  4. Then can I replace obsolete 17th-century contractions such as "consider'd" with 20th-century equivalents such as "considered"? If that is acceptable, then what about "'tis"? Can I replace "3dly" with "thirdly"?

  5. Can I replace obsolete Baroque spellings such as "plaister", "fatt", and "it self" with "plaster", "fat", and "itself"?

  6. Can I replace obsolete Baroquisms such as "strow'd" in "strow'd on Ice" with "strewn", or "stopple" with "stopper"?

  7. At the bottom of the list, I could just rewrite the whole thing in a modern style and pass it off as what Derham actually wrote.
It seems to me that replacing the long medial s's with short ones is toward the top of this scale. By doing this, I am not changing the spelling, because a long medial s is still an s; I am just replacing one s with another, and this is akin to changing the font. And anyway, my choice was forced, because stock TeX does not have any way to make a long medial s.

[ Addendum 20060405 ]

[Other articles in category /lang] permanent link

Sun, 12 Mar 2006

Naomi Wolf and Big Ethel
Aaron Swartz has done a text search of The Beauty Myth and concluded that Wolf never intended Big Ethel to serve as an example of intelligence, contrary to what I asserted in my previous article. M. Swartz says:

Judging from a search on Amazon, the only time Ethel is mentioned is in the context of noting that an attractive woman is often paired with an unattractive one: "... Veronica and Ethel in Riverdale; ... and so forth. Male culture seems happiest to imagine two women together when they are defined as being one winner and one loser in the beauty myth." (59f)

I still question the aptness of the example, since, again, the principal case in which two women are imagined together in Archie comics is not Veronica and Ethel, but Veronica and Betty, both of whom are portrayed as "winners". Betty and Veronica are major characters; Ethel is not. But the error isn't nearly as serious as the one I said Wolf had made.

The most serious error here is mine: I should have considered and discussed the possibility that my friend was misquoting Wolf. That I didn't do this was unfair to Wolf and entirely my fault. Since I haven't read the book myself, I should have realized what shaky ground I was on, and taken pains to point this out. And yet other possibilities are:

  • That my friend didn't misquote Wolf at all, and I misunderstood her at the time, or
  • that my friend correctly quoted Wolf and I understood her at the time, but my memory of the episode (which occurred around 1993) is faulty.
I took Vallely to task for poor research and for failing to pick up a dictionary to confirm some of his assertions. Had I taken my own advice, I would have checked to see what Wolf said before commenting on it. My disclaimer in the original article that I had not read the book relieves me of only part of the responsibility for this failure.

[Other articles in category /lang/etym] permanent link

On saying too much, or, bad things come in threes
Long ago, I had a conversation with a woman who had recently read Naomi Wolf's book The Beauty Myth. She was extolling the book, which I had not read, and mentioned that Wolf had an extensive discussion of the popular dichotomy between beauty and intelligence. She told me that Wolf had cited Archie comics as containing an example of this dichotomy, in the characters of Veronica and Big Ethel.

I had been nodding and agreeing up to that point. But at the mention of Big Ethel I was quite startled, and said that that spoiled the argument for me, and made me doubt the conclusion. I now had doubts about what had seemed so plausible a moment before.

Veronica is indeed one half of a contrasting pair in Archie comics. But Veronica and Big Ethel? No. Veronica is not complementary to Big Ethel. The counterpart of Veronica is Betty. The contrast is not between beauty and brains but between rich and poor, and between their derived properties, spoiled and sweet. A good point could be made about Veronica and Betty, but it was not the point that Wolf wanted to make; her citation of Veronica and Big Ethel as exemplifying the opposition of beauty and intelligence was just bizarre. Big Ethel, to my knowledge, has never been portrayed as unusually intelligent. She is characterized by homeliness and by her embarrassing and unrequited attraction to Jughead, not by intelligence.

Why would this make me doubt the conclusion of Wolf's argument? Because I had been fully ready to believe the conclusion, that our culture manufactures a division between attractiveness and intelligence for women, and makes them choose one or the other. I had imagined that it would be easy to produce examples demonstrating the point. But the example Wolf chose was completely inept. And, as I said at the time, "Naomi Wolf is very smart, and has studied this closely and thought about it for a long time. If that is the best example that she can come up with, then perhaps I'm wrong, and there really aren't as many examples as I thought there would be." Without the example, I would have agreed with the conclusion. With the example, intended to support the conclusion, I wasn't so sure.

Now, I come to the real point of this note. Paul Vallely has written an article for The Independent on "How Islamic inventors changed the world". He lists twenty of the most influential contributions of the Muslim world, including the discovery of coffee, inoculation, and the fountain pen. I am not so clear on the history of the technology here. Some of it I know is correct; some is plausible; some is extremely dubious. (The crank, not invented before 1206? Please.) But the whole article is spoiled for me, except as a topic of derision, because of three errors.

Item #1 concerns the discovery of the coffee bean. One might expect this to have been discovered in prehistoric times by local Ethiopians, long before the founding of Islam. But I'm in no position to argue with it, and I was ready to give Vallely the benefit of the doubt.

Item #2 on Vallely's list was more worrying. It says "Ibn al-Haitham....set up the first Camera Obscura (from the Arab word qamara for a dark or private room)." It may or may not be true that "qamara" is an "Arab word" (by which I suppose Vallely means an "Arabic word") for "chamber", but it is certainly true that this word, if it exists, is not the source of the English word "camera". I don't know from "qamara", but "camera obscura" is Latin for "dark chamber". "Camera" means "chamber" in Latin and has for thousands of years. The two words, in fact, are etymologically the same, which is why they have almost the same spelling. It is for this reason that the part of a legal hearing held in the judge's private chambers is said to be "in camera".

There might be an Arabic word "qamara", for all I know. If there is, it might be derived from the Latin. (The Latin word is not derived from Arabic, either; it is from Greek καμαρα, which refers to anything with an arched cover.) Two things are sure: The English word "camera" is not derived from Arabic, and Vallely did not bother to pick up a dictionary before he said that it was.

Anyone can make a mistake. But I started to get excited when I read item 3, which is about the game of chess. Vallely says "The word rook comes from the Persian rukh, which means chariot." This is true, sort of, but it is off in a subtle way. The rooks or castles of modern chess did start out as chariots. (Moving castles around never did make much sense.) And "rook" is indeed from Persian rukh. But rukh doesn't exactly mean a chariot. It means a chariot in the game of chess. The Persian word for a chariot outside of chess was different. (I don't remember what it was.) Saying that rukh is the Persian word for chariot is like saying that "rook" is the English word for castle.

I was only on item 3 and had already encountered one serious error of etymology and one other item which although it wasn't exactly an error, was peculiar. I considered that I wouldn't really have enough material for a blog post, unless Vallely made at least one more serious mistake. But there were still 17 of 20 items left. So I read on. Would Vallely escape?

No, or I would not have written this article. Item 17 says "The modern cheque comes from the Arabic saqq, a written vow to pay for goods when they were delivered...". But no. The correct etymology is fascinating and bizarre. "Cheque" is derived from Norman French "exchequer", which was roughly the equivalent of the treasury and internal revenue department in England starting around 1300. Why was the internal revenue department called the exchequer? Because it was named after the chessboard, which was also called "exchequer".

What do chessboards have to do with internal revenue? Ah, I am glad you wondered. Hindu-Arabic numerals had not yet become popular in Europe; numbers were still recorded using Roman numerals. It is extremely difficult to calculate efficiently with Roman numerals. How, then did the internal revenue department calculate taxes owed and amounts payable?

They used an abacus. But it wasn't an abacus like modern Chinese or Japanese abacuses, with beads strung on wires. A medieval European abacus was a table with a raised edge and a grid of squares ruled on it. The columns of squares represented ones, tens, hundreds, and so on. You would put metal counters, called jettons, on the squares to represent numbers. Three jettons on a "hundred" square represented three hundred; four jettons on the square to its right represented forty. Each row of squares recorded a separate numeral. To add two numerals together, just take the jettons from one row, move them to the other row, and then resolve the carrying appropriately: Ten jettons on a square can be removed and replaced with a single jetton on the square to the left.

The internal revenue department, the "exchequer", got its name from these counting-boards covered with ruled squares like chessboards.

(The word "exchequer" meaning a chessboard was derived directly from the name of the game: Old French eschecs, Medieval Latin scacci, and so on, all from shah, which means "king" in Persian. The word "checkered" is also closely related.)

So, in summary: the game is "chess", or eschek in French; the board is therefore exchequer, and since the counting-tables of the treasury department look like chessboards, the treasury department itself becomes known as the exchequer. The treasury department, like all treasury departments, issues notes promising to pay certain sums at certain times, and these notes are called "exchequer notes" or just "exchequers", later shortened (by the English) to "cheques" or (by Americans) to "checks". Arabic saqq, if there is such a word, does not come into it. Once again, it is clear that Vallely's research was shoddy.

While I was writing up this article, yet another serious error came to light. Item 11 says "The windmill was invented in 634 for a Persian caliph...". Now, I am not very knowledgeable about history, and my historical education is very poor. But that was so peculiar that it startled even me. 634 seemed to me much too early for any clever inventions to be attributed to Muslims. Then I looked it up, and so it was. Muhammad himself had only died in 632.

As for the Persian caliph Vallely mentions, he did not exist. The caliphs are the successors of Muhammad, so of course there was one in 634---the first one, in fact. Abu Bakr reigned from the death of the Prophet in 632 until his own death in 634; he was succeeded by `Umar. Neither was Persian. They were both Arabs, as you would expect of Muslim leaders in 634. There were no Persian caliphs in 634.

My own ignorance of Islam and its history is vast and deep, but at least I had a vague idea that 634 was extremely early. Vallely could have looked up the date of the founding of the caliphate as easily as I did. Why didn't he? Well, perhaps it was just a typo, and should have said 834 or 934. In that case it's just poor editing and inattention. But perhaps it was a genuine factual error, in which case Vallely was not only not paying attention, but is apparently even less familiar with Islamic history than I am, difficult as that is to achieve. In which case we have this article about the twenty greatest contributions of Islam written by a guy who literally does not know the first thing about Islam.

And so this article, which I hoped to enjoy, was spoiled by a series of errors. I am very sympathetic to the idea that the brilliant history of Islamic science and engineering has been neglected by European scholarship. One of my very first blog posts was about the Islamic use of algebra to solve complex probate problems. Just last week I was reading about al-Biruni's invention, around 1000 years ago, of an improved method for measuring the size of the earth, a topic that Vallely treats as item 18. But after reading Vallely's article, I worried a bit that the case might have been overstated. Perhaps the contributions of Muslims are not as large as I had thought?

Fortunately, there was an alternative: the conclusion is correct, and the inept support from the author speaks only to the author's ineptness, not to the validity of the conclusion. I did not have that alternative with Naomi Wolf, who is not inept. (Also, see this addendum.)

With only cursory attention, I found three major errors of fact in this one short article. How many more did I miss, I wonder? Did Abbas ibn Firnas really invent a working parachute, as Vallely says? Maybe it was someone else. Maybe there was no parachute. Maybe there was, but it didn't work. Maybe the whole thing is a propaganda invention by someone who wants to promote Islam, and has suckered Vallely into repeating fiction. Maybe all of these. Someone knows the truth, but it isn't me, and I can't trust Vallely.

Were the Turks vaccinating people eighty years before the Europeans, or did Vallely swallow a tall tale? I don't know, and I can't trust Vallely.

People sometimes joke "I am stupider for having read this," but I really believe this was the case here. The article is worse than useless, because it has polluted my brain with a lot of unreliable non-information. I will have to be careful not to think that quilted fabrics were first brought to Europe by the crusaders, who got them from the Muslims. My real fear is that the "fact" will remain in my brain for years, long after I have forgotten how unreliable Vallely is, and that I will bring it out again as real information, which it is not. True or not, it is too unreliable to be information.

The best I can hope for now is that I will forget everything Vallely says, and meet the true parts again somewhere else in the future. In the meantime, I am worse off for having read it.

[Other articles in category /lang/etym] permanent link

Thu, 02 Feb 2006

Petard corrections
Eric Cholet has written in to mention that he is familiar with the fried choux pastry that I mentioned yesterday, but under the name pets de nonne, not pets de soeurs, as I said. (Nonne, of course, is "nun". The word soeur is literally "sister", but in this context means "nun". ) I had cited On Food and Cooking as mentioning pets de soeurs, but it agrees with Eric, not with me.

It appears, though, that many people do use the name pets de soeurs to refer to these fritters, and some people also use it to refer to a kind of soda-raised cinnamon roll. Citations to various cookbooks are available through the usual searches.

Eric also points out that petard is the current word for a firecracker, and also now refers to a doobie. I was already aware of this because pictures of those things appeared when I did Google image seach for petard. Thank you, Eric.

[Other articles in category /lang/etym] permanent link

Tue, 31 Jan 2006

A petard is a Renaissance-era bomb, basically a big firecracker: a box or small barrel of gunpowder with a fuse attached. Those hissing black exploding spheres that you see in Daffy Duck cartoons are petards. Outside of cartoons, you are most likely to encounter the petard in the phrase "hoist with his own petard", which is from Hamlet. Rosencrantz and Guildenstern are being sent to England with the warrant for Hamlet's death; Hamlet alters the warrant to contain R&G's names instead of his own. "Hoist", of course, means "raised", and Hamlet is saying that it is amusing to see someone screw up his own petard and blow himself sky-high with it.

This morning I read in On Food in Cooking that there's a kind of fried choux pastry called pets de soeurs ("nuns' farts") because they're so light and delicate. That brought to mind Le Pétomane, the world-famous theatrical fartmaster. Then there was a link on reddit titled "Xmas Petard (cool gif video!)" which got me thinking about petards, and it occurred to me that "petard" was probably akin to pets, because it makes a bang like a fart. And hey, I was right; how delightful.

Another fart-related word is "partridge", so named because its call sounds like a fart.

[Other articles in category /lang/etym] permanent link

Thu, 26 Jan 2006

"Farther" vs. "further"
People mostly use "farther" and "further" interchangeably. What's the difference?

I looked it up in the dictionary, and it turns out it's simple. "Farther" means "more far". "Further" means "more forward".

"Further" does often connote "farther", because something that is further out is usually farther away, and so in many cases the two are interchangeable. For example, "Hitherto shalt thou come, but no further" (Job 38:11.)

But now when I see people write things like China Steps Further Back From Democracy (The New York Times, 26 November 1995) or, even worse, Big Pension Plans Fall Further Behind (Washington Post, 7 June 2005) it freaks me out.

Google finds 3.2 million citations for "further back", and 9.5 million for "further behind", so common usage is strongly in favor of this. But a quick check of the OED does not reveal much historical confusion between these two. Of the citations there, I can only find one that rings my alarm bell. ("1821 J. BAILLIE Metr. Leg., Wallace lvi, In the further rear.")

[Other articles in category /lang] permanent link