A couple of people pointed out that this does nothing to address the
issue of multiple-word anagrams. For example it will not discover
“I, rearrangement servant / Internet anagram server” True, that is a
different problem entirely.
Markian Gooley informed me that “megachiropteran / cinematographer” has been long known to
Scrabble players, and
Ben Zimmer pointed out
that A. Ross Eckler, unimpressed by “cholecystoduodenostomy / duodenocholecystostomy”, proposed a
method almost identical to mine for scoring anagrams in
an article in Word Ways in 1976.
M. Eckler also mentioned that the “remarkable” “megachiropteran / cinematographer” had been
published in 1927 and that “enumeration /
mountaineer” (which I also selected as a good example)
appeared in the Saturday Evening Post in 1879!
The
Hacker News comments
were unusually pleasant and interesting. Several people asked “why
didn't you just use the
Levenshtein distance”?
I don't remember that it ever occured to me, but if it had I would
have rejected it right away as being obviously the wrong thing.
Remember that my original chunking idea was motivated by the
observation that “cholecystoduodenostomy / duodenocholecystostomy” was long but of low quality.
Levenshtein distance measures how far every letter has to travel to
get to its new place and it seems clear that this would give “cholecystoduodenostomy
/ duodenocholecystostomy” a high score because most of the letters move a long way.
Hacker News user
tyingq
tried it
anyway, and reported that it produced a poor
outcome.
The top-scoring pair by Levenshtein distance is
“anatomicophysiologic physiologicoanatomic”, which under the
chunking method gets a score of 3. Repeat offender “cholecystoduodenostomy / duodenocholecystostomy”
only drops to fourth place.
A better idea seems to be Levenshtein score per unit of length,
suggested by lobste.rs user
cooler_ranch
.
A couple of people complained about my “notaries / senorita”
example, rightly observing that “senorita” is properly spelled
“señorita”. This bothered me also while I was writing the article.
I eventually decided although “notaries” and “señorita” are
certainly not anagrams in Spanish (even supposing that “notaries”
was a Spanish word, which it isn't) that the spelling of “senorita”
without the tilde is a correct alternative in English. (Although I
found out later that both the Big Dictionary and American Heritage
seem to require the tilde.)
Hacker News user
ggambetta
observed
that while ‘é’ and ‘e’, and ‘ó’ and ‘o’ feel interchangeable in
Spanish, ‘ñ’ and ‘n’ do not. I think this is right. The ‘é’ is an
‘e’, but with a mark on it to show you where the stress is in the
word. An ‘ñ’ is not like this. It was originally an abbreviation
for ‘nn’, introduced in the 18th century. So I thought it might
make sense to allow ‘ñ’ to be exchanged for ‘nn’, at least in some
cases.
(An analogous situation in German, which may be more familiar, is
that it might be reasonable to treat ‘ö’ and ‘ü’ as if they were
‘oe’ and ‘ue’. Also note that in former times, “w” and “uu” were
considered interchangeable in English anagrams.)
Unfortunately my Spanish dictionary is small (7,000 words) and of
poor quality and I did not find any anagrams of “señorita”. I wish
I had something better for you. Also, “señorita” is not one of the
cases where it is appropriate to replace “ñ” with “nn”, since it was
never spelled “sennorita”.
I wonder why sometimes this sort of complaint seems to me like
useless nitpicking, and other times it seems like a serious problem
worthy of serious consideration. I will try to think about this.
Mike Morton, who goes by the anagrammatic nickname of “Mr. Machine
Tool”, referred me to his Higgledy-piggledy about
megachiropteran / cinematographer,
which is worth reading.
Regarding the
maximum independent set algorithm I described yesterday,
Shreevatsa R. suggested
that it might be conceptually simpler to find the maximum clique in
the complement graph. I'm not sure this helps, because the
complement graph has a lot more edges than the original. Below
right is the complement graph for “acrididae / cidaridae”. I don't
think I can pick out the 4-cliques in that graph any more than the
independent sets in the graph on the lower-left, and this is an
unusually favorable example case for the clique version, because the
original graph has an unusually large number of edges.
But perhaps the cliques might be easier to see if you know what to
look for: in the right-hand diagram the four nodes on the left are
one clique, and the four on the right are the other, whereas in the
left-hand diagram the two independent sets are all mixed together.
An earlier version of the original article mentioned the putative
11-pointer “endometritria / intermediator”. The word
“endometritria” seemed pretty strange, and I did look into it before
I published the article, but not carefully enough. When Philip
Cohen wrote to me to question it, I investigated more carefully, and
discovered that it had been an error in an early
WordNet release, corrected (to
“endometria”) in version 1.6. I didn't remember that I had used
WordNet's word lists, but I am not surprised to discover that I did.
A rare printing of Webster's 2¾th American International Lexican
includes the word “endometritriostomoscopiotomous” but I suspect that
it may be a misprint.
Philippe Bruhat wrote to inform me of Alain Chevrier’s book
notes / sténo,
a collection of thematically related anagrams in French.
The full text is available online.
Alexandre Muñiz, who has a really delightful
blog, and who makes and sells
attractive and clever puzzles of his own
invention. pointed out that soapstone
teaspoons
are available. The perfect gift for the anagram-lover in your life!
They are not even expensive.
Thanks also to Clinton Weir, Simon Tatham, Jon Reeves, Wei-Hwa
Huang, and Philip Cohen for their emails about this.