Tue, 15 May 2007
Ambiguous words and dictionary hacks
I said I was surprised that he thought that was unique to English, and said that probably Spanish had just as many "ambiguous" words, but that he just hadn't noticed them. I couldn't think of any Spanish examples offhand, but I knew some German ones: in English, "suit" can mean a lawsuit, a suit of clothes, or a suit of playing cards. German has different words for all of these. In German, the suit of a playing card is its "farbe", its color. So German distinguishes between suit of clothes and suit of playing cards, which English does not, but fails to distinguish between colors of paint and suit of playing cards, which English does.
Every language has these mismatches. Korean has two words for "thin", one meaning thin like paper and the other meaning thin like string. Korean distinguishes father's sister ("komo") from mother's sister ("imo") where English has only "aunt".
Anyway, Sr. Manzo then went to lunch, and I wanted to find some examples of concepts distinguished by English but not by Spanish. I did this with a dictionary hack.
A dictionary hack is when you take a plain text dictionary and do some sort of rough-and-ready processing on it to get an 80% solution to some problem. The oldest dictionary hack I know of is the old Unix rhyming dictionary hack:
rev /usr/dict/words | sort | rev > rhyming.txtThis takes the Unix word list and turns it into a semblance of a rhyming dictionary. It's not an especially accurate semblance, but you can't beat the price.
... ugh Marlborough choreograph Guelph Wabash Hugh Scarborough lithograph Adolph cash McHugh thorough electrocardiograph Randolph dash Pugh trough electroencephalograph Rudolph leash laugh sough nomograph triumph gash bough tough tomograph lymph hash cough tanh seismograph nymph lash dough Penh phonograph philosoph clash sourdough sinh chronograph Christoph eyelash hough oh polarograph homeomorph flash though pharaoh spectrograph isomorph backlash although Shiloh Addressograph polymorph whiplash McCullough pooh chromatograph glyph splash furlough graph autograph anaglyph slash slough paragraph epitaph petroglyph mash enough telegraph staph myrrh smash rough radiotelegrap aleph ash gnash through calligraph Joseph Nash Monash breakthrough epigraph caliph bash rash borough mimeograph Ralph abash brash ...It figures out that "clash" rhymes with "lash" and "backlash", but not that "myrrh" rhymes with "purr" or "her" or "sir". You can of course, do better, by using a text file that has two columns, one for orthography and one for pronunciation, and sorting it by reverse pronunciation. But like I said, you won't beat the price.
But I digress. Last week I pulled an excellent dictionary hack. I found the Internet Dictionary Project's English-Spanish lexicon file on the web with a quick Google search; it looks like this:
a un, uno, una[Article] aardvark cerdo hormiguero aardvark oso hormiguero[Noun] aardvarks cerdos hormigueros aardvarks osos hormigueros ab prefijo que indica separacio/n aback hacia atras aback hacia atr´s,take aback, desconcertar. En facha. aback por sopresa, desprevenidamente, de improviso aback atra/s[Adverb] abacterial abacteriano, sin bacterias abacus a/baco abacuses a/bacos abaft A popa (towards stern)/En popa (in stern) abaft detra/s de[Adverb] abalone abulo/n abalone oreja de mar (molusco)[Noun] abalone oreja de mar[Noun] abalones abulones abalones orejas de mar (moluscos)[Noun] abalones orejas de mar[Noun] abandon abandonar abandon darse por vencido[Verb] abandon dejar abandon desamparar, desertar, renunciar, evacuar, repudiar abandon renunciar a[Verb] abandon abandono[Noun] abandoned abandonado abandoned dejado ...Then I did:
sort +1 idengspa.txt | perl -nle '($ecur, $scur) = split /\s+/, $_, 2; print "$eprev $ecur $scur" if $sprev eq $scur && substr($eprev, 0, 1) ne substr($ecur, 0, 1); ($eprev, $sprev) = ($ecur, $scur)'
The sort sorts the lexicon into Spanish order instead of English order. The Perl thing comes out looking a lot more complicated than it ought. It just says to look and print consecutive items that have the same Spanish, but whose English begins with different letters. The condition on the English is to filter out items where the Spanish is the same and the English is almost the same, such as:
It does filter out possible items of interest, such as:
But since the goal is just to produce some examples, and this cheap hack was never going to generate an exhaustive list anyway, that is all right.
The output is:
at letter a actions stock acciones[Noun] accredit certify acreditar around thereabout alrededor high tall alto comrade pal amigo[Noun] antecedents backgrounds antecedentes (...complete output...)A lot of these are useless, genuine synonyms. It would be silly to suggest that Spanish fails to preserve the English distinction between "marry" and "wed", between "ale" and "beer", between "desire" and "yearn", or between "vest" and "waistcoat". But some good possibilities remain.
Of these, some probably fail for reasons that only a Spanish-speaker would be able to supply. For instance, is "el pastel" really the best translation of both "cake" and "pie"? If so, it is an example of the type I want. But perhaps it's just a poor translation; perhaps Spanish does have this distinction; say maybe "torta" for "cake" and "empanada" for "pie". (That's what Google suggests, anyway.)
Another kind of failure arises because of idioms. The output:
exactly o'clock en puntois of this type. It's not that Spanish fails to distinguish between the concepts of "exactly" and "o'clock"; it's that "en punto" (which means "on the point of") is used idiomatically to mean both of those things: some phrase like "en punto tres" ("on the point of three") means "exactly three" and so, by analogy, "three o'clock". I don't know just what the correct Spanish phrases are, but I can guess that they'll be something like this.
Still, some of the outputs are suggestive:
I put some of these to Sr. Manzo, and he agreed that some were indeed ambiguous in Spanish. I wouldn't have known what to suggest without the dictionary hack.