|
Archive:
Subtopics:
Comments disabled |
Tue, 15 May 2007
Ambiguous words and dictionary hacks
I said I was surprised that he thought that was unique to English, and said that probably Spanish had just as many "ambiguous" words, but that he just hadn't noticed them. I couldn't think of any Spanish examples offhand, but I knew some German ones: in English, "suit" can mean a lawsuit, a suit of clothes, or a suit of playing cards. German has different words for all of these. In German, the suit of a playing card is its "farbe", its color. So German distinguishes between suit of clothes and suit of playing cards, which English does not, but fails to distinguish between colors of paint and suit of playing cards, which English does. Every language has these mismatches. Korean has two words for "thin", one meaning thin like paper and the other meaning thin like string. Korean distinguishes father's sister ("komo") from mother's sister ("imo") where English has only "aunt". Anyway, Sr. Manzo then went to lunch, and I wanted to find some examples of concepts distinguished by English but not by Spanish. I did this with a dictionary hack. A dictionary hack is when you take a plain text dictionary and do some sort of rough-and-ready processing on it to get an 80% solution to some problem. The oldest dictionary hack I know of is the old Unix rhyming dictionary hack:
rev /usr/dict/words | sort | rev > rhyming.txt
This takes the Unix word list and turns it into a semblance of a
rhyming dictionary. It's not an especially accurate semblance, but
you can't beat the price.
...
ugh Marlborough choreograph Guelph Wabash
Hugh Scarborough lithograph Adolph cash
McHugh thorough electrocardiograph Randolph dash
Pugh trough electroencephalograph Rudolph leash
laugh sough nomograph triumph gash
bough tough tomograph lymph hash
cough tanh seismograph nymph lash
dough Penh phonograph philosoph clash
sourdough sinh chronograph Christoph eyelash
hough oh polarograph homeomorph flash
though pharaoh spectrograph isomorph backlash
although Shiloh Addressograph polymorph whiplash
McCullough pooh chromatograph glyph splash
furlough graph autograph anaglyph slash
slough paragraph epitaph petroglyph mash
enough telegraph staph myrrh smash
rough radiotelegrap aleph ash gnash
through calligraph Joseph Nash Monash
breakthrough epigraph caliph bash rash
borough mimeograph Ralph abash brash
...
It figures out that "clash" rhymes with "lash" and "backlash", but not
that "myrrh" rhymes with "purr" or "her" or "sir". You can of
course, do better, by using a text file that has two columns, one for
orthography and one for pronunciation, and sorting it by reverse
pronunciation. But like I said, you won't beat the price.But I digress. Last week I pulled an excellent dictionary hack. I found the Internet Dictionary Project's English-Spanish lexicon file on the web with a quick Google search; it looks like this:
a un, uno, una[Article]
aardvark cerdo hormiguero
aardvark oso hormiguero[Noun]
aardvarks cerdos hormigueros
aardvarks osos hormigueros
ab prefijo que indica separacio/n
aback hacia atras
aback hacia atr´s,take aback, desconcertar. En facha.
aback por sopresa, desprevenidamente, de improviso
aback atra/s[Adverb]
abacterial abacteriano, sin bacterias
abacus a/baco
abacuses a/bacos
abaft A popa (towards stern)/En popa (in stern)
abaft detra/s de[Adverb]
abalone abulo/n
abalone oreja de mar (molusco)[Noun]
abalone oreja de mar[Noun]
abalones abulones
abalones orejas de mar (moluscos)[Noun]
abalones orejas de mar[Noun]
abandon abandonar
abandon darse por vencido[Verb]
abandon dejar
abandon desamparar, desertar, renunciar, evacuar, repudiar
abandon renunciar a[Verb]
abandon abandono[Noun]
abandoned abandonado
abandoned dejado
...
Then I did:
sort +1 idengspa.txt |
perl -nle '($ecur, $scur) = split /\s+/, $_, 2;
print "$eprev $ecur $scur"
if $sprev eq $scur &&
substr($eprev, 0, 1) ne substr($ecur, 0, 1);
($eprev, $sprev) = ($ecur, $scur)'
The sort sorts the lexicon into Spanish order instead of English order. The Perl thing comes out looking a lot more complicated than it ought. It just says to look and print consecutive items that have the same Spanish, but whose English begins with different letters. The condition on the English is to filter out items where the Spanish is the same and the English is almost the same, such as:
It does filter out possible items of interest, such as:
But since the goal is just to produce some examples, and this cheap hack was never going to generate an exhaustive list anyway, that is all right. The output is:
at letter a
actions stock acciones[Noun]
accredit certify acreditar
around thereabout alrededor
high tall alto
comrade pal amigo[Noun]
antecedents backgrounds antecedentes
(...complete output...)
A lot of these are useless, genuine synonyms. It would be silly to
suggest that Spanish fails to preserve the English distinction between
"marry" and "wed", between "ale" and "beer", between "desire" and
"yearn", or between "vest" and "waistcoat". But some good
possibilities remain.Of these, some probably fail for reasons that only a Spanish-speaker would be able to supply. For instance, is "el pastel" really the best translation of both "cake" and "pie"? If so, it is an example of the type I want. But perhaps it's just a poor translation; perhaps Spanish does have this distinction; say maybe "torta" for "cake" and "empanada" for "pie". (That's what Google suggests, anyway.) Another kind of failure arises because of idioms. The output:
exactly o'clock en punto
is of this type. It's not that Spanish fails to distinguish between
the concepts of "exactly" and "o'clock"; it's that "en punto" (which
means "on the point of") is used idiomatically to mean both of those
things: some phrase like "en punto tres" ("on the point of three")
means "exactly three" and so, by analogy, "three o'clock". I don't
know just what the correct Spanish phrases are, but I can guess that
they'll be something like this.Still, some of the outputs are suggestive:
I put some of these to Sr. Manzo, and he agreed that some were indeed ambiguous in Spanish. I wouldn't have known what to suggest without the dictionary hack.
[Other articles in category /lang] permanent link |