Tue, 15 May 2007

Ambiguous words and dictionary hacks
A Mexican gentleman of my acquaintance, Marco Antonio Manzo, was complaining to me (on IRC) that what makes English hard was the large number of ambiguous words. For example, English has the word "free" where Spanish distinguishes "gratis" (free like free beer) from "libre" (free like free speech).

I said I was surprised that he thought that was unique to English, and said that probably Spanish had just as many "ambiguous" words, but that he just hadn't noticed them. I couldn't think of any Spanish examples offhand, but I knew some German ones: in English, "suit" can mean a lawsuit, a suit of clothes, or a suit of playing cards. German has different words for all of these. In German, the suit of a playing card is its "farbe", its color. So German distinguishes between suit of clothes and suit of playing cards, which English does not, but fails to distinguish between colors of paint and suit of playing cards, which English does.

Every language has these mismatches. Korean has two words for "thin", one meaning thin like paper and the other meaning thin like string. Korean distinguishes father's sister ("komo") from mother's sister ("imo") where English has only "aunt".

Anyway, Sr. Manzo then went to lunch, and I wanted to find some examples of concepts distinguished by English but not by Spanish. I did this with a dictionary hack.

A dictionary hack is when you take a plain text dictionary and do some sort of rough-and-ready processing on it to get an 80% solution to some problem. The oldest dictionary hack I know of is the old Unix rhyming dictionary hack:

        rev /usr/dict/words | sort | rev > rhyming.txt
This takes the Unix word list and turns it into a semblance of a rhyming dictionary. It's not an especially accurate semblance, but you can't beat the price.

     ugh	      Marlborough   choreograph	            Guelph        Wabash   
     Hugh	      Scarborough   lithograph	            Adolph        cash     
     McHugh	      thorough	    electrocardiograph      Randolph      dash     
     Pugh	      trough	    electroencephalograph   Rudolph       leash    
     laugh	      sough	    nomograph	            triumph       gash     
     bough	      tough	    tomograph	            lymph         hash     
     cough	      tanh	    seismograph	            nymph         lash     
     dough	      Penh	    phonograph	            philosoph     clash    
     sourdough        sinh	    chronograph	            Christoph     eyelash  
     hough	      oh	    polarograph	            homeomorph    flash    
     though	      pharaoh	    spectrograph            isomorph      backlash 
     although         Shiloh	    Addressograph           polymorph     whiplash 
     McCullough       pooh	    chromatograph           glyph         splash   
     furlough         graph	    autograph	            anaglyph      slash    
     slough	      paragraph	    epitaph	            petroglyph    mash     
     enough	      telegraph	    staph	            myrrh         smash    
     rough	      radiotelegrap aleph	            ash           gnash    
     through	      calligraph    Joseph	            Nash          Monash   
     breakthrough     epigraph	    caliph	            bash          rash     
     borough	      mimeograph    Ralph	            abash         brash    
It figures out that "clash" rhymes with "lash" and "backlash", but not that "myrrh" rhymes with "purr" or "her" or "sir". You can of course, do better, by using a text file that has two columns, one for orthography and one for pronunciation, and sorting it by reverse pronunciation. But like I said, you won't beat the price.

But I digress. Last week I pulled an excellent dictionary hack. I found the Internet Dictionary Project's English-Spanish lexicon file on the web with a quick Google search; it looks like this:

        a	un, uno, una[Article]
        aardvark	cerdo hormiguero
        aardvark	oso hormiguero[Noun]
        aardvarks	cerdos hormigueros
        aardvarks	osos hormigueros 
        ab	prefijo que indica separacio/n
        aback	hacia atras
        aback	hacia atr´s,take aback, desconcertar. En facha.
        aback	por sopresa, desprevenidamente, de improviso
        aback	atra/s[Adverb]
        abacterial	abacteriano, sin bacterias
        abacus	a/baco
        abacuses	a/bacos
        abaft	A popa (towards stern)/En popa (in stern)
        abaft	detra/s de[Adverb]
        abalone	abulo/n
        abalone	oreja de mar (molusco)[Noun]
        abalone	oreja de mar[Noun]
        abalones	abulones
        abalones	orejas de mar (moluscos)[Noun]
        abalones	orejas de mar[Noun]
        abandon	abandonar
        abandon	darse por vencido[Verb]
        abandon	dejar
        abandon	desamparar, desertar, renunciar, evacuar, repudiar
        abandon	renunciar a[Verb]
        abandon	abandono[Noun]
        abandoned	abandonado
        abandoned	dejado
Then I did:

        sort +1 idengspa.txt  | 
        perl -nle '($ecur, $scur) = split /\s+/, $_, 2; 
                print "$eprev $ecur $scur" 
                        if $sprev eq $scur && 
                           substr($eprev, 0, 1) ne substr($ecur, 0, 1); 
                        ($eprev, $sprev) = ($ecur, $scur)'

The sort sorts the lexicon into Spanish order instead of English order. The Perl thing comes out looking a lot more complicated than it ought. It just says to look and print consecutive items that have the same Spanish, but whose English begins with different letters. The condition on the English is to filter out items where the Spanish is the same and the English is almost the same, such as:

blond blonde rubio
cake cakes tarta
oceanographic oceanographical oceanografico[Adjective]
palaces palazzi palacios[Noun]
talc talcum talco
taxi taxicab taxi

It does filter out possible items of interest, such as:

carefree careless sin cuidado

But since the goal is just to produce some examples, and this cheap hack was never going to generate an exhaustive list anyway, that is all right.

The output is:

        at letter a
        actions stock acciones[Noun]
        accredit certify acreditar
        around thereabout alrededor
        high tall alto
        comrade pal amigo[Noun]
        antecedents backgrounds antecedentes
        (...complete output...)
A lot of these are useless, genuine synonyms. It would be silly to suggest that Spanish fails to preserve the English distinction between "marry" and "wed", between "ale" and "beer", between "desire" and "yearn", or between "vest" and "waistcoat". But some good possibilities remain.

Of these, some probably fail for reasons that only a Spanish-speaker would be able to supply. For instance, is "el pastel" really the best translation of both "cake" and "pie"? If so, it is an example of the type I want. But perhaps it's just a poor translation; perhaps Spanish does have this distinction; say maybe "torta" for "cake" and "empanada" for "pie". (That's what Google suggests, anyway.)

Another kind of failure arises because of idioms. The output:

        exactly o'clock en punto
is of this type. It's not that Spanish fails to distinguish between the concepts of "exactly" and "o'clock"; it's that "en punto" (which means "on the point of") is used idiomatically to mean both of those things: some phrase like "en punto tres" ("on the point of three") means "exactly three" and so, by analogy, "three o'clock". I don't know just what the correct Spanish phrases are, but I can guess that they'll be something like this.

Still, some of the outputs are suggestive:

high tall alto
low small bajo[Adjective]
babble fumble balbucear[Verb]
jealous zealous celoso
contest debate debate[Noun]
forlorn stranded desamparado[Adjective]
docile meek do/cil[Adjective]
picture square el cuadro
fourth room el cuarto
collar neck el cuello
idiom language el idioma[Noun]
clock watch el reloj
floor ground el suelo
ceiling roof el techo
knife razor la navaja
feather pen la pluma
cloudy foggy nublado

I put some of these to Sr. Manzo, and he agreed that some were indeed ambiguous in Spanish. I wouldn't have known what to suggest without the dictionary hack.

