The Universe of Discourse


Wed, 14 May 2008

More artificial Finnish
Several Finns wrote to me to explain in some detail what was wrong with the artificial Finnish in yesterday's article. As I surmised, the words "ssän" and "kkeen" are lexically illegal in Finnish. There were a number of similar problems. For example, my sample output included the non-word "t". I don't know how this could have happened, since the input probably didn't include anything like that, and the Markov process I used to generate it shouldn't have done so. But the code is lost, so I suppose I'll never know.

Of the various comments I received, perhaps the most interesting was from Ilmari Vacklin. ("Vacklin", huh? If my program had generated "Vacklin", the Finns would have been all over the error.) M. Vacklin pointed out that a number of words in my sample output violated the Finnish rules of vowel harmony.

(M. Vacklin also suggested that my article must have been inspired by this comic, but it wasn't. I venture to guess that the Internet is full of places that point out that you can manufacture pseudo-Finnish by stringing together a lot of k's and a's and t's; it's not that hard to figure out. Maybe this would be a good place to mention the word "saippuakauppias", the Finnish term for a soap-dealer, which was in the Guinness Book of World Records as the longest commonly-used palindromic word in any language.)

Anyway, back to vowel harmony. Vowel harmony is a phenomenon found in certain languages, including Finnish. These languages class vowels into two antithetical groups. Vowels from one group never appear in the same word as vowels from the other group. When one has a prefix or a suffix that normally has a group A vowel, and one wants to join it to a word with group B vowels, the vowel in the suffix changes to match. This happens a lot in Finnish, which has a zillion suffixes. In many languages, including Finnish, there is also a third group of vowels which are "neutral" and can be mixed with either group A or with group B.

Modern Korean does not have vowel harmony, mostly, but Middle Korean did have it, up until the early 16th century. The Korean alphabet was invented around 1443, and the notation for the vowels reflected the vowel harmony:

[ Addendum 20080517: The following paragraph about vowel harmony contains significant errors of fact. I got the groups wrong. ]

The first four vowels in this illustration, with the vertical lines, were incompatible with the second four vowels, the ones with the horizontal lines. The last two vowels were neutral, as was another one, not shown here, which was written as a single dot and which has since fallen out of use. Incidentally, vowel harmony is an unusual feature of languages, and its presence in Korean has led some people to suggest that it might be distantly related to Turkish.

The vowel harmony thing is interesting in this context for the following reason. My pseudo-Finnish was generated by a Markov process: each letter was selected at random so as to make the overall frequency of the output match that of real Finnish. Similarly, the overall frequency of two- and three-letter sequences in pseudo-Finnish should match that in real Finnish. Is this enough to generate plausible (although nonsensical) Finnish text? For English, we might say maybe. But for Finnish the answer is no, because this process does not respect the vowel harmony rules. The Markov process doesn't remember, by the time it gets to the end of a long word, whether it is generating a word in vowel category A or B, and so it doesn't know which vowels it whould be generating. It will inevitably generate words with mixed vowels, which is forbidden. This problem does not come up in the generation of pseudo-English.

None of that was what I was planning to write about, however. What I wanted to do was to present samples of pseudo-Finnish generated with various tunings of the Markov process.

The basic model is this: you choose a number N, say 2, and then you look at some input text. For each different sequence of N characters, you count how many times that sequence is followed by "a", how many times it is followed by "b", and so on.

Then you start generating text at random. You pick a sequence of N characters arbitrarily to start, and then you generate the next character according to the probabilities that you calculated. Then you look at the last N characters (the last N-1 from before, plus the new one) and repeat. You keep doing that until you get tired.

For example, suppose we have N=2. Then we have a big table whose keys are 2-character strings like "ab", and then associated with each such string, a table that looks something like this:
r 54.52
a 15.89
i 10.41
o 7.95
l 4.11
e 3.01
u 1.10
space 0.82
: 0.55
t 0.55
, 0.27
. 0.27
b 0.27
s 0.27
So in the input to this process, "ab" was followed by "r" more than 54% of the time, by "a" about 16% of the time, and so on. And when generating the output, every time our process happens to generate "ab", it will follow by generating an "r" 54.52% of the time, an "a" 15.89% of the time, and so on.

Whether to count capital letters as the same as lowercase, and what to do about punctuation and spaces and so forth, are up to the designer.

Here, as examples, are some samples of pseudo-English, generated with various N. The input text was the book of Genesis, which is not entirely typical. In each case, I deleted the initial N characters and the final partial word, cleaned up the capitalization by hand, and appended a final period.

N=0
Lt per f idd et oblcs hs hae:uso ar w aaolt y tndh rl ohn otuhrthpboleel.ee n synenihbdrha,spegn.
N=1
Cachand t wim, heheethas anevem blsant ims, andofan, ieahrn anthaye s, lso iveeti alll t tand, w.
N=2
Ged hich callochbarthe of th to tre said nothem, and rin ing of brom. My and he behou spend the.
N=3
Sack one eved of and refor ther of the hand he will there that in the ful, when it up unto rangers.
It should be clear that the quality improves as one increases the N parameter. The N=3 sample has mostly real words, and the few nonsense ones it contains ("eved", "ful") are completely plausible English. N=2, on the other hand, is mostly nonsense, although it's mostly plausible nonsense. Even "callochbarthe" is almost plausible. (The unfortunate "chb" in the middle is just bad luck. It occurs because Genesis 36 mentions Baalhanan the son of Achbor.) The N=1 sample is recognizably bogus; no English word looks like "ieahrn", and the triple "l" in "alll" is nearly impossible. (I did once write to Jesse Sheidlower, an editor of the Big Dictionary, to ask his advice about whether "ballless" should be hyphenated.)

I have prepared samples of pseudo-Finnish of various qualities. The input here was a bunch of text I copied out of Finnish Wikipedia. (Where else? If you need Finnish text in 1988, you get it from the Usenet fi.talk group; if you need Finnish text in 2008, you get it from Finnish Wikipedia.) I did a little bit of manual cleanup, as with the English, but not too much.

N=0
Vtnnstäklun so so rl sieesjo.Aiijesjeäyuiotiannorin traäl.N vpojanti jonn oteaanlskmt enhksaiaaiiv oenlulniavas. Rottlatutsenynöisu iikännam e lavantkektann eaagla admikkosulssmpnrtinrkudilsorirumlshsmoti,anlosa anuioessydshln.Atierisllsjnlu e.Itatlosyhi vnko ättr otneän akho smalloailäi jiaat kajvtaopnasneilstio tntin einteaonaiimotn:r apoya oruasnainttotne wknaiossäelaäinoev aobrs,vteorlokynv. Aevsrikhanä tp s s oälnlke rvmi il ynae nara ign ssm lkimttbhineaatismäi tst lli ahaltineshne kr keöunv ah s itenh s .Ia pa elstpnanmnuiksriil anaalnttt mr ti.Ooa ka eee eiiei,tnees äusee a nanhetv.Iopkijeatatits,i l eklbiik suössmap tioaotaktdiir rkeaviohiesotkeagarihv nnadvö jlape öt kaeakmjkhykoto tnt iunnuyknnelu rutliie.Leva eiriaösnaj,rk oyumtsle,iioa,aspa aeiaä wsuinn eta y tvati klssviutkuaktmlpnheomi.T akapskushhnuksnhnnheaaaaussitseminmpnamäiaä pät.Kaaaabl unnionuhnpa iaes,outka.Cväinvkshvrnlteeoea rmi re suodmpr autlysa tnliaanäass. Srs rnvrtsita kmidusvjn tii.
N=1
Ava pän svun kerekent lsita batävomenasttenerga kovosuujalules rma punntäni rtraliksainoi van eukällä. Enäkukänesinntampalä ttan kolpäsäkyönsllvitivenestakkesenelussivaliite kuuksä kttteni einsuekeita kuterissalietäkilpöikalit ojatäjä pinsin atollukole idoitenn kkaorhjajasteden en vuolynkoiverojaa hta puon ehalan vaivä ihoshäositi. Hde setua tämpitydi makta jasyn sää oinncgrkai jeeten. Ljalanekikeri toiskkksypohoin ta yö atenesällväkeesaatituuun. Paait pukata tuon ktusumitttan zagaleskli va kkanäsin siikutytowhenttvosa veste eten vunovivä. Vorytellkeeni stan jä taa eka kaine ja kurenntonsin kyn o nta ja. Aisst urksetaka. Hotimivaa ta mppussternallai ja. Hdä on koraleerermohtydelen on jon. Rgienon kulinoilisälsa ja holälimmpa vitin, kukausoompremänn ra, palestollebilsen kaalesta, oina. Blilullaushoingiötideispaanoksiton, mulurklimi kermalli pota atebau lmomarymin kypa hta vanon tin kela vanaspoita s kulitekkäjen jäleetuolpan, veesalekäilin oii. Häreli. Ymialisstermimpriekaksst on.
N=2
Omaalis onino osa josa hormastaaraktse tyi altäänä tyntellevääostoidesenä, la siä vuansilliana inöön akalkuulukempellys kisä nen myöhelyaminenkiemostamahti omuonsa onite oni kusissa. Kungin sykynteillalkaai ellahasiteisuunnaja eroniemmin javai musuuasinä, sittan tusuovatkryt tormon vuolisenitiivansaliuotkietjuuta sensa. Kutumppalvinen. Vaikintolat hän ja kilkuossa osa koiseuvo keyhdysvisakeemppolowistoisijouliuodosijolasissän muoli ogro soluksi valuksasverix intetormon patlantaan et muiksen paiettaatulun kan vuomesyklees ovain pun. Sesva sa hänerittämpiraun tyi vuoden sälisen sän yhtiit, set tämpiraalletä. Senssaikanoje leemp:tabeten ain raa olliukettyi su. Solulukuuttellerrotolit hee säkinessa hän sekketäärinenvaikeihakti umallailuksin sestunno klossi ilunuta. Klettisaa osen vua vuola, jani ja hinangia en ta kaineemonimien polin barkiviäliukkuta joseseva. Ebb rautta onistärään on ml jokoulistä oheksi anoton allysvallelsiliineuvoja kutuko ala ulkietutablohitkain. Ituno.
N=3
Ävivät mena osakeyhti yhdysvalmiininäkin rakenne tuliitä hermoni ja umpirauhastui liin baryshnikoneja. Ain viljelukuullisää olisäke spesideksyylikoliittu latvia. Helsina hän solukeskuksen kannumme, peri palkin vieskeinä sisään on orgaan poikanssisäätelukauno klee laisenäläinen tavastui kauno on länteen muttava hän voimista kilometsästymistettäjän lehtiöiksitoreisö. Sitoutuvat mukalle. Ainettiin sisäke suomaihin, jouluun. Verenkilpalveli valtaineen opisteri poli ohjasionee rakennuttikolan aivastisenäläistuu kehittisetoja, rajahormaailmanajan kulkopuolesti kuluu mooliitoutuvat ovat olle. Ainen yhdysvaltai valiolähtiöiksi vasta, S. Muidentilaisteri jotka verenkirovin verenkiehumistä nelle väliaivoittynyt baleviiliukoisiin maailmestavarasta, jokakuudessa laisu. Sai rakeyhti yhtiö eli gluksessa. Ebbin, ja linnosakkeen hormonien I hallistehtiin kilpirasvua jaajana hormaailusta kunnetteluskäyttöön suomalaivat yhdysvalmistämistammonit veteet olimistuvatta. Hormon oli rautta.
Before anyone objects to the non-word "ml" in the N=2 sample, let me explain that this is the standard abbreviation for "millilitra". The "i" in the N=3 sample was a puzzle, since Marko Heiskanen assures me that Finnish has no one-letter words. But it appears in my sample in connection with Sukselaisen I hallitus, whatever that is, so I capitalized it.

I must say that I found "yhdysvalmistämistammonit" rather far-fetched, even in Finnish. But then I discovered that "yhdeksänkymmenvuotiaaksi" and "yhdysvalloissakaan" are genuine, so who am I to judge?

[ Addendum 20080601: Some additional notes. ]


[Other articles in category /lang] permanent link