More artificial Finnish
Several Finns wrote to me to explain in some detail what was wrong
with the artificial Finnish in yesterday's article. As I
surmised, the words "ssän" and "kkeen" are lexically illegal in
Finnish. There were a number of similar problems. For example, my
sample output included the non-word "t". I don't know how this could
have happened, since the input probably didn't include anything like
that, and the Markov process I used to generate it shouldn't have done
so. But the code is lost, so I suppose I'll never know.
Of the various comments I received, perhaps the most interesting was
from Ilmari Vacklin. ("Vacklin", huh? If my program had generated
"Vacklin", the Finns would have been all over the error.)
M. Vacklin pointed out that a number of words in my sample output
violated the Finnish rules of vowel harmony.
(M. Vacklin also suggested that my article must have been inspired
by this
comic, but it wasn't. I venture to guess that the Internet is
full of places that point out that you can manufacture pseudo-Finnish
by stringing together a lot of k's and a's and t's; it's not that hard
to figure out. Maybe this would be a good place to mention the word
"saippuakauppias", the Finnish term for a soap-dealer, which was in
the Guinness Book of World Records as the longest commonly-used
palindromic word in any language.)
Anyway, back to vowel harmony.
Vowel harmony is a phenomenon found in certain languages, including
Finnish. These languages class vowels into two antithetical groups. Vowels
from one group never appear in the same word as vowels from the other
group. When one has a prefix or a suffix that normally has a group A
vowel, and one wants to join it to a word with group B vowels, the
vowel in the suffix changes to match. This happens a lot in Finnish,
which has a zillion suffixes. In many languages, including Finnish,
there is also a third group of vowels which are "neutral" and can be
mixed with either group A or with group B.
Modern Korean does not have vowel harmony,
mostly, but Middle Korean did have it, up until the early 16th century.
The Korean alphabet was invented around 1443, and the notation for the
vowels reflected the vowel harmony:
[ Addendum 20080517: The following paragraph about vowel harmony
contains significant errors of fact. I got the groups wrong. ]
The first four vowels in this illustration, with the vertical lines,
were incompatible with the second four vowels, the ones with the
horizontal lines. The last two vowels were neutral, as was another
one, not shown here, which was written as a single dot and which has
since fallen out of use. Incidentally, vowel harmony is an unusual
feature of languages, and its presence in Korean has led some people
to suggest that it might be distantly related to Turkish.
The vowel harmony thing is interesting in this context for the
following reason. My pseudo-Finnish was generated by a Markov
process: each letter was selected at random so as to make the overall
frequency of the output match that of real Finnish. Similarly, the
overall frequency of two- and three-letter sequences in pseudo-Finnish
should match that in real Finnish. Is this enough to generate
plausible (although nonsensical) Finnish text? For English, we might
say maybe. But for Finnish the answer is no, because this process
does not respect the vowel harmony rules. The Markov process doesn't
remember, by the time it gets to the end of a long word, whether it is
generating a word in vowel category A or B, and so it doesn't know
which vowels it whould be generating. It will inevitably generate
words with mixed vowels, which is forbidden. This problem does not
come up in the generation of pseudo-English.
None of that was what I was planning to write about, however. What I
wanted to do was to present samples of pseudo-Finnish generated with
various tunings of the Markov process.
The basic model is this: you choose a number N, say 2, and then
you look at some input text. For each different sequence of N
characters, you count how many times that sequence is followed by "a",
how many times it is followed by "b", and so on.
Then you start generating text at random. You pick a sequence of
N characters arbitrarily to start, and then you generate the next
character according to the probabilities that you calculated. Then
you look at the last N characters (the last N-1 from before,
plus the new one) and repeat. You keep doing that until you get
tired.
For example, suppose we have N=2. Then we have a big table
whose keys are 2-character strings like "ab", and then associated with
each such string, a table that looks something like this:
r | 54.52
| a | 15.89
| i | 10.41
| o | 7.95
| l | 4.11
| e | 3.01
| u | 1.10
| space | 0.82
| : | 0.55
| t | 0.55
| , | 0.27
| . | 0.27
| b | 0.27
| s | 0.27
|
So in the input to this process, "ab" was followed by "r" more than
54% of the time, by "a" about 16% of the time, and so on. And when generating the output, every time our process happens to generate "ab", it will follow by generating an "r" 54.52% of the time, an "a" 15.89% of the time, and so on.
Whether to
count capital letters as the same as lowercase, and what to do about
punctuation and spaces and so forth, are up to the designer.
Here, as examples, are some samples of pseudo-English, generated with
various N. The input text was the book of Genesis, which is
not entirely typical. In each case, I deleted the initial N
characters and the final partial word, cleaned up the capitalization by hand, and appended a
final period.
- N=0
- Lt per f idd et oblcs hs hae:uso ar w aaolt y tndh rl ohn
otuhrthpboleel.ee n synenihbdrha,spegn.
- N=1
- Cachand t wim, heheethas anevem blsant ims, andofan, ieahrn anthaye s, lso iveeti alll t tand, w.
- N=2
- Ged hich callochbarthe of th to tre said nothem, and rin ing of brom. My and he behou spend the.
- N=3
- Sack one eved of and refor ther of the hand he will there that in the ful, when it up unto rangers.
It should be clear that the quality improves as one increases the
N parameter. The N=3 sample has mostly real words, and
the few nonsense ones it contains ("eved", "ful") are completely
plausible English. N=2, on the other hand, is mostly nonsense,
although it's mostly plausible nonsense. Even "callochbarthe" is
almost plausible. (The unfortunate "chb" in the middle is just bad
luck. It occurs because Genesis 36 mentions Baalhanan the son of
Achbor.) The N=1 sample is recognizably bogus; no English word
looks like "ieahrn", and the triple "l" in "alll" is nearly
impossible. (I did once write to Jesse Sheidlower, an editor of the Big Dictionary, to ask his advice about
whether "ballless" should be hyphenated.)
I have prepared samples of pseudo-Finnish of various
qualities. The input here was a bunch of text I copied out of Finnish
Wikipedia. (Where else? If you need Finnish text in 1988, you get it
from the Usenet fi.talk group; if you need Finnish text in
2008, you get it from Finnish Wikipedia.) I did a little bit of manual cleanup, as with the English, but not too much.
- N=0
- Vtnnstäklun so so rl sieesjo.Aiijesjeäyuiotiannorin traäl.N vpojanti jonn oteaanlskmt enhksaiaaiiv oenlulniavas. Rottlatutsenynöisu iikännam e lavantkektann eaagla admikkosulssmpnrtinrkudilsorirumlshsmoti,anlosa anuioessydshln.Atierisllsjnlu e.Itatlosyhi vnko ättr otneän akho smalloailäi jiaat kajvtaopnasneilstio tntin einteaonaiimotn:r apoya oruasnainttotne wknaiossäelaäinoev aobrs,vteorlokynv. Aevsrikhanä tp s s oälnlke rvmi il ynae nara ign ssm lkimttbhineaatismäi tst lli ahaltineshne kr keöunv ah s itenh s .Ia pa elstpnanmnuiksriil anaalnttt mr ti.Ooa ka eee eiiei,tnees äusee a nanhetv.Iopkijeatatits,i l eklbiik suössmap tioaotaktdiir rkeaviohiesotkeagarihv nnadvö jlape öt kaeakmjkhykoto tnt iunnuyknnelu rutliie.Leva eiriaösnaj,rk oyumtsle,iioa,aspa aeiaä wsuinn eta y tvati klssviutkuaktmlpnheomi.T akapskushhnuksnhnnheaaaaussitseminmpnamäiaä pät.Kaaaabl unnionuhnpa iaes,outka.Cväinvkshvrnlteeoea rmi re suodmpr autlysa tnliaanäass. Srs rnvrtsita kmidusvjn tii.
- N=1
- Ava pän svun kerekent lsita batävomenasttenerga kovosuujalules rma punntäni rtraliksainoi van eukällä. Enäkukänesinntampalä ttan kolpäsäkyönsllvitivenestakkesenelussivaliite kuuksä kttteni einsuekeita kuterissalietäkilpöikalit ojatäjä pinsin atollukole idoitenn kkaorhjajasteden en vuolynkoiverojaa hta puon ehalan vaivä ihoshäositi. Hde setua tämpitydi makta jasyn sää oinncgrkai jeeten. Ljalanekikeri toiskkksypohoin ta yö atenesällväkeesaatituuun. Paait pukata tuon ktusumitttan zagaleskli va kkanäsin siikutytowhenttvosa veste eten vunovivä. Vorytellkeeni stan jä taa eka kaine ja kurenntonsin kyn o nta ja. Aisst urksetaka. Hotimivaa ta mppussternallai ja. Hdä on koraleerermohtydelen on jon. Rgienon kulinoilisälsa ja holälimmpa vitin, kukausoompremänn ra, palestollebilsen kaalesta, oina. Blilullaushoingiötideispaanoksiton, mulurklimi kermalli pota atebau lmomarymin kypa hta vanon tin kela vanaspoita s kulitekkäjen jäleetuolpan, veesalekäilin oii. Häreli. Ymialisstermimpriekaksst on.
- N=2
- Omaalis onino osa josa hormastaaraktse tyi altäänä tyntellevääostoidesenä, la siä vuansilliana inöön akalkuulukempellys kisä nen myöhelyaminenkiemostamahti omuonsa onite oni kusissa. Kungin sykynteillalkaai ellahasiteisuunnaja eroniemmin javai musuuasinä, sittan tusuovatkryt tormon vuolisenitiivansaliuotkietjuuta sensa. Kutumppalvinen. Vaikintolat hän ja kilkuossa osa koiseuvo keyhdysvisakeemppolowistoisijouliuodosijolasissän muoli ogro soluksi valuksasverix intetormon patlantaan et muiksen paiettaatulun kan vuomesyklees ovain pun. Sesva sa hänerittämpiraun tyi vuoden sälisen sän yhtiit, set tämpiraalletä. Senssaikanoje leemp:tabeten ain raa olliukettyi su. Solulukuuttellerrotolit hee säkinessa hän sekketäärinenvaikeihakti umallailuksin sestunno klossi ilunuta. Klettisaa osen vua vuola, jani ja hinangia en ta kaineemonimien polin barkiviäliukkuta joseseva. Ebb rautta onistärään on ml jokoulistä oheksi anoton allysvallelsiliineuvoja kutuko ala ulkietutablohitkain. Ituno.
- N=3
- Ävivät mena osakeyhti yhdysvalmiininäkin rakenne tuliitä hermoni ja umpirauhastui liin baryshnikoneja. Ain viljelukuullisää olisäke spesideksyylikoliittu latvia. Helsina hän solukeskuksen kannumme, peri palkin vieskeinä sisään on orgaan poikanssisäätelukauno klee laisenäläinen tavastui kauno on länteen muttava hän voimista kilometsästymistettäjän lehtiöiksitoreisö. Sitoutuvat mukalle. Ainettiin sisäke suomaihin, jouluun. Verenkilpalveli valtaineen opisteri poli ohjasionee rakennuttikolan aivastisenäläistuu kehittisetoja, rajahormaailmanajan kulkopuolesti kuluu mooliitoutuvat ovat olle. Ainen yhdysvaltai valiolähtiöiksi vasta, S. Muidentilaisteri jotka verenkirovin verenkiehumistä nelle väliaivoittynyt baleviiliukoisiin maailmestavarasta, jokakuudessa laisu. Sai rakeyhti yhtiö eli gluksessa. Ebbin, ja linnosakkeen hormonien I hallistehtiin kilpirasvua jaajana hormaailusta kunnetteluskäyttöön suomalaivat yhdysvalmistämistammonit veteet olimistuvatta. Hormon oli rautta.
Before anyone objects to the non-word "ml" in the N=2 sample,
let me explain that this is the standard abbreviation for
"millilitra". The "i" in the N=3 sample was a puzzle, since
Marko Heiskanen assures me that Finnish has no one-letter words. But
it appears in my sample in connection with Sukselaisen
I hallitus, whatever that is, so I capitalized it.
I must say that I found "yhdysvalmistämistammonit" rather far-fetched,
even in Finnish. But then I discovered that
"yhdeksänkymmenvuotiaaksi" and "yhdysvalloissakaan" are genuine, so
who am I to judge?
[ Addendum 20080601: Some
additional notes. ]
[Other articles in category /lang]
permanent link
|