The Universe of Discourse
           
Mon, 29 Oct 2007

Undefined behavior in Perl and other languages
Miles Gould wrote what I thought was an interesting article on implementation-defined languages, and cited Perl as an example. One of his points was that a language that is defined by its implementation, as Perl is, rather than by a standards document, cannot have any "undefined behavior".

Undefined behavior

For people unfamiliar with this concept, I should explain briefly. The C standard is full of places that say "if the program contains x, the behavior is undefined", which really means "C programs do not contain x, so If the program contains x, it is not written in C, and, as this standard only defines the meaning of programs in C, it has nothing to say about the meaning of your program." There are around a couple of hundred of these phrases, and a larger number of places where it is implied.

For example, everyone knows that it means when you write x = 4;, but what does it mean if you write 4 = x;? According to clause 6.3.2.1[#1], it means nothing, and this is not a C program. The non-guarantee in this case is extremely strong. The C compiler, upon encountering this locution, is allowed to abort and spontaneously erase all your files, and in doing so it is not violating the requirements of the standard, because the standard does not require any particular behavior in this case.

The memorable phrase that the comp.lang.c folks use is that using that construction might cause demons to fly out of your nose.

[ Addendum 20071030: I am informed that I misread the standard here, and that the behavior of this particular line is not undefined, but requires a compiler diagnostic. Perhaps a better example would have been x = *(char *)0. ]

I mentioned this in passing in one of my recent articles about a C program I wrote:

        unsigned strinc(char *s) 
        {
          char *p = strchr(s, '\0') - 1;
          while (p >= s && *p == 'A' + colors - 1) *p-- = 'A';
          if (p < s) return 0;
          (*p)++;
          return 1;
        }
Here the pointer p starts at the end of the string s, and the loop might stop when p points to the position just before s. Except no, that is forbidden, and the program might at that moment cause demons to fly out of your nose. You are allowed to have a pointer that points to the position just after an object, but not one that points just before.

Well anyway, I seem to have digressed. My point was that M. Gould says that one advantage of languages like Perl that are defined wholly by their (one) implementation is that you never have "undefined behavior". If you want to know what some locution does, you type it in and see what it does. Poof, instant definition.

Although I think this is a sound point, it occurred to me that that is not entirely correct. The manual is a specification of sorts, and even if the implementation does X in situation Y, the manual might say "The implementation does X in situation Y, but this is unsupported and may change without warning in the future." Then what you have is not so different from Y being undefined behavior. Because the manual is (presumably) a statement of official policy from the maintainers, and, as a communiqué from the people with the ultimate authority to define the future meaning of the language, it has some of the same status that a formal specification would.

Perl: the static variable hack

Such disclaimers do appear in the Perl documentation. Probably the most significant example of this is the static variable hack. For various implementation reasons, the locution my $static if 0 has a strange and interesting effect:

  sub foo {
    my $static = 42 if 0;
    print "static is now $static\n";
    $static++;
  }

  foo() for 1..5;
This makes $static behave as a "static" variable, and persist from call to call of foo(). Without the ... if 0, the code would print "static is now 42" five times. But with ... if 0, it prints:

        static is now 
        static is now 1
        static is now 2
        static is now 3
        static is now 4
This was never an intentional feature. It arose accidentally, and then people discovered it and started using it. Since the behavior was the result of a strange quirk of the implementation, caused by the surprising interaction of several internal details, it was officially decided by the support group that this behavior would not be supported in future versions. The manual was amended to say that this behavior was explicitly undefined, and might change in the future. It can be used in one-off programs, but not in any important program, one that might have a long life and need to be run under several different versions of Perl. Programs that use pointers that point outside the bounds of allocated storage in C are in a similar position. It might work on today's system, with today's compiler, today, but you can't do that in any larger context.

Having the "undefined behavior" be determined by the manual, instead of by a language standard, has its drawbacks. The language standard is fretted over by experts for months. When the C standard says that behavior is undefined, it is because someone like Clive Feather or Doug Gwyn or P.J. Plauger, someone who knows more about C than you ever will, knows that there is some machine somewhere on which the behavior is unsupported and unsupportable. When the Perl manual says that some behavior is undefined, you might be hearing from the Perl equivalent of Doug Gwyn, someone like Nick Clark or Chip Salzenberg or Gurusamy Sarathy. Or you might be hearing from a mere nervous-nellie who got their patch into the manual on a night when the release manager had stayed up too late.

Perl: modifying a hash in a loop

Here is an example of this that has bothered me for a long time. One can use the each() operator to loop lazily over the contents of a hash:

  while (my $key = each %hash) {
    # do something with $key and $hash{$key}
  }
What happens if you modify the hash in the middle of the loop? For various implementation reasons, the manual forbids this.

For example, suppose the loop code adds a new key to the hash. The hash might overflow as a result, and this would trigger a reorganization that would move everything around, destroying the ordering information. The subsequent calls to each() would continue from the same element of the hash, but in the new order, making it likely that the loop would visit some keys more than once, or some not at all. So the prohibition in that case makes sense: The each() operator normally guarantees to produce each key exactly once, and adding elements to a hash in the middle of the loop might cause that guarantee to be broken in an unpredictable way. Moreover, there is no obvious way to fix this without potentially wrecking the performance of hashes.

But the manual also forbids deleting keys inside the loop, and there the issue does not come up, because in Perl, hashes are never reorganized as the result of a deletion. The behavior is easily described: Deleting a key that has already been visited will not affect the each() loop, and deleting one that has not yet been visited will just cause it to be skipped when the time comes.

Some people might find this general case confusing, I suppose. But the following code also runs afoul of the "do not modify a hash inside of an each loop" prohibition, and I don't think anyone would find it confusing:

  while (my $key = each %hash) {
    delete $hash{$key} if is_bad($hash{$key});
  }
Here we want to delete all the bad items from the hash. We do this by scanning the hash and deleting the current item whenever it is bad. Since each key is deleted only after it is scanned by each, we should expect this to visit every key in the hash, as indeed it does. And this appears to be a useful thing to write. The only alternative is to make two passes, constructing a list of bad keys on the first pass, and deleting them on the second pass. The code would be more complicated and the time and memory performance would be much worse.

There is a potential implementation problem, though. The way that each() works is to take the current item and follow a "next" pointer from it to find the next item. (I am omitting some unimportant details here.) But if we have deleted the current item, the implementation cannot follow the "next" pointer. So what happens?

In fact, the implementation has always contained a bunch of code, written by Larry Wall, to ensure that deleting the current key will work properly, and that it will not spoil the each(). This is nontrivial. When you delete an item, the delete() operator looks to see if it is the current item of an each() loop, and if so, it marks the item with a special flag instead of deleting it. Later on, the next time each() is invoked, it sees the flag and deletes the item after following the "next" pointer.

So the implementation takes some pains to make this work. But someone came along later and forbade all modifications of a hash inside an each loop, throwing the baby out with the bathwater. Larry and perl paid a price for this feature, in performance and memory and code size, and I think it was a feature well bought. But then someone patched the manual and spoiled the value of the feature. (Some years later, I patched the manual again to add an exception for this case. Score!)

Perl: modifying an array in a loop

Another example is the question of what happens when you modify an array inside a loop over the array, as with:

  @a = (1..3);
  for (@a) {
    print;
    push @a, $_ + 3 if $_ % 2 == 1;
  }
(This prints 12346.) The internals are simple, and the semantics are well-defined by the implementation, and straightforward, but the manual has the heebie-jeebies about it, and most of the Perl community is extremely superstitious about this, claiming that it is "entirely unpredictable". I would like to support this with a quotation from the manual, but I can't find it in the enormous and disorganized mass that is the Perl documentation.

[ Addendum: Tom Boutell found it. The perlsyn page says "If any part of LIST is an array, foreach will get very confused if you add or remove elements within the loop body, for example with splice. So don't do that." ]

The behavior, for the record, is quite straightforward: On the first iteration, the loop processes the first element in the array. On the second iteration, the loop processes the second element in the array, whatever that element is at the time the second iteration starts, whether or not that was the second element before. On the third iteration, the loop processes the third element in the array, whatever it is at that moment. And so the loop continues, terminating the first time it is called upon to process an element that is past the end of the array. We might imagine the following pseudocode:

        index = 0;     
        while (index < array.length()) {
          process element array[index];
          index += 1;
        }
There is nothing subtle or difficult about this, and claims that the behavior is "entirely unpredictable" are probably superstitious confessions of ignorance and fear.

Let's try to predict the "entirely unpredictable" behavior of the example above:

  @a = (1..3);
  for (@a) {
    print;
    push @a, $_ + 3 if $_ % 2 == 1;
  }
Initially the array contains (1, 2, 3), and so the first iteration processes the first element, which is 1. This prints 1, and, since 1 is odd, pushes 4 onto the end of the array.

The array now contains (1, 2, 3, 4), and the loop processes the second element, which is 2. 2 is printed. The loop then processes the third element, printing 3 and pushing 6 onto the end. The array now contains (1, 2, 3, 4, 6).

On the fourth iteration, the fourth element (4) is printed, and on the fifth iteration, the fifth element (6) is printed. That is the last element, so the loop is finished. What was so hard about that?

Haskell: n+k patterns

My blog was recently inserted into the feed for planet.haskell.org, and of course I immediately started my first streak of posting code-heavy articles about C and Perl. This is distressing not just because the articles were off-topic for Planet Haskell—I wouldn't give the matter two thoughts if I were posting my usual mix of abstract math and stuff—but it's so off-topic that it feels weird to see it sitting there on the front page of Planet Haskell. So I thought I'd make an effort to talk about Haskell, as a friendly attempt to promote good relations between tribes. I'm not sure what tribe I'm in, actually, but what the heck. I thought about Haskell a bit, and a Haskell example came to mind.

Here is a definition of the factorial function in Haskell:

        fact 0 = 1
        fact n = n * fact (n-1)
I don't need to explain this to anyone, right?

Okay, now here is another definition:

        fact 0     = 1
        fact (n+1) = (n+1) * fact n
Also fine, and indeed this is legal Haskell. The pattern n+1 is allowed to match an integer that is at least 1, say 7, and doing so binds n to the value 6. This is by a rather peculiar special case in the specification of Haskell's pattern-matcher. (It is section 3.17.2#8 of Haskell 98 Language and Libraries: The Revised Report, should you want to look it up.) This peculiar special case is known sometimes as a "successor pattern" but more often as an "n+k pattern".

The spec explicitly deprecates this feature:

Many people feel that n+k patterns should not be used. These patterns may be removed or changed in future versions of Haskell.

(Page 33.) One wonders why they put it in at all, if they were going to go ahead and tell you not to use it. The Haskell committee is usually smarter than this.

I have a vague recollection that there was an argument between people who wanted to use Haskell as a language for teaching undergraduate programming, and those who didn't care about that, and that this was the compromise result. Like many compromises, it is inferior to both of the alternatives that it interpolates between. Putting the feature in complicates the syntax and the semantics of the language, disrupts its conceptual purity, and bloats the spec—see the Perlesque yikkity-yak on pages 57–58 about how x + 1 = ... binds a meaning to +, but (x + 1) = ... binds a meaning to x. Such complication is worth while only if there is a corresponding payoff in terms of increased functionality and usability in the language. In this case, the payoff is a feature that can only be used in one-off programs. Serious programs must avoid it, since the patterns "may be removed or changed in future versions of Haskell". The Haskell committee purchased this feature at a certain cost, and it is debatable whether they got their money's worth. I'm not sure which side of that issue I fall on. But having purchased the feature, the committee then threw it in the garbage, squandering their sunk costs. Oh well. Not even the Haskell committee is perfect.

I think it might be worth pointing out that the version of the program with the n+k pattern is technically superior to the other version. Given a negative integer argument, the first version recurses forever, possibly taking a long time to fail and perhaps taking out the rest of the system on which it is running. But the n+k version fails immediately, because the n+1 pattern will only match an integer that is at least 1.

XML screws up

The "nasal demons" of the C standard are a joke, but a serious one. The C standard defines what C compilers must do when presented with C programs; it does not define what they do when presented with other inputs, nor what other software does when presented with C programs. The authors of C standard clearly understood the standard's role in the world.

Earlier versions of the XML standard were less clear. There was a particularly laughable clause in the first edition of the XML 1,0 standard:

XML documents may, and should, begin with an XML declaration which specifies the version of XML being used. For example, the following is a complete XML document, well-formed but not valid:

<?xml version="1.0"?>
<greeting>Hello, world!</greeting>

...

The version number "1.0" should be used to indicate conformance to this version of this specification; it is an error for a document to use the value "1.0" if it does not conform to this version of this specification.

(Emphasis is mine.) The XML 1.0 spec is just a document. It has no power, except to declare that certain files are XML 1.0 and certain files are not. A file that complies with the requirements of the spec is XML 1.0; all other files are not XML 1.0. But in the emphasized clause, the spec says that certain behavior "is an error" if it is exhibited by documents that do not conform to the spec. That is, it is declaring certain non-XML-1.0 documents "erroneous". But within the meaning of the spec, "erroneous" simply means that the documents are not XML 1.0. So the clause is completely redundant. Documents that do not conform to the spec are erroneous by definition, whether or not they use the value "1.0".

It's as if the Catholic Church issued an edict forbidding all rabbis from wearing cassocks, on pain of excommunication.

I am happy to discover that this dumb error has been removed from the most recent edition of the XML 1.0 spec.


[Other articles in category /prog/perl] permanent link

Sat, 27 Oct 2007

Where's that blog?
I haven't posted in a couple of weeks, and I was wondering why. So I took a look at the test version of the blog, which displays all the unpublished articles as well as the published ones, and the reason was obvious: In the past ten days I've written seven articles that are unfinished or that didn't work. Usually only about a third of my articles flop; this month a whole bunch flopped in a row. What can I say? Sometimes the muse delivers, and sometimes she doesn't.

I said a while back that I would try to publish more regularly, and not wait until every article was perfect. But I don't want to publish the unfinished articles yet. So I thought instead I'd publish a short summary of what I've been thinking about lately.

I hope to get at least one or two of these done by the end of the month.

Simplified Poker

I recently played a computer poker game that uses a 24-card deck, with only the nine through ace of each suit. This changes the game drastically. For example, a flush is less likely than a four of a kind. (The game uses the standard hand rankings anyway.) It is very easy to compute optimal strategies for this game, because there are so few possible hands (42,504) that you can brute-force all the calculations with a computer.

This got me thinking again of something I started writing up last year and never finished: The game of "Simplified Poker", which was an attempt to do for Poker what the λ-calculus does for computation: the simplest possible model that nevertheless captures all the essential features of the original. Simplified Poker is played with an infinite deck in which half the cards are kings and half are jacks. Each hand contains only two cards. Nevertheless, bluffing is still possible.

Order
What is the Name of this Book?
What is the Name of this Book?
with kickback
no kickback

The Annoying Boxes Puzzle

This is a logic puzzle in which you deduce which box contains the treasure, but with a twist. I thought it up many years ago, and then in the course of trying to write up an explanation about five years ago, I consulted Raymond Smullyan's book What is the Name of This Book? in order to get a citation to prove a certain fact about the form that such puzzles usually take. In doing so, I discovered that Smullyan actually presented the annoying boxes puzzle (in slightly different form) in that book!

It's primarily waiting for me to take a photograph to accompany the puzzle.

Undefined behavior

I have a pretty interesting article on the concept of "undefined behavior", which is a big deal in the C world, but which means something rather different, and is much less important, in Perl.

[ Addendum 20071029: This is ready now. ]

Order
Tootle
Tootle
with kickback
no kickback

Tootle

My daughter Iris has become interested in the book Tootle, by Gertrude Crampton, which is the third-best-selling hardback children's book of all time. A few years back I wrote some brief literary criticism of Tootle, which I included when I wrote the Wikipedia article about the book. This criticism was quite rightly deleted later on, as uncited original research. It needs a new home, and that home is obviously here.

Periodicity without Fourier Series

Suppose I have tabulated the number of blog posts I made every day for two years. I want to find if there is any discernible periodicity to this data. Do I tend to post in 26-day cycles, for example?

One way to do this is to take the Fourier transform of the data. For various reasons, I don't like this technique, and I'm trying to invent something new. I think I have what I want, although it took several tries to find it. Unfortunately, the blog posting data shows no periodicity whatsoever.

Emacs and auto-mode-alist

The elisp code I've been using for the past fifteen years to set the default mode for Perl editing in Emacs broke last week. My search for a replacement turned up some very bizarre advice on IRC.

Van der Waerden's problem

Also still pending is the rest of my van der Waerden problem series. I have written about four programs so far, and I have two to go.


[Other articles in category /meta] permanent link

Sun, 14 Oct 2007

Van der Waerden's problem: programs 3 and 4
In this series of articles I'm analyzing five versions of a program that I wrote around 1988, and then another program that does the same thing that I wrote last month without referring to the 1988 code. (I said before that it was four versions, but apparently I'm not so good at counting to five.)

If you don't remember what the program does, here's an explanation.

Here is program 1, which was an earlier attempt to do the same thing. Here's program 2.

Program 3

Complete source code for this version.

I said of the previous program:

The problem is all in the implementation. You see, this program actually constructs the entire tree in memory.

Somewhere along the line it dawned on me that constructing the tree was unnecessary, so I took that machinery out, and the result was version 3.

Consequently, this program is easy to explain once you have seen the previous version: almost all I have to do is list the stuff that I took out.

Since this program does not construct a tree of node structures, it omits the definition of the node structure and the macro for manufacturing nodes. Since it gets rid of the node allocation, it also gets rid of the memory leak of the previous version, and so omits the customized memory allocation functions Malloc and Free that performed memory tracking.

The previous program had a compiled-in limit on the number of colors it would handle, because at the time I didn't know how to do a dynamic array. In this program, I got rid of the node structures, so there was no array of node structures, so no need for a limit on the number of node structures in the array. And all the code that enforced the limit is gone.

The apchk function, which checks to see if a string is good, remains unchanged from the previous version.

The makenodes function, which was the principal function in the previous program, remains, but has lost a lot of code. It is simpler to call, too; the node argument is gone:

        makenodes(maxlen,"");
I got rid of the silly !howfar test in favor of a more easily-understood howfar == 0 test. There are lots of times when ! is appropriate, but testing whether a non-negative integer has reached zero is not one of them. I was going to comment earlier about what a novice error this is, and I'm glad to see that I fixed it.

The main use of apchk in the previous program had if (!apchk(...)) { ... }. That was okay, because apchk returns a Boolean result. But the negation is annoying. It suggests that apchk's return value is backward. (Instead of returning true for a bad string, it should return true for a good string.) This is not very much a big deal, and I only brought it up so that I could diffidently confess that these days I would probably have done:

        #define unless(c)       if(!(c))
        ...
        unless (is_bad(...)) {
        }
There are a lot of stories of doofus Pascal programmers who do:

        #define begin {
        #define end }
and Fortran programmers who do:

        #define GT >
        #define GE >=
        #define LT <
        #define LE <=
and I find, to my shame, that I have become one of them. Anyone seeing #define unless(c) if(!(c)) would snort and say "Oh, this was obviously written by a Perl programmer."

But at least I was a C programmer first.

Actually I was a Fortran programmer first. But I was never a big enough doofus to #define GE >=.

The big flaw in the current program is the string argument to makenodes. Each call to makenodes copies this string so that it can append a character to the end. I discussed this at some length in the previous article, so I don't want to make too much of it now; I'll just say that a better technique would have reused the string buffer from call to call. This obviously saves a little memory, and since most of the contents of the string doesn't change, it also saves a lot of time.

This might be worth seeing, since it seems to me now to be a marvel of wasted code:

    ls = strlen(s);
    newarg = STRING(ls + 1);
    if (!newarg) 
      {
      fprintf(stderr,"Couldn't get %d bytes for newarg in makenodes\n",ls+2);
      fprintf(stderr,"Total get was %d.\n",gotten);
      fprintf(stderr,"P\n L\n  O\n   P\n    !\n");
      abort();
      }
    strcpy(newarg,s);
    newarg[ls+1] = '\0';
    newarg[ls] = 'A' + i;
    makenodes(howfar-1,newarg);
    free(newarg);
The repeated strlen, for example, when ls could be calculated as maxlen - howfar. The excessively verbose failure message, which should be inside the STRING macro anyway. (The code that maintains gotten has gone away with the debugging allocation routines, so the second fprintf is superfluous.) And why did I think abort was the right thing to call on an out-of-memory condition?

Oh well, you live and learn.

Program 4

Complete source code for this version.

The fourth version of the program is even more trimmed-down. In this version of the program I did get the idea to reuse the string buffer instead of copying the string on every recursive call. But I also got an even better idea, and eliminated the recursive call. The makenodes function is now down to one argument, which tells it how deep a tree to search.

        void
        makenodes(maxdepth)
        int maxdepth;
        {
        int apchk(), depth = 0;
        char curlet, *curstring = STRING(maxdepth);

        curstring[0] = '\0';
        curlet = 'A';

        while (depth >= 0)
          {
          while (curlet <= 'A' - 1 + colors)
            {
        #ifdef DIAG
            printf("%s makenoding with string %s%c, depth %d.\n",
                TABS+12-depth,curstring,curlet,depth);
        #endif
            if (apchk(curstring,curlet))
              curlet++;
            else
              if (depth < maxdepth)
                {
                curstring[depth] = curlet;
                curstring[depth+1] = '\0';
                depth += 1;
                curlet = 'A';
                }
              else
                {
                printf("%s%c\n",curstring,curlet);
                curlet++;
                }
            }
          depth -= 1;
          curlet = curstring[depth] + 1;
          curstring[depth] = '\0';
          }
        }
This is a better job all around, and not very different from what I wrote last month to do the same thing. I was going to title this series of articles "I have become a better programmer!", and now that I see this version, I'm glad I didn't, because there's no evidence here that I am much better. This version of the program gets a solid A from my older self.

The value depth scans forward in the string when the search is going well, and is decremented again when the search needs to backtrack. If depth == maxdepth, a witness of the desired length has been found, and is printed out.

The curlet ("current letter") variable tracks which branch of the current tree node we are "recursing" down. After the function recurses down, by incrementing depth, curlet is set to 'A' to visit the first sub-node of the new current node. The curstring buffer tracks the path through the tree to the current node. When the function needs to backtrack, it restores the state of curlet from the last character in the buffer and then trims that character off the end of the path.

I'd only want to make two changes to this code. One would be to make depth a pointer into the curstring buffer instead of an index into it. Then again, the compiler may well have optimized it into one anyway. But it would also allow me to eliminate curlet in favor of just using *depth everywhere.

The other change would address a more serious defect: the contents of curstring are kept properly zero-terminated at all times, whenever depth is advanced or retracted. This zero-termination is unnecessary, since curstring is never used as a string except when depth == maxdepth. When printfing curstring, I could have used something like:

        printf("%.*s%c\n",curstring,maxlen,curlet);
which prints exactly maxlen characters from the buffer, regardless of whether it is zero-terminated.

It would, however, have required that I know about %.*s, which I'm sure I did not. Was %.*s even available in 1988? I forget, and my copy of K&R First Edition is in a box somewhere since my recent move. Anyway, if %.*s was unavailable for whatever reason, the code could have had a single curstring[maxdepth] = 0 up front, which would have been quite sufficient for the one printf it needed to do.

Coming next: one very different program to solve the same problem, and a comparison with last month's effort.


[Other articles in category /prog] permanent link

Fri, 12 Oct 2007

The square of the Catalan sequence
Yesterday I went to a talk by Val Tannen about his work on "provenance semirings".

The idea is that when you calculate derived data in a database, such as a view or a selection, you can simultaneously calculate exactly which input tuples contributed to each output tuple's presence in the output. Each input tuple is annotated with an identifier that says who was responsible for putting it there, and the output annotations are polynomials in these identifiers. (The complete paper is here.)

A simple example may make this a bit clearer. Suppose we have the following table R:
R
a a
a b
a c
b c
c e
d e
We'll write R(p, q) when the tuple (p, q) appears in this table. Now consider the join of R with itself. That is, consider the relation S where S(x, z) is true whenever both R(x, y) and R(y, z) are true:

S
a a
a b
a c
a e
b e
Now suppose you discover that the R(a, b) information is untrustworthy. What tuples of S are untrustworthy?

If you annotate the tuples of R with identifiers like this:

R 
a a u
a b v
a c w
b c x
c e y
d e z
then the algorithm in the paper calculates polynomials for the tuples of S like this:
S 
a a u2
a b uv
a c uw + xv
a e wy
b e xy
If you decide that R(a, b) is no good, you assign the value 0 to v, which reduces the S table to:

S 
a a u2
a b 0
a c uw
a e wy
b e xy
So we see that tuple S(a, b) is no good any more, but S(a, c) is still okay, because it can be derived from u and w, which we still trust.

This assignment of polynomials generalizes a lot of earlier work on tuple annotation. For example, suppose each tuple in R is annotated with a probability of being correct. You can propagate the probabilities to S just by substituting the appropriate numbers for the variables in the polynomials. Or suppose each tuple in R might appear multiple times and is annotated with the number of times it appears. Then ditto.

If your queries are recursive, then the polynomials might be infinite. For example, suppose you are calculating the transitive closure T of relation R. This is like the previous example, except that instead of having S(x, z) = R(x, y) and R(y, z), we have T(x, z) = R(x, z) or (T(x, y) and R(y, z)). This is a recursive equation, so we need to do a fixpoint solution for it, using certain well-known techniques. The result in this example is:

T 
a a u+
a b u*v
a c u*(vx+w)
a e u*(vx+w)y
b c x
b e xy
d e z
In such a case there might be an infinite number of paths through R to derive the provenance of a certain tuple of T. In this example, R contains a loop, namely R(a, a), so there are an infinite number of derivations of some of the tuples in T, because you can go around the loop as many times as you like. u+ here is an abbreviation for the infinite polynomial u + u2 + u3 + ...; u* here is an abbreviation for 1 + u+.

1 a
2 (a + b)
3 ((a + b) + c)
(a + (b + c))
4 (((a + b) + c) + d)
((a + (b + c)) + d)
((a + b) + (c + d))
(a + ((b + c) + d))
(a + (b + (c + d)))
5 ((((a + b) + c) + d) + e)
(((a + (b + c)) + d) + e)
(((a + b) + (c + d)) + e)
(((a + b) + c) + (d + e))
((a + ((b + c) + d)) + e)
((a + (b + (c + d))) + e)
((a + (b + c)) + (d + e))
((a + b) + ((c + d) + e))
((a + b) + (c + (d + e)))
(a + (((b + c) + d) + e))
(a + ((b + (c + d)) + e))
(a + ((b + c) + (d + e)))
(a + (b + ((c + d) + e)))
(a + (b + (c + (d + e))))
In one example in the paper, the method produces a recursive relation of the form V = s + V2, which can be solved by the same well-known techniques to come up with an (infinite) polynomial for V, namely V = 1 + s + 2s2 + 5s3 + 14s4 + ... . Mathematicians will recognize the sequence 1, 1, 2, 5, 14, ... as the Catalan numbers, which come up almost as often as the better-known Fibonacci numbers. For example, the Catalan numbers count the number of binary trees with n nodes; they also count the number of ways of parenthesizing an expression with n terms, as shown in the table at right.

Anyway, in his talk, Val referred to the sequence as "bizarre", and I had to jump in to point out that it was not at all bizarre, it was the Catalan numbers, which are just what you would expect from a relation like V = s + V2, blah blah, and he cut me off, because of course he knows all about the Catalan numbers. He only called them bizarre as a rhetorical flourish, meant to echo the presumed puzzlement of the undergraduates in the room.

(I never know how much of what kind of math to expect from computer science professors. Sometimes they know things I don't expect at all, and sometimes they don't know things that I expect everyone to know.

(Once I was discussing the algorithm used by ENIAC for computing square roots with a professor, and the professor told me that at the beginning of the program there was a loop, which accumulated a total, each time accumulating the contents of a register that was incremented by 2 each time through the loop, and he did not know what was going on there. I instantly guessed that what was happening was that the register contained the numbers 1, 3, 5, 7, ..., incremented by 2 each time, and so the accumulator contained the totals 1, 4, 9, 16, 25, ..., and so the loop was calculating an initial estimate of the size of the square root. If you count the number of increment-by-2's, then when the accumulator exceeds the radicand, the count contains the integer part of the root.

(This was indeed what was going on, and the professor seemed to think it was a surprising insight. I am not relating this boastfully, because I truly don't think it was a particularly inspired guess.

(Now that I think about it, maybe the answer here is that computer science professors know more about math than I expect, and less about computation.)

Anyway, I digress, and the whole article up to now was not really what I wanted to discuss anyway. What I wanted to discuss was that when I started blathering about Catalan numbers, Val said that if I knew so much about Catalan numbers, I should calculate the coefficient of the x59 term in V2, which also appeared as one of the annotations in his example.

So that's the puzzle, what is the coefficient of the x59 term in V2, where V = 1 + s + 2s2 + 5s3 + 14s4 + ... ?

After I had thought about this for a couple of minutes, I realized that it was going to be much simpler than it first appeared, for two reasons.

The first thing that occurred to me was that the definition of multiplication of polynomials is that the coefficient of the xn term in the product of A and B is Σaibn-i. When A=B, this reduces to Σaian-i. Now, it just so happens that the Catalan numbers obey the relation cn+1 = &Sigma cicn-i, which is exactly the same form. Since the coefficients of V are the ci, the coefficients of V2 are going to have the form Σcicn-i, which is just the Catalan numbers again, but shifted up by one place.

The next thing I thought was that the Catalan numbers have a pretty simple generating function f(x). This just means that you pretend that the sequence V is a Taylor series, and figure out what function it is the Taylor series of, and use that as a shorthand for the whole series, ignoring all questions of convergence and other such analytic fusspottery. If V is the Taylor series for f(x), then V2 is the Taylor series for f(x)2. And if f has a compact representation, say as sin(x) or something, it might be much easier to square than the original V was. Since I knew in this case that the generating function is simple, this seemed likely to win. In fact the generating function of V is not sin(x) but (1-√(1-4x))/2x. When you square this, you get almost the same thing back, which matches my prediction from the previous paragraph. This would have given me the right answer, but before I actually finished that calculation, I had an "oho" moment.

The generating function is known to satisfy the relation f(x) = 1 + xf(x)2. This relation is where the (1-√(1-4x))/2x thing comes from in the first place; it is the function that satisfies that relation. (You can see this relation prefigured in the equation that Val had, with V = s + V2. There the notation is a bit different, though.) We can just rearrange the terms here, putting the f(x)2 by itself, and get f(x)2 = (f(x)-1)/x.

Now we are pretty much done, because f(x) = V = 1 + x + 2x2 + 5x3 + 14x4 + ... , so f(x)-1 = x + 2x2 + 5x3 + 14x4 + ..., and (f(x)-1)/x = 1 + 2x + 5x2 + 14x3 + ... . Lo and behold, the terms are the Catalan numbers again.

So the answer is that the coefficient of the x59 term is just c(60), calculation of which is left as an exercise for the reader.

I don't know what the point of all that was, but I thought it was fun how the hairy-looking problem seemed likely to be simple when I looked at it a little more carefully, and then how it did turn out to be quite simple.

This blog has had a recurring dialogue between subtle technique and the sawed-off shotgun method, and I often favor the sawed-off shotgun method. Often programmers' big problem is that they are very clever and learned, and so they want to be clever and learned all the time, even when being a knucklehead would work better. But I think this example provides some balance, because it shows a big win for the clever, learned method, which does produce a lot more understanding.

Order
Higher-Order Perl
Higher-Order Perl
with kickback
no kickback
Then again, it really doesn't take long to whip up a program to multiply infinite polynomials. I did it in chapter 6 of Higher-Order Perl, and here it is again in Haskell:

        data Poly a = P [a] deriving Show

        instance (Eq a) => Eq (Poly a) 
                where (P x) == (P y) = (x == y)

        polySum x [] = x
        polySum [] y = y
        polySum (x:xs) (y:ys) = (x+y) : (polySum xs ys)

        polyTimes  [] _ = []
        polyTimes  _ [] = []
        polyTimes  (x:xs) (y:ys) = (x*y) : more
                          where
                            more = (polySum (polySum (map (x *) ys) (map (* y) xs))
                                    (0 : (polyTimes xs ys)))

        instance (Num a) => Num (Poly a) 
          where (P x) + (P y) = P (polySum x y)
                (P x) * (P y) = P (polyTimes x y)


[Other articles in category /math] permanent link

Tue, 09 Oct 2007

Relatively prime polynomials over Z2
Last week Wikipedia was having a discussion on whether the subject of "mathematical quilting" was notable enough to deserve an article. I remembered that there had been a mathematical quilt on the cover of some journal I read last year, and I went to the Penn math library to try to find it again. While I was there, I discovered that the June 2007 issue of Mathematics Magazine had a cover story about the probability that two randomly-selected polynomials over Z2 are relatively prime. ("The Probability of Relatively Prime Polynomials", Arthur T. Benjamin and Curtis D Bennett, page 196).

Polynomials over Z2 are one of my favorite subjects, and the answer to the question turned out to be beautiful. So I thought I'd write about it here.

First, what does it mean for two polynomials to be relatively prime? It's analogous to the corresponding definition for integers. For any numbers a and b, there is always some number d such that both a and b are multiples of d. (d = 1 is always a solution.) The greatest such number is called the greatest common divisor or GCD of a and b. The GCD of two numbers might be 1, or it might be some larger number. If it's 1, we say that the two numbers are relatively prime (to each other). For example, the GCD of 100 and 28 is 4, so 100 and 28 are not relatively prime. But the GCD of 100 and 27 is 1, so 100 and 27 are relatively prime. One can prove theorems like these: If p is prime, then either a is a multiple of p, or a is relatively prime to p, but not both. And the equation ap + bq = 1 has a solution (in integers) if and only if p and q are relatively prime.

The definition for polynomials is just the same. Take two polynomials over some variable x, say p and q. There is some polynomial d such that both p and q are multiples of d; d(x) = 1 is one such. When the only solutions are trivial polynomials like 1, we say that the polynomials are relatively prime. For example, consider x2 + 2x + 1 and x2 - 1. Both are multiples of x+1, so they are not relatively prime. But x2 + 2x + 1 is relatively prime to x2 - 2x + 1. And one can prove theorems that are analogous to the ones that work in the integers. The analog of "prime integer" is "irreducible polynomial". If p is irreducible, then either a is a multiple of p, or a is relatively prime to p, but not both. And the equation a(x)p(x) + b(x)q(x) = 1 has a solution for polynomials a and b if and only if p and q are relatively prime.

One uses Euclid's algorithm to calculate the GCD of two integers. Euclid's algorithm is simple: To calculate the GCD of a and b, just subtract the smaller from the larger, repeatedly, until one of the numbers becomes 0. Then the other is the GCD. One can use an entirely analogous algorithm to calculate the GCD of two polynomials. Two polynomials are relatively prime just when their GCD, as calculated by Euclid's algorithm, has degree 0.

Anyway, that was more introduction than I wanted to give. The article in Mathematics Magazine concerned polynomials over Z2, which means that the coefficients are in the field Z2, which is just like the regular integers, except that 1+1=0. As I explained in the earlier article, this implies that a=-a for all a, so there are no negatives and subtraction is the same as addition. I like this field a lot, because subtraction blows. Do you have trouble because you're always dropping minus signs here and there? You'll like Z2; there are no minus signs.

Here is a table that shows which pairs of polynomials over Z2 are relatively prime. If you read this blog through some crappy aggregator, you are really missing out, because the table is awesome, and you can't see it properly. Check out the real thing.

 a0a1a2a3a4a5a6a7a8a9b0b1b2b3b4b5b6b7b8b9c0c1c2c3c4c5c6c7c8c9d0d1
0   [a0]                                                                
1   [a1]                                                                
x   [a2]                                                                
x + 1   [a3]                                                                
x2   [a4]                                                                
x2 + 1   [a5]                                                                
x2 + x   [a6]                                                                
x2 + x + 1   [a7]                                                                
x3   [a8]                                                                
x3 + 1   [a9]                                                                
x3 + x   [b0]                                                                
x3 + x + 1   [b1]                                                                
x3 + x2   [b2]                                                                
x3 + x2 + 1   [b3]                                                                
x3 + x2 + x   [b4]                                                                
x3 + x2 + x + 1   [b5]                                                                
x4   [b6]                                                                
x4 + 1   [b7]                                                                
x4 + x   [b8]                                                                
x4 + x + 1   [b9]                                                                
x4 + x2   [c0]                                                                
x4 + x2 + 1   [c1]                                                                
x4 + x2 + x   [c2]                                                                
x4 + x2 + x + 1   [c3]                                                                
x4 + x3   [c4]                                                                
x4 + x3 + 1   [c5]                                                                
x4 + x3 + x   [c6]                                                                
x4 + x3 + x + 1   [c7]                                                                
x4 + x3 + x2   [c8]                                                                
x4 + x3 + x2 + 1   [c9]                                                                
x4 + x3 + x2 + x   [d0]                                                                
x4 + x3 + x2 + x + 1   [d1]