The Universe of Discourse

Mon, 29 Oct 2007

Undefined behavior in Perl and other languages
Miles Gould wrote what I thought was an interesting article on implementation-defined languages, and cited Perl as an example. One of his points was that a language that is defined by its implementation, as Perl is, rather than by a standards document, cannot have any "undefined behavior".

## Undefined behavior

For people unfamiliar with this concept, I should explain briefly. The C standard is full of places that say "if the program contains x, the behavior is undefined", which really means "C programs do not contain x, so If the program contains x, it is not written in C, and, as this standard only defines the meaning of programs in C, it has nothing to say about the meaning of your program." There are around a couple of hundred of these phrases, and a larger number of places where it is implied.

For example, everyone knows that it means when you write x = 4;, but what does it mean if you write 4 = x;? According to clause 6.3.2.1[#1], it means nothing, and this is not a C program. The non-guarantee in this case is extremely strong. The C compiler, upon encountering this locution, is allowed to abort and spontaneously erase all your files, and in doing so it is not violating the requirements of the standard, because the standard does not require any particular behavior in this case.

The memorable phrase that the comp.lang.c folks use is that using that construction might cause demons to fly out of your nose.

[ Addendum 20071030: I am informed that I misread the standard here, and that the behavior of this particular line is not undefined, but requires a compiler diagnostic. Perhaps a better example would have been x = *(char *)0. ]

I mentioned this in passing in one of my recent articles about a C program I wrote:

        unsigned strinc(char *s)
{
char *p = strchr(s, '\0') - 1;
while (p >= s && *p == 'A' + colors - 1) *p-- = 'A';
if (p < s) return 0;
(*p)++;
return 1;
}

Here the pointer p starts at the end of the string s, and the loop might stop when p points to the position just before s. Except no, that is forbidden, and the program might at that moment cause demons to fly out of your nose. You are allowed to have a pointer that points to the position just after an object, but not one that points just before.

Well anyway, I seem to have digressed. My point was that M. Gould says that one advantage of languages like Perl that are defined wholly by their (one) implementation is that you never have "undefined behavior". If you want to know what some locution does, you type it in and see what it does. Poof, instant definition.

Although I think this is a sound point, it occurred to me that that is not entirely correct. The manual is a specification of sorts, and even if the implementation does X in situation Y, the manual might say "The implementation does X in situation Y, but this is unsupported and may change without warning in the future." Then what you have is not so different from Y being undefined behavior. Because the manual is (presumably) a statement of official policy from the maintainers, and, as a communiqué from the people with the ultimate authority to define the future meaning of the language, it has some of the same status that a formal specification would.

## Perl: the static variable hack

Such disclaimers do appear in the Perl documentation. Probably the most significant example of this is the static variable hack. For various implementation reasons, the locution my $static if 0 has a strange and interesting effect:  sub foo { my$static = 42 if 0;
print "static is now $static\n";$static++;
}

foo() for 1..5;

This makes $static behave as a "static" variable, and persist from call to call of foo(). Without the ... if 0, the code would print "static is now 42" five times. But with ... if 0, it prints:  static is now static is now 1 static is now 2 static is now 3 static is now 4  This was never an intentional feature. It arose accidentally, and then people discovered it and started using it. Since the behavior was the result of a strange quirk of the implementation, caused by the surprising interaction of several internal details, it was officially decided by the support group that this behavior would not be supported in future versions. The manual was amended to say that this behavior was explicitly undefined, and might change in the future. It can be used in one-off programs, but not in any important program, one that might have a long life and need to be run under several different versions of Perl. Programs that use pointers that point outside the bounds of allocated storage in C are in a similar position. It might work on today's system, with today's compiler, today, but you can't do that in any larger context. Having the "undefined behavior" be determined by the manual, instead of by a language standard, has its drawbacks. The language standard is fretted over by experts for months. When the C standard says that behavior is undefined, it is because someone like Clive Feather or Doug Gwyn or P.J. Plauger, someone who knows more about C than you ever will, knows that there is some machine somewhere on which the behavior is unsupported and unsupportable. When the Perl manual says that some behavior is undefined, you might be hearing from the Perl equivalent of Doug Gwyn, someone like Nick Clark or Chip Salzenberg or Gurusamy Sarathy. Or you might be hearing from a mere nervous-nellie who got their patch into the manual on a night when the release manager had stayed up too late. ## Perl: modifying a hash in a loop Here is an example of this that has bothered me for a long time. One can use the each() operator to loop lazily over the contents of a hash:  while (my$key = each %hash) {
# do something with $key and$hash{$key} }  What happens if you modify the hash in the middle of the loop? For various implementation reasons, the manual forbids this. For example, suppose the loop code adds a new key to the hash. The hash might overflow as a result, and this would trigger a reorganization that would move everything around, destroying the ordering information. The subsequent calls to each() would continue from the same element of the hash, but in the new order, making it likely that the loop would visit some keys more than once, or some not at all. So the prohibition in that case makes sense: The each() operator normally guarantees to produce each key exactly once, and adding elements to a hash in the middle of the loop might cause that guarantee to be broken in an unpredictable way. Moreover, there is no obvious way to fix this without potentially wrecking the performance of hashes. But the manual also forbids deleting keys inside the loop, and there the issue does not come up, because in Perl, hashes are never reorganized as the result of a deletion. The behavior is easily described: Deleting a key that has already been visited will not affect the each() loop, and deleting one that has not yet been visited will just cause it to be skipped when the time comes. Some people might find this general case confusing, I suppose. But the following code also runs afoul of the "do not modify a hash inside of an each loop" prohibition, and I don't think anyone would find it confusing:  while (my$key = each %hash) {
delete $hash{$key} if is_bad($hash{$key});
}

Here we want to delete all the bad items from the hash. We do this by scanning the hash and deleting the current item whenever it is bad. Since each key is deleted only after it is scanned by each, we should expect this to visit every key in the hash, as indeed it does. And this appears to be a useful thing to write. The only alternative is to make two passes, constructing a list of bad keys on the first pass, and deleting them on the second pass. The code would be more complicated and the time and memory performance would be much worse.

There is a potential implementation problem, though. The way that each() works is to take the current item and follow a "next" pointer from it to find the next item. (I am omitting some unimportant details here.) But if we have deleted the current item, the implementation cannot follow the "next" pointer. So what happens?

In fact, the implementation has always contained a bunch of code, written by Larry Wall, to ensure that deleting the current key will work properly, and that it will not spoil the each(). This is nontrivial. When you delete an item, the delete() operator looks to see if it is the current item of an each() loop, and if so, it marks the item with a special flag instead of deleting it. Later on, the next time each() is invoked, it sees the flag and deletes the item after following the "next" pointer.

So the implementation takes some pains to make this work. But someone came along later and forbade all modifications of a hash inside an each loop, throwing the baby out with the bathwater. Larry and perl paid a price for this feature, in performance and memory and code size, and I think it was a feature well bought. But then someone patched the manual and spoiled the value of the feature. (Some years later, I patched the manual again to add an exception for this case. Score!)

## Perl: modifying an array in a loop

Another example is the question of what happens when you modify an array inside a loop over the array, as with:

  @a = (1..3);
for (@a) {
print;
push @a, $_ + 3 if$_ % 2 == 1;
}

(This prints 12346.) The internals are simple, and the semantics are well-defined by the implementation, and straightforward, but the manual has the heebie-jeebies about it, and most of the Perl community is extremely superstitious about this, claiming that it is "entirely unpredictable". I would like to support this with a quotation from the manual, but I can't find it in the enormous and disorganized mass that is the Perl documentation.

[ Addendum: Tom Boutell found it. The perlsyn page says "If any part of LIST is an array, foreach will get very confused if you add or remove elements within the loop body, for example with splice. So don't do that." ]

The behavior, for the record, is quite straightforward: On the first iteration, the loop processes the first element in the array. On the second iteration, the loop processes the second element in the array, whatever that element is at the time the second iteration starts, whether or not that was the second element before. On the third iteration, the loop processes the third element in the array, whatever it is at that moment. And so the loop continues, terminating the first time it is called upon to process an element that is past the end of the array. We might imagine the following pseudocode:

        index = 0;
while (index < array.length()) {
process element array[index];
index += 1;
}

There is nothing subtle or difficult about this, and claims that the behavior is "entirely unpredictable" are probably superstitious confessions of ignorance and fear.

Let's try to predict the "entirely unpredictable" behavior of the example above:

  @a = (1..3);
for (@a) {
print;
push @a, $_ + 3 if$_ % 2 == 1;
}

Initially the array contains (1, 2, 3), and so the first iteration processes the first element, which is 1. This prints 1, and, since 1 is odd, pushes 4 onto the end of the array.

The array now contains (1, 2, 3, 4), and the loop processes the second element, which is 2. 2 is printed. The loop then processes the third element, printing 3 and pushing 6 onto the end. The array now contains (1, 2, 3, 4, 6).

On the fourth iteration, the fourth element (4) is printed, and on the fifth iteration, the fifth element (6) is printed. That is the last element, so the loop is finished. What was so hard about that?

My blog was recently inserted into the feed for planet.haskell.org, and of course I immediately started my first streak of posting code-heavy articles about C and Perl. This is distressing not just because the articles were off-topic for Planet Haskell—I wouldn't give the matter two thoughts if I were posting my usual mix of abstract math and stuff—but it's so off-topic that it feels weird to see it sitting there on the front page of Planet Haskell. So I thought I'd make an effort to talk about Haskell, as a friendly attempt to promote good relations between tribes. I'm not sure what tribe I'm in, actually, but what the heck. I thought about Haskell a bit, and a Haskell example came to mind.

Here is a definition of the factorial function in Haskell:

        fact 0 = 1
fact n = n * fact (n-1)

I don't need to explain this to anyone, right?

Okay, now here is another definition:

        fact 0     = 1
fact (n+1) = (n+1) * fact n

Also fine, and indeed this is legal Haskell. The pattern n+1 is allowed to match an integer that is at least 1, say 7, and doing so binds n to the value 6. This is by a rather peculiar special case in the specification of Haskell's pattern-matcher. (It is section 3.17.2#8 of Haskell 98 Language and Libraries: The Revised Report, should you want to look it up.) This peculiar special case is known sometimes as a "successor pattern" but more often as an "n+k pattern".

The spec explicitly deprecates this feature:

Many people feel that n+k patterns should not be used. These patterns may be removed or changed in future versions of Haskell.

(Page 33.) One wonders why they put it in at all, if they were going to go ahead and tell you not to use it. The Haskell committee is usually smarter than this.

I have a vague recollection that there was an argument between people who wanted to use Haskell as a language for teaching undergraduate programming, and those who didn't care about that, and that this was the compromise result. Like many compromises, it is inferior to both of the alternatives that it interpolates between. Putting the feature in complicates the syntax and the semantics of the language, disrupts its conceptual purity, and bloats the spec—see the Perlesque yikkity-yak on pages 57–58 about how x + 1 = ... binds a meaning to +, but (x + 1) = ... binds a meaning to x. Such complication is worth while only if there is a corresponding payoff in terms of increased functionality and usability in the language. In this case, the payoff is a feature that can only be used in one-off programs. Serious programs must avoid it, since the patterns "may be removed or changed in future versions of Haskell". The Haskell committee purchased this feature at a certain cost, and it is debatable whether they got their money's worth. I'm not sure which side of that issue I fall on. But having purchased the feature, the committee then threw it in the garbage, squandering their sunk costs. Oh well. Not even the Haskell committee is perfect.

I think it might be worth pointing out that the version of the program with the n+k pattern is technically superior to the other version. Given a negative integer argument, the first version recurses forever, possibly taking a long time to fail and perhaps taking out the rest of the system on which it is running. But the n+k version fails immediately, because the n+1 pattern will only match an integer that is at least 1.

## XML screws up

The "nasal demons" of the C standard are a joke, but a serious one. The C standard defines what C compilers must do when presented with C programs; it does not define what they do when presented with other inputs, nor what other software does when presented with C programs. The authors of C standard clearly understood the standard's role in the world.

Earlier versions of the XML standard were less clear. There was a particularly laughable clause in the first edition of the XML 1,0 standard:

XML documents may, and should, begin with an XML declaration which specifies the version of XML being used. For example, the following is a complete XML document, well-formed but not valid:

<?xml version="1.0"?>
<greeting>Hello, world!</greeting>

...

The version number "1.0" should be used to indicate conformance to this version of this specification; it is an error for a document to use the value "1.0" if it does not conform to this version of this specification.

(Emphasis is mine.) The XML 1.0 spec is just a document. It has no power, except to declare that certain files are XML 1.0 and certain files are not. A file that complies with the requirements of the spec is XML 1.0; all other files are not XML 1.0. But in the emphasized clause, the spec says that certain behavior "is an error" if it is exhibited by documents that do not conform to the spec. That is, it is declaring certain non-XML-1.0 documents "erroneous". But within the meaning of the spec, "erroneous" simply means that the documents are not XML 1.0. So the clause is completely redundant. Documents that do not conform to the spec are erroneous by definition, whether or not they use the value "1.0".

It's as if the Catholic Church issued an edict forbidding all rabbis from wearing cassocks, on pain of excommunication.

I am happy to discover that this dumb error has been removed from the most recent edition of the XML 1.0 spec.

Sat, 27 Oct 2007

Where's that blog?
I haven't posted in a couple of weeks, and I was wondering why. So I took a look at the test version of the blog, which displays all the unpublished articles as well as the published ones, and the reason was obvious: In the past ten days I've written seven articles that are unfinished or that didn't work. Usually only about a third of my articles flop; this month a whole bunch flopped in a row. What can I say? Sometimes the muse delivers, and sometimes she doesn't.

I said a while back that I would try to publish more regularly, and not wait until every article was perfect. But I don't want to publish the unfinished articles yet. So I thought instead I'd publish a short summary of what I've been thinking about lately.

I hope to get at least one or two of these done by the end of the month.

### Simplified Poker

I recently played a computer poker game that uses a 24-card deck, with only the nine through ace of each suit. This changes the game drastically. For example, a flush is less likely than a four of a kind. (The game uses the standard hand rankings anyway.) It is very easy to compute optimal strategies for this game, because there are so few possible hands (42,504) that you can brute-force all the calculations with a computer.

This got me thinking again of something I started writing up last year and never finished: The game of "Simplified Poker", which was an attempt to do for Poker what the λ-calculus does for computation: the simplest possible model that nevertheless captures all the essential features of the original. Simplified Poker is played with an infinite deck in which half the cards are kings and half are jacks. Each hand contains only two cards. Nevertheless, bluffing is still possible.

 Order What is the Name of this Book? with kickback no kickback

### The Annoying Boxes Puzzle

This is a logic puzzle in which you deduce which box contains the treasure, but with a twist. I thought it up many years ago, and then in the course of trying to write up an explanation about five years ago, I consulted Raymond Smullyan's book What is the Name of This Book? in order to get a citation to prove a certain fact about the form that such puzzles usually take. In doing so, I discovered that Smullyan actually presented the annoying boxes puzzle (in slightly different form) in that book!

It's primarily waiting for me to take a photograph to accompany the puzzle.

### Undefined behavior

I have a pretty interesting article on the concept of "undefined behavior", which is a big deal in the C world, but which means something rather different, and is much less important, in Perl.

 Order Tootle with kickback no kickback

### Tootle

My daughter Iris has become interested in the book Tootle, by Gertrude Crampton, which is the third-best-selling hardback children's book of all time. A few years back I wrote some brief literary criticism of Tootle, which I included when I wrote the Wikipedia article about the book. This criticism was quite rightly deleted later on, as uncited original research. It needs a new home, and that home is obviously here.

### Periodicity without Fourier Series

Suppose I have tabulated the number of blog posts I made every day for two years. I want to find if there is any discernible periodicity to this data. Do I tend to post in 26-day cycles, for example?

One way to do this is to take the Fourier transform of the data. For various reasons, I don't like this technique, and I'm trying to invent something new. I think I have what I want, although it took several tries to find it. Unfortunately, the blog posting data shows no periodicity whatsoever.

### Emacs and auto-mode-alist

The elisp code I've been using for the past fifteen years to set the default mode for Perl editing in Emacs broke last week. My search for a replacement turned up some very bizarre advice on IRC.

### Van der Waerden's problem

Also still pending is the rest of my van der Waerden problem series. I have written about four programs so far, and I have two to go.

Mon, 15 Oct 2007

Van der Waerden's problem: programs 3 and 4
In this series of articles I'm analyzing five versions of a program that I wrote around 1988, and then another program that does the same thing that I wrote last month without referring to the 1988 code. (I said before that it was four versions, but apparently I'm not so good at counting to five.)

If you don't remember what the program does, here's an explanation.

Here is program 1, which was an earlier attempt to do the same thing. Here's program 2.

## Program 3

Complete source code for this version.

I said of the previous program:

The problem is all in the implementation. You see, this program actually constructs the entire tree in memory.

Somewhere along the line it dawned on me that constructing the tree was unnecessary, so I took that machinery out, and the result was version 3.

Consequently, this program is easy to explain once you have seen the previous version: almost all I have to do is list the stuff that I took out.

Since this program does not construct a tree of node structures, it omits the definition of the node structure and the macro for manufacturing nodes. Since it gets rid of the node allocation, it also gets rid of the memory leak of the previous version, and so omits the customized memory allocation functions Malloc and Free that performed memory tracking.

The previous program had a compiled-in limit on the number of colors it would handle, because at the time I didn't know how to do a dynamic array. In this program, I got rid of the node structures, so there was no array of node structures, so no need for a limit on the number of node structures in the array. And all the code that enforced the limit is gone.

The apchk function, which checks to see if a string is good, remains unchanged from the previous version.

The makenodes function, which was the principal function in the previous program, remains, but has lost a lot of code. It is simpler to call, too; the node argument is gone:

        makenodes(maxlen,"");

I got rid of the silly !howfar test in favor of a more easily-understood howfar == 0 test. There are lots of times when ! is appropriate, but testing whether a non-negative integer has reached zero is not one of them. I was going to comment earlier about what a novice error this is, and I'm glad to see that I fixed it.

The main use of apchk in the previous program had if (!apchk(...)) { ... }. That was okay, because apchk returns a Boolean result. But the negation is annoying. It suggests that apchk's return value is backward. (Instead of returning true for a bad string, it should return true for a good string.) This is not very much a big deal, and I only brought it up so that I could diffidently confess that these days I would probably have done:

        #define unless(c)       if(!(c))
...
}

There are a lot of stories of doofus Pascal programmers who do:

        #define begin {
#define end }

and Fortran programmers who do:

        #define GT >
#define GE >=
#define LT <
#define LE <=

and I find, to my shame, that I have become one of them. Anyone seeing #define unless(c) if(!(c)) would snort and say "Oh, this was obviously written by a Perl programmer."

But at least I was a C programmer first.

Actually I was a Fortran programmer first. But I was never a big enough doofus to #define GE >=.

The big flaw in the current program is the string argument to makenodes. Each call to makenodes copies this string so that it can append a character to the end. I discussed this at some length in the previous article, so I don't want to make too much of it now; I'll just say that a better technique would have reused the string buffer from call to call. This obviously saves a little memory, and since most of the contents of the string doesn't change, it also saves a lot of time.

This might be worth seeing, since it seems to me now to be a marvel of wasted code:

    ls = strlen(s);
newarg = STRING(ls + 1);
if (!newarg)
{
fprintf(stderr,"Couldn't get %d bytes for newarg in makenodes\n",ls+2);
fprintf(stderr,"Total get was %d.\n",gotten);
fprintf(stderr,"P\n L\n  O\n   P\n    !\n");
abort();
}
strcpy(newarg,s);
newarg[ls+1] = '\0';
newarg[ls] = 'A' + i;
makenodes(howfar-1,newarg);
free(newarg);

The repeated strlen, for example, when ls could be calculated as maxlen - howfar. The excessively verbose failure message, which should be inside the STRING macro anyway. (The code that maintains gotten has gone away with the debugging allocation routines, so the second fprintf is superfluous.) And why did I think abort was the right thing to call on an out-of-memory condition?

Oh well, you live and learn.

## Program 4

Complete source code for this version.

The fourth version of the program is even more trimmed-down. In this version of the program I did get the idea to reuse the string buffer instead of copying the string on every recursive call. But I also got an even better idea, and eliminated the recursive call. The makenodes function is now down to one argument, which tells it how deep a tree to search.

        void
makenodes(maxdepth)
int maxdepth;
{
int apchk(), depth = 0;
char curlet, *curstring = STRING(maxdepth);

curstring[0] = '\0';
curlet = 'A';

while (depth >= 0)
{
while (curlet <= 'A' - 1 + colors)
{
#ifdef DIAG
printf("%s makenoding with string %s%c, depth %d.\n",
TABS+12-depth,curstring,curlet,depth);
#endif
if (apchk(curstring,curlet))
curlet++;
else
if (depth < maxdepth)
{
curstring[depth] = curlet;
curstring[depth+1] = '\0';
depth += 1;
curlet = 'A';
}
else
{
printf("%s%c\n",curstring,curlet);
curlet++;
}
}
depth -= 1;
curlet = curstring[depth] + 1;
curstring[depth] = '\0';
}
}

This is a better job all around, and not very different from what I wrote last month to do the same thing. I was going to title this series of articles "I have become a better programmer!", and now that I see this version, I'm glad I didn't, because there's no evidence here that I am much better. This version of the program gets a solid A from my older self.

The value depth scans forward in the string when the search is going well, and is decremented again when the search needs to backtrack. If depth == maxdepth, a witness of the desired length has been found, and is printed out.

The curlet ("current letter") variable tracks which branch of the current tree node we are "recursing" down. After the function recurses down, by incrementing depth, curlet is set to 'A' to visit the first sub-node of the new current node. The curstring buffer tracks the path through the tree to the current node. When the function needs to backtrack, it restores the state of curlet from the last character in the buffer and then trims that character off the end of the path.

I'd only want to make two changes to this code. One would be to make depth a pointer into the curstring buffer instead of an index into it. Then again, the compiler may well have optimized it into one anyway. But it would also allow me to eliminate curlet in favor of just using *depth everywhere.

The other change would address a more serious defect: the contents of curstring are kept properly zero-terminated at all times, whenever depth is advanced or retracted. This zero-termination is unnecessary, since curstring is never used as a string except when depth == maxdepth. When printfing curstring, I could have used something like:

        printf("%.*s%c\n",curstring,maxlen,curlet);

which prints exactly maxlen characters from the buffer, regardless of whether it is zero-terminated.

It would, however, have required that I know about %.*s, which I'm sure I did not. Was %.*s even available in 1988? I forget, and my copy of K&R First Edition is in a box somewhere since my recent move. Anyway, if %.*s was unavailable for whatever reason, the code could have had a single curstring[maxdepth] = 0 up front, which would have been quite sufficient for the one printf it needed to do.

Coming next: one very different program to solve the same problem, and a comparison with last month's effort.

Fri, 12 Oct 2007

The square of the Catalan sequence
Yesterday I went to a talk by Val Tannen about his work on "provenance semirings".

The idea is that when you calculate derived data in a database, such as a view or a selection, you can simultaneously calculate exactly which input tuples contributed to each output tuple's presence in the output. Each input tuple is annotated with an identifier that says who was responsible for putting it there, and the output annotations are polynomials in these identifiers. (The complete paper is here.)

A simple example may make this a bit clearer. Suppose we have the following table R:
 R a a a b a c b c c e d e
We'll write R(p, q) when the tuple (p, q) appears in this table. Now consider the join of R with itself. That is, consider the relation S where S(x, z) is true whenever both R(x, y) and R(y, z) are true:

 S a a a b a c a e b e
Now suppose you discover that the R(a, b) information is untrustworthy. What tuples of S are untrustworthy?

If you annotate the tuples of R with identifiers like this:

 R a a u a b v a c w b c x c e y d e z
then the algorithm in the paper calculates polynomials for the tuples of S like this:
 S a a u2 a b uv a c uw + xv a e wy b e xy
If you decide that R(a, b) is no good, you assign the value 0 to v, which reduces the S table to:

 S a a u2 a b 0 a c uw a e wy b e xy
So we see that tuple S(a, b) is no good any more, but S(a, c) is still okay, because it can be derived from u and w, which we still trust.

This assignment of polynomials generalizes a lot of earlier work on tuple annotation. For example, suppose each tuple in R is annotated with a probability of being correct. You can propagate the probabilities to S just by substituting the appropriate numbers for the variables in the polynomials. Or suppose each tuple in R might appear multiple times and is annotated with the number of times it appears. Then ditto.

If your queries are recursive, then the polynomials might be infinite. For example, suppose you are calculating the transitive closure T of relation R. This is like the previous example, except that instead of having S(x, z) = R(x, y) and R(y, z), we have T(x, z) = R(x, z) or (T(x, y) and R(y, z)). This is a recursive equation, so we need to do a fixpoint solution for it, using certain well-known techniques. The result in this example is:

 T a a u+ a b u*v a c u*(vx+w) a e u*(vx+w)y b c x b e xy d e z
In such a case there might be an infinite number of paths through R to derive the provenance of a certain tuple of T. In this example, R contains a loop, namely R(a, a), so there are an infinite number of derivations of some of the tuples in T, because you can go around the loop as many times as you like. u+ here is an abbreviation for the infinite polynomial u + u2 + u3 + ...; u* here is an abbreviation for 1 + u+.

 1 a 2 (a + b) 3 ((a + b) + c) (a + (b + c)) 4 (((a + b) + c) + d) ((a + (b + c)) + d) ((a + b) + (c + d)) (a + ((b + c) + d)) (a + (b + (c + d))) 5 ((((a + b) + c) + d) + e) (((a + (b + c)) + d) + e) (((a + b) + (c + d)) + e) (((a + b) + c) + (d + e)) ((a + ((b + c) + d)) + e) ((a + (b + (c + d))) + e) ((a + (b + c)) + (d + e)) ((a + b) + ((c + d) + e)) ((a + b) + (c + (d + e))) (a + (((b + c) + d) + e)) (a + ((b + (c + d)) + e)) (a + ((b + c) + (d + e))) (a + (b + ((c + d) + e))) (a + (b + (c + (d + e))))
In one example in the paper, the method produces a recursive relation of the form V = s + V2, which can be solved by the same well-known techniques to come up with an (infinite) polynomial for V, namely V = 1 + s + 2s2 + 5s3 + 14s4 + ... . Mathematicians will recognize the sequence 1, 1, 2, 5, 14, ... as the Catalan numbers, which come up almost as often as the better-known Fibonacci numbers. For example, the Catalan numbers count the number of binary trees with n nodes; they also count the number of ways of parenthesizing an expression with n terms, as shown in the table at right.

Anyway, in his talk, Val referred to the sequence as "bizarre", and I had to jump in to point out that it was not at all bizarre, it was the Catalan numbers, which are just what you would expect from a relation like V = s + V2, blah blah, and he cut me off, because of course he knows all about the Catalan numbers. He only called them bizarre as a rhetorical flourish, meant to echo the presumed puzzlement of the undergraduates in the room.

(I never know how much of what kind of math to expect from computer science professors. Sometimes they know things I don't expect at all, and sometimes they don't know things that I expect everyone to know.

(This was indeed what was going on, and the professor seemed to think it was a surprising insight. I am not relating this boastfully, because I truly don't think it was a particularly inspired guess.

(Now that I think about it, maybe the answer here is that computer science professors know more about math than I expect, and less about computation.)

Anyway, I digress, and the whole article up to now was not really what I wanted to discuss anyway. What I wanted to discuss was that when I started blathering about Catalan numbers, Val said that if I knew so much about Catalan numbers, I should calculate the coefficient of the x59 term in V2, which also appeared as one of the annotations in his example.

So that's the puzzle, what is the coefficient of the x59 term in V2, where V = 1 + s + 2s2 + 5s3 + 14s4 + ... ?

After I had thought about this for a couple of minutes, I realized that it was going to be much simpler than it first appeared, for two reasons.

The first thing that occurred to me was that the definition of multiplication of polynomials is that the coefficient of the xn term in the product of A and B is Σaibn-i. When A=B, this reduces to Σaian-i. Now, it just so happens that the Catalan numbers obey the relation cn+1 = Σ cicn-i, which is exactly the same form. Since the coefficients of V are the ci, the coefficients of V2 are going to have the form Σcicn-i, which is just the Catalan numbers again, but shifted up by one place.

The next thing I thought was that the Catalan numbers have a pretty simple generating function f(x). This just means that you pretend that the sequence V is a Taylor series, and figure out what function it is the Taylor series of, and use that as a shorthand for the whole series, ignoring all questions of convergence and other such analytic fusspottery. If V is the Taylor series for f(x), then V2 is the Taylor series for f(x)2. And if f has a compact representation, say as sin(x) or something, it might be much easier to square than the original V was. Since I knew in this case that the generating function is simple, this seemed likely to win. In fact the generating function of V is not sin(x) but (1-√(1-4x))/2x. When you square this, you get almost the same thing back, which matches my prediction from the previous paragraph. This would have given me the right answer, but before I actually finished that calculation, I had an "oho" moment.

The generating function is known to satisfy the relation f(x) = 1 + xf(x)2. This relation is where the (1-√(1-4x))/2x thing comes from in the first place; it is the function that satisfies that relation. (You can see this relation prefigured in the equation that Val had, with V = s + V2. There the notation is a bit different, though.) We can just rearrange the terms here, putting the f(x)2 by itself, and get f(x)2 = (f(x)-1)/x.

Now we are pretty much done, because f(x) = V = 1 + x + 2x2 + 5x3 + 14x4 + ... , so f(x)-1 = x + 2x2 + 5x3 + 14x4 + ..., and (f(x)-1)/x = 1 + 2x + 5x2 + 14x3 + ... . Lo and behold, the terms are the Catalan numbers again.

So the answer is that the coefficient of the x59 term is just c(60), calculation of which is left as an exercise for the reader.

I don't know what the point of all that was, but I thought it was fun how the hairy-looking problem seemed likely to be simple when I looked at it a little more carefully, and then how it did turn out to be quite simple.

This blog has had a recurring dialogue between subtle technique and the sawed-off shotgun method, and I often favor the sawed-off shotgun method. Often programmers' big problem is that they are very clever and learned, and so they want to be clever and learned all the time, even when being a knucklehead would work better. But I think this example provides some balance, because it shows a big win for the clever, learned method, which does produce a lot more understanding.

 Order Higher-Order Perl with kickback no kickback
Then again, it really doesn't take long to whip up a program to multiply infinite polynomials. I did it in chapter 6 of Higher-Order Perl, and here it is again in Haskell:

        data Poly a = P [a] deriving Show

instance (Eq a) => Eq (Poly a)
where (P x) == (P y) = (x == y)

polySum x [] = x
polySum [] y = y
polySum (x:xs) (y:ys) = (x+y) : (polySum xs ys)

polyTimes  [] _ = []
polyTimes  _ [] = []
polyTimes  (x:xs) (y:ys) = (x*y) : more
where
more = (polySum (polySum (map (x *) ys) (map (* y) xs))
(0 : (polyTimes xs ys)))

instance (Num a) => Num (Poly a)
where (P x) + (P y) = P (polySum x y)
(P x) * (P y) = P (polyTimes x y)



Tue, 09 Oct 2007

Relatively prime polynomials over Z2
Last week Wikipedia was having a discussion on whether the subject of "mathematical quilting" was notable enough to deserve an article. I remembered that there had been a mathematical quilt on the cover of some journal I read last year, and I went to the Penn math library to try to find it again. While I was there, I discovered that the June 2007 issue of Mathematics Magazine had a cover story about the probability that two randomly-selected polynomials over Z2 are relatively prime. ("The Probability of Relatively Prime Polynomials", Arthur T. Benjamin and Curtis D Bennett, page 196).

Polynomials over Z2 are one of my favorite subjects, and the answer to the question turned out to be beautiful. So I thought I'd write about it here.

First, what does it mean for two polynomials to be relatively prime? It's analogous to the corresponding definition for integers. For any numbers a and b, there is always some number d such that both a and b are multiples of d. (d = 1 is always a solution.) The greatest such number is called the greatest common divisor or GCD of a and b. The GCD of two numbers might be 1, or it might be some larger number. If it's 1, we say that the two numbers are relatively prime (to each other). For example, the GCD of 100 and 28 is 4, so 100 and 28 are not relatively prime. But the GCD of 100 and 27 is 1, so 100 and 27 are relatively prime. One can prove theorems like these: If p is prime, then either a is a multiple of p, or a is relatively prime to p, but not both. And the equation ap + bq = 1 has a solution (in integers) if and only if p and q are relatively prime.

The definition for polynomials is just the same. Take two polynomials over some variable x, say p and q. There is some polynomial d such that both p and q are multiples of d; d(x) = 1 is one such. When the only solutions are trivial polynomials like 1, we say that the polynomials are relatively prime. For example, consider x2 + 2x + 1 and x2 - 1. Both are multiples of x+1, so they are not relatively prime. But x2 + 2x + 1 is relatively prime to x2 - 2x + 1. And one can prove theorems that are analogous to the ones that work in the integers. The analog of "prime integer" is "irreducible polynomial". If p is irreducible, then either a is a multiple of p, or a is relatively prime to p, but not both. And the equation a(x)p(x) + b(x)q(x) = 1 has a solution for polynomials a and b if and only if p and q are relatively prime.

One uses Euclid's algorithm to calculate the GCD of two integers. Euclid's algorithm is simple: To calculate the GCD of a and b, just subtract the smaller from the larger, repeatedly, until one of the numbers becomes 0. Then the other is the GCD. One can use an entirely analogous algorithm to calculate the GCD of two polynomials. Two polynomials are relatively prime just when their GCD, as calculated by Euclid's algorithm, has degree 0.

Anyway, that was more introduction than I wanted to give. The article in Mathematics Magazine concerned polynomials over Z2, which means that the coefficients are in the field Z2, which is just like the regular integers, except that 1+1=0. As I explained in the earlier article, this implies that a=-a for all a, so there are no negatives and subtraction is the same as addition. I like this field a lot, because subtraction blows. Do you have trouble because you're always dropping minus signs here and there? You'll like Z2; there are no minus signs.

Here is a table that shows which pairs of polynomials over Z2 are relatively prime. If you read this blog through some crappy aggregator, you are really missing out, because the table is awesome, and you can't see it properly. Check out the real thing.

 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 d0 d1 0 [a0] 1 [a1] x [a2] x + 1 [a3] x2 [a4] x2 + 1 [a5] x2 + x [a6] x2 + x + 1 [a7] x3 [a8] x3 + 1 [a9] x3 + x [b0] x3 + x + 1 [b1] x3 + x2 [b2] x3 + x2 + 1 [b3] x3 + x2 + x [b4] x3 + x2 + x + 1 [b5] x4 [b6] x4 + 1 [b7] x4 + x [b8] x4 + x + 1 [b9] x4 + x2 [c0] x4 + x2 + 1 [c1] x4 + x2 + x [c2] x4 + x2 + x + 1 [c3] x4 + x3 [c4] x4 + x3 + 1 [c5] x4 + x3 + x [c6] x4 + x3 + x + 1 [c7] x4 + x3 + x2 [c8] x4 + x3 + x2 + 1 [c9] x4 + x3 + x2 + x [d0] x4 + x3 + x2 + x + 1 [d1]

A pink square means that the polynomials are relatively prime; a white square means that they are not. Another version of this table appeared on the cover of Mathematics Magazine. It's shown at right.

The thin black lines in the diagram above divide the polynomials of different degrees. Suppose you pick two degrees, say 2 and 2, and look at the corresponding black box in the diagram:

 a4 a5 a6 a7 x2 [a4] x2 + 1 [a5] x2 + x [a6] x2 + x + 1 [a7]
You will see that each box contains exactly half pink and half white squares. (8 pink and 8 white in that case.) That is, exactly half the possible pairs of degree-2 polynomials are relatively prime. And in general, if you pick a random degree-a polynomial and a random degree-b polynomial, where a and b are not both zero, the polynomials will be relatively prime exactly half the time.

The proof of this is delightful. If you run Euclid's algorithm on two relatively prime polynomials over Z2, you get a series of intermediate results, terminating in the constant 1. Given the intermediate results and the number of steps, you can run the algorithm backward and find the original polynomials. If you run the algorithm backward starting from 0 instead of from 1, for the same number of steps, you get two non-relatively-prime polynomials of the same degrees instead. This establishes a one-to-one correspondence between pairs of relatively prime polynomials and pairs of non-relatively-prime polynomials of the same degrees. End of proof. (See the paper for complete details.)

You can use basically the same proof to show that the probability that two randomly-selected polynomials over Zp is 1-1/p. The argument is the same: Euclid's algorithm could produce a series of intermediate results terminating in 0, in which case the polynomials are not relatively prime, or it could produce the same series of intermediate results terminating in something else, in which case they are relatively prime. The paper comes to an analogous conclusion about monic polynomials over Z.

Some folks I showed the diagram to observed that it looks like a quilt pattern. My wife did actually make a quilt that tabulates the GCD function for integers, which I mentioned in the Wikipedia discussion of the notability of the Mathematical Quilting article. That seems to have brought us back to where the article started, so I'll end here.

[ Puzzle: The (11,12) white squares in the picture are connected to the others via row and column 13, which doesn't appear. Suppose the quilt were extended to cover the entire quarter-infinite plane. Would the white area be connected? ]

Mon, 08 Oct 2007

Reduces your risk of auto theft by 400%.

Fri, 05 Oct 2007

Van der Waerden's problem: program 2
In this series of articles I'm going to analyze four versions of a program that I wrote around 1988, and then another program that does the same thing that I wrote last month without referring to the 1988 code.

If you don't remember what the program does, here's an explanation.

Here is program 1, which was an earlier attempt to do the same thing.

## Program 2

In yesterday's article I wrote about a crappy program to search for "good" strings in van der Waerden's problem. It was crappy because it searched the entire space of all 327 strings, with no pruning.

I can't remember whether I expected this to be practical at the time. Did I really think it would work? Well, there was some sense to it. It does work just fine for the 29 case. I think probably my idea was to do the simplest thing that could possibly work, and get as much information out of it as I could. On my current machine, this method proves that V(3,3) > 19 by finding a witness (RRBRRBBYYRRBRRBBYYB) in under 10 seconds. If we estimate that the computer I had then was 10,000 times slower, then I could have produced the same result in about 28 hours. I was at college, and there was plenty of free computing power available, so running a program for 28 hours was easily done. While I was waiting for it to finish, I could work on a better program.

Excerpts of the better program follow. The complete source code is here.

The idea behind this program is that the strings of length less than V form a tree, with the empty string as the root, and the children of string s are obtained from s by appending a single character to the end of s. If the string at a node is bad, so will be all the strings under it, and we can prune the entire branch at that node. This leaves us with a tree of all the good strings. The ones farthest from the root will be the witnesses we seek for the values of V(n, C), and we can find these by doing depth-first search on the tree,

There is nothing wrong with this idea in principle; that's the way my current program works too. The problem is all in the implementation. You see, this program actually constructs the entire tree in memory:

    #define NEWN		((struct tree *) Malloc(sizeof(struct tree)));\
printf("*")
struct tree {
struct tree *away[MAXCOLORS];
} *root;

struct tree is a tree node structure. It represents a string s, and has a flag to record whether s is bad. It also has pointers to its subnodes, which will represents strings sA, sB, and so on.

MAXCOLORS is a compiled-in limit on the number of different symbols the strings can contain, an upper bound on C. Apparently I didn't know the standard technique for avoiding this inflexibility. You declare the array as having length 1, but then when you allocate the structure, you allocate enough space for the array you are actually planning to use. Even though the declared size of the array is 1, you are allowed to refer to node->away[37] as long as there is actually enough space in the allocated chunk. The implementation would look like this:

        struct tree {
struct tree *away[1];
} ;

struct tree *make_tree_node(char bad, unsigned n_subnodes)
{
struct tree *t;
unsigned i;

t =  malloc(sizeof(struct tree)
+ (n_subnodes-1) * sizeof(struct tree *));

if (t == NULL) return NULL;

for (i=0; i < n_subnodes; i++) t->away[i] = NULL;

return t;
}

(Note for those who are not advanced C programmers: I give you my solemn word of honor that I am not doing anything dodgy or bizarre here; it is a standard, widely-used, supported technique, guaranteed to work everywhere.)

(As before, this code is in a pink box to indicate that it is not actually part of the program I am discussing.)

Another thing I notice is that the NEWN macro is very weird. Note that it may not work as expected in a context like this:

        for(i=0; i<10; i++)
s[i] = NEWN;

This allocates ten nodes but prints only one star, because it expands to:

        for(i=0; i<10; i++)
s[i] = ((struct tree *) Malloc(sizeof(struct tree)));
printf("*");

and the for loop does not control the printf. The usual fix for multiline macros like this is to wrap them in do...while(0), but that is not appropriate here. Had I been writing this today, I would have made NEWN a function, not a macro. Clevermacroitis is a common disorder of beginning C programmers, and I was no exception.

The main business of the program is in the makenodes function; the main routine does some argument processing and then calls makenodes. The arguments to the makenodes function are the current tree node, the current string that that node represents, and an integer howfar that says how deep a tree to construct under the current node.

There's a base case, for when nothing needs to be constructed:

    if (!howfar)
{
for (i=0; i<colors; i++)
n->away[i] = NULL;
return;
}

But in general the function calls itself recursively:

    for (i=0; i<colors; i++)
{
n->away[i] = NEWN;
if (apchk(s,'A'+i))
{
}
else
...

Recall that apchk checks a string for an arithmetic progression of equal characters. That is, it checks to see if a string is good or bad. If the string is bad, the function prunes the tree at the current node, and doesn't recurse further.

Unlike the one in the previous program, this apchk doesn't bother checking all the possible arithmetic progressions. It only checks the new ones: that is, the ones involving the last character. That's why it has two arguments. One is the old string s and the other is the new symbol that we want to append to s.

If s would still be good with symbol 'A'+i appended to the end, the function recurses:

        ...
else
{
ls = strlen(s);
newarg = STRING(ls + 1);
strcpy(newarg,s);
newarg[ls+1] = '\0';
newarg[ls] = 'A' + i;
makenodes(n->away[i],howfar-1,newarg);
Free(newarg,ls+2);
Free(n->away[i],sizeof(struct tree));
}
}
}

The entire string is copied here into a new buffer. A better technique sould have been to allocate a single buffer back up in main, and to reuse that buffer over again on each call to makenodes. It would have looked something like this:

        char *s = String(maxlen);
memset(s, 0, maxlen+1);
makenodes(s, s, maxlen);

void
makenodes(char *start, char *end, unsigned howfar)
{
...
for (i=0; i<colors; i++) {
*end = 'A' + i;
makenodes(start, end+1, howfar-1);
}
*end = '\0';
...
}

This would have saved a lot of consing, ahem, I mean a lot of mallocing. Also a lot of string copying. We could avoid the end pointer by using start+maxlen-howfar instead, but this way is easier to understand.

I was thinking this afternoon how it's intersting the way I wrote this. It's written the way it would have been done, had I been using a functional programming language. In a functional language, you would never mutate the same string for each function call; you always copy the old structure and construct a new one, just as I did in this program. This is why C programmers abominate functional languages.

Had I been writing makenodes today, I would probably have eliminated the other argument. Instead of passing it a node and having it fill in the children, I would have had it construct and return a complete node. The recursive call would then have looked like this:

  struct tree *new = NEWN;
...
for (i=0; i<colors; i++) {
new->away[i] = makenodes(...);
...
}
return new;

One thing I left out of all this was the diagnostic printfs; you can see them in the complete code if you want. But there's one I thought was worth mentioning anyway:

    #define TABS	"                                        "
....

#ifdef DIAG
printf("%s makenoding with string %s, depth %d.\n",
TABS+12-maxlen+howfar,s,maxlen-howfar);
#endif

The interesting thing here is the TABS+12-maxlen+howfar argument, which indents the display depending on how far the recursion has progressed. In Perl, which has nonaddressable strings, I usually do something like this:

        my $TABS = " " x (maxlen - howfar); print$TABS, "....";

The TABS trick here is pretty clever, and I'm a bit surprised that I thought of it in 1988, when I had been programming in C for only about a year. It makes an interesting contrast to my failure to reuse the string buffer in makenodes earlier.

(Peeking ahead, I see that in the next version of the program, I did reuse the string buffer in this way.)

TABS is actually forty spaces, not tabs. I suspect I used tabs when I tested it with V(2, 3), where maxlen was only 9, and then changed it to spaces for calculating V(3, 3), where maxlen was 27.

The apchk function checks to see if a string is good. Actually it gets a string, qq, and a character, q, and checks to see if the concatenation of qq and q would be good. This reduces its running time to O(|qq|) rather than O(|qq|2).

  int
apchk(qq,q)
char *qq ,q;
{
int lqq, f, s, t;

t = lqq = strlen(qq);
if (lqq < 2) return NO;

for (f=lqq % 2; f <= lqq - 2; f += 2)
{
s = (f + t) / 2;
if ((qq[f] == qq[s]) && (qq[s] == q))
return YES;
}
return NO;
}

It's funny that it didn't occur to me to include an extra parameter to avoid the strlen, or to use q instead of qq[s] in the first == test. Also, as in the previous program, I seem unaware of the relative precedences of && and ==. This is probably a hangover from my experience with Pascal, where the parentheses are required.

It seems I hadn't learned yet that predicate functions like apchk should be named something like is_bad, so that you can understand code like if (is_bad(s)) { ... } without having to study the code of is_bad to figure out what it returns.

I was going to write that I hated this function, and that I could do it a lot better now. But then I tried to replace it, and wasn't as successful as I expected I would be. My replacement was:

        unsigned
{
size_t qql = strlen(qq);
char *f = qq + qql%2;
char *s = f + qql/2;
while (f < s) {
if (*f == q && *s == q) return 1;
f += 2; s += 1;
}
return 0;
}

I could simplify the initializations of f and s, which are the parts I dislike most here, by making the pointers move backward instead of forward, but then the termination test becomes more complicated:
        unsigned
{
char *s = strchr(qq, '\0')-1;
char *f = s-1;
while (1) {
if (*f == q && *s == q) return 1;
if (f - qq < 2) break;
f -= 2; s -= 1;
}
return 0;
}

Anyway, I thought I could improve it, but I'm not sure I did. On the one hand, I like the f -= 2; s -= 1;, which I think is pretty clear. On the other hand, s = (f + t) / 2 is pretty clear too; s is midway between f and t. I'm willing to give teenage Dominus a passing grade on this one.

Someone probably wants to replace the while loop here with a for loop. That person is not me.

The Malloc and Free functions track memory usage and were presumably introduced when I discovered that my program used up way too much memory and crashed—I think I remember that the original version omitted the calls to free. They aren't particularly noteworthy, except perhaps for this bit, in Malloc:

        if (p == NULL)
{
fprintf(stderr,"Couldn't get %d bytes.\n",c);
fprintf(stderr,"Total get was %d.\n",gotten);
fprintf(stderr,"P\n L\n  O\n   P\n    !\n");
abort();
}

Plop!

It strikes me as odd that I was using void in 1988 (this is before the C90 standard) but still K&R-style function declarations. I don't know what to make of that.

## Behavior

This program works, almost. On my current machine, it can find the length-26 witnesses for V(3, 3) in no time. (In 1998, it took several days to run on a Sequent Balance 21000.) The major problem is that it gobbles memory: the if (!howfar) base case in makenodes forgets to release the memory that was allocated for the new node. I wonder if the Malloc and Free functions were written in an unsuccessful attempt to track this down.

Sometime after I wrote this program, while I was waiting for it to complete, it occurred to me that it never actually used the tree for anything, and I could take it out.

I have this idea that one of the principal symptoms of novice programmers is that they take the data structures too literally, and always want to represent data the way it will appear when it's printed out. I haven't developed the idea well enough to write an article about it, but I hope it will show up here sometime in the next three years. This program, which constructs an entirely unnecessary tree structure, may be one of the examples of this idea.

I'll show the third version sometime in the next few days, I hope.

[ Addendum 20071014: Here is part 3. ]

Thu, 04 Oct 2007

The world's worst macro preprocessor: postmortem
I see that the world's worst macro processor, subject of a previous article, is a little over a year old. A year ago I said that it was a huge success. I think it's time for a postmortem analysis.

My overall assessment is that it has been a huge success, and that if I were doing it over I would do it the same way.

A recent article contained a bunch of red and blue dots:

Well, clearly you can do four: . And then you can add another red one on the end: . And then another that could be either red or blue: . And then the next can be either color, say blue: .

I typed this using these macros:

        #define R* <span style="color: red">&bull;</span>
#define B* <span style="color: blue">&bull;</span>
#define Y* <span style="color: yellow">&bull;</span>

Without the macro processor, I would have had to suffer a lot. Then, a little while later, I needed to prepare this display:

••••••••••••••••••••••••••
••••••••••••••••••••••••••
••••••••••••••••••••••••••
••••••••••••••••••••••••••
••••••••••••••••••••••••••
••••••••••••••••••••••••••
••••••••••••••••••••••••••
••••••••••••••••••••••••••

No problem; the lines just look like R*R*B*B*R*R*B*Y*B*Y*Y*R*Y*R*R*B*R*B*B*Y*R*Y*Y*B*Y*B*.

Some time later I realized that this display would be totally illegible to the blind, the color-blind, and people using text-only browsers. So I just changed the macros:

        #define R* <span style="color: red">R</span>
#define B* <span style="color: blue">B</span>
#define Y* <span style="color: yellow">Y</span>

Problem solved. instantly becomes R R B B R B B. And a good thing, too, because I discovered afterward that a lot of aggregators, like bloglines and feedburner, discard the color information.

I find that I've used the macro feature 114 times so far. The most common use has been:

   #define ^2 <sup>2</sup>

But I also have files with:

      #define r2 &radic;2

That last one appears in three files. Clearly, making the macros local to files was a good decision.

Those uses are pretty typical. A less typical one is:

      #define <OVL> <span style="text-decoration: overline">
#define </OVL> </span>

This is the sort of thing that you can get away with on a one-time basis, but which you wouldn't want to make a convention of. Since the purpose of the macro processor is to enable such hacks for the duration of a single article, it's all good.

I did run into at least one problem: I was writing an article in which I had defined ^i to abbreviate <sup><i>i</i></sup>. And then several paragraphs later I had a TeX formula that contained the ^i sequence in its TeX meaning. This was being replaced with a bunch of HTML, which was then passed to TeX, which then produced the wrong output.

One can solve this by reordering the plugins. If I had put the TeX plugin before the macro plugin, the problem would have gone away, because the TeX plugin would have replaced the TeX formula with an image element before the macro plugin ever saw the ^i.

This approach has many drawbacks. One is that it would no longer have been possible to use Blosxom macros in a TeX formula. I wasn't willing to foreclose this possibility, and I also wasn't sure that I hadn't done it somewhere. If I had, the TeX formula that depended on the macro expansion would have broken. And this is a risk whenever you move the macro plugin: if you move it from before plugin X to after plugin X, you have to worry that maybe something in some article depended on the text passed to X having been macro-processed.

When I installed the macro processor, I placed it first in plugin order for precisely this reason. Moving the macro substitution later would have required me to remember which plugins would be affected by the macro substitutions and which not. With the macro processing first, the question has a simple answer: all of them are affected.

Also, I didn't ever want to have to worry that some macro definition might mangle the output of some plugin. What if you are hacking on some plugin, and you change it to return <span style="Foo"> instead of <span style="foo">, and then discover that three articles you wrote back in 1997 are now totally garbled because they contained #define Foo >WUGGA<? It's just too unpredictable. Having the macro processing occur first means that you can always see in the original article file just what might be macro-replaced.

So I didn't reorder the plugins.

Another way to solve the TeX ^i problem would have been to do something like this:

        #define ^i <sup><i>i</i></sup>
#define ^*i ^i

with the idea that I could write ^*i in the TeX formula, and the macro processor would replace it with ^i after it was done replacing all the ^i's.

At present the macro processor does not define any order to macro replacements, but it does guarantee to replace each string only once. That is, the results of macro replacement are not themselves searched for macro replacement. This limits the power of the macro system, but I think that is a good thing. One of the powers that is thus proscribed is the power to get stuck in an infinite loop.

It occurs to me now that although I call it the world's worst macro system, perhaps that doesn't give me enough credit for doing good design that might not have been obvious. I had forgotten about my choice of single-substituion behavior, but looking back on it a year later, I feel pleased with myself for it, and imagine that a lot of people would have made the wrong choice instead.

(A brief digression: unlimited, repeated substitution is a bad move here because it is complex—much more complex than it appears. A macro system with single substitution is nothing much, but a macro system with repeated substitution is a programming language. The semantics of the λ-calculus is nothing more than simple substitution, repeated as necessary, and the λ-calculus is a maximally complex computational engine. Term-rewriting systems are a more obvious theoretical example, and TeX is a better-known practical example of this phenomenon. I was sure I did not want my macro system to be a programming language, so I avoided repeated substitution.)

Because each input text is substituted at most once, the processor's refusal to define the order of the replacements is not something you have to think about, as long as your macros are prefix-unique. (That is, as long as none is a prefix of another.) So you shouldn't define:

  #define foo   bar
#define fool  idiot

because then you don't know if foolish turns into barlish or idiotish. This is not a big deal in practice.

Well, anyway, I did not solve the problem with #define ^*i ^i. I took a much worse solution, which was to hack a #undefall directive into the macro processor. In my original article, I boasted that the macro processor "has exactly one feature". Now it has two, and it's not an improvement. I disliked the new feature at the time, and now that I'm reviewing the decision, I think I'm going to take it out.

I see that I did use the double-macro solution elsewhere. In the article about Gödel and the U.S. Constitution, I macroed an abbreviation for the umlaut:

        #define Godel G&ouml;del

But this sequence also ocurred in the URLs in the link elements, and the substitution broke the links. I should probably have changed this to:

        #define Go:del G&ouml;del


        #define GODEL Godel

and then used GODEL in the URLs. Oh well, whatever works, I guess.

Perhaps my favorite use so far is in an (unfinished) article about prosopagnosia. I got tired of writing about prosopagnosia and prosopagnosiacs, so

      #define PAa prosopagnosia
#define PAic prosopagnosiac

Note that with these definitions, I get PAa's, and PAics for free. I could use PAac instead of defining PAic, but that would prevent me from deciding later that prosopagnosiac should be spelled "prosopagnosic".

Wed, 03 Oct 2007

Van der Waerden's problem: program 1
In this series of articles I'm going to analyze four versions of a program that I wrote around 1988, and then another program that does the same thing that I wrote last month without referring to the 1988 code.

If you don't remember what the program does, here's an explanation.

## Program 1

I'm going to discuss the program a bit at a time. The complete program is here.

This program does an unpruned exhaustive search of the string space. Since for V(3, 3) the string space contains 327 = 7,625,597,484,987 strings, it takes a pretty long time to finish. I quickly realized that I was wasting my time with this program.

The program is invoked with a length argument and an optional colors argument, which defaults to 2. It then looks for good strings of the specified length, printing those it finds. If there are none, one then knows that V(3, colors) > length. Otherwise, one knows that V(3, colors) ≤ length, and has witness strings to prove it.

I don't want to spend a lot of time on it because there are plenty of C programming style guides you can read if you care for that. But already on lines 4–5 we have something I wouldn't write today:

        #define NO	0
#define YES	!NO

Oh well.

The program wants to iterate through all Cn strings. How does it know when it's done? It's not easy to make a program as slow as this one even slower, but I found a way to do it.

        last = STRING(length);
stuff(last,'A' - 1 + colors);

for (i=0; i<colors; i++)
last[i] = 'A' + i;

for (; strcmp(seq,last); strinc(seq))
...

It manufactures the string ABCDDDDDDDDD....D and compares the current string to that one every time through the loop. A much simpler method is to detect completion while incrementing the target string. The function that does the increment looks like this:

        void
strinc(s)
char *s;
{
int i;

for (i= length - 1; i>=0; i--)
{
if (s[i] != 'A' - 1 + colors)
{
s[i]++;
return;
}
s[i] = 'A';
}
return;
}

Had I been writing it today, it would have looked more like this:

        unsigned strinc(char *s)
{
char *p = strchr(s, '\0') - 1;
while (p >= s && *p == 'A' + colors - 1) *p-- = 'A';
if (p < s) return 0;
(*p)++;
return 1;
}

(This code is in a pink box to show that it is not actually part of the program I am discussing in this article.)

The function returns true on success and false on failure. A false return can be taken by the caller as the signal to terminate the program.

This replacement function invokes undefined behavior, because there is no guarantee that p is allowed to run off the beginning of the string in the way that it does. But there is no need to check the strings in lexicographic order. Instead of scanning the strings in the order AAA, AAB, ABA, ABB, BAA, etc., one can scan them in reverse lexicographic order: AAA, BAA, ABA, BBA, AAB, etc. Then instead of running off the beginning of the string, p runs off the end, which is allowed. This fixes the undefined behavior problem and also eliminates the call to strchr that finds the end of the string. This is likely to produce a significant speedup:

        unsigned strinc(char *s)
{
while (*s == 'A' + colors - 1) *s++ = 'A';
if (!*s) return 0;
(*s)++;
return 1;
}

Here we're depending on the optimizer to avoid recomputing the value of 'A' + colors - 1 every time through the loop.

The heart of the program is the apchk() function, which checks whether a string q contains an arithmetic progression of length 3:

        int
apchk(q)
char *q;
{
int f, s, t;

for (f=0; f <= length - 3; f++)
for (s=f+1; s <= length - 2; s++)
{
t = s+s-f;
if (t >= length) break;
if ((q[f] == q[s]) && (q[s] == q[t])) return YES;
}
return NO;
}

I hesitate to say that this is the biggest waste of time in the whole program, since after all it is a program whose job is to examine 7,625,597,484,987 strings. But look. 2/3 of the calls to this function are asking it to check a string that differs from the previous string in the final character only. Nevertheless, it still checks all 49 possible arithmetic progressions, even the ones that didn't change.

The t ≥ length test is superfluous, or if it isn't, it should be.

Also notice that I wasn't sure of the precendence in the final test.

It didn't take me long to figure out that this program was not going to finish in time. I wrote a series of others, which I hope to post here in coming days. The next one sucks too, but in a completely different way.

[ Addendum 20071005: Here is part 2. ]

[ Addendum 20071014: Here is part 3. ]

Tue, 02 Oct 2007

Van der Waerden's problem
In this series of articles I'm going to analyze four versions of a program that I wrote around 1988, and then another program that does the same thing that I wrote last month without referring to the 1988 code.

First I'll explain what the programs are about.

## Van der Waerden's problem

Color each of a row of dots red or blue, so that no three evenly-spaced dots are the same color. (That is, if dots n and n+i are the same color, dot n+2i must be a different color.) How many dots can you do?

Well, clearly you can do four: R R B B. And then you can add another red one on the end: R R B B R. And then another that could be either red or blue: R R B B R B. And then the next can be either color, say blue: R R B B R B B.

But now you are at the end, because if you make the next dot red, then dots 2, 5, and 8 will all be red (R R B B R B B R), and if you make the next dot blue then dots 6, 7, and 8 will be blue (R R B B R B B B).

But maybe we made a mistake somewhere earlier, and if the first seven dots were colored differently, we could have made a row of more than 7 that obeyed the no-three-evenly-spaced-dots requirement. In fact, this is so: R R B B R R B B is an example.

But this is the end of the line. Any coloring of a row of 9 dots contains three evenly-spaced dots of the same color. (I don't know a good way to prove this, short of an enumeration of all 512 possible arrangements of dots. Well, of course it is sufficient to enumerate the 256 that begin with R, but that is pretty much the same thing.)

Van der Waerden's theorem says that for any number of colors, say C, a sufficiently-long row of colored dots will contain n evenly-spaced same-color dots for any n. Or, put another way, if you partition the integers into C disjoint classes, at least one class will contain arbitrarily long arithmetic progressions.

The proof of van der Waerden's theorem works by taking C and n and producing a number V such that a row of V dots, colored with C colors, is guaranteed to contain n evenly-spaced dots of a single color. The smallest such V is denoted V(n, C). For example V(3, 2) is 9, because any row of 9 dots of 2 colors is guaranteed to contain 3 evenly-spaced dots of the same color, but this is not true of such row of only 8 dots.

Van der Waerden's theorem does not tell you what V(n, C) actually is; it provides only an upper bound. And here's the funny thing about van der Waerden's theorem: the upper bound is incredibly bad.

For V(3, 2), the theorem tells you only that V(3, 2) ≤ 325. That is, it tells you that any row of 325 red and blue dots must contain three evenly spaced dots of the same color. This is true, but oh, so sloppy, since the same is true of any row of 9 dots.

For V(3, 3), the question is how many red, yellow, and blue dots do you need to guarantee three evenly-spaced same-colored dots. The theorem helpfully suggests that:

$$V(3,3) \leq 7(2\cdot3^7+1)(2\cdot3^{7(2\cdot3^7+1)}+1)$$

This is approximately 5.79·1014613. But what is the actual value of V(3, 3)? It's 27. Urgggh.

In fact, there is a rather large cash prize available to be won by the first person who comes up with a general upper bound for V(n, C) that is smaller than a tower of 2's of height n. (That's 222... with n 2's.)

In the rest of this series, a string which does not contain three evenly-spaced equal symbols will be called good, and one which does contain three such symbols will be called bad. Then a special case of Van der Waerden's theorem, with n=3, says that, for any fixed number of symbols, all sufficiently long strings are bad.

In college I wanted to investigate this a little more. In particular, I wanted to calculate V(3, 3). These days you can just look it up on Wikipedia, but in those benighted times such information was hard to come by. I also wanted to construct the longest possible good strings, witnesses of length V(3, 3)-1. Although I did not know it at the time, V(3, 3) = 27, so a witness should have length 26. It turns out that there are exactly 48 witnesses of length 26. Here are the 1/6 of them that begin with RB or RRB:

RRBBRRBYBYYRYRRBRBBYRYYBYB
RRBBYRRYRYBBYYBBYRYRRYBBRR
RRBYBRRYRYBBYYBBYRYRRBYBRR
RBRRBRBYYBBYYBRBRRBYYRRYRY
RBRBBRRYBBYBYRRYYRRYBYBBYR
RBRBBRRYBBYBYRRYYRRYBYBBYB
RBRBBYBRRYRYYBYBBRBRYYRRYY
RBYYBYBRRBBRRBYBYYBRRYYRYR


The rest of the witnesses may be obtained by permuting the colors in these eight.

I wrote a series of C programs around 1988 to exhaustively search for good strings. Last month I was in a meeting and I decided to write the program again for some reason. I wrote a much better program. This series of articles will compare the five programs. I will post the first one tomorrow.

[ Addendum 20071003: Here is part 1. ]

[ Addendum 20071005: Here is part 2. ]

[ Addendum 20071005: I made a mistake in the expression I gave for the upper bound on V(3,3) and left out a factor of 7 in the exponent on the last 3. I had said that the upper bound was around 102092, but actually it is more like the seventh power of this. ]

[ Addendum 20071014: Here is part 3. ]