The Universe of Discourse

Mon, 29 Oct 2007

Undefined behavior in Perl and other languages
Miles Gould wrote what I thought was an interesting article on implementation-defined languages, and cited Perl as an example. One of his points was that a language that is defined by its implementation, as Perl is, rather than by a standards document, cannot have any "undefined behavior".

## Undefined behavior

For people unfamiliar with this concept, I should explain briefly. The C standard is full of places that say "if the program contains x, the behavior is undefined", which really means "C programs do not contain x, so If the program contains x, it is not written in C, and, as this standard only defines the meaning of programs in C, it has nothing to say about the meaning of your program." There are around a couple of hundred of these phrases, and a larger number of places where it is implied.

For example, everyone knows that it means when you write x = 4;, but what does it mean if you write 4 = x;? According to clause 6.3.2.1[#1], it means nothing, and this is not a C program. The non-guarantee in this case is extremely strong. The C compiler, upon encountering this locution, is allowed to abort and spontaneously erase all your files, and in doing so it is not violating the requirements of the standard, because the standard does not require any particular behavior in this case.

The memorable phrase that the comp.lang.c folks use is that using that construction might cause demons to fly out of your nose.

[ Addendum 20071030: I am informed that I misread the standard here, and that the behavior of this particular line is not undefined, but requires a compiler diagnostic. Perhaps a better example would have been x = *(char *)0. ]

I mentioned this in passing in one of my recent articles about a C program I wrote:

        unsigned strinc(char *s)
{
char *p = strchr(s, '\0') - 1;
while (p >= s && *p == 'A' + colors - 1) *p-- = 'A';
if (p < s) return 0;
(*p)++;
return 1;
}

Here the pointer p starts at the end of the string s, and the loop might stop when p points to the position just before s. Except no, that is forbidden, and the program might at that moment cause demons to fly out of your nose. You are allowed to have a pointer that points to the position just after an object, but not one that points just before.

Well anyway, I seem to have digressed. My point was that M. Gould says that one advantage of languages like Perl that are defined wholly by their (one) implementation is that you never have "undefined behavior". If you want to know what some locution does, you type it in and see what it does. Poof, instant definition.

Although I think this is a sound point, it occurred to me that that is not entirely correct. The manual is a specification of sorts, and even if the implementation does X in situation Y, the manual might say "The implementation does X in situation Y, but this is unsupported and may change without warning in the future." Then what you have is not so different from Y being undefined behavior. Because the manual is (presumably) a statement of official policy from the maintainers, and, as a communiqué from the people with the ultimate authority to define the future meaning of the language, it has some of the same status that a formal specification would.

## Perl: the static variable hack

Such disclaimers do appear in the Perl documentation. Probably the most significant example of this is the static variable hack. For various implementation reasons, the locution my $static if 0 has a strange and interesting effect:  sub foo { my$static = 42 if 0;
print "static is now $static\n";$static++;
}

foo() for 1..5;

This makes $static behave as a "static" variable, and persist from call to call of foo(). Without the ... if 0, the code would print "static is now 42" five times. But with ... if 0, it prints:  static is now static is now 1 static is now 2 static is now 3 static is now 4  This was never an intentional feature. It arose accidentally, and then people discovered it and started using it. Since the behavior was the result of a strange quirk of the implementation, caused by the surprising interaction of several internal details, it was officially decided by the support group that this behavior would not be supported in future versions. The manual was amended to say that this behavior was explicitly undefined, and might change in the future. It can be used in one-off programs, but not in any important program, one that might have a long life and need to be run under several different versions of Perl. Programs that use pointers that point outside the bounds of allocated storage in C are in a similar position. It might work on today's system, with today's compiler, today, but you can't do that in any larger context. Having the "undefined behavior" be determined by the manual, instead of by a language standard, has its drawbacks. The language standard is fretted over by experts for months. When the C standard says that behavior is undefined, it is because someone like Clive Feather or Doug Gwyn or P.J. Plauger, someone who knows more about C than you ever will, knows that there is some machine somewhere on which the behavior is unsupported and unsupportable. When the Perl manual says that some behavior is undefined, you might be hearing from the Perl equivalent of Doug Gwyn, someone like Nick Clark or Chip Salzenberg or Gurusamy Sarathy. Or you might be hearing from a mere nervous-nellie who got their patch into the manual on a night when the release manager had stayed up too late. ## Perl: modifying a hash in a loop Here is an example of this that has bothered me for a long time. One can use the each() operator to loop lazily over the contents of a hash:  while (my$key = each %hash) {
# do something with $key and$hash{$key} }  What happens if you modify the hash in the middle of the loop? For various implementation reasons, the manual forbids this. For example, suppose the loop code adds a new key to the hash. The hash might overflow as a result, and this would trigger a reorganization that would move everything around, destroying the ordering information. The subsequent calls to each() would continue from the same element of the hash, but in the new order, making it likely that the loop would visit some keys more than once, or some not at all. So the prohibition in that case makes sense: The each() operator normally guarantees to produce each key exactly once, and adding elements to a hash in the middle of the loop might cause that guarantee to be broken in an unpredictable way. Moreover, there is no obvious way to fix this without potentially wrecking the performance of hashes. But the manual also forbids deleting keys inside the loop, and there the issue does not come up, because in Perl, hashes are never reorganized as the result of a deletion. The behavior is easily described: Deleting a key that has already been visited will not affect the each() loop, and deleting one that has not yet been visited will just cause it to be skipped when the time comes. Some people might find this general case confusing, I suppose. But the following code also runs afoul of the "do not modify a hash inside of an each loop" prohibition, and I don't think anyone would find it confusing:  while (my$key = each %hash) {
delete $hash{$key} if is_bad($hash{$key});
}

Here we want to delete all the bad items from the hash. We do this by scanning the hash and deleting the current item whenever it is bad. Since each key is deleted only after it is scanned by each, we should expect this to visit every key in the hash, as indeed it does. And this appears to be a useful thing to write. The only alternative is to make two passes, constructing a list of bad keys on the first pass, and deleting them on the second pass. The code would be more complicated and the time and memory performance would be much worse.

There is a potential implementation problem, though. The way that each() works is to take the current item and follow a "next" pointer from it to find the next item. (I am omitting some unimportant details here.) But if we have deleted the current item, the implementation cannot follow the "next" pointer. So what happens?

In fact, the implementation has always contained a bunch of code, written by Larry Wall, to ensure that deleting the current key will work properly, and that it will not spoil the each(). This is nontrivial. When you delete an item, the delete() operator looks to see if it is the current item of an each() loop, and if so, it marks the item with a special flag instead of deleting it. Later on, the next time each() is invoked, it sees the flag and deletes the item after following the "next" pointer.

So the implementation takes some pains to make this work. But someone came along later and forbade all modifications of a hash inside an each loop, throwing the baby out with the bathwater. Larry and perl paid a price for this feature, in performance and memory and code size, and I think it was a feature well bought. But then someone patched the manual and spoiled the value of the feature. (Some years later, I patched the manual again to add an exception for this case. Score!)

## Perl: modifying an array in a loop

Another example is the question of what happens when you modify an array inside a loop over the array, as with:

  @a = (1..3);
for (@a) {
print;
push @a, $_ + 3 if$_ % 2 == 1;
}

(This prints 12346.) The internals are simple, and the semantics are well-defined by the implementation, and straightforward, but the manual has the heebie-jeebies about it, and most of the Perl community is extremely superstitious about this, claiming that it is "entirely unpredictable". I would like to support this with a quotation from the manual, but I can't find it in the enormous and disorganized mass that is the Perl documentation.

[ Addendum: Tom Boutell found it. The perlsyn page says "If any part of LIST is an array, foreach will get very confused if you add or remove elements within the loop body, for example with splice. So don't do that." ]

The behavior, for the record, is quite straightforward: On the first iteration, the loop processes the first element in the array. On the second iteration, the loop processes the second element in the array, whatever that element is at the time the second iteration starts, whether or not that was the second element before. On the third iteration, the loop processes the third element in the array, whatever it is at that moment. And so the loop continues, terminating the first time it is called upon to process an element that is past the end of the array. We might imagine the following pseudocode:

        index = 0;
while (index < array.length()) {
process element array[index];
index += 1;
}

There is nothing subtle or difficult about this, and claims that the behavior is "entirely unpredictable" are probably superstitious confessions of ignorance and fear.

Let's try to predict the "entirely unpredictable" behavior of the example above:

  @a = (1..3);
for (@a) {
print;
push @a, $_ + 3 if$_ % 2 == 1;
}

Initially the array contains (1, 2, 3), and so the first iteration processes the first element, which is 1. This prints 1, and, since 1 is odd, pushes 4 onto the end of the array.

The array now contains (1, 2, 3, 4), and the loop processes the second element, which is 2. 2 is printed. The loop then processes the third element, printing 3 and pushing 6 onto the end. The array now contains (1, 2, 3, 4, 6).

On the fourth iteration, the fourth element (4) is printed, and on the fifth iteration, the fifth element (6) is printed. That is the last element, so the loop is finished. What was so hard about that?

## Haskell: n+k patterns

My blog was recently inserted into the feed for planet.haskell.org, and of course I immediately started my first streak of posting code-heavy articles about C and Perl. This is distressing not just because the articles were off-topic for Planet Haskell—I wouldn't give the matter two thoughts if I were posting my usual mix of abstract math and stuff—but it's so off-topic that it feels weird to see it sitting there on the front page of Planet Haskell. So I thought I'd make an effort to talk about Haskell, as a friendly attempt to promote good relations between tribes. I'm not sure what tribe I'm in, actually, but what the heck. I thought about Haskell a bit, and a Haskell example came to mind.

Here is a definition of the factorial function in Haskell:

        fact 0 = 1
fact n = n * fact (n-1)

I don't need to explain this to anyone, right?

Okay, now here is another definition:

        fact 0     = 1
fact (n+1) = (n+1) * fact n

Also fine, and indeed this is legal Haskell. The pattern n+1 is allowed to match an integer that is at least 1, say 7, and doing so binds n to the value 6. This is by a rather peculiar special case in the specification of Haskell's pattern-matcher. (It is section 3.17.2#8 of Haskell 98 Language and Libraries: The Revised Report, should you want to look it up.) This peculiar special case is known sometimes as a "successor pattern" but more often as an "n+k pattern".

The spec explicitly deprecates this feature:

Many people feel that n+k patterns should not be used. These patterns may be removed or changed in future versions of Haskell.

(Page 33.) One wonders why they put it in at all, if they were going to go ahead and tell you not to use it. The Haskell committee is usually smarter than this.

I have a vague recollection that there was an argument between people who wanted to use Haskell as a language for teaching undergraduate programming, and those who didn't care about that, and that this was the compromise result. Like many compromises, it is inferior to both of the alternatives that it interpolates between. Putting the feature in complicates the syntax and the semantics of the language, disrupts its conceptual purity, and bloats the spec—see the Perlesque yikkity-yak on pages 57–58 about how x + 1 = ... binds a meaning to +, but (x + 1) = ... binds a meaning to x. Such complication is worth while only if there is a corresponding payoff in terms of increased functionality and usability in the language. In this case, the payoff is a feature that can only be used in one-off programs. Serious programs must avoid it, since the patterns "may be removed or changed in future versions of Haskell". The Haskell committee purchased this feature at a certain cost, and it is debatable whether they got their money's worth. I'm not sure which side of that issue I fall on. But having purchased the feature, the committee then threw it in the garbage, squandering their sunk costs. Oh well. Not even the Haskell committee is perfect.

I think it might be worth pointing out that the version of the program with the n+k pattern is technically superior to the other version. Given a negative integer argument, the first version recurses forever, possibly taking a long time to fail and perhaps taking out the rest of the system on which it is running. But the n+k version fails immediately, because the n+1 pattern will only match an integer that is at least 1.

## XML screws up

The "nasal demons" of the C standard are a joke, but a serious one. The C standard defines what C compilers must do when presented with C programs; it does not define what they do when presented with other inputs, nor what other software does when presented with C programs. The authors of C standard clearly understood the standard's role in the world.

Earlier versions of the XML standard were less clear. There was a particularly laughable clause in the first edition of the XML 1,0 standard:

XML documents may, and should, begin with an XML declaration which specifies the version of XML being used. For example, the following is a complete XML document, well-formed but not valid:

<?xml version="1.0"?>
<greeting>Hello, world!</greeting>

...

The version number "1.0" should be used to indicate conformance to this version of this specification; it is an error for a document to use the value "1.0" if it does not conform to this version of this specification.

(Emphasis is mine.) The XML 1.0 spec is just a document. It has no power, except to declare that certain files are XML 1.0 and certain files are not. A file that complies with the requirements of the spec is XML 1.0; all other files are not XML 1.0. But in the emphasized clause, the spec says that certain behavior "is an error" if it is exhibited by documents that do not conform to the spec. That is, it is declaring certain non-XML-1.0 documents "erroneous". But within the meaning of the spec, "erroneous" simply means that the documents are not XML 1.0. So the clause is completely redundant. Documents that do not conform to the spec are erroneous by definition, whether or not they use the value "1.0".

It's as if the Catholic Church issued an edict forbidding all rabbis from wearing cassocks, on pain of excommunication.

I am happy to discover that this dumb error has been removed from the most recent edition of the XML 1.0 spec.