The Universe of Discourse


Fri, 07 Sep 2007

Families of scalars
I'm supposedly in the midst of writing a book about fixing common errors in Perl programs, and my canonical example is the family of scalar variables. For instance, code like this:

     if ($FORM{'h01'}) {$checked01 = " CHECKED "}
     if ($FORM{'h02'}) {$checked02 = " CHECKED "}
     if ($FORM{'h03'}) {$checked03 = " CHECKED "}
     if ($FORM{'h04'}) {$checked04 = " CHECKED "}
     if ($FORM{'h05'}) {$checked05 = " CHECKED "}
     if ($FORM{'h06'}) {$checked06 = " CHECKED "}
(I did not make this up; I got it from here.) The flag here is the family $checked01, $checked02, etc. Such code is almost always improved by replacing the family with an array, and the repeated code with a loop:

        $checked[$_] = $FORM{"h$_"} for "01" .. "06";
Actually in this particular case a better solution was to eliminated the checked variables entirely, but that is not what I was planning to discuss. Rather, I planned to discuss a recent instance in which I wrote some code with a family of variables myself, and the fix was somewhat different.

The program I was working on was a digester for the qmail logs, translating them into a semblance of human-readable format. (This is not a criticism; log files need not be human-readable; they need to be easy to translate, scan, and digest.) The program scans the log, gathering information about each message and all the attempts to deliver it to each of its recipient addresses. Each delivery can be local or remote.

Normally the program prints information about each message and all its deliveries. I was adding options to the program to allow the user to specify that only local deliveries or only remote deliveries were of interest.

The first thing I did was to add the option-processing code:

    ...
  } elsif ($arg eq "--local-only" || $arg eq '-L') {
    $local_only = 1;
  } elsif ($arg eq "--remote-only" || $arg eq '-R') {
    $remote_only = 1;
As you see, this is where I made my mistake, and introduced a (two-member) family of variables. The conventional fix says that this should have been something like $do_only{local} and $do_only{remote}. But I didn't notice my mistake right away.

Later on, when processing a message, I wanted to the program to scan its deliveries, and skip all processing and display of the message unless some of its deliveries were of the interesting type:

  if ($local_only || $remote_only) {
        ...
  }
I had vague misgivings at this point about the test, which seemed redundant, but I pressed on anyway, and found myself in minor trouble. Counting the number of local or remote deliveries was complicated:
  if ($local_only || $remote_only) {
    my $n_local_deliveries = 
      grep $msg->{del}{$_}{lr} eq "local", keys %{$msg->{del}};
    my $n_remote_deliveries = 
      grep $msg->{del}{$_}{lr} eq "remote", keys %{$msg->{del}};
    ...
  }
There is a duplication of code here. Also, there is a waste of CPU time, since the program never needs to have both numbers available. This latter waste could be avoided at the expense of complicating the code, by using something like $n_remote_deliveries = keys(%{$msg->{del}}) - $n_local_deliveries, but that is not a good solution.

Also, the complete logic for skipping the report was excessively complicated:

  if ($local_only || $remote_only) {
    my $n_local_deliveries = 
      grep $msg->{del}{$_}{lr} eq "local", keys %{$msg->{del}};
    my $n_remote_deliveries = 
      grep $msg->{del}{$_}{lr} eq "remote", keys %{$msg->{del}};

    return if $local_only  && $local_deliveries == 0
           || $remote_only && $remote_deliveries == 0;

  }
I could have saved the wasted CPU time (and the repeated tests of the flags) by rewriting the code like this:

  if ($local_only) {
    return unless
      grep $msg->{del}{$_}{lr} eq "local", keys %{$msg->{del}};
  } elsif ($remote_only) {
    return unless
      grep $msg->{del}{$_}{lr} eq "remote", keys %{$msg->{del}};
  }
but that is not addressing the real problem, which was the family of variables, $local_only and $remote_only, which inevitably lead to duplicated code, as they did here.

Such variables are related by a convention in the programmer's mind, and nowhere else. The language itself is as unaware of the relationship as if the variables had been named $number_of_nosehairs_on_typical_goat and $fusion_point_of_platinum. A cardinal rule of programming is to make such conventional relationships explicit, because then the programming system can give you some assistance in dealing with them. (Also because then they are apparent to the maintenance programmer, who does not have to understand the convention.) Here, the program was unable to associate $local_only with the string "local" and $remote_only with "remote", and I had to make up the lack by writing additional code.

For families of variables, the remedy is often to make the relationship explicit by using an aggregate variable, such as an array or a hash, something like this:

  if (%use_only) {
    my ($only_these) = keys %use_only;
    return unless
      grep $msg->{del}{$_}{lr} eq $only_these, keys %{$msg->{del}};
  }
Here the relationship is explicit because $use_only{"remote"} indicates an interest in remote deliveries and $use_only{"local"} indicates an interest in local deliveries, and the program can examine the key in the hash to determine what to look for in the {lr} data.

But in this case the alternatives are disjoint, so the %use_only hash will never contain more than one element. The tipoff is the bizarre ($only_these) = keys ... line. Since the hash is really storing a single scalar, it can be replaced with a scalar variable:

  } elsif ($arg eq "--local-only" || $arg eq '-L') {
    $only_these = "local";
  } elsif ($arg eq "--remote-only" || $arg eq '-R') {
    $only_these = "remote";
Then the logic for skipping uninteresting messages becomes:

  if ($only_these) {
    return unless
      grep $msg->{del}{$_}{lr} eq $only_these, keys %{$msg->{del}};
  }
Ahh, better.

A long time ago I started to suspect that flag variables themselves are a generally bad practice, and are best avoided, and I think this example is evidence in favor of that theory. I had a conversation about this yesterday with Aristotle Pagaltzis, who is very thoughtful about this sort of thing. One of our conclusions was that although the flag variable can be useful to avoid computing the same boolean value more than once, if it is worth having, it is because your program uses it repeatedly, and so it is probably testing the same boolean value more than once, and so it is likely that the program logic would be simplified if one could merge the blocks that would have been controlled by those multiple tests into one place, thus keeping related code together, and eliminating the repeated tests.


[Other articles in category /prs] permanent link

Thu, 15 Jun 2006

Just put it in the damn object!
The Perl Program Repair Shop book, despite outward appearances, is coming along. A few months ago in Pittsburgh I gave a talk about someone's module that I didn't think was very well-written, for various reasons. Up to that point, the book had been about a lot of small-scale stuff: repeated code, unnecessary variables, making two passes over a data structure when only one was necessary, C-style for loops, and such. But unlike the previous examples, this one had been written by someone who was actually competent. So the problems I found were competent-programmer examples, rather than incompetent-programmer examples. I need some more chapters of that kind of examples, or the book will not be of much interest to competent programmers.

What kind of mistakes do competent programmers make? They make a lot of errors of object-oriented design.

This module's purpose was to emulate some Java library that lets you register an "observer" of an object. Let's say that the "observer" object is a Guard, and the object it is observing is an "Alarm". The idea is that the Alarm object registers the Guard as being an observer of the Alarm, and then whenever the Alarm calls a notify_observers method, the notify method in the Guard object is called back. Actually the Java people didn't make the names that sensible; instead of notify_observers calling the notify method in the observers, it calls the update method. Why wasn't notify_observers called update_observers, then? I dunno. It's Java. You want me to explain Java?

Okay, so for each Alarm A you need somewhere to store the list of observers of A. Where do you put that?

The author of the module put it in a global hash. I don't think it's immediately clear that this was a mistake. But I do think it was a mistake. Big problems arise; the module had a lot of bugs, mostly related to garbage collection. As a result of putting the Guard objects into this hash, they are never garbage collected. Well, not quite. The author used a weak reference in the hash, so the objects there are garbage collected.

Weak references are one of those technical solutions that fits really well into the formula "A programmer had a problem. So he used weak references. Then he had two problems." (Non-greedy regexes are another example. Some people say Perl itself is an example.) Weak references do solve a couple of very specific problems, mostly having to do with caching. For anything else, they turn out to be a bigger problem than the thing you were trying to solve. In this case, the weak references cause this delightful problem:

  
        my $alarm = Alarm->new();
        $alarm->add_observer(Guard->new);
        $alarm->notify_observers("I like pie!");


        Failed to send observation from 'Alarm=HASH(0x8113f74)' to '':
        Can't call method "update" on an undefined value at
        lib/Class/Observable.pm line 95.

See, the Guard that is observing the Alarm is immediately garbage-collected. We could prevent the fatal error here, but that wouldn't solve the problem, which is that that Guard should not be garbage collected as long as the Alarm is still extant, because it is supposed to be watching the Alarm.

This was only one of several problems caused by this design, to store the observers in a global hash. Some of these could possibly be avoided if Perl were a better language, but others I think not.

My suggestion here was the following advice: Whenever you're trying to store information about an object, there is a right place to store it and a wrong place to store it. The right place: inside the object. The wrong place: anywhere else.

Rationale: What is an object? It's a place where you store all the pertinent information about some entity. So here we have some pertinent information about this entity, the alarm, namely the list of guards who are watching it. Where should we store this information? Well, we have this data structure, the Alarm object, which is supposed to store all the pertinent information about the alarm. The list of guards is part of this, so that's where it should go. Entia non sunt multiplicanda praeter necessitatem.

Now, this choice may be obvious, but it has a fair share of problems too, which is why I think the other programmer might not have made it. The add_observers method in the base class of Alarm is called upon to add the Guard to the list of observers of the Alarm. If the list of observers is stored in the Alarm object, add_observersmust make some assumptions about how the Alarm is implemented. But it turns out that once you start thinking about it, the problems turn out to have fairly simple solutions. For example, the base class needs to know what list to append the Guard to. It could assume that the list will be stored in $alarm->{Observers}, but that's iffy: it assumes that the Alarm will be made from a hash, it assumes that that hash key is not used for something else, and so on. So instead, you give Alarm a method that returns a reference to the array in which the Guard is supposed to place itself. Normally, that method is inherited from the base class itself, and returns a reference to $self->{Observers}, but the Alarm class can override that if it needs to. It can even get back the original global-variable implementation by overriding the method to return a reference to an element in a global hash.

To summarize, both designs appeared to have significant technical problems, but the problems in the design I suggested turned out to be a lot easier to solve well then the problems that arose from the design that the author chose. At least in this example, the advice that object data should be stored in the object turned out pretty well.

(Cynical advice to people wishing to become famous experts: phrase your advice so that it sounds inevitable. Anyone wishing to argue against advice like object data should be stored in the object will have an uphill battle, and risks looking like an idiot.)

Now today I was in the shower, thinking about this other piece of software I wrote recently, in which I was having, surprise, a garbage collection problem. The software is a module that is intended to provide lightweight access to flat-file databases, which is something that has been inexplicably overlooked by CPAN.

You use the module like this:

        my $db = FlatFile->new(FILE => ..., FIELDS => [...], ...);

        my @red = $db->lookup(color => "red");

        $red[0]->set_color("blue");  # paint it blue

        $red[1]->delete;
The elements of @red are Record objects, and the Record methods like set_color and delete need to call back to the FlatFile object that represents the database, to inform it that records have been modified or deleted. So each Record object needs a pointer back to the database to which it belongs.

The way I implemented this was to have a class variable in the Record class, $Record::DB, which contained the database object. But then the database object is never garbage-collected, which means it is never finalized, which means that changes to the database are not automatically written to the disk when the database object becomes inaccessible, which means that you have to call $db->flush to make sure the changes are written out, which opens the possibility that you will forget, and the module will bugger your database with a badger.

What I realized in the shower was that I had better take my own advice, that object data should be stored in the object. If the source database is a pertinent piece of information about a record, then I had better store the source database as a piece of member data in the record object, and not try this stupid thing involving a global variable.

If the advice is good, this will be a better design. I haven't done it yet, so I'm not sure. But it certainly looks like a better design.

I want to say I don't know why I used the global variable in the first place, but I do know: my original design for the record object was that it had nothing in it but the data from the record; no metadata at all. The data members were named after the fields in the database, so there was no namespace left over for metadata. That didn't work out too well, and although I resisted putting metadata into the record objects long enough to screw myself regarding where to store the owning database and some other stuff, I eventually got backed into a corner and had to redesign the object to have space for metadata. But the effects of original misdesign persisted, because I was still storing a bunch of the object's metadata in stupid global variables, until I took that shower.

That's two successes so far for my advice, which is two good signs. Now all I need is a good counterexample. Every piece of advice in this book is going to have a counterexample. That's the big problem with computer programming advice books: no counterexamples. But that's another article for another day.


[Other articles in category /prs] permanent link

Mon, 13 Feb 2006

Accidental Features
Last week I was away in Chicago as part of my Red Flags world tour, about which more in an upcoming entry. I mentioned before that I always learn something surprising from giving these classes, and this time was no exception. I was reviewing a medium-sized piece of code for a client and I encountered the following astonishing construction:

        for $i qw(foo bar baz) {
          # do something with $i
        }
If you're not familiar with Perl, you aren't going to see what is wrong with this. You may not see it for a while even if you are familiar with Perl.

Normally, the syntax for this sort of construction is:

        for $VAR (LIST) {
          ...
        }
where the parentheses around the LIST are not optional. And the qw(...) construction generates a list. So it should be:
        for $i (qw(foo bar baz)) {
          # do something with $i
        }
Leave off the outer parentheses, and you should get a syntax error. So when I saw this code, I said, "Oh, that's a syntax error."

Except, of course, it wasn't. The code I was reviewing was in production on the client's web site, and was processing millions of dollars of orders. If it had bee a syntax error, they would have known right away. And indeed, it worked when I tried it.

But it's not documented anywhere, and I have never seen it before. It is certainly little-known. I believe it is an accidental feature.

This is not the first time that someone has discovered an unintended but useful feature in Perl. Another example that comes to mind is the s/.../.../ee trick. Normally, you give s/.../.../ a pattern and a replacement string, and it performs a substitution, locating some string that matches the pattern and replacing it with the replacement. For example, s/\d+/carrots/ locates a a numeral (a sequence of digits) and replaces it with carrots. But you might want to do something tricky; say you want to replace each numeral with its double. Then you can use s/(\d+)/$1 * 2/e. The $1 is replaced with the digits, so that if the digits are 123, you get 123 * 2, and the trailing e means that this is a Perl expression, rather than a literal string, and should be evaluated. Without the e, 123 becomes 123 * 2; with e, it becomes 246. (Other modifiers were permitted: for example, i made the pattern match case-insensitive.)

The implemenation was to scan the modifiers left-to-right, and to call Perl's expression evaluator if an e was seen. The accidental feature was that the implementation therefore allowed one to put two es to force two evaluations. The typical use of this was to write something like s/(\$\w+)/$1/ee, which finds $foo and replaces it with the value of the variable $foo. The first e evaluates $1 to $foo, and the second evaluates $foo to its value. I believe this was discovered by Tom Christiansen. It's rarely used now, since there are more straightforward ways to accomplish the same thing, but it still appears in three different places in the documentation.

I don't think it says anything good about Perl that this sort of accidental feature shows up in the language. I can't think of anything comparable in any other language. (Except APL.) C has the 4[ptr] syntax, but that's not a feature; it's just a joke. The for $i qw(...) { ... } syntax in Perl seems genuinely valuable.

Accordingly, I dropped a note to the Perl development mailing list, asking if anyone had ever noticed this before, and whether people thought we should commit to supporting it in the future. (It works in all versions of Perl that support qw().) I was surprised that there was no reply. So I suppose that the next step is to prepare a documentation patch describing it and some tests in the test suite for it. If everyone is still asleep when I post those, they may go in without comment, and once a feature is put in, it is very difficult to remove it.


[Other articles in category /prs] permanent link

Wed, 25 Jan 2006

Red Flags world tour: New York City
My wife came up with a brilliant plan to help me make regular progress on my current book. The idea of the book is that I show how to take typical programs and repair and refurbish them. The result usually has between one-third and one-half less code, is usually a little faster, and sometimes has fewer bugs. Lorrie's idea was that I should schedule a series of talks for Perl Mongers groups. Before each talk, I would solicit groups to send me code to review; then I'd write up and deliver the talk, and then afterward I could turn the talk notes into a book chapter. The talks provide built-in, inflexible deadlines, and I love giving talks, so the plan will help keep me happy while writing the book.

The first of these talks was on Monday, in my home town of New York.

'It makes no odds whether a man has a thousand pound, or nothing, there. Particular in New York, I'm told, where Ned landed.'

'New York, was it?' asked Martin, thoughtfully.

'Yes,' said Bill. 'New York. I know that, because he sent word home that it brought Old York to his mind, quite wivid, in consequence of being so exactly unlike it in every respect.'

(Charles Dickens, Martin Chuzzlewit, about which more in some future entry, perhaps.)

The New Yorkers gave me a wonderful welcome, and generously paid my expenses afterward. The only major hitch was that I accidentally wrote my talk about a submission that had come from London. Oops! I must be more careful in the future.

Each time I look at a new program it teaches me something new. Some people, perhaps, seem to be able to reason from general principles to specifics: if you tell them that common code in both branches of a conditional can be factored out, they will immediately see what you mean. Or so they would have you believe; I have my doubts. Anyway, whether they are telling the truth or not, I have almost none of that ability myself. I frequently tell people that I have very little capacity for abstract thought. They sometimes think I'm joking, but I'm not. What I mean is that I can't identify, remember, or understand general principles except as generalizations of specific examples. Whenever I want to study some problem, my approach is always to select a few typical-seeming examples and study them minutely to try to understand what they might have in common. Some people seem to be able to go from abstract properties to conclusions; I can only go from examples.

So my approach to understanding how to improve programs is to collect a bunch of programs, repair them, take notes, and see what sorts of repairs come up frequently, what techniques seem to apply to multiple programs, what techniques work on one program and fail on another, and why, and so on. Probably someone smarter than me would come up with a brilliant general theory about what makes bad programs bad, but that's not how my brain works. My brain is good at coming up with a body of technique. It's a limitation, but it's not all bad.


The goal of generalization had become so fashionable that a generation of mathematicians had become unable to relish beauty in the particular, to enjoy the challenge of solving quantitative problems, or to appreciate the value of technique.

(Ronald L. Graham, Donald E. Knuth, Oren Patashnik, Concrete Mathematics.)

So anyway, here's something I learned from this program. I have this idea now that you should generally avoid the Perl . (string concatenation) operator, because there's almost always a better alternative. The typical use of the . operator looks like this:

          $html = "<a href='".$url."'>".$hot_text."</a>";
It's hard to see here what is code and what is data. You pretty much have to run the Perl lexer algorithm in your head. But Perl has another notation for concatenating strings: "$a$b" concatenates strings $a and $b. If you use this interpolation notation to rewrite the example above, it gets much easier to read:

          $html = "<a href='$url'>$hot_text</a>";
So when I do these classes, I always suggest that whenever you're going to use the . operator, you try writing it as an interpolation too and see which you like better.

This frequently brings on a question about what to do in cases like this:

          $tmpfilealrt = "alert_$daynum" . "_$day" . "_$mon.log" ;
Here you can't eliminate the . operators in this way, because you would get:
          $tmpfilealrt = "alert_$daynum_$day_$mon.log" ;
This fails because it wants to interpolate $daynum_ and $day_, rather than $daynum and $day. Perl has an escape hatch for this situation:
          $tmpfilealrt = "alert_${daynum}_${day}_$mon.log" ;
But it's not clear to me that that is an improvement on the version that used the . operator. The punctuation is only slightly reduced, and you've used an obscure notation that a lot of people won't recognize and that is visually similar to, but entirely unconnected with, hash notation.

Anyway, when this question would come up, I'd discuss it, and say that yeah, in that case it didn't seem to me that the . operator was inferior to the alternatives. But since my review of the program I talked about in New York on Monday, I know a better alternative. The author of that program wrote it like this:

          $tmpfilealrt = "alert_$daynum\_$day\_$mon.log" ;
When I saw it, I said "Duh! Why didn't I think of that?"


[Other articles in category /prs] permanent link