The Universe of Disco


Fri, 07 Sep 2007

Families of scalars
I'm supposedly in the midst of writing a book about fixing common errors in Perl programs, and my canonical example is the family of scalar variables. For instance, code like this:

     if ($FORM{'h01'}) {$checked01 = " CHECKED "}
     if ($FORM{'h02'}) {$checked02 = " CHECKED "}
     if ($FORM{'h03'}) {$checked03 = " CHECKED "}
     if ($FORM{'h04'}) {$checked04 = " CHECKED "}
     if ($FORM{'h05'}) {$checked05 = " CHECKED "}
     if ($FORM{'h06'}) {$checked06 = " CHECKED "}
(I did not make this up; I got it from here.) The flag here is the family $checked01, $checked02, etc. Such code is almost always improved by replacing the family with an array, and the repeated code with a loop:

        $checked[$_] = $FORM{"h$_"} for "01" .. "06";
Actually in this particular case a better solution was to eliminated the checked variables entirely, but that is not what I was planning to discuss. Rather, I planned to discuss a recent instance in which I wrote some code with a family of variables myself, and the fix was somewhat different.

The program I was working on was a digester for the qmail logs, translating them into a semblance of human-readable format. (This is not a criticism; log files need not be human-readable; they need to be easy to translate, scan, and digest.) The program scans the log, gathering information about each message and all the attempts to deliver it to each of its recipient addresses. Each delivery can be local or remote.

Normally the program prints information about each message and all its deliveries. I was adding options to the program to allow the user to specify that only local deliveries or only remote deliveries were of interest.

The first thing I did was to add the option-processing code:

    ...
  } elsif ($arg eq "--local-only" || $arg eq '-L') {
    $local_only = 1;
  } elsif ($arg eq "--remote-only" || $arg eq '-R') {
    $remote_only = 1;
As you see, this is where I made my mistake, and introduced a (two-member) family of variables. The conventional fix says that this should have been something like $do_only{local} and $do_only{remote}. But I didn't notice my mistake right away.

Later on, when processing a message, I wanted to the program to scan its deliveries, and skip all processing and display of the message unless some of its deliveries were of the interesting type:

  if ($local_only || $remote_only) {
        ...
  }
I had vague misgivings at this point about the test, which seemed redundant, but I pressed on anyway, and found myself in minor trouble. Counting the number of local or remote deliveries was complicated:
  if ($local_only || $remote_only) {
    my $n_local_deliveries = 
      grep $msg->{del}{$_}{lr} eq "local", keys %{$msg->{del}};
    my $n_remote_deliveries = 
      grep $msg->{del}{$_}{lr} eq "remote", keys %{$msg->{del}};
    ...
  }
There is a duplication of code here. Also, there is a waste of CPU time, since the program never needs to have both numbers available. This latter waste could be avoided at the expense of complicating the code, by using something like $n_remote_deliveries = keys(%{$msg->{del}}) - $n_local_deliveries, but that is not a good solution.

Also, the complete logic for skipping the report was excessively complicated:

  if ($local_only || $remote_only) {
    my $n_local_deliveries = 
      grep $msg->{del}{$_}{lr} eq "local", keys %{$msg->{del}};
    my $n_remote_deliveries = 
      grep $msg->{del}{$_}{lr} eq "remote", keys %{$msg->{del}};

    return if $local_only  && $local_deliveries == 0
           || $remote_only && $remote_deliveries == 0;

  }
I could have saved the wasted CPU time (and the repeated tests of the flags) by rewriting the code like this:

  if ($local_only) {
    return unless
      grep $msg->{del}{$_}{lr} eq "local", keys %{$msg->{del}};
  } elsif ($remote_only) {
    return unless
      grep $msg->{del}{$_}{lr} eq "remote", keys %{$msg->{del}};
  }
but that is not addressing the real problem, which was the family of variables, $local_only and $remote_only, which inevitably lead to duplicated code, as they did here.

Such variables are related by a convention in the programmer's mind, and nowhere else. The language itself is as unaware of the relationship as if the variables had been named $number_of_nosehairs_on_typical_goat and $fusion_point_of_platinum. A cardinal rule of programming is to make such conventional relationships explicit, because then the programming system can give you some assistance in dealing with them. (Also because then they are apparent to the maintenance programmer, who does not have to understand the convention.) Here, the program was unable to associate $local_only with the string "local" and $remote_only with "remote", and I had to make up the lack by writing additional code.

For families of variables, the remedy is often to make the relationship explicit by using an aggregate variable, such as an array or a hash, something like this:

  if (%use_only) {
    my ($only_these) = keys %use_only;
    return unless
      grep $msg->{del}{$_}{lr} eq $only_these, keys %{$msg->{del}};
  }
Here the relationship is explicit because $use_only{"remote"} indicates an interest in remote deliveries and $use_only{"local"} indicates an interest in local deliveries, and the program can examine the key in the hash to determine what to look for in the {lr} data.

But in this case the alternatives are disjoint, so the %use_only hash will never contain more than one element. The tipoff is the bizarre ($only_these) = keys ... line. Since the hash is really storing a single scalar, it can be replaced with a scalar variable:

  } elsif ($arg eq "--local-only" || $arg eq '-L') {
    $only_these = "local";
  } elsif ($arg eq "--remote-only" || $arg eq '-R') {
    $only_these = "remote";
Then the logic for skipping uninteresting messages becomes:

  if ($only_these) {
    return unless
      grep $msg->{del}{$_}{lr} eq $only_these, keys %{$msg->{del}};
  }
Ahh, better.

A long time ago I started to suspect that flag variables themselves are a generally bad practice, and are best avoided, and I think this example is evidence in favor of that theory. I had a conversation about this yesterday with Aristotle Pagaltzis, who is very thoughtful about this sort of thing. One of our conclusions was that although the flag variable can be useful to avoid computing the same boolean value more than once, if it is worth having, it is because your program uses it repeatedly, and so it is probably testing the same boolean value more than once, and so it is likely that the program logic would be simplified if one could merge the blocks that would have been controlled by those multiple tests into one place, thus keeping related code together, and eliminating the repeated tests.


[Other articles in category /prs] permanent link