The Universe of Discourse
           
Mon, 30 Apr 2007

Excessive precision
You sometimes read news articles that say that some object is 98.42 feet tall, and it is clear what happened was that the object was originally reported to be 30 meters tall, and some knucklehead translated 30 meters to 98.42 feet, instead of to 100 feet as they should have.

Finding a real example for you was easy: I just did Google search for "62.14 miles", and got this little jewel:

Tsunami waves can be up to 62.14 miles long! They can also be about three feet high in the middle of the ocean. Because of its strong underwater energetic force, the tsunami can rise up to 90 feet, in extreme cases, when they hit the shore! Tsunami waves act like shallow water waves because they are so long. Because it is so long, it can last an hour. In the Pacific Ocean, a tsunami moves 60.96 feet a second, passing through water that is around 1219.2 feet deep.

(library.thinkquest.org.)

The 60.96 feet per second is actually 100 km/hr, but I'm not sure what's going on with the 1219.2 feet deep. Is it 1/5 nautical mile? But that would be strange. [ Addendum 20070428: the explanation.]

Here's another delightful example:

The MiniC.A.T. is very cost-efficient to operate. According to MDI, it costs less than one dollar per 62.14 miles... Given the absence of combustion and the fact that the MiniC.A.T. runs on vegetable oil, oil changes are only necessary every 31,068 miles.

(eesi.org.)

(I should add that many of the hits for "62.14 miles" were perfectly legitimate. Many concerned 100-km bicycle races, or the conditions for winning the X-prize. In both cases the distance is in fact 62.14 miles, not 62.13 or 62.15, and the precision is warranted. But I digress.)

(Long ago there was a parody of the New York Times which included a parody sports section that announced "FOOTBALL TO GO METRIC". The article revealed that after the change, the end zones would be placed 91.44 meters apart...)

Anyway, similar knuckleheadedness occurs in the well-known value of 98.6 degrees Fahrenheit for normal human body temperature. Human body temperature varies from individual to individual, and by a couple of degrees over the course of the day, so citing the "normal" temperature to a tenth of a degree is ridiculous. The same thing happened here as with the 62.14-mile tsunami. Normal human body temperature was determined to be around 37 degrees Celsius, and then some knucklehead translated 37°C to 98.6°F instead of to 98°F.

When our daughter was on the way, Lorrie and I took a bunch of classes on baby care. Several of these emphasized that the maximum safe spacing for the bars of a crib, rails of a banister, etc., was two and three-eighths inches. I was skeptical, and at one of these classes I was foolish enough to ask if that precision were really required: was two and one-half inches significantly less safe? How about two and seven-sixteenths inches? The answer was immediate and unequivocal: two and one-half inches was too far apart for safety; two and three-eighths inches is the maximum safe distance.

All the baby care books say the same thing. (For example...)

But two and three-eighths inches is 6.0325 cm, so draw your own conclusion about what happened here.

[ Addendum 20070430: 60.96 feet per second is nothing like 100 km/hr, and I have no idea why I said it was. The 60.96 feet per second appears to be a backwards conversion of 200 m/s to ft/s, multiplying by 3.048 instead of dividing. As Scott turner noted a few days ago, a similar error occurs in the conversion of meters to feet in the "1219.2 feet deep" clause. ]


[Other articles in category /physics] permanent link

1219.2 feet
In Thursday's article, I quoted an article about tsunamis that asserted that:

In the Pacific Ocean, a tsunami moves 60.96 feet a second, passing through water that is around 1219.2 feet deep.
"60.96 feet a second" is an inept conversion of 100 km/h to imperial units, but I wasn't able to similarly identify 1219.2 feet.

Scott Turner has solved the puzzle. 1219.2 feet is an inept conversion of 4000 meters to imperial units, obtained by multiplying 4000 by 0.3048, because there are 0.3048 meters in a foot.

Thank you, M. Turner.

[ Addendum 20070430: 60.96 feet per second is nothing like 100 km/hr, and I have no idea why I said it was. The 60.96 feet per second is a backwards conversion of 200 m/s. ]


[Other articles in category /physics] permanent link

Tue, 24 Apr 2007

favicon.ico
I'm getting hundreds of failed requests for favicon.ico. I need to come up with a good-looking icon for the blog. Since the blog isn't about Higher-Order Perl, I don't want to use the HOP cover or the HOP quilt block icon .

I have not had any good ideas. So I am asking LazyNet.

The best suggestion will win a free one-year subscription to The Universe of Discourse blog. If the winner supplies an icon that I can actually use, the prize will be increased to a free two-year subscription.

Any suggestions?

[ Addendum 20061120: David Eppstein responded immediately with , which I will explain in a future article that summarizes the suggestions. M. Eppstein's suggestion is good enough that I am comfortable installing it right away. But don't let that stop you from coming up with your own suggestion. ]

[ Addendum 20061120: Neil Kandalgaonkar has contributed . Thank you, Neil! ]


[Other articles in category /meta] permanent link

Tue, 17 Apr 2007

Who farted?
The software that generates these web pages from my blog entries is Blosxom. When I decided I wanted to try blogging, I did a web search for something like "simplest possible blogging software", and found a page that discussed Bryar. So I located the Bryar manual. The first sentence in the Bryar manual says "Bryar is a piece of blog production software, similar in style to (but considerably more complex than) Rael Dornfest's blosxom." So I dropped Bryar and got Blosxom. This was a smashing success: It was running within ten minutes. Now, two months later, I'm thinking about moving everything to Bryar, because Blosxom really sucks. . But for me it was a huge success, and I probably wouldn't have started a blog if I hadn't found it. It sucks in the best possible way: because it's drastically under-designed and excessively lightweight. That is much better than sucking because it is drastically over-designed and excessively heavyweight. It also sucks in some less ambivalent ways, but I'm not here (today) to criticize Blosxom, so let's move on.

Until the big move to Bryar, I've been writing plugins for Blosxom. When I was shopping for blog software, one of my requirements that that the thing be small and simple enough to be hackable, since I was sure I would inevitably want to do something that couldn't be accomplished any other way. With Blosxom, that happened a lot sooner than it should have (the plugin interface is inadequate), but the fallback position of hacking the base code worked quite well, because it is fairly small and fairly simple.

Most recently, I had to hack the Blosxom core to get the menu you can see on the left side of the page, with the titles of the recent posts. This should have been possible with a plugin. You need a callback (story) in the plugin that is invoked once per article, to accumulate a list of title-URL pairs, and then you need another callback (stories_done) that is invoked once after all the articles have been scanned, to generate the menu and set up a template variable for insertion into the template. Then, when Blosxom fills the template, it can insert the menu that was set up by the plugin.

With stock Blosxom, however, this is impossible. The first problem you encounter is that there is no stories_done callback. This is only a minor problem, because you can just have a global variable that holds the complete menu so far at all times; each time story is invoked, it can throw away the incomplete menu that it generated last time and replace it with a revised version. After the final call to story, the complete menu really is complete:

        sub story {
          my ($pkg, $path, $filename, $story_ref, $title_ref, $body_ref,
                 $dir, $date) = @_;
          return unless $dir eq "" && $date eq "";
          return unless $blosxom::flavour eq "html";
          my $link = qq{<a href="$blosxom::url$path/$filename.$blosxom::flavour">} .
                     qq{<span class=menuitem>$$title_ref</span></a>};
          push @menu, $link;
          $menu = join "&nbsp;/ ", @menu;
        }
That strategy is wasteful of CPU time, but not enough to notice. You could also fix it by hacking the base code to add a stories_done callback, but that strategy is wasteful of programmer time.

But it turns out that this doesn't work, and I don't think there is a reasonable way to get what I wanted without hacking the base code. (Blosxom being what it is, hacking the base code is a reasonable solution.) This is because of a really bad architecture decision inside of Blosxom. The page is made up of three independent components, called the "head", the "body", and the "foot". There are separate templates of each of these. And when Blosxom generates a page, it does so in this sequence:

  1. Fill "head" template and append result to output
  2. Process articles
  3. Fill "body" template and append result to output
  4. Fill "foot" template and append result to output
This means that at the time the "head" template is filled, Blosxom has not yet seen the articles. So if you want the article's title to appear in the HTML <title> element, you are out of luck, because the article has not yet been read at the time that the <title> element is generated. And if you want a menu of article titles on the left-hand side of the page, you are out of luck, because the article has not yet been read at the time that the left-hand side of the page was generated.

So I had to go in and hack the Blosxom core to make it fractionally less spiky:

  1. Process articles
  2. Fill "head" template and append result to output
  3. Fill "body" template and append result to output
  4. Fill "foot" template and append result to output
It's tempting to finish this article right now with a long explanation of the philosophical mistakes of this component of Blosxom's original design, how it should have been this:

  1. Process articles
  2. Fill and output templates:
    1. Fill "head" template and append result to output
    2. Fill "body" template and append result to output
    3. Fill "foot" template and append result to output
Or an analysis of Blosxom's confusion between structure and presentation, and so forth. Why three templates? Why not one? The one template, as distributed with the package, could simply have been:

        #include head
        #include body
        #include foot
But if I were to finish the article with that discussion, then the first half of the article would have been relevant and to the point, and as regular readers know, We Don't Do That Here. No, there must come a point about a third of the way through each article at which I say "But anyway, the real point of this note is...".

Anyway, the real point of this note is to discuss was the debugging technique I used to fix Blosxom after I made this core change, which broke the output. The menu was showing up where I wanted, but now all the date headers (like the "Sat, 18 Mar 2006" one just above) appeared at the very top of the page body, before all the articles.

The way Blosxom generates output is by appending it to a variable $output, which eventually accumulates all the output, and is printed once, right at the very end. I had found the code that generated the "head" part of the output; this code ended with $output .= $head to append the head part to the complete output. I moved this section down, past the (much more complicated) section that scanned the articles themselves, so that the "head" template, when filled, would have access to information (such as menus) accumulated while scanning the articles.

But apparently some other part of the program was inserting the date headers into the output while scanning the articles. I had to find this. The right thing to do would have been just to search the code for $output. This was not sure to work: there might have been dozens of appearances of $output, making it a difficult task to determine which one was the responsible appearance. Also, any of the plugins, not all of which were written by me, could have been modifying $output, so it would not have been enough just to search the base code.

Still, I should have started by searching for $output, under the look-under-the-lamppost principle. (If you lose your wallet in a dark street start by looking under the lamppost. The wallet might not be there, but you will not waste much time on the search.) If I had looked under the lamppost, I would have found the lost wallet:

      $curdate ne $date and $curdate = $date and $output .= $date;
That's not what I did. Daunted by the theoretical difficulties, I got out a big hammer. The hammer is of some technical interest, and for many problems it is not too big, so there is some value in presenting it.

Perl has a feature called tie that allows a variable to be enchanted so that accesses to it are handled by programmer-defined methods instead of by Perl's usual internal processes. When a tied variable $foo is stored into, a STORE method is called, and passed the value to be stored; it is the responsibility of STORE to put this value somewhere that it can be accessed again later by its counterpart, the FETCH method.

There are a lot of problems with this feature. It has an unusually strong risk of creating incomprehensible code by being misused, since when you see something like $this = $that you have no way to know whether $this and $that are tied, and that what you think is an assignment expression is actually STORE($this, FETCH($that)), and so is invoking two hook functions that could have completely arbitrary effects. Another, related problem is that the compiler can't know that either, which tends to disable the possibility of performing just about any compile-time optimization you could possibly think of. Perhaps you would like to turn this:

        for (1 .. 1000000) { $this = $that }
into just this:

        $this = $that;
Too bad; they are not the same, because $that might be tied to a FETCH function that will open the pod bay doors the 142,857th time it is called. The unoptimized code opens the pod bay doors; the "optimized" code does not.

But tie is really useful for certain kinds of debugging. In this case the question is "who was responsible for inserting those date headers into $output?" We can answer the question by tieing $output and supplying a STORE callback that issues a report whenever $output is modified. The code looks like this:

        package who_farted;

        sub TIESCALAR {
          my ($package, $fh) = @_;
          my $value = "";
          bless { value => \$value, fh => $fh } ;
        }
This is the constructor; it manufactures an object to which the FETCH and STORE methods can be directed. It is the responsibility of this object to track the value of $output, because Perl won't be tracking the value in the usual way. So the object contains a "value" member, holding the value that is stored in $object. It also contains a filehandle for printing diagnostics. The FETCH method simply returns the current stored value:

        sub FETCH {
          my $self = shift;
          ${$self->{value}};
        }
The basic STORE method is simple:

        sub STORE {
          my ($self, $val) = @_;
          my $fh = $self->{fh};
          print $fh "Someone stored '$val' into \$output\n";
          ${$self->{value}} = $val;
        }
It issues a diagnostic message about the new value, and stores the new value into the object so that FETCH can get it later. But the diagnostic here is not useful; all it says is that "someone" stored the value; we need to know who, and where their code is. The function st() generates a stack trace:

        sub st {
          my @stack;
          my $spack = __PACKAGE__;
          my $N = 1;
          while (my @c = caller($N)) {
            my (undef, $file, $line, $sub) = @c;
            next if $sub =~ /^\Q$spack\E::/o;
            push @stack, "$sub ($file:$line)";
          } continue { $N++ }
          @stack;
        }
Perl's built-in caller() function returns information about the stack frame N levels up. For clarity, st() omits information about frames in the who_farted class itself. (That's the next if... line.)

The real STORE dumps the stack trace, and takes some pains to announce whether the value of $output was entirely overwritten or appended to:

        sub STORE {
          my ($self, $val) = @_;
          my $old = $ {$self->{value}};
          my $olen = length($old);
          my ($act, $what) = ("set to", $val);
          if (substr($val, 0, $olen) eq $old) {
            ($act, $what) = ("appended", substr($val, $olen));
          }
          $what =~ tr/\n/ /;
          $what =~ s/\s+$//;
          my $fh = $self->{fh};
          print $fh "var $act '$what'\n";
          print $fh "  $_\n" for st();
          print $fh "\n";
          ${$self->{value}} = $val;
        }
To use this, you just tie $output at the top of the base code:

        open my($DIAGNOSTIC), ">", "/tmp/who-farted" or die $!;
        tie $output, 'who_farted', $DIAGNOSTIC;
This worked well enough. The stack trace informed me that the modification of interest was occurring in the blosxom::generate function at line 290 of blosxom.cgi. That was the precise location of the lost wallet. It was, as I said, a big hammer to use to squash a mosquito of a problem---but the mosquito is dead.

A somewhat more useful version of this technique comes in handy in situations where you have some program, say a CGI program, that is generating the wrong output; maybe there is a "1" somewhere in the middle of the output and you don't know what part of the program printed "1" to STDOUT. You can adapt the technique to watch STDOUT instead of a variable. It's simpler in some ways, because STDOUT is written to but never read from, so you can eliminate the FETCH method and the data store:

        package who_farted;

        sub rig_fh {
          my ($handle, $diagnostic) = shift;
          my $mode = shift || "<";
          open my($aux_handle), "$mode&=", $handle or die $!;
          tie *$handle, __PACKAGE__, $aux_handle, $diagnostic;
        }

        sub TIEHANDLE {
          my ($package, $aux_handle, $diagnostic) = @_;
          bless [$aux_handle, $diagnostic] => $package;
        }

        sub PRINT {
          my ($aux_handle, $diagnostic) = @$self;
          print $aux_handle @_;
          my $str = join("", @_);
          print $diagnostic "$str:\n";
          print $diagnostic "  $_\n" for st();
          print $diagnostic "\n";
        }
To use this, you put something like rig_fh(\*STDOUT, $DIAGNOSTIC, ">") in the main code. The only tricky part is that some part of the code (rig_fh here) must manufacture a new, untied handle that points to the same place that STDOUT did before it was tied, so that the output can actually be printed.

Something it might be worth pointing out about the code I showed here is that it uses the very rare one-argument form of bless:

        package who_farted;

        sub TIESCALAR {
          my ($package, $fh) = @_;
          my $value = "";
          bless { value => \$value, fh => $fh } ;
        }
Normally the second argument should be $package, so that the object is created in the appropriate package; the default is to create it in package who_farted. Aren't these the same? Yes, unless someone has tried to subclass who_farted and inherit the TIESCALAR method. So the only thing I have really gained here is to render the TIESCALAR method uninheritable. <sarcasm>Gosh, what a tremendous benefit</sarcasm>. Why did I do it this way? In regular OO-style code, writing a method that cannot be inherited is completely idiotic. You very rarely have a chance to write a constructor method in a class that you are sure will never be inherited. This seemed like such a case. I decided to take advantage of this non-feature of Perl since I didn't know when the opportunity would come by again.

"Take advantage of" is the wrong phrase here, because, as I said, there is not actually any advantage to doing it this way. And although I was sure that who_farted would never be inherited, I might have been wrong; I have been wrong about such things before. A smart programmer requires only ten years to learn that you should not do things the wrong way, even if you are sure it will not matter. So I violated this rule and potentially sabotaged my future maintenance efforts (on the day when the class is subclassed) and I got nothing, absolutely nothing in return.

Yes. It is completely pointless. A little Dada to brighten your day. Anyone who cannot imagine a lobster-handed train conductor sleeping in a pile of celestial globes is an idiot.


[Other articles in category /oops] permanent link

Sun, 15 Apr 2007

Happy birthday Leonhard Euler
Leonhard Euler, one of the greatest and most prolific mathematicians ever to walk the earth, was born 300 years ago today in Basel, Switzerland.

Euler named the constant e (not for himself; he used vowels for constants and had already used a for something else), and discovered the astonishing formula $e^{ix} = \cos x + i \sin x$, which is known as Euler's formula. A special case of this formula is the Euler identity: $e^{i\pi} + 1 = 0$.

Order
Visual Complex Analysis
Visual Complex Analysis
with kickback
no kickback
I never really understood what was going on there until last year, when I read the utterly brilliant book Visual Complex Analysis, by Tristan Needham. This was certainly the best math book I read in all of 2006, and probably the best one I've read in the past five years. (Many thanks to Dan Schmidt for rcommending it.)

The brief explanantion is something like this: the exponential function ect is exactly the function that satisfies the differential equation df/dt = cf(t). That is, it is the function that describes the motion of a particle whose velocity is proportional to its position at all times.

Imagine a particle moving on the real line. If its velocity is proportional to its position, it will speed away from the origin at an exponentially increasing rate. Or, if the proportionality constant is negative, it will rapidly approach the origin, getting closer (but never quite reaching it) at an exponentially increasing rate.

Now, suppose we consider a particle moving on the complex plane instead of on the real line, again with velocity proportional to position. If the proportionality constant is real, the particle will speed away from the origin (or towards it, if the constant is negative), as before. But what if the proportionality constant is imaginary?

A proportionality constant of i means that the velocity of the particle is at right angles to the position, because multiplication by i in the complex plane corresponds to a counterclockwise rotation by 90°, as always. In this case, the path of the particle is a circle, and so its position as a function of t is described by something like cos t + i sin t. But this function must satisfy the differential equation also, with c = i, and we have Euler's formula.

Another famous and important formula named after Euler is also called Euler's formula, and states that for any simply-connected polyhedron with F faces, E edges, and V vertices, F - E + V = 2. For example, the cube has 6 faces, 12 edges, and 8 vertices, and indeed 6 - 12 + 8 = 2. The formula also holds for all planar graphs and is the fundamental result of planar graph theory.

Spheres in this case behave like planes, and graphs that cover spheres also satisfy F - E + V = 2. One then wonders whether the theorem holds for more complex surfaces, such as tori; this is equivalent to asking about polyhedra that have a single hole. In this case, the the theorem is a little different, and the identity becomes F - E + V = 0.

It turns out that every surface S has a value χ(S), called the Euler characteristic, such that graphs on the surface all satisfy F - E + V = χ(S).

Euler also discovered that the sum of the first n terms of the harmonic series, 1 + 1/2 + 1/3 + ... + 1/n, is approximately log n. We might like to say that it becomes arbitrarily close to log n, as so many things do, but it does not. It is always a bit larger than log n, and you cannot make it as close as you want. The more terms you take, the closer the sum gets to log n + γ, where γ is approximately 0.577216. This γ is Euler's constant:

$$\gamma = \lim_{n\rightarrow\infty}\left({\sum_{i=1}^n {1\over i} - \ln n}\right)$$

This is one of those numbers that shows up all over the place, and is easy to calculate, but is a big fat mystery. Is it rational? Everyone would be shocked if it were, but nobody knows for sure.

The Euler totient function φ(x) counts the number of integers less than x that have no divisors in common with x. It is of tremendous importance in combinmatorics and number theory. One of the most fundamental and astonishing facts about the totient function is Euler's theorem: aφ(n) - 1 is a multiple of n whenever a and n have no divisors in common. For example, since &phi(9) = 6, a6 - 1 is a multiple of 9, except when a is divisible by 3:

16 - 1= 9.
26 - 1= 9.
46 - 1= 455·9.
56 - 1= 1736·9.
76 - 1= 13072·9.

Euler's solution in 1736 of the "bridges of Königsberg" problem is often said to have begun the study of topology. It is also the source of the term "Eulerian path".

Wikipedia lists forty more items that are merely named for Euler. The list of topics that he discovered, invented, or contributed to would be far too large to actually construct.

Happy birthday, Leonhard Euler.


[Other articles in category /anniversary] permanent link

Thu, 12 Apr 2007

A security problem in a CGI program: addenda

Shell-less piping in Perl

In my previous article, I said:

Unfortunately, there is no easy way to avoid the shell when running a command that is attached to the parent process via a pipe. Perl provides open "| command arg arg arg...", which is what I used, and which is analogous to [system STRING], involving the shell. But it provides nothing analogous to [system ARGLIST], which avoids the shell. If it did, then I probably would have used it, writing something like this:

        open M, "|", $MAILER, "-fnobody\@plover.com", $addre;
and the whole problem would have been avoided.

Several people wrote to point out that, as of Perl 5.8.0, Perl does provide this, with a syntax almost identical to what I proposed:

        open M, "|-", $MAILER, "-fnobody\@plover.com", $addre;
Why didn't I use this? The program was written in late 2002, and Perl 5.8.0 was released in July 2002, so I expect it's just that I wasn't familiar with the new feature yet. Why didn't I mention it in the original article? As I said, I just got back from Asia, and I am still terribly jetlagged.

(Jet lag when travelling with a toddler is worse than normal jet lag, because nobody else can get over the jet lag until the toddler does.)

Jeff Weisberg also pointed out that even prior to 5.8.0, you can write:

        open(F, "|-") || exec("program", "arg", "arg", "arg");
Why didn't I use this construction? I have run out of excuses. Perhaps I was jetlagged in 2002 also.

RFC 822

John Berthels wrote to point out that my proposed fix, which rejects all inputs containing spaces, also rejects some RFC822-valid addresses. Someone whose address was actually something like "Mark Dominus"@example.com would be unable to use the web form to subscribe to the mailing list.

Quite so. Such addresses are extremely rare, and people who use them are expected to figure out how to subscribe by email, rather than using the web form.

qmail

Nobody has expressed confusion on this point, but I want to expliticly state that, in my opinion, the security problem I described was entirely my fault, and was not due to any deficiency in the qmail mail system, or in its qmail-inject or qmail-queue components.

Moreover, since I have previously been paid to give classes at large conferences on how to avoid exactly this sort of problem, I deserve whatever scorn and ridicule comes my way because of this.

Thanks to everyone who wrote in.


[Other articles in category /oops] permanent link

A security problem in a CGI program
<sarcasm>No! Who could possibly have predicted such a thing?</sarcasm>

I was away in Asia, and when I got back I noticed some oddities in my mail logs. Specifically, Yahoo! was rejecting mail.plover.com's outgoing email. In the course of investigating the mail logs, I discovered the reason why: mail.plover.com had been relaying a ton of outgoing spam.

It took me a little while to track down the problem. It was a mailing list subscription form on perl.plover.com:

your address
perl-qotw perl-qotw-discuss perl-qotw-discuss-digest

The form took the input email address and used it to manufacture an email message, requesting that that address be subscribed to the indicated lists:

        my $MAILER = '/var/qmail/bin/qmail-inject';
        ...

        for (@lists) {
          next unless m|^perl-qotw[\-a-z]*|;
          open M, "|$MAILER" or next;
          print M "From: nobody\@plover.com\n";
          print M "To: $_-subscribe-$addre\@plover.com\n";
          print M "\nRequested by $ENV{REMOTE_ADDR} at ", scalar(localtime), "\n";
          close M or next;
          push @DONE, $_;
        }
The message was delivered to the list management software, which interpreted it as a request to subscribe, and generated an appropriate confirmation reply. In theory, this doesn't open any new security holes, because a malicious remote user could also forge an identical message to the list management software without using the form.

The problem is the interpolated $addre variable. The value of this variable is essentially the address from the form. Interpolating user input into a string like this is always fraught with peril. Daniel J. Bernstein has one of the most succinct explanantions of this that I have ever seen:

The essence of user interfaces is parsing: converting an unstructured sequence of commands, in a format usually determined more by psychology than by solid engineering, into structured data.

When another programmer wants to talk to a user interface, he has to quote: convert his structured data into an unstructured sequence of commands that the parser will, he hopes, convert back into the original structured data.

This situation is a recipe for disaster. The parser often has bugs: it fails to handle some inputs according to the documented interface. The quoter often has bugs: it produces outputs that do not have the right meaning. Only on rare joyous occasions does it happen that the parser and the quoter both misinterpret the interface in the same way.

When the original data is controlled by a malicious user, many of these bugs translate into security holes.

In this case, I interpolated user data without quoting, and suffered the consequences.

The malicious remote user supplied an address of the following form:

        into9507@plover.com
        Content-Transfer-Encoding: quoted-printable
        Content-Type: text/plain
        From: Alwin Bestor <Morgan@plover.com>
        Subject: Enhance your love life with this medical marvel
        bcc: hapless@example.com,recipients@example.com,of@example.net,
             spam@example.com,...

        Attention fellow males..

        If you'd like to have stronger, harder and larger erections,
        more power and intense orgasms, increased stamina and
        ejaculatory control, and MUCH more,...

        .
(Yes, my system was used to send out penis enlargement spam. Oh, the embarrassment.)

The address contained many lines of data, separated by CRNL, and a complete message header. Interpolated into the subscription message, the bcc: line caused the qmail-inject user ineterface program to add all the "bcc" addresses to the outbound recipient list.

Several thoughts occur to me about this.

User interfaces and programmatic interfaces

The problem would probably not have occurred had I used the qmail-queue progam, which provides a programmatic interface, rather than qmail-inject, which provides a user interface. I originally selected qmail-inject for convenience: it automatically generates Date and Message-ID fields, for example. The qmail-queue program does not try to parse the recipient information from the message header; it takes recipient information in an out-of-band channel.

Perl piped open is deficient

Perl's system and exec functions have two modes. One mode looks like this:

        system "command arg arg arg...";
If the argument string contains shell metacharacters or certain other constructions, Perl uses the shell to execute the command; otherwise it forks and execs the command directly. The shell is the cause of all sorts of parsing-quoting problems, and is best avoided in programs like this one. But Perl provides an alternative:

        system "command", "arg", "arg", "arg"...;
Here Perl never uses the shell; it always forks and execs the command directly. Thus, system "cat *" prints the contents of all the files in the current working directory, but system "cat", "*" prints only the contents of the file named "*", if there is one.

qmail-inject has an option to take the envelope information from an out-of-band channel: you can supply it in the command-line arguments. I did not use this option in the original program, because I did not want to pass the user input through the Unix shell, which is what Perl's open FH, "| command args..." construction would have required.

Unfortunately, there is no easy way to avoid the shell when running a command that is attached to the parent process via a pipe. Perl provides open "| command arg arg arg...", which is what I used, and which is analogous to the first construction, involving the shell. But it provides nothing analogous to the second construction, which avoids the shell. If it did, then I probably would have used it, writing something like this:

        open M, "|", $MAILER, "-fnobody\@plover.com", $addre;
and the whole problem would have been avoided.

A better choice would have been to set up the pipe myself and use Perl's exec function to execute qmail-inject, which bypasses the shell. The qmail-inject program would always have received exactly one receipient address argument. In the event of an attack like the one above, it would have tried to send mail to into9507@plover.com^M^JContent-Transfer-Encoding:..., which would simply have bounced.

Why didn't I do this? Pure laziness.

qmail-queue more vulnerable than qmail-inject in this instance

Rather strangely, an partial attack is feasible with qmail-queue, even though it provides a (mostly) non-parsing interface. The addresses to qmail-queue are supplied to it on file descriptor 1 in the form:

        Taddress1@example.com^@Taddress2@example.com^@Taddress3@example.com^@...^@
If my program were to talk to qmail-queue instead of to qmail-inject, the program would have contained code that looked like this:

        print QMAIL_QUEUE_ENVELOPE "T$addre\0";
qmail-queue parses only to the extent of dividing up its input at the ^@ characters. But even this little bit of parsing is a problem. By supplying an appropriately-formed address string, a malicious user could still have forced my program to send mail to many addresses.

But still the recipient addresses would have been out of the content of the message. If the malicious user is unable to affect the content of the message body, the program is not useful for spamming.

But using qmail-queue, my program would have had to generate the To field itself, and so it would have had to insert the user-supplied address into the content of the message. This would have opened the whole can of worms again.

My program attacked specifically

I think some human put real time into attacking this particular program. There are bots that scour the web for email submission forms, and then try to send spam through them. Those bots don't successfully attack this program, because the recipient address is hard-wired. Also, the program refuses to send email unless at least one of the checkboxes is checked, and form-spam bots don't typically check boxes. Someone had to try some experiments to get the input format just so. I have logs of the experiments.

A couple of days after the exploit was discovered, a botnet started using it to send spam; 42 different IP addresses sent requests. I fixed the problem last night around 22:30; there were about 320 more requests, and by 09:00 this morning the attempts to send spam stopped.

Perl's "taint" feature would not have prevented this

Perl offers a feature that is designed specifically for detecting and preventing exactly this sort of problem. It tracks which data are possibly under control of a malicious user, and whether they are used in unsafe operations. Unsafe operations include most file and process operations.

One of my first thoughts was that I should have enabled the tainting feature to begin with. However, it would not have helped in this case. Although the user-supplied address would have been flagged as "tainted" and so untrustworthy; by extension, the email message string into which it was interpolated would have been tainted. But Perl does not consider writing a tainted string to a pipe to be an "unsafe" operation and so would not have signalled a failure.

The fix

The short-range fix to the problem was simple:

        my $addr = param('addr');
        $addr =~ s/^\s+//;
        $addr =~ s/\s+$//;

        ...

        if ($addr =~ /\s/) {
          sleep 45 + rand(45);
          print p("Sorry, addresses are not allowed to contain spaces");
          fin();
        }
Addresses are not allowed to contain whitespace, except leading or trailing whitespace, which is ignored. Since whitespace inside an address is unlikely to be an innocent mistake, the program waits before responding, to slow down the attacker.

Summary

Dammit.

[ Addendum 20070412: There is a followup article to this one. ]


[Other articles in category /oops] permanent link

Pick's theorem
In a recent article, I discussed Hero's formula for the area of a triangle in terms of its sides, and I said it was an oddity that didn't seem like any other formula in geometry. Pick's theorem is another such oddity, although not at all like Hero's.

Pick's theorem concerns the area of so-called "lattice polygons". These are simply polygons whose vertices all lie at points whose coordinates are integers. Such points are called "lattice points". Here it is:

Let P be a lattice polygon. Let b be the number of lattice points that lie on the edges of the polygon, and i be the number of lattice points inside the polygon. Then the area of the polygon is exactly b/2 + i - 1.

I think this should be at least a bit surprising. It implies that every lattice polygon has an area that is an integer multiple of ½, which I would not have thought was obvious.

Some examples now. Pick's theorem is obviously true for rectangles:

In the example above, b = 14 and i = 6, so Pick's theorem says that the area is 14/2 + 6 - 1 = 12, which is correct. In general, an m×n rectangle has (m-1)(n-1) lattice points inside, and 2m + 2n on the edges, so by Pick's theorem its area is (2m + 2n)/2 + (mn - m - n + 1) - 1 = mn.

We can cut the rectangle in half, and it still works:

b is now 8 and i is 3, so Pick's theorem predicts an area of 8/2 + 3 - 1 = 6, which is still correct. This works even when the diagonal cuts through some lattice points:

Here b is still 8, but i is only 1, so Pick predicts an area of 8/2 + 1 - 1 = 4.

It works for figures without right angles:

Here b=5 and i=0, for a total area of 3/2. We can check the area manually as follows:

The entire square has area 9. Regions A, B, and C each have area 1, and D has area 4½. That leaves X = 1½ as Pick's theorem says.

It works for more complicated figures too. Here we have b=7 and i=5 for a total area of 7½:


It works for non-convex polygons:

It does not, however, work for polygons with holes. The figure below is the same as the first triangle in this article (which had an area of of 6) except that it has a hole of size ½ chopped out of it. It also has three fewer interior points and three more boundary points than the original triangle, for a total net loss of 3 - 3/2 = 3/2.

To fix Pick's theorem for non-simply-connected polygons, you need to say that each hole adds an extra -1 to the total area.

The proof of Pick's theorem isn't hard. You start by proving it that it holds for all triangles that have two sides parallel to the x and y axes. Then you prove it for all triangles, using a subtraction argument like the one I used above for triangle X. Finally, you use induction to prove it for more complicated regions, which can be built up from triangles.

But the funny thing about Pick's theorem is that you can guess it even without a proof. Suppose someone told you that there was a formula for the size of a polygon in terms of the number of lattice points on the boundary and in the interior. Well, each interior point is surrounded by a square of area 1, which is typically inside the polygon:

So each such point should contribute about 1 to the area. Of course, not all do; some will contribute a little less:

But there will also be parts of the polygon that are not near any interior points:

The points outside the polygon whose squares are partly inside (which count less than they should) will tend to balance out the contributions of the points inside whose squares are partly outside (which count more than they should.)

The squares around a point on the edge (but not the vertex) of the polygon will always be half inside, half outside, so that such a point will contribute exactly half of a square to the total:

The vertices are a little funny. They also contribute about ½, for the same reason that the edge points do. But convex vertices contribute rather less than ½:

While concave vertices contribute a bit more:

So we need to adjust the contribution of the vertices away from the ideal of ½ each. The adjustment is positive when angle is more than 180°, in which case the path at the vertex turns clockwise, and negative when the angle is less than 180°, when the path turns counterclockwise. But any polygon makes exactly one complete counterclockwise turn, so the total adjustment is exactly one square's-worth, or -1, and this is where the fudge factor of -1 comes from. (I've known about Pick's theorem for twenty years, but I never knew where the -1 came from until just now.) Viewed in this light, it makes perfect sense that a hole in the polygon should subtract an extra unit of space, since it's adding an extra complete turn.

I made the diagrams for this article with a picture-drawing tool, originally designed at Bell Labs, called pic. The source code files for the illustrations were all named things like pick.pic, and the command to compile them to PostScript was pic pick.pic. This was really confusing.


[Other articles in category /math] permanent link

Hero's formula
On of my favorite search engine queries of last month was:

        this mathematician's best known work is the formula for the
        area of a triangle in terms of the lengths of it's sides. who
        is this and when did they live?
This is the kind of question that always trips me up in Jeopardy. (And doesn't that first sentence sounds just like a Jeopardy question? "I'll take triangles for $600, Alex. . .") The ears hear "mathematician", "triangle", "formula", and the mouth says "Who was Pythagoras!" without any conscious intervention. But the answer here is almost certainly Hero of Alexandria.

Hero's formula, or Heron's formula, has always seemed to me like something of an oddity. It doesn't look like any other formula in geometry. Usually to get anything useful from a triangle you need to involve the angles, and express the results in terms of trigonometric functions. This will be obvious if you pause to remember what the phrase "trigonometric function" means. The few rare cases in which you can avoid trigonometry (there's that word again) usually involve special cases, such as right triangles.

But Hero's formula expresses the area of a triangle in terms of the lengths of its sides, with no angles and no trigonometric functions. If we were trying to discover such a formula, we might proceed trigonometrically, and then hope we could somehow eliminate the trigonometry by the end. So we might proceed as follows:

The magnitude of the cross product a×b is the area of the parallelogram with sides a and b, so the area of the triangle is half that. And |a × b| = |a||b| sin θ, where θ is the angle between the two vectors. So if a and b are sides of a triangle and the included angle is θ, the area A is ab/2 · sin θ.

Now we want θ is in terms of the third side c. The law of cosines generalizes the Pythagorean theorem to non-right triangles: c2 = a2 + b2 - 2ab cos θ, where θ is the included angle between sides a and b. (In a right triangle, θ = 90°, and the cosine term drops out.) So we have cos θ = (a2 + b2 - c2) / 2ab. But we want sine instead of cosine, so convert cosine to sine using sin θ = √(1-cos2&theta);, which yields:

$$\sin\theta = {1\over 2ab}\sqrt{4a^2b^2 - {(a^2 + b^2 - c^2)}^2}$$

Expanding out the right-hand side, and remembering that what we really want is A = ab/2 sin θ. we get:

$$A = {1\over4}\sqrt{2a^2b^2 + 2a^2c^2 + 2b^2c^2 - a^4 - b^4 - c^4}$$.

Now, that might satisfy anyone, but Hero's answer is better. Hero says: Let p be the "semiperimeter" of the triangle; that is, p = (a+b+c)/2. Then the area of the triangle is just √(p(p-a)(p-b)(p-c)). This is "Hero's formula".

Hero's formula is simple and easy to remember, which (2a2b2 + 2a2c2 + 2b2c2 - a4 - b4- c4)/16 is not. If you expand out Hero's formula, you find that that the two formulas are the same, as of course they must be. But if you have only the complicated fourth-degree polynomial, how would you get the idea that putting it in terms of p will simplify it? There is some technique or insight here that I am missing.

Even though such technique is the indispensable tool of the working mathematician, mathematical writing customarily scorns such technique,and has numerous phrases for disparaging it or kicking it under the carpet. For example, having gotten the fourth-degree polynomial, we might say "obviously, this is equivalent to Hero's formula", or "it is left to the student to prove that the two formulations are equivalent." Or we could just skip direct from the polynomial to Hero's formula, and say what Wikipedia says in the same situation: "Here the simple algebra in the last step was omitted."

Indeed, if you are trying to prove that Hero's formula is true, it's tedious but straightforward to grovel over the algebra and grind out that the two formulations are the same. But all this ignores the real problem, which is that nobody looking at the trigonometric formulation would know to guess Hero's formula if they had not seen it before. A proof serves two purposes. it is supposed to persuade you that the theorem is true. But more importantly, it is supposed to help you understand why it is true, and to give you some insight that may help you solve similar problems. The proof-by-handwaving-away-complex-and-unexpected-algebra technique serves the first purpose, but not the second. And Hero did not discover the formula using this approach anyway, since he did not take a trig class when he was in high school.

Another way to proceed is to drop a perpendicular from the vertex opposite side c. If the length of the perpendicular is h, the area of the triangle is hc/2. But if the foot of the perpendicular divides c into x + y, we also have x2 + h2 = a2 and y2 + h2 = b2. Grovelling over the algebra again yields something equivalent to Hero's formula.

I don't know a good proof of Hero's formula. I haven't seen Hero's, about which MathWorld says:

Heron's proof is ingenious but extremely convoluted, bringing together a sequence of apparently unrelated geometric identities and relying on the properties of cyclic quadrilaterals and right triangles.

A "cyclic quadrilateral" is a quadrilateral whose vertices all lie on a circle. Any triangle can be considered a degenerate cyclic quadrilateral, which happens to have two of its vertices in the same place. Indeed, there's a formula, called "Bretschneider's formula", just like Hero's, but for the area of a cyclic quadrilateral: A = √((p-a)(p-b)(p-c)(p-d)), where a, b, c, d are the lengths of the sides and p is the semiperimeter. If you consider that a triangle is just a cyclic quadrilateral one of whose sides has length 0, you put d = 0 in Bretschneider's formula and you get out Hero's formula.


[Other articles in category /math] permanent link