The Universe of Disco


Mon, 29 Oct 2018

A long story about many bugs

Warning: Long and possibly dull.

I spent a big chunk of today fixing a bug that should have been easy but that just went deeper and deeper. If you look over in the left sidebar there you'll se a sub-menu titled “subtopics” with a per-category count of the number of articles in each section of this blog. (Unless you're using a small display, where the whole sidebar is suppressed.) That menu was at least a year out of date. I wanted to fix it.

The blog software I use is the wonderfully terrible Blosxom. It has a plugin system, and the topic menu was generated by a plugin that I wrote some time ago. When the topic plugin starts up it opens two Berkeley DB files. Each is a simple key-value mapping. One maps topic names to article counts. The other is just a set of article IDs for the articles that have already been counted. These key-value mappings are exposed in Perl as hash variables.

When I regenerate the static site, the topic plugin has a subroutine, story, that is called for each article in each generated page. The business end of the subroutine looks something like this:

      sub story {
        # ... acquire arguments ..

        if ( $Seen{ $article_id } ) {
          return;
        } else {
          $topic_count{ $article_topic }++;
          $Seen{ $article_id } = 1;
        }
      }

The reason the menu wasn't being updated is that at some point in the past, I changed the way story plugins were called. Out of the box, Blosxom passes story a list of five arguments, like this:

       my ($pkg, $path, $filename, $story_ref, $title_ref) = @_;

Over the years I had extended this to eight or nine, and I felt it was getting unwieldy, so at some point I changed it to pass a hash, like this:

         my %args = (
              category   => $path, # directory of this story
              filename   => $fn,   # filename of story, without suffix
              ...
         )
         $entries = $plugin->story(\%args);

When I made this conversion, I had to convert all the plugins. I missed converting topic. So instead of getting the eight or nine arguments it expected, it got two: the plugin itself, and the hash. Then it used the hash as the key into the databases, which by now were full of thousands of entries for things like HASH(0x436c1d) because that is what Perl silently and uselessly does if you try to use a hash as if it were a string.

Anyway, this was easily fixed, or should have been easily fixed. All I needed to do was convert the plugin to use the new calling convention. Ha!

One thing all my plugins do when they start up is write a diagnostic log, something like this:

    sub start {
      open F, ">", "/tmp/topic.$>";
      print F "Writing to $blosxom::plugin_state_dir/topics\n";
    }

Then whenever the plugin has something to announce it just does print F. For example, when the plugin increments the count for a topic, it inserts a message like this:

    print F "'$article_id' is item $topic_count{$article_topic} in topic $article_topic.\n";

If the article has already been seen, it remains silent.

Later I can look in /tmp/topic.119 or whatever to see what it said. When I'm debugging a plugin, I can open an Emacs buffer on this file and put it in auto-revert mode so that Emacs always displays the current contents of the file.

Blosxom has an option to generate pages on demand for a web browser, and I use this for testing. https://blog.plover.com/PATH is the static version of the article, served from a pre-generated static file. But https://blog.plover.com/test/PATH calls Blosxom as a CGI script to generate the article on the fly and send it to the browser. So I visited https://blog.plover.com/test/2018/, which should generate a page with all the articles from 2018, to see what the plugin put in the file. I should have seen it inserting a lot of HASH(0x436c1d) garbage:

    'lang/etym/Arabic-2' is article 1 in topic HASH(0x22c501b)
    'addenda/200801' is article 1 in topic HASH(0x5300aa2)
    'games/poker-24' is article 1 in topic HASH(0x4634a79)
    'brain/pills' is article 1 in topic HASH(0x1a9f6ab)
    'lang/long-s' is article 1 in topic HASH(0x29489be)
    'google-roundup/200602' is article 1 in topic HASH(0x360e6f5)
    'prog/van-der-waerden-1' is article 1 in topic HASH(0x3f2a6dd)
    'math/math-se-gods' is article 1 in topic HASH(0x412b105)
    'math/pow-sqrt-2' is article 1 in topic HASH(0x23ebfe4)
    'aliens/dd/p22' is article 1 in topic HASH(0x878748)

I didn't see this. I saw the startup message and nothing else. I did a bunch of very typical debugging, such as having the plugin print a message every time story was called:

    sub story {
        print F "Calling 'story' (@_)\n";
        ...
    }

Nothing. But I knew that story was being called. Was I maybe editing the wrong file on disk? No, because I could introduce a syntax error and the browser would happily report the resulting 500 Server Error. Fortunately, somewhere along the way I changed

      open F, ">", "/tmp/topic.$>";

to

      open F, ">>", "/tmp/topic.$>";

and discovered that each time I loaded the page, the plugin was run exactly twice. When I had had >, the second run would immediately overwrite the diagnostics from the first run.

But why was the plugin being run twice? This took quite a while to track down. At first I suspected that Blosxom was doing it, either on purpose or by accident. My instance of Blosxom is a hideous Frankenstein monster that has been cut up and reassembled and hacked and patched dozens of times since 2006 and it is full of unpleasant surprises. But the problem turned out to be quite different. Looking at the Apache server logs I saw that the browser was actually making two requests, not one:

    100.14.199.174 - mjd [28/Oct/2018:18:00:49 +0000] "GET /test/2018/ HTTP/1.1" 200 213417 "-" ...
    100.14.199.174 - mjd [28/Oct/2018:18:00:57 +0000] "GET /test/2018/BLOGIMGREF/horseshoe-curve-small.mp4 HTTP/1.1" 200 623 ...

Since the second request was for a nonexistent article, the story callback wasn't invoked in the second run. So I would see the startup message, but I didn't see any messages from the story callback. They had been there in the first run for the first request, but that output was immediately overwritten on the second request.

BLOGIMGREF is a tag that I include in image URLs, that expands to whatever is the appropriate URL for the images for the particular article it's in. This expansion is done by a different plugin, called path2, and apparently in this case it wasn't being expanded. The place it was being used was easy enough to find; it looked like this:

     <video width="480" height="270" controls>
       <source src="BLOGIMGREF/horseshoe-curve-small.mp4" type="video/mp4">
     </video>

So I dug down into the path2 plugin to find out why BLOGIMGREF wasn't being replaced by the correct URL prefix, which should have been in a different domain entirely.

This took a very long time to track down, and I think it was totally not my fault. When I first wrote path2 I just had it do a straight text substitution. But at some point I had improved this to use a real HTML parser, supplied by the Perl HTML::TreeBuilder module. This would parse the article body and return a tree of HTML::Element objects, which the plugin would then filter, looking for img and a elements. The plugin would look for the magic tags and replace them with the right URLs.

This magic tag was not in an img or an a element, so the plugin wasn't finding it. I needed to tell the plugin to look in source elements also. Easy fix! Except it didn't work.

Then began a tedious ten-year odyssey through the HTML::TreeBuilder and HTML::Element modules to find out why it hadn't worked. It took a long time because I'm good at debugging. When you lose your wallet, you look in the most likely places first, and I know from many years of experience what the most likely places are — usually in my misunderstanding of the calling convention of some library I didn't write, or my misunderstanding of what it was supposed to do; sometimes in my own code. The downside of this is that when the wallet is in an unlikely place it takes a really long time to find it.

The end result this time was that it wasn't in any of the usual places. It was 100% not my fault: HTML::TreeBuilder has a bug in its parser. For some reason it completely ignores source elements:

    perl -MHTML::TreeBuilder -e '$z = q{<source src="/media/horseshoe-curve-small.mp4" type="video/mp4"/>}; HTML::TreeBuilder->new->parse($z)->eof->elementify()->dump(\*STDERR)' 

The output is:

    <html> @0 (IMPLICIT)
      <head> @0.0 (IMPLICIT)
      <body> @0.1 (IMPLICIT)

No trace of the source element. I reported the bug, commented out the source element in the article, and moved on. (The article was unpublished, in part because I could never get the video to play properly in the browser. I had been tearing my hair about over it, but now I knew why! The BLOGIMGREF in the URL was not being replaced! Because of a bug in the HTML parser!)

With that fixed I went back to finish the work on the topic plugin. Now that the diagnostics were no longer being overwritten by the bogus request for /test/2018/BLOGIMGREF/horseshoe-curve-small.mp4, I expected to see the HASH(0x436c1d) garbage. I did, and I fixed that. Then I expected the 'article' is article 17 in topic prog lines to go away. They were only printed for new articles that hadn't been seen before, and by this time every article should have been in the %Seen database.

But no, every article on the page, every article from 2018, was being processed every time I rebuilt the page. And the topic counts were going up, up, up.

This also took a long time to track down, because again the cause was so unlikely. I must have been desperate because I finally found it by doing something like this:


            if ( $Seen{ $article_id } ) {
              return;
            } else {
              $topic_count{ $article_topic }++;
              $Seen{ $article_id } = 1;
              die "WTF!!" unless $Seen{ $article_id };
            }

Yep, it died. Either Berkeley DB, or Perl's BerkeleyDB module, was just flat-out not working. Both of them are ancient, and this kind of shocking bug should have been shaken out 20 years go. WTF, indeed,

I fixed this by discarding the entire database and rebuilding it. I needed to clean out the HASH(0x436c1d) crap anyway.

I am sick of DB files. I am never using them again. I have been bitten too many times. From now on I am doing the smart thing, by which I mean the dumb thing, the worse-is-better thing: I will read a plain text file into memory, modify it, and write out the modified version whem I am done. It will be simple to debug the code and simple to modify the database.

Well, that sucked. Usually this sort of thing is all my fault, but this time I was only maybe 10% responsible.

At least it's working again.

[ Addendum: I learned that discarding the source element is a ⸢feature⸣ of HTML::Parser. It has a list of valid HTML4 tags and by default it ignores any element that isn't one. The maintainer won't change the default to HTML5 because that might break backward compatibility for people who are depending on this behavior. ]


[Other articles in category /prog/bug] permanent link