A long story about many bugs
Warning: Long and possibly dull.
I spent a big chunk of today fixing a bug that should have been easy
but that just went deeper and deeper. If you look over in the left
sidebar there you'll se a sub-menu titled “subtopics” with a
per-category count of the number of articles in each section of this
blog. (Unless you're using a small display, where the whole sidebar
is suppressed.) That menu was at least a year out of date. I wanted
to fix it.
The blog software I use is the wonderfully terrible
Blosxom. It has a plugin system,
and the topic menu was generated by a plugin that I wrote some time
ago. When the topic plugin starts up it opens two Berkeley
DB files. Each is a simple
key-value mapping. One maps topic names to article counts. The other
is just a set of article IDs for the articles that have already been
counted. These key-value mappings are exposed in Perl as hash
variables.
When I regenerate the static site, the topic plugin has a
subroutine, story , that is called for each article in each generated
page. The business end of the subroutine looks something like this:
sub story {
# ... acquire arguments ..
if ( $Seen{ $article_id } ) {
return;
} else {
$topic_count{ $article_topic }++;
$Seen{ $article_id } = 1;
}
}
The reason the menu wasn't being updated is that at some point in the
past, I changed the way story plugins were called. Out of the box,
Blosxom passes story a list of five arguments, like this:
my ($pkg, $path, $filename, $story_ref, $title_ref) = @_;
Over the years I had extended this to eight or nine, and I felt it was
getting unwieldy, so at some point I changed it to pass a hash, like
this:
my %args = (
category => $path, # directory of this story
filename => $fn, # filename of story, without suffix
...
)
$entries = $plugin->story(\%args);
When I made this conversion, I had to convert all the plugins. I
missed converting topic . So instead of getting the eight or nine
arguments it expected, it got two: the plugin itself, and the hash.
Then it used the hash as the key into the databases, which by now
were full of thousands of entries for things like HASH(0x436c1d)
because that is what Perl silently and uselessly does if you try to
use a hash as if it were a string.
Anyway, this was easily fixed, or should have been easily fixed.
All I needed to do was convert the plugin to use the new calling
convention. Ha!
One thing all my plugins do when they start up is write a diagnostic
log, something like this:
sub start {
open F, ">", "/tmp/topic.$>";
print F "Writing to $blosxom::plugin_state_dir/topics\n";
}
Then whenever the plugin has something to announce it just does
print F . For example, when the plugin increments the count for a
topic, it inserts a message like this:
print F "'$article_id' is item $topic_count{$article_topic} in topic $article_topic.\n";
If the article has already been seen, it remains silent.
Later I can look in /tmp/topic.119 or whatever to see what it said.
When I'm debugging a plugin, I can open an Emacs buffer on this file
and put it in auto-revert mode so that Emacs always displays the
current contents of the file.
Blosxom has an option to generate pages on demand for a web browser,
and I use this for testing. https://blog.plover.com/PATH is the
static version of the article, served from a pre-generated static
file. But https://blog.plover.com/test/PATH calls Blosxom as a CGI
script to generate the article on the fly and send it to the browser.
So I visited https://blog.plover.com/test/2018/ , which should
generate a page with all the articles from 2018, to see what the
plugin put in the file. I should have seen it inserting a lot of
HASH(0x436c1d) garbage:
'lang/etym/Arabic-2' is article 1 in topic HASH(0x22c501b)
'addenda/200801' is article 1 in topic HASH(0x5300aa2)
'games/poker-24' is article 1 in topic HASH(0x4634a79)
'brain/pills' is article 1 in topic HASH(0x1a9f6ab)
'lang/long-s' is article 1 in topic HASH(0x29489be)
'google-roundup/200602' is article 1 in topic HASH(0x360e6f5)
'prog/van-der-waerden-1' is article 1 in topic HASH(0x3f2a6dd)
'math/math-se-gods' is article 1 in topic HASH(0x412b105)
'math/pow-sqrt-2' is article 1 in topic HASH(0x23ebfe4)
'aliens/dd/p22' is article 1 in topic HASH(0x878748)
I didn't see this. I saw the startup message and nothing else. I did
a bunch of very typical debugging, such as having the plugin print a
message every time story was called:
sub story {
print F "Calling 'story' (@_)\n";
...
}
Nothing. But I knew that story was being called. Was I maybe
editing the wrong file on disk? No, because I could introduce a
syntax error and the browser would happily report the resulting 500
Server Error . Fortunately, somewhere along the way I changed
open F, ">", "/tmp/topic.$>";
to
open F, ">>", "/tmp/topic.$>";
and discovered that each time I loaded the page, the plugin was run
exactly twice. When I had had > , the second run would immediately
overwrite the diagnostics from the first run.
But why was the plugin being run twice? This took quite a while to
track down. At first I suspected that Blosxom was doing it, either on
purpose or by accident. My instance of Blosxom is a hideous
Frankenstein monster that has been cut up and reassembled and hacked
and patched dozens of times since 2006 and it is full of unpleasant
surprises. But the problem turned out to be quite different. Looking
at the Apache server logs I saw that the browser was actually making
two requests, not one:
100.14.199.174 - mjd [28/Oct/2018:18:00:49 +0000] "GET /test/2018/ HTTP/1.1" 200 213417 "-" ...
100.14.199.174 - mjd [28/Oct/2018:18:00:57 +0000] "GET /test/2018/BLOGIMGREF/horseshoe-curve-small.mp4 HTTP/1.1" 200 623 ...
Since the second request was for a nonexistent article, the story
callback wasn't invoked in the second run. So I would see the startup
message, but I didn't see any messages from the story callback.
They had been there in the first run for the first request, but that
output was immediately overwritten on the second request.
BLOGIMGREF is a tag that I include in image URLs, that expands to
whatever is the appropriate URL for the images for the particular
article it's in. This expansion is done by a different plugin, called
path2 , and apparently in this case it wasn't being expanded. The
place it was being used was easy enough to find; it looked like this:
<video width="480" height="270" controls>
<source src="BLOGIMGREF/horseshoe-curve-small.mp4" type="video/mp4">
</video>
So I dug down into the path2 plugin to find out why BLOGIMGREF
wasn't being replaced by the correct URL prefix, which should have
been in a different domain entirely.
This took a very long time to track down, and I think it was totally
not my fault. When I first wrote path2 I just had it do a straight
text substitution. But at some point I had improved this to use a real
HTML parser, supplied by the Perl HTML::TreeBuilder module. This
would parse the article body and return a tree of HTML::Element
objects, which the plugin would then filter, looking for img and a
elements. The plugin would look for the magic tags and replace them
with the right URLs.
This magic tag was not in an img or an a element, so the plugin
wasn't finding it. I needed to tell the plugin to look in source
elements also. Easy fix! Except it didn't work.
Then began a tedious ten-year odyssey through the HTML::TreeBuilder
and HTML::Element modules to find out why it hadn't worked. It took
a long time because I'm good at debugging. When you lose your wallet,
you look in the most likely places first, and I know from many years
of experience what the most likely places are — usually in my
misunderstanding of the calling convention of some library I didn't
write, or my misunderstanding of what it was supposed to do; sometimes
in my own code. The downside of this is that when the wallet is in
an unlikely place it takes a really long time to find it.
The end result this time was that it wasn't in any of the usual
places. It was 100% not my fault: HTML::TreeBuilder has a bug in
its parser. For
some reason it completely ignores source elements:
perl -MHTML::TreeBuilder -e '$z = q{<source src="/media/horseshoe-curve-small.mp4" type="video/mp4"/>}; HTML::TreeBuilder->new->parse($z)->eof->elementify()->dump(\*STDERR)'
The output is:
<html> @0 (IMPLICIT)
<head> @0.0 (IMPLICIT)
<body> @0.1 (IMPLICIT)
No trace of the source element. I reported the bug, commented out
the source element in the article, and moved on. (The article was
unpublished, in part because I could never get the video to play
properly in the browser. I had been tearing my hair about over it,
but now I knew why! The BLOGIMGREF in the URL was not being
replaced! Because of a bug in the HTML parser!)
With that fixed I went back to finish the work on the topic plugin.
Now that the diagnostics were no longer being overwritten by the bogus
request for /test/2018/BLOGIMGREF/horseshoe-curve-small.mp4 , I
expected to see the HASH(0x436c1d) garbage. I did, and I fixed
that. Then I expected the 'article' is article 17 in topic prog
lines to go away. They were only printed for new articles that hadn't
been seen before, and by this time every article should have been in
the %Seen database.
But no, every article on the page, every article from 2018, was being
processed every time I rebuilt the page. And the topic counts were
going up, up, up.
This also took a long time to track down, because again the cause was
so unlikely. I must have been desperate because I finally found it by
doing something like this:
if ( $Seen{ $article_id } ) {
return;
} else {
$topic_count{ $article_topic }++;
$Seen{ $article_id } = 1;
die "WTF!!" unless $Seen{ $article_id };
}
Yep, it died. Either Berkeley DB, or Perl's BerkeleyDB module, was
just flat-out not working. Both of them are ancient, and this kind of
shocking bug should have been shaken out 20 years go. WTF, indeed,
I fixed this by discarding the entire database and rebuilding it. I
needed to clean out the HASH(0x436c1d) crap anyway.
I am sick of DB files. I am never using them again. I have been
bitten too many times. From now on I am doing the smart thing, by
which I mean the dumb thing, the worse-is-better thing: I will
read a plain text file into memory, modify it, and write out the
modified version whem I am done. It will be simple to debug the code
and simple to modify the database.
Well, that sucked. Usually this sort of thing is all my fault, but
this time I was only maybe 10% responsible.
At least it's working again.
[ Addendum: I learned that discarding the source element is a
⸢feature⸣ of HTML::Parser . It has a list of valid HTML4 tags and by
default it ignores any element that isn't one. The maintainer won't
change the default to HTML5 because that might break backward
compatibility for people who are depending on this behavior. ]
[Other articles in category /prog/bug]
permanent link
|