The Universe of Discourse


Fri, 29 Nov 2024

A complex bug with a ⸢simple⸣ fix

Last month I did a fairly complex piece of systems programming that worked surprisingly well. But it had one big bug that took me a day to track down.

One reason I find the bug interesting is that it exemplifies the sort of challenges that come up in systems programming. The essence of systems programming is that your program is dealing with the state of a complex world, with many independent agents it can't control, all changing things around. Often one can write a program that puts down a wrench and then picks it up again without looking. In systems programming, the program may have to be prepared for the possibility that someone else has come along and moved the wrench.

The other reason the bug is interesting is that although it was a big bug, fixing it required only a tiny change. I often struggle to communicate to nonprogrammers just how finicky and fussy programming is. Nonprogrammers, even people who have taken a programming class or two, are used to being harassed by crappy UIs (or by the compiler) about missing punctuation marks and trivially malformed inputs, and they think they understand how fussy programming is. But they usually do not. The issue is much deeper, and I think this is a great example that will help communicate the point.

The job of my program, called sync-spam, was to move several weeks of accumulated email from system S to system T. Each message was probably spam, but its owner had not confirmed that yet, and the message was not yet old enough to be thrown away without confirmation.

The probably-spam messages were stored on system S in a directory hierarchy with paths like this:

    /spam/2024-10-18/…

where 2024-10-18 was the date the message had been received. Every message system S had received on October 18 was somewhere under /spam/2024-10-18.

One directory, the one for the current date, was "active", and new messages were constantly being written to it by some other programs not directly related to mine. The directories for the older dates never changed. Once sync-spam had dealt with the backlog of old messages, it would continue to run, checking periodically for new messages in the active directory.

The sync-spam program had a database that recorded, for each message, whether it had successfully sent that message from S to T, so that it wouldn't try to send the same message again.

The program worked like this:

  • Repeat forever:
    1. Scan the top-level spam directory for the available dates
    2. For each date D:
      1. Scan the directory for D and find the messages in it. Add to the database any messages not already recorded there.
      2. Query the database for the list of messages for date D that have not yet been sent to T
      3. For each such message:
        1. Attempt to send the message
        2. If the attempt was successful, record that in the database
    3. Wait some appropriate amount of time and continue.

Okay, very good. The program would first attempt to deal with all the accumulated messages in roughly chronological order, processing the large backlog. Let's say that on November 1 it got around to scanning the active 2024-11-01 directory for the first time. There are many messages, and scanning takes several minutes, so by the time it finishes scanning, some new messages will be in the active directory that it hasn't seen. That's okay. The program will attempt to send the messages that it has seen. The next time it comes around to 2024-11-01 it will re-scan the directory and find the new messages that have appeared since the last time around.

But scanning a date directory takes several minutes, so we would prefer not to do it if we don't have to. Since only the active directory ever changes, if the program is running on November 1, it can be sure that none of the directories from October will ever change again, so there is no point in its rescanning them. In fact, once we have located the messages in a date directory and recorded them in the database, there is no point in scanning it again unless it is the active directory, the one for today's date.

So sync-spam had an elaboration that made it much more efficient. It was able to put a mark on a date directory that meant "I have completely scanned this directory and I know it will not change again". The algorithm was just as I said above, except with these elaborations.

  • Repeat forever:
    1. Scan the top-level spam directory for the available dates
    2. For each date D:
        • If the directory for D is marked as having already been scanned, we already know exactly what messages are in it, since they are already recorded in the database.
        • Otherwise:
          1. Scan the directory for D and find the messages in it. Add to the database any messages not already recorded there.
          2. If D is not today's date, mark the directory for D as having been scanned completely, because we need not scan it again.
      1. Query the database for the list of messages for date D that have not yet been sent to T
      2. For each such message:
        1. Attempt to send the message
        2. If the attempt was successful, record that in the database
    3. Wait some appropriate amount of time and continue.

It's important to not mark the active directory as having been completely scanned, because new messages are continually being deposited into it until the end of the day.

I implemented this, we started it up, and it looked good. For several days it processed the backlog of unsent messages from September and October, and it successfully sent most of them. It eventually caught up to the active directory for the current date, 2024-11-01, scanned it, and sent most of the messages. Then it went back and started over again with the earliest date, attempting to send any messages that it hadn't sent the first time.

But a couple of days later, we noticed that something was wrong. Directories 2024-11-02 and 2024-11-03 had been created and were well-stocked with the messages that had been received on those dates. The program had found the directories for those dates and had marked them as having been scanned, but there were no messages from those dates in its database.

Now why do you suppose that is?

(Spoilers will follow the horizontal line.)

I investigate this in two ways. First, I made sync-spam's logging more detailed and looked at the results. While I was waiting for more logs to accumulate, I built a little tool that would generate a small, simulated spam directory on my local machine, and then I ran sync-spam against the simulated messages, to make sure it was doing what I expected.

In the end, though, neither of these led directly to my solving the problem; I just had a sudden inspiration. This is very unusual for me. Still, I probably wouldn't have had the sudden inspiration if the information from the logging and the debugging hadn't been percolating around my head. Fortune favors the prepared mind.


The problem was this: some other agent was creating the 2024-11-02 directory a bit prematurely, say at 11:55 PM on November 1.

Then sync-spam came along in the last minutes of November 1 and started its main loop. It scanned the spam directory for available dates, and found 2024-11-02. It processed the unsent messages from the directories for earlier dates, then looked at 2024-11-02 for the first time. And then, at around 11:58, as per above it would:

  1. Scan the directory for 2024-11-02 and find the messages in it. Add to the database any messages not already recorded there.

There weren't any yet, because it was still 11:58 on November 1.

  1. If 2024-11-02 is not today's date, mark the directory as having been scanned completely, because we need not scan it again.

Since the 2024-11-02 directory was not the one for today's date — it was still 11:58 on November 1 — sync-spam recorded that it had scanned that directory completely and need not scan it again.

Five minutes later, at 00:03 on November 2, there would be new messages in the 2024-11-02, which was now the active directory, but sync-spam wouldn't look for them, because it had already marked 2024-11-02 as having been scanned completely.

This complex problem in this large program was completely fixed by changing:

        if ($date ne $self->current_date) {
          $self->mark_this_date_fully_scanned($date_dir);
        }

to:

        if ($date lt $self->current_date) {
          $self->mark_this_date_fully_scanned($date_dir);
        }

(ne and lt are Perl-speak for "not equal to" and "less than".)

Many organizations have their own version of a certain legend, which tells how a famous person from the past was once called out of retirement to solve a technical problem that nobody else could understand. I first heard the General Electric version of the legend, in which Charles Proteus Steinmetz was called out of retirement to figure out why a large complex of electrical equipment was not working.

In the story, Steinmetz walked around the room, looking briefly at each of the large complicated machines. Then, without a word, he took a piece of chalk from his pocket, marked one of the panels, and departed. When the puzzled engineers removed that panel, they found a failed component, and when that component was replaced, the problem was solved.

Steinmetz's consulting bill for $10,000 arrived the following week. Shocked, the bean-counters replied that $10,000 seemed an exorbitant fee for making a single chalk mark, and, hoping to embarrass him into reducing the fee, asked him to itemize the bill.

Steinmetz returned the itemized bill:

One chalk mark $1.00
Knowing where to put it $9,999.00
TOTAL $10,000.00

This felt like one of those times. Any day when I can feel a connection with Charles Proteus Steinmetz is a good day.

This episode also makes me think of the following variation on an old joke:

A: Ask me what is the most difficult thing about systems programming.

B: Okay, what is the most difficult thing ab—

A: TIMING!


[Other articles in category /prog/bug] permanent link

Wed, 23 Sep 2020

The mystery of the malformed command-line flags

Today a user came to tell me that their command

  greenlight submit branch-name --require-review-by skordokott

failed, saying:

    ** 
    ** unexpected extra argument 'branch-name' to 'submit' command
    **

This is surprising. The command looks correct. The branch name is required. The --require-review-by option can be supplied any number of times (including none) and each must have a value provided. Here it is given once and the provided value appears to be skordocott.

The greenlight command is a crappy shell script that pre-validates the arguments before sending them over the network to the real server. I guessed that the crappy shell script parser wanted the branch name last, even though the server itself would have been happy to take the arguments in either order. I suggested that the user try:

  greenlight submit --require-review-by skordokott branch-name 

But it still didn't work:

    ** 
    ** unexpected extra argument '--require-review-by' to 'submit' command
    **

I dug in to the script and discovered the problem, which was not actually a programming error. The crappy shell script was behaving correctly!

I had written up release notes for the --require-review-by feature. The user had clipboard-copied the option string out of the release notes and pasted it into the shell. So why didn't it work?

In an earlier draft of the release notes, when they were displayed as an HTML page, there would be bad line breaks:

blah blah blah be sure to use the -
-require-review-by option…

or:

blah blah blah the new --
require-review-by feature is…

No problem, I can fix it! I just changed the pair of hyphens (- U+002D) at the beginning of --require-review-by to Unicode nonbreaking hyphens ( U+2011). Bad line breaks begone!

But then this hapless user clipboard-copied the option string out of the release notes, including its U+2011 characters. The parser in the script was (correctly) looking for U+002D characters, and didn't recognize --require-review-by as an option flag.

One lesson learned: people will copy-paste stuff out of documentation, and I should be prepared for that.

There are several places to address this. I made the error message more transparent; formerly it would complain only about the first argument, which was confusing because it was the one argument that wasn't superfluous. Now it will say something like

    ** 
    ** extra branch name '--require-review-by' in 'submit' command
    ** 
    ** 
    ** extra branch name 'skordokott' in 'submit' command
    ** 

which is more descriptive of what it actually doesn't like.

I could change the nonbreaking hyphens in the release notes back to regular hyphens and just accept the bad line breaks. But I don't want to. Typography is important.

One idea I'm toying with is to have the shell script silently replace all nonbreaking hyphens with regular ones before any further processing. It's a hack, but it seems like it might be a harmless one.

So many weird things can go wrong. This computer stuff is really complicated. I don't know how anyone get anything done.

[ Addendum: A reader suggests that I could have fixed the line breaks with CSS. But the release notes were being presented as a Slack “Post”, which is essentially a WYSIWYG editor for creating shared documents. It presents the document in a canned HTML style, and as far as I know there's no way to change the CSS it uses. Similarly, there's no way to insert raw HTML elements, so no way to change the style per-element. ]


[Other articles in category /prog/bug] permanent link

Tue, 21 May 2019

Super-obscure bug in my code

Say $dt is a Perl DateTime object.

You are allowed to say

  $dt->add( days => 2 )
  $dt->subtract( days => 2 )

Today Jeff Boes pointed out that I had written a program that used

  $dt->add({ days => 2 })

which as far as I can tell is not documented to work. But it did work. (I wrote it in 2016 and would surely have noticed by now if it hadn't.) Jeff told me he noticed when he copied my code and got a warning. When I tried it, no warning.

It turns out that

  $dt->add({ days => 2 })
  $dt->subtract({ days => 2 })

both work, except that:

  1. The subtract call produces a warning (add doesn't! and Jeff had changed my add to subtract)

  2. If you included an end_of_month => $mode parameter in the arguments to subtract, it would get lost.

Also, the working-ness of what I wrote is a lucky fluke. It is undocumented (I think) and works only because of a quirk of the implementation. ->add passes its arguments to DateTime::Duration->new, which passes them to Params::Validate::validate. The latter is documented to accept either form. But its use by DateTime::Duration is an undocumented implementation detail.

->subtract works the same way, except that it does a little bit of preprocessing on the arguments before calling DateTime::Duration->new. That's where the warning comes from, and why end_of_month won't work with the hashref form.

(All this is as of version 1.27. The current version is 1.51. Matthew Horsfall points out that 1.51 does not raise a warning, because of a different change to the same interface.)

This computer stuff is amazingly complicated. I don't know how anyone gets anything done.


[Other articles in category /prog/bug] permanent link

Tue, 04 Dec 2018

I figured out that context manager bug!

A couple of days ago I described a strange bug in my “Greenlight” project that was causing Git to fail unpredictably, saying:

    fatal: this operation must be run in a work tree

The problem seemed to go away when I changed

    with env_var("GIT_DIR", self.repo_dir):
        with env_var("GIT_WORK_TREE", self.work_dir):
            result = subprocess.run(command, ...)

to

    with env_var("GIT_DIR", self.repo_dir, "GIT_WORK_TREE", self.work_dir):
        result = subprocess.run(command, ...)

but I didn't understand why. I said:

This was so unexpected that I wondered if the real problem was nondeterministic and if some of the debugging messages had somehow perturbed it. But I removed everything but the context manager change and ran another test, which succeeded. By then I was five and half hours into the debugging and I didn't have any energy left to actually understand what the problem had been. I still don't know.

The problem re-manifested again today, and this time I was able to track it down and fix it. The context manager code I mentioned above was not the issue.

That subprocess.run call is made inside a git_util object which, as you can see in the tiny excerpt above, has a self.work_dir attribute that tells it where to find the working tree. Just before running a Git command, the git_util object installs self.work_dir into the environment to tell Git where the working tree is.

The git_util object is originally manufactured by Greenlight itself, which sets the work_dir attribute to a path that contains the current process ID number. Just before the process exits, Greenlight destroys the working tree. This way, concurrent processes never try to use the same working tree, which would be a mess.

When Greenlight needs to operate on the repository, it uses its git_util object directly. It also creates a submission object to represent the submitted branch, and it installs the git_util object into the submission object, so that the submission object can also operate on the repository. For example, the submission object may ask its git_util object if it needs to be rebased onto some other branch, and if so to please do it. So:

  • Greenlight has a submission.
  • submission.git is the git_util object that deals with Git.
  • submission.git.work_dir is the path to the per-process temporary working tree.

Greenlight's main purpose is to track these submission objects, and it has a database of them. To save time when writing the initial implementation, instead of using a real database, I had Greenlight use Python's “pickle” feature to pickle the list of submissions.

Someone would submit a branch, and Greenlight would pickle the submission. The submission contained its git_util object, and that got pickled along with the rest. Then Greenlight would exit and, just before doing so, it would destroy its temporary working tree.

Then later, when someone else wanted to approve the submission for publication, Greenlight would set up a different working tree with its new process ID, and unpickle the submission. But the submission's git.work_dir had been pickled with the old path, which no longer existed.

The context manager was working just fine. It was setting GIT_WORK_TREE to the work_dir value in the git_util object. But the object was obsolete and its work_dir value pointed to a directory that had been destroyed!

Adding to the confusion:

  1. Greenlight's own git_util object was always fresh and had the right path in it, so Git commands run directly by Greenlight all worked properly.

  2. Any new submission objects created by Greenlight would have the right path, so Git commands run by fresh submissions also worked properly.

  3. Greenlight doesn't always destroy the working tree when it exits. If it exits abnormally, it leaves the working tree intact, for a later autopsy. And the unpickled submission would work perfectly if the working tree still existed, and it would be impossible to reproduce the problem!

Toward the end of the previous article, I said:

I suspect I'm being sabotaged somewhere by Python's weird implicit ideas of scope and variable duration, but I don't know. Yet.

For the record, then: The issue was indeed one of variable duration. But Python's weird implicit ideas were, in this instance, completely blameless. Instead the issue was caused by a software component even more complex and more poorly understood: “Dominus”.

This computer stuff is amazingly complicated. I don't know how anyone gets anything done.


[Other articles in category /prog/bug] permanent link

Sun, 02 Dec 2018

Another day, another bug. No, four bugs.

I'm working on a large and wonderful project called “Greenlight”. It's a Git branch merging service that implements the following workflow:

  1. Submitter submits a branch to Greenlight (greenlight submit my-topic-branch)
  2. Greenlight analyzes the branch to decide if it changes anything that requires review and signoff
  3. If so, it contacts the authorized reviewers, who then inform Greenlight that they approve the changes (greenlight approve 03a46dc1)
  4. Greenlight merges the branch to master and publishes the result to the central repository

Of course, there are many details elided here.

Multiple instances of Greenlight share a local repository, but to avoid confusion each has its own working tree. In Git you can configure these by setting GIT_DIR and GIT_WORK_TREE environment variables, respectively. When Greenlight needs to run a Git command, it does so like this:

    with env_var("GIT_DIR", self.repo_dir):
        with env_var("GIT_WORK_TREE", self.work_dir):
            result = subprocess.run(command, ...)

The env_var here is a Python context manager that saves the old environment, sets the new environment variable, and then when the body of the block is complete, it restores the environment to the way it was. This worked in testing every time.

But the first time a beta tester ran the approve command, Greenlight threw a fatal exception. It was trying to run git checkout --quiet --detach, and this was failing, with Git saying

fatal: this operation must be run in a work tree

Where was the GIT_WORK_TREE setting going? I still don't know. But in the course of trying to track the problem down, I changed the code above to:

    with env_var("GIT_DIR", self.repo_dir, "GIT_WORK_TREE", self.work_dir):
        result = subprocess.run(command, ...)

and the problem, whatever it was, no longer manifested.

But this revealed a second bug: Greenlight no longer failed in the approval phase. It went ahead and merged the branch, and then tried to publish the merge with git push origin .... But the push was rejected.

This is because the origin repository had an update hook that ran on every push, which performed the same review analysis that Greenlight was performing; one of Greenlight's main purposes is to be a replacement for this hook. To avoid tying up the main repository for too long, this hook had a two-minute timeout, after which it would die and reject the push. This had only happened very rarely in the past, usually when someone was inadvertently trying to push a malformed branch. For example, they might have rebased all of master onto their topic branch. In this case, however, the branch really was legitimately enormous; it contained over 2900 commits.

“Oh, right,” I said. “I forgot to add the exception to the hook that tells it that it can immediately approve anything pushed by Greenlight.” The hook can assume that if the push comes from Greenlight, it has already been checked and authorized.

Pushes are happening via SSH, and Greenlight has its own SSH identity, which is passed to the hook itself in the GL_USERNAME variable. Modifying the hook was easy: I just added:

      if environ["GL_USERNAME"] == 'greenlight':
            exit(0)

This didn't work. My first idea was that Greenlight's public SSH key had not been installed in the authorized_keys file in the right place. When I grepped for greenlight in the authorized_keys file, there were no matches. The key was actually there, but in Gitlab the authorized_keys file doesn't have actual usernames in it. It has internal userids, which are then mapped to GL_USERNAME variables by some other entity. So I chased that wild goose for a while. Eventually I determined that the key was in the right place, but that the name of the Greenlight identity on the receiving side was not greenlight but bot-greenlight, which I had forgotten.

So I changed the exception to say:

      if environ["GL_USERNAME"] == 'bot-greenlight':
            exit(0)

and it still didn't work. I eventually discovered that when Greenlight did the push, the GL_USERNAME was actually set to mjd.

“Oh, right,” I said. “I forgot to have Greenlight use its own SSH credentials in the ssh connection.”

The way you do this is to write a little wrapper program that obtains the correct credentials and runs ssh, and then you set GIT_SSH to point to the wrapper. It looks like this:

    #!/usr/bin/env bash

    export -n SSH_CLIENT SSH_TTY SSH_AUTH_SOCK SSH_CONNECTION
    exec /usr/bin/ssh -i $HOME/.ssh/identity  "$@"

But wait, why hadn't I noticed this before? Because, apparently, every single person who had alpha-tested Greenlight had had their own credentials stored in ssh-agent, and every single one had had agent-forwarding enabled, so that when Greenlight tried to use ssh to connect to the Git repository, SSH duly forwarded their credentials along and the pushes succeeded. Amazing.

With these changes, the publication went through. I committed the changes to the SSH credential stuff, and some other unrelated changes, and I looked at what was left to see what had actually fixed the original bug. Every change but one was to add diagnostic messages and logging. The fix for the original bug had been to replace the nested context managers with a single context manager. This was so unexpected that I wondered if the real problem was nondeterministic and if some of the debugging messages had somehow perturbed it. But I removed everything but the context manager change and ran another test, which succeeded. By then I was five and half hours into the debugging and I didn't have any energy left to actually understand what the problem had been. I still don't know.

If you'd like to play along at home, the context manager looks like this, and did not change during the debugging process:

    from contextlib import contextmanager

    @contextmanager
    def env_var(*args):

        # Save old values of environment variables in `old`
        # A saved value of `None` means that the variable was not there before
        old = {}
        for i in range(len(args)//2):
            (key, value) = (args[2*i : 2*i+2])
            old[key] = None
            if key in os.environ:
                old[key] = os.environ[str(key)]

            if value is None: os.environ.pop(str(key), "dummy")
            else:
                os.environ[str(key)] = str(value)

        yield

        # Undo changes from versions saved in `old`
        for (key, value) in old.items():
            if value is None: os.environ.pop(str(key), "dummy")
            else:             os.environ[str(key)] = value

I suspect I'm being sabotaged somewhere by Python's weird implicit ideas of scope and variable duration, but I don't know. Yet.

This computer stuff is amazingly complicated. I don't know how anyone gets anything done.

[ Addendum 20181204: I figured it out. ]


[Other articles in category /prog/bug] permanent link

Mon, 29 Oct 2018

A long story about many bugs

Warning: Long and possibly dull.

I spent a big chunk of today fixing a bug that should have been easy but that just went deeper and deeper. If you look over in the left sidebar there you'll se a sub-menu titled “subtopics” with a per-category count of the number of articles in each section of this blog. (Unless you're using a small display, where the whole sidebar is suppressed.) That menu was at least a year out of date. I wanted to fix it.

The blog software I use is the wonderfully terrible Blosxom. It has a plugin system, and the topic menu was generated by a plugin that I wrote some time ago. When the topic plugin starts up it opens two Berkeley DB files. Each is a simple key-value mapping. One maps topic names to article counts. The other is just a set of article IDs for the articles that have already been counted. These key-value mappings are exposed in Perl as hash variables.

When I regenerate the static site, the topic plugin has a subroutine, story, that is called for each article in each generated page. The business end of the subroutine looks something like this:

      sub story {
        # ... acquire arguments ..

        if ( $Seen{ $article_id } ) {
          return;
        } else {
          $topic_count{ $article_topic }++;
          $Seen{ $article_id } = 1;
        }
      }

The reason the menu wasn't being updated is that at some point in the past, I changed the way story plugins were called. Out of the box, Blosxom passes story a list of five arguments, like this:

       my ($pkg, $path, $filename, $story_ref, $title_ref) = @_;

Over the years I had extended this to eight or nine, and I felt it was getting unwieldy, so at some point I changed it to pass a hash, like this:

         my %args = (
              category   => $path, # directory of this story
              filename   => $fn,   # filename of story, without suffix
              ...
         )
         $entries = $plugin->story(\%args);

When I made this conversion, I had to convert all the plugins. I missed converting topic. So instead of getting the eight or nine arguments it expected, it got two: the plugin itself, and the hash. Then it used the hash as the key into the databases, which by now were full of thousands of entries for things like HASH(0x436c1d) because that is what Perl silently and uselessly does if you try to use a hash as if it were a string.

Anyway, this was easily fixed, or should have been easily fixed. All I needed to do was convert the plugin to use the new calling convention. Ha!

One thing all my plugins do when they start up is write a diagnostic log, something like this:

    sub start {
      open F, ">", "/tmp/topic.$>";
      print F "Writing to $blosxom::plugin_state_dir/topics\n";
    }

Then whenever the plugin has something to announce it just does print F. For example, when the plugin increments the count for a topic, it inserts a message like this:

    print F "'$article_id' is item $topic_count{$article_topic} in topic $article_topic.\n";

If the article has already been seen, it remains silent.

Later I can look in /tmp/topic.119 or whatever to see what it said. When I'm debugging a plugin, I can open an Emacs buffer on this file and put it in auto-revert mode so that Emacs always displays the current contents of the file.

Blosxom has an option to generate pages on demand for a web browser, and I use this for testing. https://blog.plover.com/PATH is the static version of the article, served from a pre-generated static file. But https://blog.plover.com/test/PATH calls Blosxom as a CGI script to generate the article on the fly and send it to the browser. So I visited https://blog.plover.com/test/2018/, which should generate a page with all the articles from 2018, to see what the plugin put in the file. I should have seen it inserting a lot of HASH(0x436c1d) garbage:

    'lang/etym/Arabic-2' is article 1 in topic HASH(0x22c501b)
    'addenda/200801' is article 1 in topic HASH(0x5300aa2)
    'games/poker-24' is article 1 in topic HASH(0x4634a79)
    'brain/pills' is article 1 in topic HASH(0x1a9f6ab)
    'lang/long-s' is article 1 in topic HASH(0x29489be)
    'google-roundup/200602' is article 1 in topic HASH(0x360e6f5)
    'prog/van-der-waerden-1' is article 1 in topic HASH(0x3f2a6dd)
    'math/math-se-gods' is article 1 in topic HASH(0x412b105)
    'math/pow-sqrt-2' is article 1 in topic HASH(0x23ebfe4)
    'aliens/dd/p22' is article 1 in topic HASH(0x878748)

I didn't see this. I saw the startup message and nothing else. I did a bunch of very typical debugging, such as having the plugin print a message every time story was called:

    sub story {
        print F "Calling 'story' (@_)\n";
        ...
    }

Nothing. But I knew that story was being called. Was I maybe editing the wrong file on disk? No, because I could introduce a syntax error and the browser would happily report the resulting 500 Server Error. Fortunately, somewhere along the way I changed

      open F, ">", "/tmp/topic.$>";

to

      open F, ">>", "/tmp/topic.$>";

and discovered that each time I loaded the page, the plugin was run exactly twice. When I had had >, the second run would immediately overwrite the diagnostics from the first run.

But why was the plugin being run twice? This took quite a while to track down. At first I suspected that Blosxom was doing it, either on purpose or by accident. My instance of Blosxom is a hideous Frankenstein monster that has been cut up and reassembled and hacked and patched dozens of times since 2006 and it is full of unpleasant surprises. But the problem turned out to be quite different. Looking at the Apache server logs I saw that the browser was actually making two requests, not one:

    100.14.199.174 - mjd [28/Oct/2018:18:00:49 +0000] "GET /test/2018/ HTTP/1.1" 200 213417 "-" ...
    100.14.199.174 - mjd [28/Oct/2018:18:00:57 +0000] "GET /test/2018/BLOGIMGREF/horseshoe-curve-small.mp4 HTTP/1.1" 200 623 ...

Since the second request was for a nonexistent article, the story callback wasn't invoked in the second run. So I would see the startup message, but I didn't see any messages from the story callback. They had been there in the first run for the first request, but that output was immediately overwritten on the second request.

BLOGIMGREF is a tag that I include in image URLs, that expands to whatever is the appropriate URL for the images for the particular article it's in. This expansion is done by a different plugin, called path2, and apparently in this case it wasn't being expanded. The place it was being used was easy enough to find; it looked like this:

     <video width="480" height="270" controls>
       <source src="BLOGIMGREF/horseshoe-curve-small.mp4" type="video/mp4">
     </video>

So I dug down into the path2 plugin to find out why BLOGIMGREF wasn't being replaced by the correct URL prefix, which should have been in a different domain entirely.

This took a very long time to track down, and I think it was totally not my fault. When I first wrote path2 I just had it do a straight text substitution. But at some point I had improved this to use a real HTML parser, supplied by the Perl HTML::TreeBuilder module. This would parse the article body and return a tree of HTML::Element objects, which the plugin would then filter, looking for img and a elements. The plugin would look for the magic tags and replace them with the right URLs.

This magic tag was not in an img or an a element, so the plugin wasn't finding it. I needed to tell the plugin to look in source elements also. Easy fix! Except it didn't work.

Then began a tedious ten-year odyssey through the HTML::TreeBuilder and HTML::Element modules to find out why it hadn't worked. It took a long time because I'm good at debugging. When you lose your wallet, you look in the most likely places first, and I know from many years of experience what the most likely places are — usually in my misunderstanding of the calling convention of some library I didn't write, or my misunderstanding of what it was supposed to do; sometimes in my own code. The downside of this is that when the wallet is in an unlikely place it takes a really long time to find it.

The end result this time was that it wasn't in any of the usual places. It was 100% not my fault: HTML::TreeBuilder has a bug in its parser. For some reason it completely ignores source elements:

    perl -MHTML::TreeBuilder -e '$z = q{<source src="/media/horseshoe-curve-small.mp4" type="video/mp4"/>}; HTML::TreeBuilder->new->parse($z)->eof->elementify()->dump(\*STDERR)' 

The output is:

    <html> @0 (IMPLICIT)
      <head> @0.0 (IMPLICIT)
      <body> @0.1 (IMPLICIT)

No trace of the source element. I reported the bug, commented out the source element in the article, and moved on. (The article was unpublished, in part because I could never get the video to play properly in the browser. I had been tearing my hair about over it, but now I knew why! The BLOGIMGREF in the URL was not being replaced! Because of a bug in the HTML parser!)

With that fixed I went back to finish the work on the topic plugin. Now that the diagnostics were no longer being overwritten by the bogus request for /test/2018/BLOGIMGREF/horseshoe-curve-small.mp4, I expected to see the HASH(0x436c1d) garbage. I did, and I fixed that. Then I expected the 'article' is article 17 in topic prog lines to go away. They were only printed for new articles that hadn't been seen before, and by this time every article should have been in the %Seen database.

But no, every article on the page, every article from 2018, was being processed every time I rebuilt the page. And the topic counts were going up, up, up.

This also took a long time to track down, because again the cause was so unlikely. I must have been desperate because I finally found it by doing something like this:


            if ( $Seen{ $article_id } ) {
              return;
            } else {
              $topic_count{ $article_topic }++;
              $Seen{ $article_id } = 1;
              die "WTF!!" unless $Seen{ $article_id };
            }

Yep, it died. Either Berkeley DB, or Perl's BerkeleyDB module, was just flat-out not working. Both of them are ancient, and this kind of shocking bug should have been shaken out 20 years go. WTF, indeed,

I fixed this by discarding the entire database and rebuilding it. I needed to clean out the HASH(0x436c1d) crap anyway.

I am sick of DB files. I am never using them again. I have been bitten too many times. From now on I am doing the smart thing, by which I mean the dumb thing, the worse-is-better thing: I will read a plain text file into memory, modify it, and write out the modified version whem I am done. It will be simple to debug the code and simple to modify the database.

Well, that sucked. Usually this sort of thing is all my fault, but this time I was only maybe 10% responsible.

At least it's working again.

[ Addendum: I learned that discarding the source element is a ⸢feature⸣ of HTML::Parser. It has a list of valid HTML4 tags and by default it ignores any element that isn't one. The maintainer won't change the default to HTML5 because that might break backward compatibility for people who are depending on this behavior. ]


[Other articles in category /prog/bug] permanent link

Wed, 21 Feb 2007

A bug in HTML generation
A few days ago I hacked on the TeX plugin I wrote for Blosxom so that it would put the TeX source code into the ALT attributes of the image elements it generated.

But then I started to see requests in the HTTP error log for URLs like this:

    /pictures/blog/tex/total-die-rolls.gif$${6/choose%20k}k!{N!/over%20/prod%20{i!}^{n_i}{n_i}!}/qquad%20/hbox{/rm%20where%20$k%20=%20/sum%20n_i$}$$.gif
Someone must be referring people to these incorrect URLs, and it is presumably me. The HTML version of the blog looked okay, so I checked the RSS and Atom files, and found that, indeed, they were malformed. Instead of <img src="foo.gif" alt="$TeX$">, they contained codes for <img src="foo.gif$TeX$">.

I tracked down and fixed the problem. Usually when I get a bug like this, I ask myself what I could learn from it. This one is unusual. I can't think of much. Here's the bug.

The <img> element is generated by a function called imglink. The arguments to imglink are the filename that contains the image (for use in the SRC attribute) and the text for the ALT attribute. The ALT text is optional. If it is omitted, the function tries to locate the TeX source code and fetch it. If this attempt fails, it continues anyway, and omits the ALT attribute. Then it generates and returns the HTML:

        sub imglink {
          my $file = shift;
          ...

          my $alt = shift || fetch_tex($file);

          ...
          $alt = qq{alt="$alt"} if $alt;

          qq{<img $alt border=0 src="$url">};
        }
This function is called from several places in the plugin. Sometimes the TeX source code is available at the place from which the call comes, and the code has return imglink($file, $tex); sometimes it isn't and the code has return imglink($file) and hopes that the imglink function can retrieve the TeX.

One such place is the branch that handles generation of tags for every type of output except HTML. When generating the HTML output, the plugin actually tries to run TeX and generate the resulting image file. For other types of output, it assumes that the image file is already prepared, and just calls imglink to refer to an image that it presumes already exists:

  return imglink($file, $tex) unless $blosxom::flavour eq "html";
The bug was that I had written this instead:

  return imglink($file. $tex) unless $blosxom::flavour eq "html";
The . here is a string concatenation operator.

It's a bit surprising that I don't make more errors like this than I do. I am a very inaccurate typist.

Stronger type checking would not have saved me here. Both arguments are strings, concatenation of strings is perfectly well-defined, and the imglink function was designed and implemented to accept either one or two arguments.

The function did note the omission of the $tex argument, attempted to locate the TeX source code for the bizarrely-named file, and failed, but I had opted to have it recover and continue silently. I still think that was the right design. But I need to think about that some more.

The only lesson I have been able to extract from this so far is that I need a way of previewing the RSS and Atom outputs before publishing them. I do preview the HTML output, but in this case it was perfectly correct.


[Other articles in category /prog/bug] permanent link