The Universe of Discourse


Wed, 05 Dec 2018

I figured out that context manager bug!

A couple of days ago I described a strange bug in my “Greenlight” project that was causing Git to fail unpredictably, saying:

    fatal: this operation must be run in a work tree

The problem seemed to go away when I changed

    with env_var("GIT_DIR", self.repo_dir):
        with env_var("GIT_WORK_TREE", self.work_dir):
            result = subprocess.run(command, ...)

to

    with env_var("GIT_DIR", self.repo_dir, "GIT_WORK_TREE", self.work_dir):
        result = subprocess.run(command, ...)

but I didn't understand why. I said:

This was so unexpected that I wondered if the real problem was nondeterministic and if some of the debugging messages had somehow perturbed it. But I removed everything but the context manager change and ran another test, which succeeded. By then I was five and half hours into the debugging and I didn't have any energy left to actually understand what the problem had been. I still don't know.

The problem re-manifested again today, and this time I was able to track it down and fix it. The context manager code I mentioned above was not the issue.

That subprocess.run call is made inside a git_util object which, as you can see in the tiny excerpt above, has a self.work_dir attribute that tells it where to find the working tree. Just before running a Git command, the git_util object installs self.work_dir into the environment to tell Git where the working tree is.

The git_util object is originally manufactured by Greenlight itself, which sets the work_dir attribute to a path that contains the current process ID number. Just before the process exits, Greenlight destroys the working tree. This way, concurrent processes never try to use the same working tree, which would be a mess.

When Greenlight needs to operate on the repository, it uses its git_util object directly. It also creates a submission object to represent the submitted branch, and it installs the git_util object into the submission object, so that the submission object can also operate on the repository. For example, the submission object may ask its git_util object if it needs to be rebased onto some other branch, and if so to please do it. So:

  • Greenlight has a submission.
  • submission.git is the git_util object that deals with Git.
  • submission.git.work_dir is the path to the per-process temporary working tree.

Greenlight's main purpose is to track these submission objects, and it has a database of them. To save time when writing the initial implementation, instead of using a real database, I had Greenlight use Python's “pickle” feature to pickle the list of submissions.

Someone would submit a branch, and Greenlight would pickle the submission. The submission contained its git_util object, and that got pickled along with the rest. Then Greenlight would exit and, just before doing so, it would destroy its temporary working tree.

Then later, when someone else wanted to approve the submission for publication, Greenlight would set up a different working tree with its new process ID, and unpickle the submission. But the submission's git.work_dir had been pickled with the old path, which no longer existed.

The context manager was working just fine. It was setting GIT_WORK_TREE to the work_dir value in the git_util object. But the object was obsolete and its work_dir value pointed to a directory that had been destroyed!

Adding to the confusion:

  1. Greenlight's own git_util object was always fresh and had the right path in it, so Git commands run directly by Greenlight all worked properly.

  2. Any new submission objects created by Greenlight would have the right path, so Git commands run by fresh submissions also worked properly.

  3. Greenlight doesn't always destroy the working tree when it exits. If it exits abnormally, it leaves the working tree intact, for a later autopsy. And the unpickled submission would work perfectly if the working tree still existed, and it would be impossible to reproduce the problem!

Toward the end of the previous article, I said:

I suspect I'm being sabotaged somewhere by Python's weird implicit ideas of scope and variable duration, but I don't know. Yet.

For the record, then: The issue was indeed one of variable duration. But Python's weird implicit ideas were, in this instance, completely blameless. Instead the issue was cause by a software component even more complex and more poorly understood: “Dominus”.

This computer stuff is amazingly complicated. I don't know how anyone gets anything done.


[Other articles in category /prog/bug] permanent link

Sun, 02 Dec 2018

Another day, another bug. No, four bugs.

I'm working on a large and wonderful project called “Greenlight”. It's a Git branch merging service that implements the following workflow:

  1. Submitter submits a branch to Greenlight (greenlight submit my-topic-branch)
  2. Greenlight analyzes the branch to decide if it changes anything that requires review and signoff
  3. If so, it contacts the authorized reviewers, who then inform Greenlight that they approve the changes (greenlight approve 03a46dc1)
  4. Greenlight merges the branch to master and publishes the result to the central repository

Of course, there are many details elided here.

Multiple instances of Greenlight share a local repository, but to avoid confusion each has its own working tree. In Git you can configure these by setting GIT_DIR and GIT_WORK_TREE environment variables, respectively. When Greenlight needs to run a Git command, it does so like this:

    with env_var("GIT_DIR", self.repo_dir):
        with env_var("GIT_WORK_TREE", self.work_dir):
            result = subprocess.run(command, ...)

The env_var here is a Python context manager that saves the old environment, sets the new environment variable, and then when the body of the block is complete, it restores the environment to the way it was. This worked in testing every time.

But the first time a beta tester ran the approve command, Greenlight threw a fatal exception. It was trying to run git checkout --quiet --detach, and this was failing, with Git saying

fatal: this operation must be run in a work tree

Where was the GIT_WORK_TREE setting going? I still don't know. But in the course of trying to track the problem down, I changed the code above to:

    with env_var("GIT_DIR", self.repo_dir, "GIT_WORK_TREE", self.work_dir):
        result = subprocess.run(command, ...)

and the problem, whatever it was, no longer manifested.

But this revealed a second bug: Greenlight no longer failed in the approval phase. It went ahead and merged the branch, and then tried to publish the merge with git push origin .... But the push was rejected.

This is because the origin repository had an update hook that ran on every push, which performed the same review analysis that Greenlight was performing; one of Greenlight's main purposes is to be a replacement for this hook. To avoid tying up the main repository for too long, this hook had a two-minute timeout, after which it would die and reject the push. This had only happened very rarely in the past, usually when someone was inadvertently trying to push a malformed branch. For example, they might have rebased all of master onto their topic branch. In this case, however, the branch really was legitimately enormous; it contained over 2900 commits.

“Oh, right,” I said. “I forgot to add the exception to the hook that tells it that it can immediately approve anything pushed by Greenlight.” The hook can assume that if the push comes from Greenlight, it has already been checked and authorized.

Pushes are happening via SSH, and Greenlight has its own SSH identity, which is passed to the hook itself in the GL_USERNAME variable. Modifying the hook was easy: I just added:

      if environ["GL_USERNAME"] == 'greenlight':
            exit(0)

This didn't work. My first idea was that Greenlight's public SSH key had not been installed in the authorized_keys file in the right place. When I grepped for greenlight in the authorized_keys file, there were no matches. The key was actually there, but in Gitlab the authorized_keys file doesn't have actual usernames in it. It has internal userids, which are then mapped to GL_USERNAME variables by some other entity. So I chased that wild goose for a while. Eventually I determined that the key was in the right place, but that the name of the Greenlight identity on the receiving side was not greenlight but bot-greenlight, which I had forgotten.

So I changed the exception to say:

      if environ["GL_USERNAME"] == 'bot-greenlight':
            exit(0)

and it still didn't work. I eventually discovered that when Greenlight did the push, the GL_USERNAME was actually set to mjd.

“Oh, right,” I said. “I forgot to have Greenlight use its own SSH credentials in the ssh connection.”

The way you do this is to write a little wrapper program that obtains the correct credentials and runs ssh, and then you set GIT_SSH to point to the wrapper. It looks like this:

    #!/usr/bin/env bash

    export -n SSH_CLIENT SSH_TTY SSH_AUTH_SOCK SSH_CONNECTION
    exec /usr/bin/ssh -i $HOME/.ssh/identity  "$@"

But wait, why hadn't I noticed this before? Because, apparently, every single person who had alpha-tested Greenlight had had their own credentials stored in ssh-agent, and every single one had had agent-forwarding enabled, so that when Greenlight tried to use ssh to connect to the Git repository, SSH duly forwarded their credentials along and the pushes succeeded. Amazing.

With these changes, the publication went through. I committed the changes to the SSH credential stuff, and some other unrelated changes, and I looked at what was left to see what had actually fixed the original bug. Every change but one was to add diagnostic messages and logging. The fix for the original bug had been to replace the nested context managers with a single context manager. This was so unexpected that I wondered if the real problem was nondeterministic and if some of the debugging messages had somehow perturbed it. But I removed everything but the context manager change and ran another test, which succeeded. By then I was five and half hours into the debugging and I didn't have any energy left to actually understand what the problem had been. I still don't know.

If you'd like to play along at home, the context manager looks like this, and did not change during the debugging process:

    from contextlib import contextmanager

    @contextmanager
    def env_var(*args):

        # Save old values of environment variables in `old`
        # A saved value of `None` means that the variable was not there before
        old = {}
        for i in range(len(args)//2):
            (key, value) = (args[2*i : 2*i+2])
            old[key] = None
            if key in os.environ:
                old[key] = os.environ[str(key)]

            if value is None: os.environ.pop(str(key), "dummy")
            else:
                os.environ[str(key)] = str(value)

        yield

        # Undo changes from versions saved in `old`
        for (key, value) in old.items():
            if value is None: os.environ.pop(str(key), "dummy")
            else:             os.environ[str(key)] = value

I suspect I'm being sabotaged somewhere by Python's weird implicit ideas of scope and variable duration, but I don't know. Yet.

This computer stuff is amazingly complicated. I don't know how anyone gets anything done.

[ Addendum 20181204: I figured it out. ]


[Other articles in category /prog/bug] permanent link

Mon, 29 Oct 2018

A long story about many bugs

Warning: Long and possibly dull.

I spent a big chunk of today fixing a bug that should have been easy but that just went deeper and deeper. If you look over in the left sidebar there you'll se a sub-menu titled “subtopics” with a per-category count of the number of articles in each section of this blog. (Unless you're using a small display, where the whole sidebar is suppressed.) That menu was at least a year out of date. I wanted to fix it.

The blog software I use is the wonderfully terrible Blosxom. It has a plugin system, and the topic menu was generated by a plugin that I wrote some time ago. When the topic plugin starts up it opens two Berkeley DB files. Each is a simple key-value mapping. One maps topic names to article counts. The other is just a set of article IDs for the articles that have already been counted. These key-value mappings are exposed in Perl as hash variables.

When I regenerate the static site, the topic plugin has a subroutine, story, that is called for each article in each generated page. The business end of the subroutine looks something like this:

      sub story {
        # ... acquire arguments ..

        if ( $Seen{ $article_id } ) {
          return;
        } else {
          $topic_count{ $article_topic }++;
          $Seen{ $article_id } = 1;
        }
      }

The reason the menu wasn't being updated is that at some point in the past, I changed the way story plugins were called. Out of the box, Blosxom passes story a list of five arguments, like this:

       my ($pkg, $path, $filename, $story_ref, $title_ref) = @_;

Over the years I had extended this to eight or nine, and I felt it was getting unwieldy, so at some point I changed it to pass a hash, like this:

         my %args = (
              category   => $path, # directory of this story
              filename   => $fn,   # filename of story, without suffix
              ...
         )
         $entries = $plugin->story(\%args);

When I made this conversion, I had to convert all the plugins. I missed converting topic. So instead of getting the eight or nine arguments it expected, it got two: the plugin itself, and the hash. Then it used the hash as the key into the databases, which by now were full of thousands of entries for things like HASH(0x436c1d) because that is what Perl silently and uselessly does if you try to use a hash as if it were a string.

Anyway, this was easily fixed, or should have been easily fixed. All I needed to do was convert the plugin to use the new calling convention. Ha!

One thing all my plugins do when they start up is write a diagnostic log, something like this:

    sub start {
      open F, ">", "/tmp/topic.$>";
      print F "Writing to $blosxom::plugin_state_dir/topics\n";
    }

Then whenever the plugin has something to announce it just does print F. For example, when the plugin increments the count for a topic, it inserts a message like this:

    print F "'$article_id' is item $topic_count{$article_topic} in topic $article_topic.\n";

If the article has already been seen, it remains silent.

Later I can look in /tmp/topic.119 or whatever to see what it said. When I'm debugging a plugin, I can open an Emacs buffer on this file and put it in auto-revert mode so that Emacs always displays the current contents of the file.

Blosxom has an option to generate pages on demand for a web browser, and I use this for testing. https://blog.plover.com/PATH is the static version of the article, served from a pre-generated static file. But https://blog.plover.com/test/PATH calls Blosxom as a CGI script to generate the article on the fly and send it to the browser. So I visited https://blog.plover.com/test/2018/, which should generate a page with all the articles from 2018, to see what the plugin put in the file. I should have seen it inserting a lot of HASH(0x436c1d) garbage:

    'lang/etym/Arabic-2' is article 1 in topic HASH(0x22c501b)
    'addenda/200801' is article 1 in topic HASH(0x5300aa2)
    'games/poker-24' is article 1 in topic HASH(0x4634a79)
    'brain/pills' is article 1 in topic HASH(0x1a9f6ab)
    'lang/long-s' is article 1 in topic HASH(0x29489be)
    'google-roundup/200602' is article 1 in topic HASH(0x360e6f5)
    'prog/van-der-waerden-1' is article 1 in topic HASH(0x3f2a6dd)
    'math/math-se-gods' is article 1 in topic HASH(0x412b105)
    'math/pow-sqrt-2' is article 1 in topic HASH(0x23ebfe4)
    'aliens/dd/p22' is article 1 in topic HASH(0x878748)

I didn't see this. I saw the startup message and nothing else. I did a bunch of very typical debugging, such as having the plugin print a message every time story was called:

    sub story {
        print F "Calling 'story' (@_)\n";
        ...
    }

Nothing. But I knew that story was being called. Was I maybe editing the wrong file on disk? No, because I could introduce a syntax error and the browser would happily report the resulting 500 Server Error. Fortunately, somewhere along the way I changed

      open F, ">", "/tmp/topic.$>";

to

      open F, ">>", "/tmp/topic.$>";

and discovered that each time I loaded the page, the plugin was run exactly twice. When I had had >, the second run would immediately overwrite the diagnostics from the first run.

But why was the plugin being run twice? This took quite a while to track down. At first I suspected that Blosxom was doing it, either on purpose or by accident. My instance of Blosxom is a hideous Frankenstein monster that has been cut up and reassembled and hacked and patched dozens of times since 2006 and it is full of unpleasant surprises. But the problem turned out to be quite different. Looking at the Apache server logs I saw that the browser was actually making two requests, not one:

    100.14.199.174 - mjd [28/Oct/2018:18:00:49 +0000] "GET /test/2018/ HTTP/1.1" 200 213417 "-" ...
    100.14.199.174 - mjd [28/Oct/2018:18:00:57 +0000] "GET /test/2018/BLOGIMGREF/horseshoe-curve-small.mp4 HTTP/1.1" 200 623 ...

Since the second request was for a nonexistent article, the story callback wasn't invoked in the second run. So I would see the startup message, but I didn't see any messages from the story callback. They had been there in the first run for the first request, but that output was immediately overwritten on the second request.

BLOGIMGREF is a tag that I include in image URLs, that expands to whatever is the appropriate URL for the images for the particular article it's in. This expansion is done by a different plugin, called path2, and apparently in this case it wasn't being expanded. The place it was being used was easy enough to find; it looked like this:

     <video width="480" height="270" controls>
       <source src="BLOGIMGREF/horseshoe-curve-small.mp4" type="video/mp4">
     </video>

So I dug down into the path2 plugin to find out why BLOGIMGREF wasn't being replaced by the correct URL prefix, which should have been in a different domain entirely.

This took a very long time to track down, and I think it was totally not my fault. When I first wrote path2 I just had it do a straight text substitution. But at some point I had improved this to use a real HTML parser, supplied by the Perl HTML::TreeBuilder module. This would parse the article body and return a tree of HTML::Element objects, which the plugin would then filter, looking for img and a elements. The plugin would look for the magic tags and replace them with the right URLs.

This magic tag was not in an img or an a element, so the plugin wasn't finding it. I needed to tell the plugin to look in source elements also. Easy fix! Except it didn't work.

Then began a tedious ten-year odyssey through the HTML::TreeBuilder and HTML::Element modules to find out why it hadn't worked. It took a long time because I'm good at debugging. When you lose your wallet, you look in the most likely places first, and I know from many years of experience what the most likely places are — usually in my misunderstanding of the calling convention of some library I didn't write, or my misunderstanding of what it was supposed to do; sometimes in my own code. The downside of this is that when the wallet is in an unlikely place it takes a really long time to find it.

The end result this time was that it wasn't in any of the usual places. It was 100% not my fault: HTML::TreeBuilder has a bug in its parser. For some reason it completely ignores source elements:

    perl -MHTML::TreeBuilder -e '$z = q{<source src="/media/horseshoe-curve-small.mp4" type="video/mp4"/>}; HTML::TreeBuilder->new->parse($z)->eof->elementify()->dump(\*STDERR)' 

The output is:

    <html> @0 (IMPLICIT)
      <head> @0.0 (IMPLICIT)
      <body> @0.1 (IMPLICIT)

No trace of the source element. I reported the bug, commented out the source element in the article, and moved on. (The article was unpublished, in part because I could never get the video to play properly in the browser. I had been tearing my hair about over it, but now I knew why! The BLOGIMGREF in the URL was not being replaced! Because of a bug in the HTML parser!)

With that fixed I went back to finish the work on the topic plugin. Now that the diagnostics were no longer being overwritten by the bogus request for /test/2018/BLOGIMGREF/horseshoe-curve-small.mp4, I expected to see the HASH(0x436c1d) garbage. I did, and I fixed that. Then I expected the 'article' is article 17 in topic prog lines to go away. They were only printed for new articles that hadn't been seen before, and by this time every article should have been in the %Seen database.

But no, every article on the page, every article from 2018, was being processed every time I rebuilt the page. And the topic counts were going up, up, up.

This also took a long time to track down, because again the cause was so unlikely. I must have been desperate because I finally found it by doing something like this:


            if ( $Seen{ $article_id } ) {
              return;
            } else {
              $topic_count{ $article_topic }++;
              $Seen{ $article_id } = 1;
              die "WTF!!" unless $Seen{ $article_id };
            }

Yep, it died. Either Berkeley DB, or Perl's BerkeleyDB module, was just flat-out not working. Both of them are ancient, and this kind of shocking bug should have been shaken out 20 years go. WTF, indeed,

I fixed this by discarding the entire database and rebuilding it. I needed to clean out the HASH(0x436c1d) crap anyway.

I am sick of DB files. I am never using them again. I have been bitten too many times. From now on I am doing the smart thing, by which I mean the dumb thing, the worse-is-better thing: I will read a plain text file into memory, modify it, and write out the modified version whem I am done. It will be simple to debug the code and simple to modify the database.

Well, that sucked. Usually this sort of thing is all my fault, but this time I was only maybe 10% responsible.

At least it's working again.

[ Addendum: I learned that discarding the source element is a ⸢feature⸣ of HTML::Parser. It has a list of valid HTML4 tags and by default it ignores any element that isn't one. The maintainer won't change the default to HTML5 because that might break backward compatibility for people who are depending on this behavior. ]


[Other articles in category /prog/bug] permanent link

Wed, 21 Feb 2007

A bug in HTML generation
A few days ago I hacked on the TeX plugin I wrote for Blosxom so that it would put the TeX source code into the ALT attributes of the image elements it generated.

But then I started to see requests in the HTTP error log for URLs like this:

    /pictures/blog/tex/total-die-rolls.gif$${6/choose%20k}k!{N!/over%20/prod%20{i!}^{n_i}{n_i}!}/qquad%20/hbox{/rm%20where%20$k%20=%20/sum%20n_i$}$$.gif
Someone must be referring people to these incorrect URLs, and it is presumably me. The HTML version of the blog looked okay, so I checked the RSS and Atom files, and found that, indeed, they were malformed. Instead of <img src="foo.gif" alt="$TeX$">, they contained codes for <img src="foo.gif$TeX$">.

I tracked down and fixed the problem. Usually when I get a bug like this, I ask myself what I could learn from it. This one is unusual. I can't think of much. Here's the bug.

The <img> element is generated by a function called imglink. The arguments to imglink are the filename that contains the image (for use in the SRC attribute) and the text for the ALT attribute. The ALT text is optional. If it is omitted, the function tries to locate the TeX source code and fetch it. If this attempt fails, it continues anyway, and omits the ALT attribute. Then it generates and returns the HTML:

        sub imglink {
          my $file = shift;
          ...

          my $alt = shift || fetch_tex($file);

          ...
          $alt = qq{alt="$alt"} if $alt;

          qq{<img $alt border=0 src="$url">};
        }
This function is called from several places in the plugin. Sometimes the TeX source code is available at the place from which the call comes, and the code has return imglink($file, $tex); sometimes it isn't and the code has return imglink($file) and hopes that the imglink function can retrieve the TeX.

One such place is the branch that handles generation of tags for every type of output except HTML. When generating the HTML output, the plugin actually tries to run TeX and generate the resulting image file. For other types of output, it assumes that the image file is already prepared, and just calls imglink to refer to an image that it presumes already exists:

  return imglink($file, $tex) unless $blosxom::flavour eq "html";
The bug was that I had written this instead:

  return imglink($file. $tex) unless $blosxom::flavour eq "html";
The . here is a string concatenation operator.

It's a bit surprising that I don't make more errors like this than I do. I am a very inaccurate typist.

Stronger type checking would not have saved me here. Both arguments are strings, concatenation of strings is perfectly well-defined, and the imglink function was designed and implemented to accept either one or two arguments.

The function did note the omission of the $tex argument, attempted to locate the TeX source code for the bizarrely-named file, and failed, but I had opted to have it recover and continue silently. I still think that was the right design. But I need to think about that some more.

The only lesson I have been able to extract from this so far is that I need a way of previewing the RSS and Atom outputs before publishing them. I do preview the HTML output, but in this case it was perfectly correct.


[Other articles in category /prog/bug] permanent link