A complex bug with a ⸢simple⸣ fix
Last month I did a fairly complex piece of systems programming that
worked surprisingly well. But it had one big bug that took me a day
to track down.
One reason I find the bug interesting is that it exemplifies the sort
of challenges that come up in systems programming. The essence of
systems programming is that your program is dealing with the state of
a complex world, with many independent agents it can't control, all
changing things around. Often one can write a program that puts down
a wrench and then picks it up again without looking. In systems
programming, the program may have to be prepared for the possibility
that someone else has come along and moved the wrench.
The other reason the bug is interesting is that although it was a big
bug, fixing it required only a tiny change. I often struggle to
communicate to nonprogrammers just how finicky and fussy programming
is. Nonprogrammers, even people who have taken a programming class or
two, are used to being harassed by crappy UIs (or by the compiler)
about missing punctuation marks and trivially malformed inputs, and
they think they understand how fussy programming is. But they usually
do not. The issue is much deeper, and I think this is a great example
that will help communicate the point.
The job of my program, called sync-spam , was to move several weeks of
accumulated email from system S to system T. Each message was
probably spam, but its owner had not confirmed that yet, and the
message was not yet old enough to be thrown away without confirmation.
The probably-spam messages were stored on system S in
a directory hierarchy with paths like this:
/spam/2024-10-18/…
where 2024-10-18 was the date the message had been received. Every
message system S had received on October 18 was somewhere under
/spam/2024-10-18 .
One directory, the one for the current date, was "active", and new
messages were constantly being written to it by some other programs
not directly related to mine. The directories for the older dates
never changed. Once sync-spam had dealt with the backlog of old messages, it
would continue to run, checking periodically for new messages in the
active directory.
The sync-spam program had a database that recorded, for each message, whether
it had successfully sent that message from S to T, so that it
wouldn't try to send the same message again.
The program worked like this:
- Repeat forever:
- Scan the top-level
spam directory for the available dates
- For each date D:
- Scan the directory for D and find the messages in it.
Add to the database any messages not already recorded there.
- Query the database for the list of
messages for date D that have not yet been sent
to T
- For each such message:
- Attempt to send the message
- If the attempt was successful, record that in the database
- Wait some appropriate amount of time and continue.
Okay, very good. The program would first attempt to deal with all the
accumulated messages in roughly chronological order, processing the
large backlog. Let's say that on November 1 it got around to scanning
the active 2024-11-01 directory for the first time. There are many
messages, and scanning takes several minutes, so by the time it
finishes scanning, some new messages will be in the active directory that
it hasn't seen. That's okay. The program will attempt to send the
messages that it has seen. The next time it comes around to
2024-11-01 it will re-scan the directory and find the new messages
that have appeared since the last time around.
But scanning a date directory takes several minutes, so we would
prefer not to do it if we don't have to. Since only the active
directory ever changes, if the program is running on November 1, it
can be sure that none of the directories from October will ever change
again, so there is no point in its rescanning them. In fact, once we
have located the messages in a date directory and recorded them in the
database, there is no point in scanning it again unless it is the
active directory, the one for today's date.
So sync-spam had an elaboration that made it much more efficient. It was
able to put a mark on a date directory that meant "I have completely
scanned this directory and I know it will not change again". The
algorithm was just as I said above, except with these elaborations.
- Repeat forever:
- Scan the top-level
spam directory for the available dates
- For each date D:
- If the directory for D is marked as having already been scanned, we
already know exactly what messages are in it, since they are already
recorded in the database.
- Otherwise:
- Scan the directory for D and find the messages in it.
Add to the database any messages not already recorded there.
- If D is not today's date, mark the directory for
D as
having been
scanned completely, because we need not
scan it again.
- Query the database for the list of
messages for date D that have not yet been sent
to T
- For each such message:
- Attempt to send the message
- If the attempt was successful, record that in the database
- Wait some appropriate amount of time and continue.
It's important to not mark the active directory as having been
completely scanned, because new messages are continually being
deposited into it until the end of the day.
I implemented this, we started it up, and it looked good. For several
days it processed the backlog of unsent messages from
September and October, and it successfully sent most of them. It
eventually caught up to the active directory for the current date, 2024-11-01 , scanned
it, and sent most of the messages. Then it went back and started over
again with the earliest date, attempting to send any messages that it
hadn't sent the first time.
But a couple of days later, we noticed that something was wrong.
Directories 2024-11-02 and 2024-11-03 had been created and were
well-stocked with the messages that had been received on those dates. The
program had found the directories for those dates and had marked them as having been
scanned, but there were no messages from those dates in its database.
Now why do you suppose that is?
(Spoilers will follow the horizontal line.)
I investigate this in two ways. First, I made sync-spam 's logging more
detailed and looked at the results. While I was waiting for more logs
to accumulate, I built a little tool that would generate a small,
simulated spam directory on my local machine, and then I ran sync-spam
against the simulated messages, to make sure it was doing what I
expected.
In the end, though, neither of these led directly to my solving the
problem; I just had a sudden inspiration. This is very unusual for
me. Still, I probably wouldn't have had the sudden inspiration if the
information from the logging and the debugging hadn't been percolating
around my head. Fortune favors the prepared mind.
The problem was this: some other agent was creating the 2024-11-02
directory a bit prematurely, say at 11:55 PM on November 1.
Then sync-spam came along in the last minutes of November 1 and started its
main loop. It scanned the spam directory for available dates, and
found 2024-11-02 . It processed the unsent messages from the
directories for earlier dates, then looked at 2024-11-02 for the
first time. And then, at around 11:58, as per above it would:
- Scan the directory for 2024-11-02 and find the messages in it.
Add to the database any messages not already recorded there.
There weren't any yet, because it was still 11:58 on November 1.
- If 2024-11-02 is not today's date, mark the directory
as having been scanned completely, because we need not
scan it again.
Since the 2024-11-02 directory was not the one for today's date —
it was still 11:58 on November 1 — sync-spam recorded that it had
scanned that directory completely and need
not scan it again.
Five minutes later, at 00:03 on November 2, there would be new
messages in the 2024-11-02 , which was now the active directory, but
sync-spam wouldn't look for them, because it had already marked 2024-11-02
as having been scanned completely.
This complex problem in this large program was completely fixed by changing:
if ($date ne $self->current_date) {
$self->mark_this_date_fully_scanned($date_dir);
}
to:
if ($date lt $self->current_date) {
$self->mark_this_date_fully_scanned($date_dir);
}
(ne and lt are Perl-speak for "not equal to" and "less than".)
Many organizations have their own version of a certain legend,
which tells how a famous
person from the past was once called out of retirement to solve a
technical problem that nobody else could understand. I first heard
the General Electric version of the legend, in which
Charles Proteus Steinmetz
was called out of retirement to figure out why a large complex of
electrical equipment was not working.
In the story, Steinmetz walked around the room, looking briefly at
each of the large complicated machines. Then, without a word, he took
a piece of chalk from his pocket, marked one of the panels, and
departed. When the puzzled engineers removed that panel, they found a
failed component, and when that component was replaced, the problem
was solved.
Steinmetz's consulting bill for $10,000 arrived the following week.
Shocked, the bean-counters replied that $10,000 seemed an exorbitant fee
for making a single chalk mark, and, hoping to embarrass him into
reducing the fee, asked him to itemize the bill.
Steinmetz returned the itemized bill:
One chalk mark
| $1.00
|
Knowing where to put it
| $9,999.00
|
TOTAL
| $10,000.00
|
This felt like one of those times. Any day when I can feel a
connection with Charles Proteus Steinmetz is a good day.
This episode also makes me think of the following variation on an old joke:
A: Ask me what is the most difficult thing about systems programming.
B: Okay, what is the most difficult thing ab—
A: TIMING!
[Other articles in category /prog/bug]
permanent link
The mystery of the malformed command-line flags
Today a user came to tell me that their command
greenlight submit branch-name --require-review-by skordokott
failed, saying:
**
** unexpected extra argument 'branch-name' to 'submit' command
**
This is surprising. The command looks correct. The branch name is
required. The --require-review-by option can be supplied any
number of times (including none) and each must have a value provided.
Here it is given once and the provided value appears to be
skordocott .
The greenlight command is a crappy shell script that pre-validates
the arguments before sending them over the network to the real server.
I guessed that the crappy shell script parser wanted the branch name
last, even though the server itself would have been happy to take the
arguments in either order. I suggested that the user try:
greenlight submit --require-review-by skordokott branch-name
But it still didn't work:
**
** unexpected extra argument '--require-review-by' to 'submit' command
**
I dug in to the script and discovered the problem, which was not
actually a programming error. The crappy shell script was behaving
correctly!
I had written up release notes for the --require-review-by feature.
The user had clipboard-copied the option string out of
the release notes and pasted it into the shell. So why didn't it work?
In an earlier draft of the release notes, when they were displayed as
an HTML page, there would be bad line breaks:
blah blah blah be sure to use the -
-require-review-by option…
or:
blah blah blah the new --
require-review-by feature is…
No problem, I can fix it! I just changed the pair of hyphens (- U+002D)
at the beginning of --require-review-by to Unicode nonbreaking
hyphens (‑ U+2011). Bad line breaks begone!
But then this hapless user clipboard-copied the option string out of
the release notes, including its U+2011 characters. The parser in the
script was (correctly) looking for U+002D characters, and didn't
recognize --require-review-by as an option flag.
One lesson learned: people will copy-paste stuff out of documentation,
and I should be prepared for that.
There are several places to address this. I made the error message
more transparent; formerly it would complain only about the first
argument, which was confusing because it was the one argument that
wasn't superfluous. Now it will say something like
**
** extra branch name '--require-review-by' in 'submit' command
**
**
** extra branch name 'skordokott' in 'submit' command
**
which is more descriptive of what it actually doesn't like.
I could change the nonbreaking hyphens in the release notes back to
regular hyphens and just accept the bad line breaks. But I don't want
to. Typography is important.
One idea I'm toying with is to have the shell script silently replace
all nonbreaking hyphens with regular ones before any further
processing. It's a hack, but it seems like it might be a harmless
one.
So many weird things can go wrong. This computer stuff is really
complicated. I don't know how anyone get anything done.
[ Addendum: A reader suggests that I could have fixed the line breaks with CSS. But the release notes were being presented as a Slack “Post”,
which is essentially a WYSIWYG editor for creating shared documents.
It presents the document in a canned HTML style, and as far as I know
there's no way to change the CSS it uses. Similarly, there's no way to insert raw
HTML elements, so no way to change the style per-element. ]
[Other articles in category /prog/bug]
permanent link
Super-obscure bug in my code
Say $dt is a Perl DateTime
object.
You are allowed to say
$dt->add( days => 2 )
$dt->subtract( days => 2 )
Today Jeff Boes pointed out that I had written a program that used
$dt->add({ days => 2 })
which as far as I can tell is not documented to work. But it did
work. (I wrote it in 2016 and would surely have noticed by now if it
hadn't.) Jeff told me he noticed when he copied my code and got a
warning. When I tried it, no warning.
It turns out that
$dt->add({ days => 2 })
$dt->subtract({ days => 2 })
both work, except that:
The subtract call produces a warning (add doesn't! and Jeff
had changed my add to subtract )
If you included an end_of_month => $mode parameter in the
arguments to subtract , it would get lost.
Also, the working-ness of what I wrote is a lucky fluke. It is
undocumented (I think) and works only because of a quirk of the
implementation. ->add passes its arguments to
DateTime::Duration->new , which passes them to
Params::Validate::validate . The latter is documented to accept
either form. But its use by DateTime::Duration is an undocumented
implementation detail.
->subtract works the same way, except that it does a little bit of
preprocessing on the arguments before calling
DateTime::Duration->new . That's where the warning comes from, and
why end_of_month won't work with the hashref form.
(All this is as of version 1.27. The current version is 1.51.
Matthew Horsfall points out that 1.51 does not raise a warning,
because of a different change to the same interface.)
This computer stuff is amazingly complicated. I don't know how anyone
gets anything done.
[Other articles in category /prog/bug]
permanent link
I figured out that context manager bug!
A couple of days ago I described a strange
bug in my “Greenlight” project
that was causing Git to fail unpredictably, saying:
fatal: this operation must be run in a work tree
The problem seemed to go away when I changed
with env_var("GIT_DIR", self.repo_dir):
with env_var("GIT_WORK_TREE", self.work_dir):
result = subprocess.run(command, ...)
to
with env_var("GIT_DIR", self.repo_dir, "GIT_WORK_TREE", self.work_dir):
result = subprocess.run(command, ...)
but I didn't understand why. I said:
This was so unexpected that I wondered if the real problem was
nondeterministic and if some of the debugging messages had somehow
perturbed it. But I removed everything but the context manager
change and ran another test, which succeeded. By then I was five
and half hours into the debugging and I didn't have any energy left
to actually understand what the problem had been. I still don't
know.
The problem re-manifested again today, and this time I was able to
track it down and fix it. The context manager code I mentioned above
was not the issue.
That subprocess.run call is made inside a git_util object which,
as you can see in the tiny excerpt above, has a self.work_dir
attribute that tells it where to find the working tree. Just before
running a Git command, the git_util object installs self.work_dir
into the environment to tell Git where the working tree is.
The git_util object is originally manufactured by Greenlight itself,
which sets the work_dir attribute to a path that contains the
current process ID number. Just before the process exits, Greenlight
destroys the working tree. This way, concurrent processes never try
to use the same working tree, which would be a mess.
When Greenlight needs to operate on the repository, it uses its
git_util object directly. It also creates a submission object to
represent the submitted branch, and it installs the git_util object
into the submission object, so that the submission object can also
operate on the repository. For example, the submission object may ask
its git_util object if it needs to be rebased onto some other
branch, and if so to please do it. So:
- Greenlight has a
submission .
submission.git is the git_util object that deals with Git.
submission.git.work_dir is the path to the per-process temporary working tree.
Greenlight's main purpose is to track these submission objects, and it
has a database of them. To save time when writing the initial
implementation, instead of using a real database, I had Greenlight use
Python's “pickle”
feature to pickle
the list of submissions.
Someone would submit a branch, and Greenlight would pickle the
submission. The submission contained its git_util object, and that
got pickled along with the rest. Then Greenlight would exit and, just
before doing so, it would destroy its temporary working tree.
Then later, when someone else wanted to approve the submission for
publication, Greenlight would set up a different working tree with its
new process ID, and unpickle the submission. But the submission's
git.work_dir had been pickled with the old path, which no longer
existed.
The context manager was working just fine. It was setting
GIT_WORK_TREE to the work_dir value in the git_util object. But
the object was obsolete and its work_dir value pointed to a
directory that had been destroyed!
Adding to the confusion:
Greenlight's own git_util object was always fresh and had the
right path in it, so Git commands run directly by Greenlight all
worked properly.
Any new submission objects created by Greenlight would have the
right path, so Git commands run by fresh submissions also worked
properly.
Greenlight doesn't always destroy the working tree when it exits.
If it exits abnormally, it leaves the working tree intact, for a
later autopsy. And the unpickled submission would work perfectly
if the working tree still existed, and it would be impossible to
reproduce the problem!
Toward the end of the previous article, I said:
I suspect I'm being sabotaged somewhere by Python's weird implicit
ideas of scope and variable duration, but I don't know. Yet.
For the record, then: The issue was indeed one of variable duration.
But Python's weird implicit ideas were, in this instance, completely
blameless. Instead the issue was caused by a software component even
more complex and more poorly understood: “Dominus”.
This computer stuff is amazingly complicated. I don't know how anyone
gets anything done.
[Other articles in category /prog/bug]
permanent link
Another day, another bug. No, four bugs.
I'm working on a large and wonderful project called “Greenlight”.
It's a Git branch merging service that implements the following
workflow:
- Submitter submits a branch to Greenlight (
greenlight submit my-topic-branch )
- Greenlight analyzes the branch to decide if it changes anything
that requires review and signoff
- If so, it contacts the authorized reviewers, who then inform
Greenlight that they approve the changes (
greenlight approve 03a46dc1 )
- Greenlight merges the branch to
master and publishes the result
to the central repository
Of course, there are many details elided here.
Multiple instances of Greenlight share a local repository, but to avoid
confusion each has its own working tree. In Git you can configure
these by setting GIT_DIR and GIT_WORK_TREE environment variables,
respectively. When Greenlight needs to run a Git command, it does so
like this:
with env_var("GIT_DIR", self.repo_dir):
with env_var("GIT_WORK_TREE", self.work_dir):
result = subprocess.run(command, ...)
The env_var here is a Python context manager that saves the old
environment, sets the new environment variable, and then when the body
of the block is complete, it restores the environment to the way it
was. This worked in testing every time.
But the first time a beta tester ran the approve command, Greenlight
threw a fatal exception. It was trying to run git checkout --quiet
--detach , and this was failing, with Git saying
fatal: this operation must be run in a work tree
Where was the GIT_WORK_TREE setting going? I still don't know. But
in the course of trying to track the problem down, I changed the code
above to:
with env_var("GIT_DIR", self.repo_dir, "GIT_WORK_TREE", self.work_dir):
result = subprocess.run(command, ...)
and the problem, whatever it was, no longer manifested.
But this revealed a second bug: Greenlight no longer failed in the
approval phase. It went ahead and merged the branch, and then tried
to publish the merge with git push origin ... . But the push was
rejected.
This is because the origin repository had an update hook that ran
on every push, which performed the same review analysis that Greenlight
was performing; one of Greenlight's main purposes is to be a
replacement for this hook. To avoid tying up the main repository for
too long, this hook had a two-minute timeout, after which it would die
and reject the push. This had only happened very rarely in the past,
usually when someone was inadvertently trying to push a malformed
branch. For example, they might have rebased all of master onto
their topic branch. In this case, however, the branch really was
legitimately enormous; it contained over 2900 commits.
“Oh, right,” I said. “I forgot to add the exception to the hook that
tells it that it can immediately approve anything pushed by
Greenlight.” The hook can assume that if the push comes from
Greenlight, it has already been checked and authorized.
Pushes are happening via SSH, and Greenlight has its own SSH identity,
which is passed to the hook itself in the GL_USERNAME variable.
Modifying the hook was easy: I just added:
if environ["GL_USERNAME"] == 'greenlight':
exit(0)
This didn't work. My first idea was that Greenlight's public SSH key
had not been installed in the authorized_keys file in the right
place. When I grepped for greenlight in the authorized_keys file,
there were no matches. The key was actually there, but in Gitlab the
authorized_keys file doesn't have actual usernames in it. It has
internal userids, which are then mapped to GL_USERNAME variables by
some other entity. So I chased that wild goose for a while.
Eventually I determined that the key was in the right place, but
that the name of the Greenlight identity on the receiving side was not
greenlight but bot-greenlight , which I had forgotten.
So I changed the exception to say:
if environ["GL_USERNAME"] == 'bot-greenlight':
exit(0)
and it still didn't work. I eventually discovered that when
Greenlight did the push, the GL_USERNAME was actually set to mjd .
“Oh, right,” I said. “I forgot to have Greenlight use its own
SSH credentials in the ssh connection.”
The way you do this is to write a little wrapper program that obtains
the correct credentials and runs ssh , and then you set GIT_SSH to
point to the wrapper. It looks like this:
#!/usr/bin/env bash
export -n SSH_CLIENT SSH_TTY SSH_AUTH_SOCK SSH_CONNECTION
exec /usr/bin/ssh -i $HOME/.ssh/identity "$@"
But wait, why hadn't I noticed this before? Because, apparently,
every single person who had alpha-tested Greenlight had had their own
credentials stored in ssh-agent , and every single one had had
agent-forwarding enabled, so that when Greenlight tried to use ssh
to connect to the Git repository, SSH duly forwarded their credentials
along and the pushes succeeded. Amazing.
With these changes, the publication went through. I committed the
changes to the SSH credential stuff, and some other unrelated changes,
and I looked at what was left to see what had actually fixed the
original bug. Every change but one was to add diagnostic messages and
logging. The fix for the original bug had been to replace the nested
context managers with a single context manager. This was so
unexpected that I wondered if the real problem was nondeterministic
and if some of the debugging messages had somehow perturbed it. But I
removed everything but the context manager change and ran another
test, which succeeded. By then I was five and half hours into the
debugging and I didn't have any energy left to actually understand
what the problem had been. I still don't know.
If you'd like to play along at home, the context manager looks like
this, and did not change during the debugging process:
from contextlib import contextmanager
@contextmanager
def env_var(*args):
# Save old values of environment variables in `old`
# A saved value of `None` means that the variable was not there before
old = {}
for i in range(len(args)//2):
(key, value) = (args[2*i : 2*i+2])
old[key] = None
if key in os.environ:
old[key] = os.environ[str(key)]
if value is None: os.environ.pop(str(key), "dummy")
else:
os.environ[str(key)] = str(value)
yield
# Undo changes from versions saved in `old`
for (key, value) in old.items():
if value is None: os.environ.pop(str(key), "dummy")
else: os.environ[str(key)] = value
I suspect I'm being sabotaged somewhere by Python's weird implicit
ideas of scope and variable duration, but I don't know. Yet.
This computer stuff is amazingly complicated. I don't know how anyone
gets anything done.
[ Addendum 20181204: I figured it
out. ]
[Other articles in category /prog/bug]
permanent link
A long story about many bugs
Warning: Long and possibly dull.
I spent a big chunk of today fixing a bug that should have been easy
but that just went deeper and deeper. If you look over in the left
sidebar there you'll se a sub-menu titled “subtopics” with a
per-category count of the number of articles in each section of this
blog. (Unless you're using a small display, where the whole sidebar
is suppressed.) That menu was at least a year out of date. I wanted
to fix it.
The blog software I use is the wonderfully terrible
Blosxom. It has a plugin system,
and the topic menu was generated by a plugin that I wrote some time
ago. When the topic plugin starts up it opens two Berkeley
DB files. Each is a simple
key-value mapping. One maps topic names to article counts. The other
is just a set of article IDs for the articles that have already been
counted. These key-value mappings are exposed in Perl as hash
variables.
When I regenerate the static site, the topic plugin has a
subroutine, story , that is called for each article in each generated
page. The business end of the subroutine looks something like this:
sub story {
# ... acquire arguments ..
if ( $Seen{ $article_id } ) {
return;
} else {
$topic_count{ $article_topic }++;
$Seen{ $article_id } = 1;
}
}
The reason the menu wasn't being updated is that at some point in the
past, I changed the way story plugins were called. Out of the box,
Blosxom passes story a list of five arguments, like this:
my ($pkg, $path, $filename, $story_ref, $title_ref) = @_;
Over the years I had extended this to eight or nine, and I felt it was
getting unwieldy, so at some point I changed it to pass a hash, like
this:
my %args = (
category => $path, # directory of this story
filename => $fn, # filename of story, without suffix
...
)
$entries = $plugin->story(\%args);
When I made this conversion, I had to convert all the plugins. I
missed converting topic . So instead of getting the eight or nine
arguments it expected, it got two: the plugin itself, and the hash.
Then it used the hash as the key into the databases, which by now
were full of thousands of entries for things like HASH(0x436c1d)
because that is what Perl silently and uselessly does if you try to
use a hash as if it were a string.
Anyway, this was easily fixed, or should have been easily fixed.
All I needed to do was convert the plugin to use the new calling
convention. Ha!
One thing all my plugins do when they start up is write a diagnostic
log, something like this:
sub start {
open F, ">", "/tmp/topic.$>";
print F "Writing to $blosxom::plugin_state_dir/topics\n";
}
Then whenever the plugin has something to announce it just does
print F . For example, when the plugin increments the count for a
topic, it inserts a message like this:
print F "'$article_id' is item $topic_count{$article_topic} in topic $article_topic.\n";
If the article has already been seen, it remains silent.
Later I can look in /tmp/topic.119 or whatever to see what it said.
When I'm debugging a plugin, I can open an Emacs buffer on this file
and put it in auto-revert mode so that Emacs always displays the
current contents of the file.
Blosxom has an option to generate pages on demand for a web browser,
and I use this for testing. https://blog.plover.com/PATH is the
static version of the article, served from a pre-generated static
file. But https://blog.plover.com/test/PATH calls Blosxom as a CGI
script to generate the article on the fly and send it to the browser.
So I visited https://blog.plover.com/test/2018/ , which should
generate a page with all the articles from 2018, to see what the
plugin put in the file. I should have seen it inserting a lot of
HASH(0x436c1d) garbage:
'lang/etym/Arabic-2' is article 1 in topic HASH(0x22c501b)
'addenda/200801' is article 1 in topic HASH(0x5300aa2)
'games/poker-24' is article 1 in topic HASH(0x4634a79)
'brain/pills' is article 1 in topic HASH(0x1a9f6ab)
'lang/long-s' is article 1 in topic HASH(0x29489be)
'google-roundup/200602' is article 1 in topic HASH(0x360e6f5)
'prog/van-der-waerden-1' is article 1 in topic HASH(0x3f2a6dd)
'math/math-se-gods' is article 1 in topic HASH(0x412b105)
'math/pow-sqrt-2' is article 1 in topic HASH(0x23ebfe4)
'aliens/dd/p22' is article 1 in topic HASH(0x878748)
I didn't see this. I saw the startup message and nothing else. I did
a bunch of very typical debugging, such as having the plugin print a
message every time story was called:
sub story {
print F "Calling 'story' (@_)\n";
...
}
Nothing. But I knew that story was being called. Was I maybe
editing the wrong file on disk? No, because I could introduce a
syntax error and the browser would happily report the resulting 500
Server Error . Fortunately, somewhere along the way I changed
open F, ">", "/tmp/topic.$>";
to
open F, ">>", "/tmp/topic.$>";
and discovered that each time I loaded the page, the plugin was run
exactly twice. When I had had > , the second run would immediately
overwrite the diagnostics from the first run.
But why was the plugin being run twice? This took quite a while to
track down. At first I suspected that Blosxom was doing it, either on
purpose or by accident. My instance of Blosxom is a hideous
Frankenstein monster that has been cut up and reassembled and hacked
and patched dozens of times since 2006 and it is full of unpleasant
surprises. But the problem turned out to be quite different. Looking
at the Apache server logs I saw that the browser was actually making
two requests, not one:
100.14.199.174 - mjd [28/Oct/2018:18:00:49 +0000] "GET /test/2018/ HTTP/1.1" 200 213417 "-" ...
100.14.199.174 - mjd [28/Oct/2018:18:00:57 +0000] "GET /test/2018/BLOGIMGREF/horseshoe-curve-small.mp4 HTTP/1.1" 200 623 ...
Since the second request was for a nonexistent article, the story
callback wasn't invoked in the second run. So I would see the startup
message, but I didn't see any messages from the story callback.
They had been there in the first run for the first request, but that
output was immediately overwritten on the second request.
BLOGIMGREF is a tag that I include in image URLs, that expands to
whatever is the appropriate URL for the images for the particular
article it's in. This expansion is done by a different plugin, called
path2 , and apparently in this case it wasn't being expanded. The
place it was being used was easy enough to find; it looked like this:
<video width="480" height="270" controls>
<source src="BLOGIMGREF/horseshoe-curve-small.mp4" type="video/mp4">
</video>
So I dug down into the path2 plugin to find out why BLOGIMGREF
wasn't being replaced by the correct URL prefix, which should have
been in a different domain entirely.
This took a very long time to track down, and I think it was totally
not my fault. When I first wrote path2 I just had it do a straight
text substitution. But at some point I had improved this to use a real
HTML parser, supplied by the Perl HTML::TreeBuilder module. This
would parse the article body and return a tree of HTML::Element
objects, which the plugin would then filter, looking for img and a
elements. The plugin would look for the magic tags and replace them
with the right URLs.
This magic tag was not in an img or an a element, so the plugin
wasn't finding it. I needed to tell the plugin to look in source
elements also. Easy fix! Except it didn't work.
Then began a tedious ten-year odyssey through the HTML::TreeBuilder
and HTML::Element modules to find out why it hadn't worked. It took
a long time because I'm good at debugging. When you lose your wallet,
you look in the most likely places first, and I know from many years
of experience what the most likely places are — usually in my
misunderstanding of the calling convention of some library I didn't
write, or my misunderstanding of what it was supposed to do; sometimes
in my own code. The downside of this is that when the wallet is in
an unlikely place it takes a really long time to find it.
The end result this time was that it wasn't in any of the usual
places. It was 100% not my fault: HTML::TreeBuilder has a bug in
its parser. For
some reason it completely ignores source elements:
perl -MHTML::TreeBuilder -e '$z = q{<source src="/media/horseshoe-curve-small.mp4" type="video/mp4"/>}; HTML::TreeBuilder->new->parse($z)->eof->elementify()->dump(\*STDERR)'
The output is:
<html> @0 (IMPLICIT)
<head> @0.0 (IMPLICIT)
<body> @0.1 (IMPLICIT)
No trace of the source element. I reported the bug, commented out
the source element in the article, and moved on. (The article was
unpublished, in part because I could never get the video to play
properly in the browser. I had been tearing my hair about over it,
but now I knew why! The BLOGIMGREF in the URL was not being
replaced! Because of a bug in the HTML parser!)
With that fixed I went back to finish the work on the topic plugin.
Now that the diagnostics were no longer being overwritten by the bogus
request for /test/2018/BLOGIMGREF/horseshoe-curve-small.mp4 , I
expected to see the HASH(0x436c1d) garbage. I did, and I fixed
that. Then I expected the 'article' is article 17 in topic prog
lines to go away. They were only printed for new articles that hadn't
been seen before, and by this time every article should have been in
the %Seen database.
But no, every article on the page, every article from 2018, was being
processed every time I rebuilt the page. And the topic counts were
going up, up, up.
This also took a long time to track down, because again the cause was
so unlikely. I must have been desperate because I finally found it by
doing something like this:
if ( $Seen{ $article_id } ) {
return;
} else {
$topic_count{ $article_topic }++;
$Seen{ $article_id } = 1;
die "WTF!!" unless $Seen{ $article_id };
}
Yep, it died. Either Berkeley DB, or Perl's BerkeleyDB module, was
just flat-out not working. Both of them are ancient, and this kind of
shocking bug should have been shaken out 20 years go. WTF, indeed,
I fixed this by discarding the entire database and rebuilding it. I
needed to clean out the HASH(0x436c1d) crap anyway.
I am sick of DB files. I am never using them again. I have been
bitten too many times. From now on I am doing the smart thing, by
which I mean the dumb thing, the worse-is-better thing: I will
read a plain text file into memory, modify it, and write out the
modified version whem I am done. It will be simple to debug the code
and simple to modify the database.
Well, that sucked. Usually this sort of thing is all my fault, but
this time I was only maybe 10% responsible.
At least it's working again.
[ Addendum: I learned that discarding the source element is a
⸢feature⸣ of HTML::Parser . It has a list of valid HTML4 tags and by
default it ignores any element that isn't one. The maintainer won't
change the default to HTML5 because that might break backward
compatibility for people who are depending on this behavior. ]
[Other articles in category /prog/bug]
permanent link
A bug in HTML generation
A few days ago I hacked on the TeX plugin I wrote for Blosxom so that
it would put the TeX source code into the ALT attributes of the image
elements it generated.
But then I started to see requests in the HTTP error log
for URLs like this:
/pictures/blog/tex/total-die-rolls.gif$${6/choose%20k}k!{N!/over%20/prod%20{i!}^{n_i}{n_i}!}/qquad%20/hbox{/rm%20where%20$k%20=%20/sum%20n_i$}$$.gif
Someone must be referring people to these incorrect URLs, and it is
presumably me. The HTML version of the blog looked okay, so I checked
the RSS and Atom files, and found that, indeed, they were malformed.
Instead of <img src="foo.gif" alt="$TeX$">, they
contained codes for <img src="foo.gif$TeX$">.
I tracked down and fixed the problem. Usually when I get a bug like
this, I ask myself what I could learn from it. This one is unusual.
I can't think of much. Here's the bug.
The <img> element is generated by a function
called imglink. The arguments to imglink are the
filename that contains the image (for use in the SRC attribute) and
the text for the ALT attribute. The ALT text is optional. If it is
omitted, the function tries to locate the TeX source code and fetch
it. If this attempt fails, it continues anyway, and omits the ALT
attribute. Then it generates and returns the HTML:
sub imglink {
my $file = shift;
...
my $alt = shift || fetch_tex($file);
...
$alt = qq{alt="$alt"} if $alt;
qq{<img $alt border=0 src="$url">};
}
This function is called from several places in the plugin. Sometimes
the TeX source code is available at the place from which the call
comes, and the code has return imglink($file, $tex);
sometimes it isn't and the code has
return imglink($file) and hopes that the imglink
function can retrieve the TeX.
One such place is the branch that handles generation of tags for every
type of output except HTML. When generating the HTML output, the
plugin actually tries to run TeX and generate the resulting image
file. For other types of output, it assumes that the image file is
already prepared, and just calls imglink to refer to an image
that it presumes already exists:
return imglink($file, $tex) unless $blosxom::flavour eq "html";
The bug was that I had written this instead:
return imglink($file. $tex) unless $blosxom::flavour eq "html";
The . here is a string concatenation operator.
It's a bit surprising that I don't make more errors like this than I
do. I am a very inaccurate typist.
Stronger type checking would not have saved me here. Both arguments
are strings, concatenation of strings is perfectly well-defined, and
the imglink function was designed and implemented to accept
either one or two arguments.
The function did note the omission of the $tex argument,
attempted to locate the TeX source code for the bizarrely-named file,
and failed, but I had opted to have it recover and continue silently.
I still think that was the right design. But I need to think about
that some more.
The only lesson I have been able to extract from this so far is that I
need a way of previewing the RSS and Atom outputs before publishing
them. I do preview the HTML output, but in this case it was
perfectly correct.
[Other articles in category /prog/bug]
permanent link
|