Archive:
Subtopics:
Comments disabled |
Fri, 29 Nov 2024
A complex bug with a ⸢simple⸣ fix
Last month I did a fairly complex piece of systems programming that worked surprisingly well. But it had one big bug that took me a day to track down. One reason I find the bug interesting is that it exemplifies the sort of challenges that come up in systems programming. The essence of systems programming is that your program is dealing with the state of a complex world, with many independent agents it can't control, all changing things around. Often one can write a program that puts down a wrench and then picks it up again without looking. In systems programming, the program may have to be prepared for the possibility that someone else has come along and moved the wrench. The other reason the bug is interesting is that although it was a big bug, fixing it required only a tiny change. I often struggle to communicate to nonprogrammers just how finicky and fussy programming is. Nonprogrammers, even people who have taken a programming class or two, are used to being harassed by crappy UIs (or by the compiler) about missing punctuation marks and trivially malformed inputs, and they think they understand how fussy programming is. But they usually do not. The issue is much deeper, and I think this is a great example that will help communicate the point. The job of my program, called The probably-spam messages were stored on system S in a directory hierarchy with paths like this:
where One directory, the one for the current date, was "active", and new
messages were constantly being written to it by some other programs
not directly related to mine. The directories for the older dates
never changed. Once The The program worked like this:
Okay, very good. The program would first attempt to deal with all the
accumulated messages in roughly chronological order, processing the
large backlog. Let's say that on November 1 it got around to scanning
the active But scanning a date directory takes several minutes, so we would prefer not to do it if we don't have to. Since only the active directory ever changes, if the program is running on November 1, it can be sure that none of the directories from October will ever change again, so there is no point in its rescanning them. In fact, once we have located the messages in a date directory and recorded them in the database, there is no point in scanning it again unless it is the active directory, the one for today's date. So
It's important to not mark the active directory as having been completely scanned, because new messages are continually being deposited into it until the end of the day. I implemented this, we started it up, and it looked good. For several
days it processed the backlog of unsent messages from
September and October, and it successfully sent most of them. It
eventually caught up to the active directory for the current date, But a couple of days later, we noticed that something was wrong.
Directories Now why do you suppose that is? (Spoilers will follow the horizontal line.) I investigate this in two ways. First, I made In the end, though, neither of these led directly to my solving the problem; I just had a sudden inspiration. This is very unusual for me. Still, I probably wouldn't have had the sudden inspiration if the information from the logging and the debugging hadn't been percolating around my head. Fortune favors the prepared mind. The problem was this: some other agent was creating the Then
There weren't any yet, because it was still 11:58 on November 1.
Since the Five minutes later, at 00:03 on November 2, there would be new
messages in the This complex problem in this large program was completely fixed by changing:
if ($date ne $self->current_date) {
$self->mark_this_date_fully_scanned($date_dir);
}
to:
if ($date lt $self->current_date) {
$self->mark_this_date_fully_scanned($date_dir);
}
( Many organizations have their own version of a certain legend, which tells how a famous person from the past was once called out of retirement to solve a technical problem that nobody else could understand. I first heard the General Electric version of the legend, in which Charles Proteus Steinmetz was called out of retirement to figure out why a large complex of electrical equipment was not working. In the story, Steinmetz walked around the room, looking briefly at each of the large complicated machines. Then, without a word, he took a piece of chalk from his pocket, marked one of the panels, and departed. When the puzzled engineers removed that panel, they found a failed component, and when that component was replaced, the problem was solved. Steinmetz's consulting bill for $10,000 arrived the following week. Shocked, the bean-counters replied that $10,000 seemed an exorbitant fee for making a single chalk mark, and, hoping to embarrass him into reducing the fee, asked him to itemize the bill. Steinmetz returned the itemized bill:
This felt like one of those times. Any day when I can feel a connection with Charles Proteus Steinmetz is a good day. This episode also makes me think of the following variation on an old joke:
[Other articles in category /prog/bug] permanent link |