The Universe of Discourse


Sat, 08 Jul 2006

A programmer had a problem...
A while back, I wrote an article in which I mentioned a programmer who had a problem, tried to solve it with weak references, and, as a result, had two problems. I said that weak references work unusually well in that little formula.

Yesterday I was about to make the same mistake. I had a problem, and weak references seemed like the solution. Fortunately, it was time to go home, which is a two-mile walk. Taking a two-mile walk is a great way to fix mistakes, especially the ones you haven't made yet. On this particular walk, I came to my senses and avoided the weak references.

The problem concerns the following classes and methods. You have a database object $db. You can call @rec = $db->lookup, which may return some record objects that represent records. You then call methods on the records, say $rec[3]->get_color, to extract data from them, or $rec[3]->set_color("purple"), to modify the data in the records. The updating is done in-memory only, and a later call to $db->flush writes all the updates back to the database.

The database object needs to store the changes that have been made but not yet written out. The easy way to do this is to have it store a change log of the modified record objects. So set_color first makes its change to the target record object, and then calls an internal _update method on the original database object to attach the record to the change log. Later on, flush will process this array, writing out the indicated changes.

In order for set_color to know which database to direct the _update call to, each record object must have a pointer back to the database that created it. This is convenient for other purposes too. Fine. But then if the record object is stored in the change log inside the database object, we now have a reference loop: the database contains a change log with a pointer to the record, which contains a pointer back to the database itself. This means that neither the database nor the record will ever be garbage collected. (This problem is common in complex Perl programs, and would simply vanish if Perl had even a slightly less awful garbage collector. Improvement is unlikely to occur before the release of Perl 6, now scheduled for October 28, 2073.)

My first reaction when faced with a problem like this one is to gurgle contentedly in my sleep, turn over, and pull the blankets over my head. This strategy is the primary contributor to my success as a programmer; it is somewhat superior to the typical programmer's response, which is to swing into action, overthink the problem, and come up with an elaborate solution. Aron Nimzovitch once said that the problem chess novices have is the irrepressible urge to always be doing something. Programmers are similar. They are all very bright people, very good at solving problems, and they solve problems all the time, even the ones that don't need to be solved.

I seem to be digressing. How unusual. In any case, this problem really did have to be solved. One wants the database object to flush out its pending changes at the time it becomes inacessible. If the object is never garbage collected, then the programmer must always remember to flush out the changes manually. Miss one call to flush, and your updates are lost. This is unacceptable. The primary purpose of a database is to record the updates. So I had to take my head out from under the covers, like it or not.

I thought about several solutions, and even tried one out, but it was too complicated and got me into a horrible tar pit, so I threw it away and started over. (That is another superior strategy that programmers don't exercise as often as they should. As Erik Naggum says, they will drive a hundred miles through a forest, stopping every five feet to cut down another tree, instead of pausing to wonder if maybe they shouldn't have driven off the road in the first place.)

Then I got the bright idea to use weak references, which seemed like just the thing. That's what weak references are for: breaking dependency loops so that things that need to be garbage collected can be. Fortunately, it was time to go, so I walked home instead of diving into the chyme-filled swimming pool of weak references.

With the weak references, you need to decide which reference to weaken. There is a reference to the record object, in the change log inside the database object. And there is a reference to the database object, in the record object. Which do you weaken?

If you weaken the reference to the record, you get a disaster:

        {
          my ($rec) = $db->lookup(...);
          $rec->set_color("purple");
        }
        $db->flush;
When the block is exited, the last strong reference to the record goes away, and the modified record evaporates, leaving nothing inside the database object. The flush method can see by the lingering ghost that there was something there it was supposed to deal with, but it no longer knows what. So that choice is doomed.

What if you weaken the reference inside the record, the one that points back to the database? That is hardly any better:

        my $rec;
        {
          my $db = FlatFile->new(...);
          ($rec) = $db->lookup(...);
        }
        $rec->set_color("purple");
We would like the database object to hang around as long as there are still some extant records from it. But because we weakened the references from the records to the database, it doesn't; it evaporates at the end of the block, leaving the record orphaned. The set_color method then fails, because the database to which it is supposed to write changes has evaporated.

Conclusion: I've heard it before, and it wasn't funny the first time.

On the walk home, I realized something else: actually storing the database data inside the record objects is a bad move. The general advice under which this is a bad move is something like Don't store the same data in two places. The specific problems in this instance are exemplified by this:

        my ($a) = $db->lookup(unique_id => "142857");
        my ($b) = $db->lookup(unique_id => "142857");
        $a->set_color("red");
        $b->set_color("purple");
        $a->color eq "purple";  # True or false?
Since $a and $b represent the same record, the answer should be true. But in the implementation I had (and still have, actually; I haven't fixed this yet) it is false. The set_color method on $b updates the data that is cached in object $b, but has no idea that it should also update the data cached in $a.

To work properly, $a and $b should be identical objects. One way to do this is to store an object in memory for every record in the database, and hand out these preconstructed objects as needed; then both calls to lookup return the same object. This is time- and memory-intensive. Another way to do this is to cache the record objects as they are constructed, and arrange for lookup to return the cached objects when appropriate. This is more complicated.

A simpler solution is not to store the data in memory at all. Record objects are always created as needed, but contain nothing but a database handle and some sort of locator information that says how to get the record data, should it be asked for. ("Any problem can be solved by another layer of indirection," they say, although it's not really true. Still, there are several classes of problems that can be solved by adding another layer of indirection, and this particular object identity problem could serve well as an exemplar of one of those classes.) Then modifications don't go into the record objects themselves. Instead, they go into the database object as an instruction to modify a certain record in a certain way.

This solution, however, presupposes that there is a good way to build locator information for a flat file and update it as needed. Fortunately, there is. I did a really good job of solving this problem a few years ago when I wrote the Tie::File module. It represents a text file as a Perl array, so a record locator can simply be an index into the array, and a record object then becomes something like:

        {
          db => $db,
          recno => 37,
        }
The change log inside the database object looks something like:

        { 0 => no change,
          1 => no change,
          2 => "color" field was set to "purple",
          3 => no change,
          4 => "size" field was set to "unusually large",
          ...
        }        
This happily gets rid of the garbage collection problem I had been trying to solve in the first place.

Using Tie::File also eliminates a lot of I/O issues that I had solved before, and gets all the I/O code out of the database module. I had already been thinking about getting rid of the explicit I/O and having the database module depend on Tie::File, and when I recognized the lurking record object identity problem, I was convinced that it had to happen sooner rather than later. Having done it, I'm really pleased with the outcome.


[Other articles in category /prog] permanent link