The Universe of Discourse

Fri, 07 Jul 2006

On design
I'm writing this Perl module called FlatFile, which is supposed to provide lightweight simple access to flat-file databases, such as the Unix password file. An interesting design issue came up, and since I think that understanding is usually best served by minuscule examination of specific examples, that's what I'm going to do.

The basic usage of the module is as follows: You create a database object that represents the entire database:

        my $db = FlatFile->new(FILE => "/etc/passwd", 
                               FIELDS => ['username', 'password', 'uid', 'gid',
                                          'gecos', 'homedir', 'shell'],
                               FIELDSEP => ':',
                              ) or die ...;
Then you can do queries on the database:

        my @roots = $db->lookup(uid => 0);
This returns a list of Record objects. (Actually it returns a list of FlatFile::Record::A objects, where FlatFile::Record::A is a dynamically-generated class that was manufactured at the time you did the new call, and which inherits from FlatFile::Record, but we can ignore that here.) Once we have the Record objects, we can query them or modify them:

        for my $root (@roots) {
          if ($root->username eq 'root') {
          } else {
This loops over the records that were selected in the earlier call and examines the username field in each one. if the username is root, the program sets the shell in the record to /bin/false; otherwise it deletes the record entirely.

Since lookup returns all the matching records, there is the question of what this should do:

        my $root = $db->lookup(uid => 0);
Here we have provided enough room for at most one root user. What if there is more than one?

Every Perl function needs to make a decision about this issue. The function could be called in list context or in scalar context, and you need to choose the two behaviors sensibly. Here are some possibilities for what lookup might do if called in scalar context:

  1. die unconditionally

  2. return the number of matching records, analogous to the builtin grep function or the @array syntax

  3. return the single matching record, if there is only one, and die if there is more than one.

  4. return the first matching record, and discard the others

  5. return a reference to an array of all matching records

  6. return an iterator object which can be used to access all the matching records

There are probably some other reasonable possibilities.

How to decide on the best behavior? This is the kind of problem that I really enjoy. What will people expect? What will they want? What do they need?

Two important criteria are:

  1. Difficulty: Whatever I provide should be something that's not easy to get any other way.

  2. Usefulness: Whatever I provide should be something that people will use a lot.

The difficulty criterion argues strongly against behavior #5 (return an array), because it's too much like the current list context behavior. No matter what the method does in scalar context, no matter what design decision I make, the programmer will always be able to get behavior #5 very easily:

        my $ref = [ $db->lookup(...) ];
Or they can subclass the Record module and add a new one-line method that does the same:
        sub lookup_ref {
          my $self = shift;
          [ $self->lookup(@_) ];
Similarly, behavior #2 (return a count) is so easy to get that supporting it directly would probably not be a good use of my code or my precious interface space:

        my $N_recs = () = $db->lookup(...);
I had originally planned to do #3 (require that the query produce a single record, on pain of death), and here's why: in my first forays into programming with this module, I frequently found myself writing things like my $rec = $db->lookup(...) without meaning to, and in spite of the fact that I had documented the behavior in scalar context as being undefined. I kept doing it unintentionally in cases where I expected only one record to be returned. So each time I wrote this code, I was putting in an implicit assumption that there would be only one match. I would have been quite surprised in each case if there had actually been multiple matches. That's the sort of assumption that you might like to have automatically checked.

I ran the question by the folks on IRC, and reaction against this design was generally negative. Folks said that it's not the module's job to try to discern the programmer's intention and enforce this inference by committing suicide.

I can certainly get behind that point of view. I once wrote an article complaining bitterly about modules that call die. I said it was like when you're having tea and crumpets on your 112-piece Spode china set, and you accidentally chip the teacup, and the butler comes running in, crying "Don't worry, Master! I'll take care of that for you!" and then he whips out a hammer and smashes all 112 pieces of china to tiny bits.

I don't think the point applies here, though. I had mentioned it in connection with the Text::ParseWords module, which would throw an exception if the input string was unparseable, hardly an uncommon occurrence, and one that was entirely unavoidable: if I knew that the string would be unparseable, I wouldn't be calling Text::ParseWords to parse it.

Folks on IRC said that when the method might call die, you have to wrap every call to it in an exception handler, which I certainly agree is a pain in the ass. But in this example, you do not have to do that. Here, to prevent the function from dying is very easy: just call it in list context; then it will never die. If what you want is behavior #4, to have it discard all the records but the first one, that is easy to get, regardless of the design I adopt for scalar context behavior:

        my ($rec) = $db->lookup(...);
This argues against #4 (return the first matching record) in the same way that we argued against #2 and #5 already: it's so very easy to do already, maybe we don't need an even easier way to do it. But if so, couldn't the programmer just:

        sub lookup_first {
          my $self = shift;
          my ($rec) = $self->lookup(@_);
          return $rec;
A counterargument in favor of #4 might be based on the usefulness criterion: perhaps this behavior is so commonly wanted that we really do need an even easier way to do it.

I was almost persuaded by the strong opinion in favor of #4, but then Roderick Schertler spoke up in favor of #3, for basically the reasons I set forth. I consider M. Schertler to have higher-than-normal reliability on matters of this type, so his opinion counterbalances several of the counteropinions on the other side. #3 is not too difficult to get, but still scores higher than most of the others on the difficulty scale. There doesn't seem to be a trivial inline expression of it, as there was with #2, #4, and #5. You would have to actually write a method, or else do something nasty like:

        (my ($rec) = $db->lookup(...)) < 2 or die ...;
What about the other proposed behaviors? #1 (unconditional fatality) is simple, but both criteria seem to argue against it. It does, however, have the benefit of being a good temporary solution since it is easy to change without breaking backward compatibility. Were I to adopt it, it would be very unlikely (although not impossible) that anyone would write a program that would depend on that behavior; I would then be able to change it later on.

#6 (return an iterator object) is very tempting, because it is the only one that scores high on the difficulty criterion scale: it is difficult or impossible to do this any other way, so by providing it, I am providing a real service to users of the module, rather than yet another way to do the same thing. The module's user cannot implement a good iterator interface as a wrapper around lookup, because lookup always searches the entire database before it returns, and allocates enough memory to store every returned record, whereas a good iterator interface will search only as far as is necessary to find the next matching record, and will store only one record at a time.

This performance argument would be more important if we expected the databases to be very large. But since this is a module for manipulating plain text files, we can expect that they will not be too big, and perhaps the time and memory costs of searching them will be relatively small, so perhaps this design will score fairly low on the usefulness scale.

I still haven't made up my mind, although writing this article has pushed me strongly toward #6. I would be glad to receive email on the matter.

[Other articles in category /prog] permanent link