The Universe of Discourse


Thu, 20 Dec 2007

Another trivial utility: accumulate
As usual, whenever I write one of these things, I wonder why it took me so long to get off my butt and put in the five minutes of work that were actually required. I've wanted something like this for years. It's called accumulate. It reads an input of this form:

        k1 v1
        k1 v2
        k2 v3
        k1 v4
        k2 v5
        k3 v6
and writes it out in this format:

        k1 v1 v2 v4
        k2 v3 v5
        k3 v6
I wanted it this time because I had a bunch of files that included some duplicates, and wanted to get rid of the duplicates. So:

        md5sum * | accumulate | perl -lane 'unlink @F[2..$#F]'
(Incidentally, people sometimes argue that Perl's .. operator should count backwards when the left operand exceeds the right one. These people are wrong. There is only one argument that needs to be made to refute this idea; maybe it is the only argument that can be made. And examples of it abound. The code above is one such example.)

I'm afraid of insulting you by showing the source code for accumulate, because of course it is so very trivial, and you could write it in five minutes, as I did. But who knows; maybe seeing the source has some value:

        #!/usr/bin/perl

        use Getopt::Std;
        my %opt = (k => 1, v => 2);
        getopts('k:v:', \%opt) or usage();
        for (qw(k v)) {
          $opt{$_} -= 1 if $opt{$_} > 0;
        }

        while (<>) {
          chomp;
          my @F = split;
          push @{$K{$F[$opt{k}]}}, $F[$opt{v}];
        }

        for my $k (keys %K) {
          print "$k @{$K{$k}}\n";
        }
It's tempting to add a -F option to tell it that the input is not delimited by white space, or an option to change the output format, or blah blah blah, but I managed to restrain myself, mostly.

Several years ago I wrote a conference tutorial about the contents of my ~/bin directory. The clearest conclusion that transpired from my analysis was that the utilities I write have too many features that I don't use. The second-clearest was that I waste too much time writing custom argument-parsing code instead of using Getopt::Std. I've tried to learn from this. One thing I found later is that a good way to sublimate the urge to put in some feature is to put in the option to enable it, and to document it, but to leave the feature itself unimplemented. This might work for you too if you have the same problem.

I did put in -k and -v options to control which input columns are accumulated. These default to the first and second columns, naturally. Maybe this was a waste of time, since it occurs to me now that accumulate -k k -v v could be replaced by cut -fk,v | accumulate, if only cut didn't suck quite so badly. Of course one could use awk {print "$k $v" } | accumulate to escape cut's suckage. And some solution of this type obviates the need for accumulate's putative -F option also. Well, I digress.

The accumulate program itself reminds me of a much more ambitious project I worked on for a while between 1998 and 2001, as does the yucky line:

          push @{$K{$F[$opt{k}]}}, $F[$opt{v}];
The ambitious project was tentatively named "twingler".

Beginning Perl programmers often have trouble with compound data structures because Perl's syntax for the nested structures is so horrendous. Suppose, for example, that you have a reference to a two-dimensional array $aref, and you want to produce a hash, such that each value in the array appears as a key in the hash, associated with a list of strings in the form "m,n" indicating where in the array that value appeared. Well, of course it is obviously nothing more than:

        for my $a1 (0 .. $#$aref) {
          for my $a2 (0 .. $#{$aref->[$a1]}) {
            push @{$hash{$aref->[$a1][$a2]}}, "$a1,$a2";
          }
        }
Obviously. <sarcasm>Geez, a child could see that.</sarcasm>

The idea of twingler was that you would specify the transformation you wanted declaratively, and it would then write the appropriate Perl code to perform the transformation. The interesting part of this project is figuring out the language for specifying the transformation. It must be complex enough to be able to express most of the interesting transformations that people commonly want, but if it isn't at the same time much simpler than Perl itself, it isn't worth using. Nobody will see any point in learning a new declarative language for expressing Perl data transformations unless it is itself simpler to use than just writing the Perl would have been.

[ Addendum 20150508: I dumped all my Twingler notes on the blog last year. ]

There are some hard problems here: What do people need? What subset of this can be expressed simply? How can we design a simple, limited language that people can use to express their needs? Can the language actually be compiled to Perl?

I had to face similar sorts of problems when I was writing linogram, but in the case of linogram I was more successful. I tinkered with twingler for some time and made several pages of (typed) notes but never came up with anything I was really happy with.

[ Addendum 20150508: I dumped all my Twingler notes on the blog last year. ]

At one point I abandoned the idea of a declarative language, in favor of just having the program take a sample input and a corresponding sample output, and deduce the appropriate transformation from there. For example, you would put in:

        [ [ A, B ],
          [ C, B ],
          [ D, E ] ]
and
       { B => [A, C],
          E => [D],
        }
and it would generate:
       for my $a1 (@$input) {
          my ($e1, $e2) = @$a1;
          push @{$output{$e2}}, $e1;
        }
And then presumably you could eyeball this, and if what you really wanted was @{$a1}[0, -1] instead of @$a1 you could tinker it into the form you needed without too much extra trouble. This is much nicer from a user-experience point of view, but at the same time it seems more difficult to implement.

I had some ideas. One idea was to have it generate a bunch of expressions for mapping single elements from the input to the output, and then to try to unify those expressions. But as I said, I never did figure it out.

It's a shame, because it would have been pretty cool if I had gotten it to work.

The MIT CS grad students' handbook used to say something about how you always need to have several projects going on at once, because two-thirds of all research projects end in failure. The people you see who seem to have one success after another actually have three projects going on all the time, and you only see the successes. This is a nice example of that.


[Other articles in category /prog] permanent link