# The Universe of Discourse

Fri, 21 Mar 2008

z-commands
The gzip distribution includes a command called zcat. Its command-line arguments can include any number of filenames, compressed or not, and it prints out the contents, uncompressing them on the fly if necessary. Sometime later a zgrep command appeared, which was similar but which also performed a grep search.

But for anything else, you either need to uncompress the files, or build a special tool. I have a utility that scans the web logs of blog.plover.com, and extracts a report about new referrers. The historical web logs are normally kept compressed, so I recently built in support for decompression. This is quite easy in Perl. Normally one scans a sequence of input files something like this:

        while (<>) {
... do something with $_ ... }  The <> operator implicitly scans all the lines in all the files named in the command-line arguments, opening a new file each time the previous one is exhausted. To decompress the files on the fly, one can preprocess the command-line arguments:  for (@ARGV) { if (/\.gz$/) {
$_ = "gzip -dc$_ |";
}
}

while (<>) {
... do something with $_ ... }  The for loop scans the command-line arguments, replacing each one that has the form foo.gz with gzip -dc foo.gz |. Perl's magic open semantics treat filenames specially if they end with a pipe symbol: a pipe to a command is opened instead. Of course, anyone can think of half a dozen ways in which this can go wrong. But Larry Wall's skill in making such tradeoffs has been a large factor in Perl's success. But it bothered me to have to make this kind of change in every program that wanted to handle compressed files. We have zcat and zgrep; where are zcut, zpr, zrev, zwc, zcol, zbc, zsed, zawk, and so on? Echh. But after I got to thinking about it, I decided that I could write a single z utility that would do a lot of the same things. Instead of this:  zsed -e 's/:.*//' * | ...  where the * matches some files that have .gz suffixes and some that haven't, one would write:  z sed -e 's/:.*//' * | ...  and it would Just Work. That's the idea, anyway. If sed were written in Perl, z would have an easy job. It could rely on Perl's magic open, and simply preprocess the arguments before running sed:  # hypothetical implementation of z # my$command = shift;
for (@ARGV) {
if (/\.gz$/) {$_ = "gzip -dc $_ |"; } } exec$command, @ARGV;
die "Couldn't run command '$command':$!\n";

But sed is not written in Perl, and has no magic open. So I have to play a trickier trick:

        for my $file (@ARGV) { if ($file =~ /\.gz$/) { unless (open($fhs[@fhs], "-|", "gzip", "-cd", $file)) { warn "Couldn't open file '$file': $!; skipping\n"; next; } my$fd = fileno $fhs[-1];$_ = "/proc/self/fd/$fd"; } } # warn "running$command @ARGV\n";
exec $command, @ARGV; die "Couldn't run command '$command': \$!\n";

This is a stripped-down version to illustrate the idea. For various reasons that I explained yesterday, it does not actually work. The complete, working source code is here.

The idea, as before, is that the program preprocesses the command-line arguments. But instead of replacing the arguments with pipe commands, which are not supported by open(2), the program sets up the pipes itself, and then directs the command to take its input from the pipes by specifying the appropriate items from /proc/self/fd.

The trick depends crucially on having /proc/self/fd, or /dev/fd, or something of the sort, because otherwise there's no way to trick the command into reading from a pipe when it thinks it is opening a file. (Actually there is at least one other way, involving FIFOs, which I plan to discuss tomorrow.) Most modern systems do have /proc/self/fd. That feature postdates my earliest involvement with Unix, so it isn't a ready part of my mental apparatus as perhaps it ought to be. But this utility seems to me like a sort of canonical application of /proc/self/fd, in the sense that, if you couldn't think what /proc/self/fd might be good for, then you could read this example and afterwards have a pretty clear idea.

The z utility has a number of flaws. Principally, the original filenames are gone. Here's a typical run with regular zgrep:

        % zgrep immediately *
ctime.blog:we want to update.  It is immediately copied into a register, and
qmail-throttle.blog.gz:program continues immediately, possibly posting its message.  (It
struct-inode.blog:is a symbolic link, its inode is returned immediately; iname() would
sync.blog:and reports success back to the process immediately, even though the

But here's the same thing with z:

        % z grep immediately *
ctime.blog:we want to update.  It is immediately copied into a register, and
/proc/self/fd/5:program continues immediately, possibly posting its message.  (It
struct-inode.blog:is a symbolic link, its inode is returned immediately; iname() would
sync.blog:and reports success back to the process immediately, even though the

The problem is even more glaring in the case of commands like wc:

        % z wc *
411    2611   16988 ctime.blog
71     358    2351 /proc/self/fd/3
121     725    5053 /proc/self/fd/4
51     380    2381 files-talk.blog
48     145     885 find-uniq.pl
288    2159   12829 /proc/self/fd/5
95     665    4337 ssh-agent-revisted.blog
221     941    6733 struct-inode.blog
106     555    3976 sync-2.blog
115     793    4904 sync.blog
124     624    4208 /proc/self/fd/6
1651    9956   64645 total


So perhaps z will not turn out to be useful enough to be more than a curiosity. But I'm not sure yet.

This is article #300 on my blog. Thanks for reading.