Archive:
Subtopics:
Comments disabled |
Fri, 21 Mar 2008
z-commands
But for anything else, you either need to uncompress the files, or build a special tool. I have a utility that scans the web logs of blog.plover.com, and extracts a report about new referrers. The historical web logs are normally kept compressed, so I recently built in support for decompression. This is quite easy in Perl. Normally one scans a sequence of input files something like this:
while (<>) { ... do something with $_ ... }The <> operator implicitly scans all the lines in all the files named in the command-line arguments, opening a new file each time the previous one is exhausted. To decompress the files on the fly, one can preprocess the command-line arguments:
for (@ARGV) { if (/\.gz$/) { $_ = "gzip -dc $_ |"; } } while (<>) { ... do something with $_ ... }The for loop scans the command-line arguments, replacing each one that has the form foo.gz with gzip -dc foo.gz |. Perl's magic open semantics treat filenames specially if they end with a pipe symbol: a pipe to a command is opened instead. Of course, anyone can think of half a dozen ways in which this can go wrong. But Larry Wall's skill in making such tradeoffs has been a large factor in Perl's success. But it bothered me to have to make this kind of change in every program that wanted to handle compressed files. We have zcat and zgrep; where are zcut, zpr, zrev, zwc, zcol, zbc, zsed, zawk, and so on? Echh. But after I got to thinking about it, I decided that I could write a single z utility that would do a lot of the same things. Instead of this:
zsed -e 's/:.*//' * | ...where the * matches some files that have .gz suffixes and some that haven't, one would write:
z sed -e 's/:.*//' * | ...and it would Just Work. That's the idea, anyway. If sed were written in Perl, z would have an easy job. It could rely on Perl's magic open, and simply preprocess the arguments before running sed:
# hypothetical implementation of z # my $command = shift; for (@ARGV) { if (/\.gz$/) { $_ = "gzip -dc $_ |"; } } exec $command, @ARGV; die "Couldn't run command '$command': $!\n";But sed is not written in Perl, and has no magic open. So I have to play a trickier trick:
for my $file (@ARGV) { if ($file =~ /\.gz$/) { unless (open($fhs[@fhs], "-|", "gzip", "-cd", $file)) { warn "Couldn't open file '$file': $!; skipping\n"; next; } my $fd = fileno $fhs[-1]; $_ = "/proc/self/fd/$fd"; } } # warn "running $command @ARGV\n"; exec $command, @ARGV; die "Couldn't run command '$command': $!\n";This is a stripped-down version to illustrate the idea. For various reasons that I explained yesterday, it does not actually work. The complete, working source code is here. The idea, as before, is that the program preprocesses the command-line arguments. But instead of replacing the arguments with pipe commands, which are not supported by open(2), the program sets up the pipes itself, and then directs the command to take its input from the pipes by specifying the appropriate items from /proc/self/fd. The trick depends crucially on having /proc/self/fd, or /dev/fd, or something of the sort, because otherwise there's no way to trick the command into reading from a pipe when it thinks it is opening a file. (Actually there is at least one other way, involving FIFOs, which I plan to discuss tomorrow.) Most modern systems do have /proc/self/fd. That feature postdates my earliest involvement with Unix, so it isn't a ready part of my mental apparatus as perhaps it ought to be. But this utility seems to me like a sort of canonical application of /proc/self/fd, in the sense that, if you couldn't think what /proc/self/fd might be good for, then you could read this example and afterwards have a pretty clear idea. The z utility has a number of flaws. Principally, the original filenames are gone. Here's a typical run with regular zgrep:
% zgrep immediately * ctime.blog:we want to update. It is immediately copied into a register, and env-2.blog.gz:All five people who wrote to me about this immediately said "oh, yes, qmail-throttle.blog.gz:program continues immediately, possibly posting its message. (It struct-inode.blog:is a symbolic link, its inode is returned immediately; iname() would sync.blog:and reports success back to the process immediately, even though theBut here's the same thing with z:
% z grep immediately * ctime.blog:we want to update. It is immediately copied into a register, and /proc/self/fd/3:All five people who wrote to me about this immediately said "oh, yes, /proc/self/fd/5:program continues immediately, possibly posting its message. (It struct-inode.blog:is a symbolic link, its inode is returned immediately; iname() would sync.blog:and reports success back to the process immediately, even though theThe problem is even more glaring in the case of commands like wc:
% z wc * 411 2611 16988 ctime.blog 71 358 2351 /proc/self/fd/3 121 725 5053 /proc/self/fd/4 51 380 2381 files-talk.blog 48 145 885 find-uniq.pl 288 2159 12829 /proc/self/fd/5 95 665 4337 ssh-agent-revisted.blog 221 941 6733 struct-inode.blog 106 555 3976 sync-2.blog 115 793 4904 sync.blog 124 624 4208 /proc/self/fd/6 1651 9956 64645 total So perhaps z will not turn out to be useful enough to be more than a curiosity. But I'm not sure yet. This is article #300 on my blog. Thanks for reading. [ Addendum 20080322: There is a followup to this article. ] [ Addendum 20080325: Another followup. ]
[Other articles in category /Unix] permanent link |