The Universe of Discourse


Thu, 12 Jul 2007

Another useful utility
Every couple of years I get a good idea for a simple utility that will make my life easier. Last time it was the following triviality, which I call f:

	#!/usr/bin/perl

	my $field = shift or usage();
	$field -= 1 if $field > 0;
	$|=1;

	while (<>) {
		chomp;
		my @f = split;
		print $f[$field], "\n";
	}

	sub usage {
		print STDERR "$0 fieldnumber\n"; 
		exit 1;
	}
I got tired of writing awk '{print $11}' when I wanted to extract the 11th field of some stream of data in a Unix pipeline, which is something I do about six thousand times a day. So I wrote this tiny thing. It was probably the most useful piece of software I wrote in that calendar year, and as you can see from the length, it certainly had the best cost-to-benefit ratio. I use it every day.

The point here is that you can replace awk '{print $11}' with just f 11. For example, f 11 access_log finds out the referrer URLs from my Apache httpd log. I also frequently use f -1, which prints the last field in each line. ls -l | grep '^l' | f -1 prints out the targets of all the symbolic links in the current directory.

Programs like this won't win me any prizes, but they certainly are useful.

Anyway, today's post was inspired by another similarly tiny utility that I expect will be similarly useful that I just finished. It's called runN:

	#!/usr/bin/perl

	use Getopt::Std;
	my %opt;
	getopts('r:n:c:v', \%opt) or usage();
	$opt{n} or usage();
	$opt{c} or usage();

	@ARGV = shuffle(@ARGV) if $opt{r};

	my $N = $opt{n};
	my %pid;
	while (@ARGV) {
	  if (keys(%pid) < $N) {
	    $pid{spawn($opt{c}, split /\s+/, shift @ARGV)} = 1;
	  } else {
	    delete $pid{wait()};
	  }
	}

	1 while wait() >= 0;

	sub spawn {
	  my $pid = fork;
	  die "fork: $!" unless defined $pid;
	  return $pid if $pid;
	  exec @_;
	  die "exec: $!";
	}
You can tell I just finished it because the shuffle() and usage() functions are unimplemented.

The idea is that you execute the program like this:

	runN -n 3 -c foo arg1 arg2 arg3 arg4...
and it runs the commands foo arg1, foo arg2, foo arg3, foo arg4, etc., simultaneously, but with no more than 3 running at a time.

The -n option says how many commands to run simultaneously; after running that many the main control waits until one has exited before starting another.

If I had implemented shuffle(), then -r would run the commands in random order, instead of in the order specified. Probably I should get rid of -c and just have the program take the first argument as the command name, so that the invocation above would become runN -n 3 foo arg1 arg2 arg3 arg4.... The -v flag, had I implemented it, would put the program into verbose mode.

I find that it's best to defer the implementation of features like -r and -v until I actually need them, which might be never. In the past I've done post-analyses of the contents of ~mjd/bin, and what I found was that my tendency was to implement a lot more features than I needed or used.

In the original implementation, the -n is mandatory, because I couldn't immediately think of a reasonable default. The only obvious choice is 1, but since the point of the program was to run programs concurrently, 1 is not reasonable. But it occurs to me now that if I let -n default to 1, then this command would replace many of my current invocations of:

	for i in ...; do
	  cmd $i
	done
which I do quite a lot. Typing runN cmd ... would be a lot quicker and easier. As I've written before, when a feature you put in turns out to have unanticipated uses, it's a sign of a good, modular design.

The code itself makes me happy for two reasons. One is that the program worked properly on the first try, which does not happen very often for me. When I was in elementary school, my teachers always complained that although I was very bright, I made a lot of careless mistakes because I was not methodical enough. They tried hard to fix this personality flaw. They did not succeed.

The other thing I like about the code is that it's so very brief. Not to say that it is any briefer than it should be; I think it's just about perfect. One of the recurring themes of my study of programming for the last few years is that beginner programmers use way more code than is necessary, just like beginning writers use way too many words. The process and concurrency management turned out to be a lot easier than I thought they would be: the default Unix behavior was just exactly what I needed. I am particularly pleased with delete $pid{wait()}. Sometimes these things just come together.

The 1 while wait() >= 0 line is a non-obfuscated version of something I wrote in my prize-winning obfuscated program, of all places. Sometimes the line between the sublime and the ridiculous is very fine indeed.

Despite my wariness of adding unnecessary features, there is at least one that I will put in before I deploy this to ~mjd/bin and start using it. I'll implement usage(), since experience has shown that I tend to forget how to invoke these things, and reading the usage message is a quicker way to figure it out than is rereading the source code. In the past, usage messages have been good investments.

I'm tempted to replace the cut-rate use of split here with something more robust. The problem I foresee is that I might want to run a command with an argument that contains a space. Consider:

	runN -n 2 -c ls foo bar "-l baz"
This runs ls foo, then ls bar, then ls -l baz. Without the split() or something like it, the third command would be equivalent to ls "-l baz" and would fail with something like -l baz: no such file or directory. (Actually it tries to interpret the space as an option flag, and fails for that reason instead.) So I put the split in to enable this usage. (Maybe this was a you-ain't-gonna-need-it moment; I'm not sure.) But this design makes it difficult or impossible to apply the command to an argument with a space in it. Suppose I'm trying to do ls on three directories, one of which is called old stuff. The natural thing to try is:

	runN -n 2 -c ls foo bar "old stuff"
But the third command turns into ls old stuff and produces:

	ls: old: No such file or directory
	ls: stuff: No such file or directory
If the split() were omitted, it would just work, but then the ls -l baz example above would fail. If the split() were replaced by the correct logic, I would be able to get what I wanted by writing something like this:

	runN -n 2 -c ls foo bar "'old stuff'"
But as it is this just produces another error:

	ls: 'old: No such file or directory
	ls: stuff': No such file or directory
Perl comes standard with a library called ShellWords that is probably close to what I want here. I didn't use it because I wasn't sure I'd actually need it—only time will tell—and because shell parsing is very complicated and error-prone, more so when it is done synthetically rather than by the shell, and even more so when it is done multiple times; you end up with horrible monstrosities like this:

	s='q=`echo "$s" | sed -e '"'"'s/'"'"'"'"'"'"'"'"'/'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'"'/g'"'"'`; echo "s='"'"'"$q"'"'"'"; echo $s'
	q=`echo "$s" | sed -e 's/'"'"'/'"'"'"'"'"'"'"'"'/g'`; echo "s='"$q"'"; echo $s
So my fear was that by introducing a double set of shell-like interpretation, I'd be opening a horrible can of escape character worms and weird errors, and my hope was that if I ignored the issue the problems might be simpler, and might never arise in practice. We'll see.

[ Addendum 20080712: Aaron Crane wrote a thoughtful followup. Thank you, M. Crane. ]


[Other articles in category /prog] permanent link