Benchmarking shell pipelines and the Unix “tools” philosophy
Sometimes I look through the HTTP referrer logs to see if anyone is
talking about my blog. I use the f 11 command to extract the
referrer field from the log files, count up the number of occurrences
of each referring URL, then discard the ones that are internal
referrers from elsewhere on my blog. It looks like this:
f 11 access.2020-01-0* | count | grep -v plover
(I've discussed f before. The f 11 just
prints the eleventh field of each line. It is
essentially shorthand for awk '{print $11}' or perl -lane 'print
$F[10]' . The count utility is even simpler; it counts the number
of occurrences of each distinct line in its input, and emits a report
sorted from least to most frequent, essentially a trivial wrapper
around sort | uniq -c | sort -n . Civilization advances by extending the number of
important operations which we can perform without thinking about
them.)
This has obvious defects, but it works well enough. But every time I
used it, I wondered: is it faster to do the grep before the count ,
or after? I didn't ever notice a difference. But I still wanted to
know.
After years of idly wondering this, I have finally looked into it.
The point of this article is that the investigation produced the
following pipeline, which I think is a great example of the Unix
“tools” philosophy:
for i in $(seq 20); do
TIME="%U+%S" time \
sh -c 'f 11 access.2020-01-0* | grep -v plover | count > /dev/null' \
2>&1 | bc -l ;
done | addup
I typed this on the command line, with no backslashes or newlines, so
it actually looked like this:
for i in $(seq 20); do TIME="%U+%S" time sh -c 'f 11 access.2020-01-0* | grep -v plover |count > /dev/null' 2>&1 | bc -l ; done | addup
Okay, what's going on here? The pipeline I actually want to analyze,
with f | grep| count , is
there in the middle, and I've already explained it, so let's elide it:
for i in $(seq 20); do
TIME="%U+%S" time \
sh -c '¿SOMETHING? > /dev/null' 2>&1 | bc -l ;
done | addup
Continuing to work from inside to out, we're going to use time to
actually do the timings. The time command is standard. It runs a
program, asks the kernel how long the program took, then prints
a report.
The time command will only time a single process (plus its
subprocesses, a crucial fact that is inexplicably omitted from the man
page). The ¿SOMETHING? includes a pipeline, which must be set up by
the shell, so we're actually timing a shell command sh -c '...'
which tells time to run the shell and instruct it to run the pipeline we're
interested in. We tell the shell to throw away the output of
the pipeline, with > /dev/null , so that the output doesn't get mixed
up with time 's own report.
The default format for the report printed by time is intended for human consumption. We can
supply an alternative format in the $TIME variable. The format I'm using here is %U+%S , which comes out as something
like 0.25+0.37 , where 0.25 is the user CPU time and 0.37 is the
system CPU time. I didn't see a format specifier that would emit the
sum of these directly. So instead I had it emit them with a + in
between, and then piped the result through the bc command, which performs the requested arithmetic
and emits the result. We need the -l flag on bc
because otherwise it stupidly does integer arithmetic. The time command emits its report to
standard error, so I use 2>&1
to redirect the standard error into the pipe.
[ Addendum 20200108: We don't actually need -l here; I was mistaken. ]
Collapsing the details I just discussed, we have:
for i in $(seq 20); do
(run once and emit the total CPU time)
done | addup
seq is a utility I invented no later than 1993 which has since
become standard in most Unix systems. (As with
netcat , I am not claiming to be the
first or only person to have invented this, only to have invented it
independently.) There are many variations of seq , but
the main use case is that seq 20 prints
1
2
3
…
19
20
Here we don't actually care about the output (we never actually use
$i ) but it's a convenient way to get the for loop to run twenty
times. The output of the for loop is the twenty total CPU
times that were emitted by the twenty invocations of bc . (Did you know that
you can pipe the output of a loop?) These twenty lines of output are
passed into addup , which I wrote no later than 2011. (Why did it
take me so long to do this?) It reads a list of numbers and prints the
sum.
All together, the command runs and prints a single number like 5.17 ,
indicating that the twenty runs of the pipeline took 5.17 CPU-seconds
total. I can do this a few times for the original pipeline, with
count before grep , get times between 4.77 and 5.78, and then try
again with the grep before the count , producing times between 4.32
and 5.14. The difference is large enough to detect but too small to
notice.
(To do this right we also need to test a null command, say
sh -c 'sleep 0.1 < /dev/null'
because we might learn that 95% of the reported time is spent
in running the shell, so the actual difference between the two pipelines is
twenty times as large as we thought. I did this; it turns out that the time
spent to run the shell is insignificant.)
What to learn from all this? On the one hand, Unix wins: it's
supposed to be quick and easy to assemble small tools to do whatever it is
you're trying to do. When time wouldn't do the arithmetic I needed
it to, I sent its output to a generic arithmetic-doing utility. When
I needed to count to twenty, I had a utility for doing that; if I
hadn't there are any number of easy workarounds. The
shell provided the I/O redirection and control flow I needed.
On the other hand, gosh, what a weird mishmash of stuff I had to
remember or look up. The -l flag for bc . The fact that I needed
bc at all because time won't report total CPU time. The $TIME
variable that controls its report format. The bizarro 2>&1 syntax
for redirecting standard error into a pipe. The sh -c trick to get
time to execute a pipeline. The missing documentation of the core
functionality of time .
Was it a win overall? What if Unix had less compositionality but I
could use it with less memorized trivia? Would that be an
improvement?
I don't know. I rather suspect that there's no way to
actually reach that hypothetical universe. The bizarre mishmash of
weirdness exists because so many different people invented so many
tools over such a long period. And they wouldn't have done any of
that inventing if the compositionality hadn't been there. I think we
don't actually get to make
a choice between an incoherent mess of composable paraphernalia and a
coherent, well-designed but noncompositional system. Rather, we get a
choice between a incoherent but useful mess and an incomplete, limited
noncompositional system.
(Notes to self: (1) In connection with Parse::RecDescent , you once wrote
about open versus closed systems. This is another point in that
discussion. (2) Open systems tend to evolve into messes. But closed
systems tend not to evolve at all, and die. (3) Closed systems are
centralized and hierarchical; open systems, when they succeed, are
decentralized and organic. (4) If you are looking for another
example of a successful incoherent mess of composable paraphernalia,
consider Git.)
[ Addendum: Add this to the list of “weird mishmash of trivia”: There are two
time commands. One, which I discussed above, is a separate
executable, usually in /usr/bin/time . The other is built into the
shell. They are incompatible. Which was I actually using? I would
have been pretty confused if I had accidentally gotten the built-in
one, which ignores $TIME and uses a $TIMEFORMAT that is
interpreted in a completely different way. I was fortunate, and got
the one I intended to get. But it took me quite a while to understand
why I had! The appearance of the TIME=… assignment at the start of
the shell command disabled the shell's special builtin treatment of the
keyword time , so it really did use /usr/bin/time . This computer stuff
is amazingly complicated. I don't know how anyone gets anything done. ]
[ Addenda 20200104: (1) Perl's module ecosystem is another example of a
successful incoherent mess of composable paraphernalia. (2) Of the
seven trivia I included in my “weird mishmash”, five were related to
the time command. Is this a reflection on time , or is it just
because time was central to this particular example? ]
[ Addendum 20200104: And, of course, this is exactly what Richard
Gabriel was thinking about in Worse is
Better. Like Gabriel, I'm not sure. ]
[Other articles in category /Unix]
permanent link
|