The Universe of Discourse

Tue, 21 Apr 2015

Another use for strace (isatty)

(This is a followup to an earlier article describing an interesting use of strace.)

A while back I was writing a talk about Unix internals and I wanted to discuss how the ls command does a different display when talking to a terminal than otherwise:

ls to a terminal

ls not to a terminal

How does ls know when it is talking to a terminal? I expect that is uses the standard POSIX function isatty. But how does isatty find out?

I had written down my guess. Had I been programming in C, without isatty, I would have written something like this:

    @statinfo = stat STDOUT;
    if (    $statinfo[2] & 0060000 == 0020000
        && ($statinfo[6] & 0xff) == 5) { say "Terminal" }
    else { say "Not a terminal" }

(This is Perl, written as if it were C.) It uses fstat (exposed in Perl as stat) to get the mode bits ($statinfo[2]) of the inode attached to STDOUT, and then it masks out the bits the determine if the inode is a character device file. If so, $statinfo[6] is the major and minor device numbers; if the major number (low byte) is equal to the magic number 5, the device is a terminal device. On my current computers the magic number is actually 136. Obviously this magic number is nonportable. You may hear people claim that those bit operations are also nonportable. I believe that claim is mistaken.

The analogous code using isatty is:

    use POSIX 'isatty';
    if (isatty(STDOUT)) { say "Terminal" } 
    else { say "Not a terminal" }

Is isatty doing what I wrote above? Or something else?

Let's use strace to find out. Here's our test script:

    % perl -MPOSIX=isatty -le 'print STDERR isatty(STDOUT) ? "terminal" : "nonterminal"'
    % perl -MPOSIX=isatty -le 'print STDERR isatty(STDOUT) ? "terminal" : "nonterminal"' > /dev/null

Now we use strace:

    % strace -o /tmp/isatty perl -MPOSIX=isatty -le 'print STDERR isatty(STDOUT) ? "terminal" : "nonterminal"' > /dev/null
    % less /tmp/isatty

We expect to see a long startup as Perl gets loaded and initialized, then whatever isatty is doing, the write of nonterminal, and then a short teardown, so we start searching at the end and quickly discover, a couple of screens up:

    ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7ffea6840a58) = -1 ENOTTY (Inappropriate ioctl for device)
    write(2, "nonterminal", 11)             = 11
    write(2, "\n", 1)                       = 1

My guess about fstat was totally wrong! The actual method is that isatty makes an ioctl call; this is a device-driver-specific command. The TCGETS parameter says what command is, in this case “get the terminal configuration”. If you do this on a non-device, or a non-terminal device, the call fails with the error ENOTTY. When the ioctl call fails, you know you don't have a terminal. If you do have a terminal, the TCGETS command has no effects, because it is a passive read of the terminal state. Here's the successful call:

    ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
    write(2, "terminal", 8)                 = 8
    write(2, "\n", 1)                       = 1

The B38400 opost… stuff is the terminal configuration; 38400 is the baud rate.

(In the past the explanatory text for ENOTTY was the mystifying “Not a typewriter”, even more mystifying because it tended to pop up when you didn't expect it. Apparently Linux has revised the message to the possibly less mystifying “Inappropriate ioctl for device”.

(SNDCTL_TMR_TIMEBASE is mentioned because apparently someone decided to give their SNDCTL_TMR_TIMEBASE operation, whatever that is, the same numeric code as TCGETS, and strace isn't sure which one is being requested. It's possible that if we figured out which device was expecting SNDCTL_TMR_TIMEBASE, and redirected standard output to that device, that isatty would erroneously claim that it was a terminal.)

[ Addendum 20150415: Paul Bolle has found that the SNDCTL_TMR_TIMEBASE pertains to the old and possibly deprecated OSS (Open Sound System) It is conceivable that isatty would yield the wrong answer when pointed at the OSS /dev/dsp or /dev/audio device or similar. If anyone is running OSS and willing to give it a try, please contact me at ]

[Other articles in category /Unix] permanent link

Sun, 19 Apr 2015

Another use for strace (groff)

The marvelous Julia Evans is always looking for ways to express her love of strace and now has written a zine about it. I don't use strace that often (not as often as I should, perhaps) but every once in a while a problem comes up for which it's not only just the right thing to use but the only thing to use. This was one of those times.

I sometimes use the ancient Unix drawing language pic. Pic has many good features, but is unfortunately coupled too closely to the Roff family of formatters (troff, nroff, and the GNU project version, groff). It only produces Roff output, and not anything more generally useful like SVG or even a bitmap. I need raw images to inline into my HTML pages. In the past I have produced these with a jury-rigged pipeline of groff, to produce PostScript, and then GNU Ghostscript (gs) to translate the PostScript to a PPM bitmap, some PPM utilities to crop and scale the result, and finally ppmtogif or whatever. This has some drawbacks. For example, gs requires that I set a paper size, and its largest paper size is A0. This means that large drawings go off the edge of the “paper” and gs discards the out-of-bounds portions. So yesterday I looked into eliminating gs. Specifically I wanted to see if I could get groff to produce the bitmap directly.

GNU groff has a -Tdevice option that specifies the "output" device; some choices are -Tps for postscript output and -Tpdf for PDF output. So I thought perhaps there would be a -Tppm or something like that. A search of the manual did not suggest anything so useful, but did mention -TX100, which had something to do with 100-DPI X window system graphics. But when I tried this groff only said:

    groff: can't find `DESC' file
    groff:fatal error: invalid device `X100`

The groff -h command said only -Tdev use device dev. So what devices are actually available?

strace to the rescue! I did:

    % strace -o /tmp/gr groff -Tfpuzhpx

and then a search for fpuzhpx in the output file tells me exactly where groff is searching for device definitions:

    % grep fpuzhpx /tmp/gr
    execve("/usr/bin/groff", ["groff", "-Tfpuzhpx"], [/* 80 vars */]) = 0
    open("/usr/share/groff/site-font/devfpuzhpx/DESC", O_RDONLY) = -1 ENOENT (No such file or directory)
    open("/usr/share/groff/1.22.2/font/devfpuzhpx/DESC", O_RDONLY) = -1 ENOENT (No such file or directory)
    open("/usr/lib/font/devfpuzhpx/DESC", O_RDONLY) = -1 ENOENT (No such file or directory)

I could then examine those three directories to see if they existed, and if so find out what was in them.

Without strace here, I would be reduced to groveling over the source, which in this case is likely to mean trawling through the autoconf output, and that is something that nobody wants to do.

Addendum 20150421: another article about strace. ]

[ Addendum 20150424: I did figure out how to prevent gs from cropping my output. You can use the flag -p-P48i,48i to groff to set the page size to 48 inches (48i) by 48 inches. The flag is passed to grops, and then resulting PostScript file contains

   %%DocumentMedia: Default 3456 3456 0 () ()

which instructs gs to pretend the paper size is that big. If it's not big enough, increase 48i to 120i or whatever. ]

[Other articles in category /Unix] permanent link

Fri, 17 Feb 2012

It came from... the HOLD SPACE
Since 2002, I've given a talk almost every December for the Philadelphia Linux Users' Group. It seems like most of their talks are about the newest and best developments in Linux applications, which is a topic I don't know much about. So I've usually gone the other way, talking about the oldest and worst stuff. I gave a couple of pretty good talks about how files work, for example, and what's in the inode structure.

I recently posted about my work on Zach Holman's spark program, which culminated in a ridiculous workaround for the shell's lack of fractional arithmetic. That work inspired me to do a talk about all the awful crap we had to deal with before we had Perl. (And the other 'P' languages that occupy a similar solution space.) Complete materials are here. I hope you check them out, because i think they are fun. This post is a bunch of miscellaneous notes about the talk.

One example of awful crap we had to deal with before Perl etc. were invented was that some people used to write 'sed scripts', although I am really not sure how they did it. I tried once, without much success, and then for this talk I tried again, and again did not have much success.

"The hold space" is a sed-ism. The basic model of sed is that it reads the next line of data into the 'pattern space', then applies a bunch of transformations to it, and then prints it out. If you need to save this line for later examination, or for emitting later on instead, you can hold it in the 'hold space'. Use of the hold space is what distinguishes sed experts from mere sed nobodies like me. So I planned to talk about the hold space, and then I got the happy idea to analogize the Hold Space to the Twilight Zone, or maybe the Phantom Zone, a place where you stick naughty data when you don't want it to escape. I never feel like audiences appreciate the work I put into this sort of thing; when I'm giving the talk it always sounds too much like a private joke. Explaining it just feels like everyone is sitting through my explanation of a private joke.

The little guy to the right is known as hallucigenia. It is a creature so peculiar that when the paleontologists first saw the fossils, they could not even agree on which side was uppermost. It has nothing to do with Unix, but I put it on the slide to illustrate "alien horrors from the dawn of time".

Between slides 9 and 10 (about the ed line editor) I did a quick demo of editing with ed. You will just have to imagine this. I first learned to program with a line editor like ed, on a teletypewriter just like the one on slide 8. Modern editors are much better. But it used to be that Unix sysadmins were expected to know at least a little ed, because if your system got into some horrible state where it couldn't mount the /usr partition, you wouldn't be able to run /usr/bin/vi or /usr/local/bin/emacs, but you would still be able to use /bin/ed to fix /etc/fstab or whatever else was broken. Knowing ed saved my bacon several times.

(Speaking of teletypewriters, ours had an attachment for punching paper tape, which you can see on the left side of the picture. The punched chads fell into a plastic chad box (which is missing in the picture), and when I was about three I spilled the chad box. Chad was everywhere, and it was nearly impossible to pick up. There were still chads stuck in the cracks in the floorboards when we moved out three years later. That's why, when the contested election of 2000 came around, I was one of the few people in North America who was not bemused to learn that there was a name for the little punched-out bits.)

Anyway, back to ed. ed has one and only one diagnostic: if you do something it didn't like, it prints ?. This explains the ancient joke on slide 10, which first appeared circa 1982 in the 4.2BSD fortune program.

I really wanted to present a tour de force of sed mastery, but as slides 24–26 say, I was not clever enough. I tried really hard and just could not do it. If anyone wants to fix my not-quite-good-enough sed script, I will be quite grateful.

On slide 28 I called awk a monster. This was a slip-up; awk is not a monster and that is why it does not otherwise appear in this talk. There is nothing really wrong with awk, other than being a little old, a little tired, and a little underpowered.

If you are interested in the details of the classify program, described on slide 29, the sources are still available from the comp.sources.unix archive. People often say "Why don't you just use diff for that?" so I may as well answer that here: You use diff if you have two files and you want to see how they differ. You use classify if you have 59 files, of which 36 are identical, 17 more are also identical to each other but different from the first 36, and the remaining 6 are all weirdos, and you want to know which is which. These days you would probably just use md5sum FILES | accumulate, and in hindsight that's probably how I should have implemented classify. We didn't have md5sum but we had something like it, or I could have made a checksum program. The accumulate utility is trivial.

Several people have asked me to clarify my claim to have invented netcat. It seems that a similar program with the same name is attributed to someone called "Hobbit". Here is the clarification: In 1991 I wrote a program with the functionality I described and called it "netcat". You would run netcat hostname port and it would open a network socket to the indicated address, and transfer data from standard input into the socket, and data from the socket to standard output. I still have the source code; the copyright notice at the top says "21 October 1991". Wikipedia says that the same-named program by the other guy was released on 20 March 1996. I do not claim that the other guy stole it from me, got the idea from me, or ever heard of my version. I do not claim to be the first or only person to have invented this program. I only claim to have invented mine independently.

My own current version of the spark program is on GitHub, but I think Zach Holman's current version is probably simpler and better now.

[Other articles in category /Unix] permanent link

Wed, 11 Jan 2012

Where should usage messages go?
Last week John Speno complained about Unix commands which, when used incorrectly, print usage messages to standard error instead of to standard output. The problem here is that if the usage message is long, it might scroll off the screen, and it's a pain when you try to pipe it through a pager with command | pager and discover that the usage output has gone to stderr, missed the pager, and scrolled off the screen anyway.

Countervailing against this, though, is the usual argument for stderr: if you had run the command in a pipeline, and it wrote its error output to stdout instead of to stderr, then the error message would have gotten lost, and would possibly have caused havoc further down the pipeline. I considered this argument to be the controlling one, but I ran a quick and informal survey to see if I was in the minority.

After 15 people had answered the survey, Ron Echeverri pointed out that although it makes sense for the usage message to go to stderr when the command is used erroneously, it also makes sense for it to go to stdout if the message is specifically requested, say by the addition of a --help flag, since in that case the message is not erroneous. So I added a second question to the survey to ask about where the message should go in such a case.

83 people answered the first question, "When a command is misused, should it deliver its usage message to standard output or to standard error?". 62 (75%) agreed that the message should go to stderr; 11 (13%) said it should go to stdout. 10 indicated that they preferred a more complicated policy, of which 4 were essentially (or exactly) what M. Echeverri suggested; this brings the total in favor of stderr to 66 (80%). The others were:

  1. stdout, if it is a tty; stderr otherwise
  2. stdout, if it is a pipe; stderr otherwise
  3. A very long response that suggested syslog.
  4. stderr, unless an empty stdout would cause problems
  5. It depends, but the survey omitted the option of printing directly on the console
  6. It depends
I think #2 must have been trying to articulate #1, but (a) got it backwards and (b) missed. #3 seemed to be answering a different question than the one that was asked; syslog may make sense for general diagnostics, but to use it for usage messages seems peculiar. #5 also seems strange to me, since my idea of "console" is the line printer hardwired to the back of the mainframe down in the machine room; I think the writer might have meant "terminal".

68 people answered the second question, "Where should the command send the output when the user specifically requests usage information?". (15 people took the survey before I added this question.) 50 (74%) said the output should go to stdout, 12 (18%) to the user's default pager and then to stdout, and 5 (7%) to stderr. One person (The same as #5 above) said "it depends".

Thanks to everyone who participated.

[Other articles in category /Unix] permanent link

Tue, 25 Mar 2008

The "z" command: output filtering
My last few articles ([1] [2] [p] [p-2]) have been about this z program. The first part of this article is a summary of that discussion, which you can skip if you remember it.

The idea of z is that you can do:

        z grep pattern files...
and it does approximately the same as:
        zgrep pattern files...
or you could do:
        z sed script files...
and it would do the same as:
        zsed script files...
if there were a zsed command, although there isn't.

Much of the discussion has concerned a problem with the implementation, which is that the names of the original compressed files are not available to the command, due to the legerdemain z must perform in order to make the uncompressed data available to the command. The problem is especially apparent with wc:

        % z wc *  
            411    2611   16988
             71     358    2351 /proc/self/fd/3
            121     725    5053 /proc/self/fd/4
             51     380    2381
             48     145     885
            288    2159   12829 /proc/self/fd/5
             95     665    4337
            221     941    6733
            106     555    3976
            115     793    4904
            124     624    4208 /proc/self/fd/6
           1651    9956   64645 total

Here /proc/self/fd/3 and the rest should have been names ending in .gz, such as

Another possible solution

At the time I wrote the first article, it occurred to me briefly that it would be possible to have z capture the output of the command and attempt to translate /proc/self/fd/3 back to or whatever is appropriate, because although the subcommand does not know the original filenames, z itself does. The code would look something like this. Instead of ending by execing the command, as the original version of z did:
  exec $command, @ARGV;
  die "Couldn't run '$command': $!.\n";
this revised version of z, which we might call zz, would end with the code to translate back to the original filenames:

  open my($out), "-|", $command, @ARGV
    or die "Couldn't run '$command': $!.\n";
  while (<$out>) {
Here @old is an array that translates from file descriptors back to the original filename.

At the time, I thought of doing this, and my immediate thought was "well, that is so obviously a terrible idea that it is not worth even mentioning", so I left it out. But since then at least five people have written to me to suggest it, so it appears that it is not obviously a terrible idea. I had to think a little deeper about why I thought it was a terrible idea.

Really the question is why I think this is a more terrible idea than the original z program was in the first place. Because one could say that z is garbling the output of its command, and the filtering code in zz is only un-garbling it. But I think this isn't the right way to look at it.

The output of the command has a certain format, a certain structure. We don't know ahead of time what that structure is, but it can be described for any particular command. For instance, the output of wc is always a sequence of lines where each line has four whitespace-separated fields, of which the first three are numerals and the last is a filename, and then a final total line at the end.

Similarly, the output of tar is a file in a complicated binary format, one which is documented somewhere and which is intelligible to other instances of the tar command that are trying to decode it.

The original behavior of z may alter the content of the command output to some extent, replacing some filenames with others. But it cannot disrupt the structure or the format of the file, ever. This is because the output of z tar is the output of tar, unmodified. The z program tampers with the arguments it gives to tar, but having done that it runs tar and lets tar do what it wants, and tar then must produce a tar-format output, possibly not the one it would have normally produced—the content might be a little different—but a properly-formatted one for sure. In particular, any program written to deal properly with the output of tar will still work with the output of z tar. The output might not have the same meaning, but we can say very particularly what the extent of the differences might be: if the output mentions filenames, then some of these might have changed from the true filenames to filenames of the form /proc/self/fd/37.

With zz, we cannot make any such guarantee. The output of zz tar zc foo.gz, for example, might be in proper .tar.gz format. But suppose the output of tar zc foo.gz creates compressed binary output that just happens to contain the byte sequence 2f70 726f 632f 7365 6c66 2f66 642f 33? (That is, "/proc/self/fd/3".) Then zz will silently replace these 15 bytes with the six bytes 666f 6f2e 677a.

What if the original sequence was understood as part of a sequence of 2-byte integers? The result is not even properly aligned. What if that initial 2f was a count? The resulting count (66) is much too long. The result would be utterly garbled and unintelligible to tar zx. What the tar command will do with a garbled input is not well-defined: it might dump core, or it might write out random garbage data, or overwrite essential files in the filesystem. We are into nasal demon territory. With the original z, we never get anywhere near the nasal demons.

I suppose the short summary here is that z treats its command as a black box, while zz pretends to understand what comes out of it. But zz's understanding is a false pretense. My experience says that programs should not screw around with things they don't understand, and this is why I instantly rejected the idea when I thought of it before.

One correspondent argued that the garbling is very unlikely, and proposed various techniques to make it even less likely, mostly by rewriting the input filenames to various long random strings. But I felt then that this was missing the point, and I still do. He says it is unlikely, but he doesn't know that it is unlikely, and indeed the unlikeliness depends on the format of the output of the command, which is precisely the unknown here. In my view, the difference between z and zz is that the changes that z makes are bounded, because you can describe them briefly, as I did above, and the changes that zz makes are unbounded, because there is no limit to what could happen as a result.

On the other hand, this correspondent made a good point that if the output of zz is not consumed by anything other than human eyeballs, there may be no real problem. And for some particular commands, such as wc, there is never any problem at all. So perhaps it's a good idea to add a command-line option to z to enable the zz behavior. I did this in my version, and I'm going to try it out and see how it goes.

Complete modified source code is available. (Diffs from previous version.)

[Other articles in category /Unix] permanent link

Sat, 22 Mar 2008

The "z" command: alternative implementations
In yesterday's article I discussed a possibly-useful utility program named z, which has a flaw. To jog your memory, here is a demonstration:

        % z grep immediately *  want to update.  It is immediately copied into a register, and
        /proc/self/fd/3:All five people who wrote to me about this immediately said "oh, yes,
        /proc/self/fd/5:program continues immediately, possibly posting its message.  (It a symbolic link, its inode is returned immediately; iname() would reports success back to the process immediately, even though the
For a detailed discussion, see the previous article.

Fixing this flaw seems difficult-to-impossible. As I said earlier, the trick is to fool the command into reading from a pipe when it thinks it is opening a file, and this is precisely what /proc/self/fd is for. But there is an older, even more widely-implemented Unix feature that does the same thing, namely the FIFO. So an alternative implementation creates one FIFO for each compressed file, with a gzip process writing to the FIFO, and tells the command to read from the FIFO. Since we have some limited control over the name of the FIFO, we can ameliorate the missing-filename problem to some extent. Say, for example, we create the FIFOs in /tmp/PID. Then the broken zgrep example above might look like this instead:

        % z grep immediately *  want to update.  It is immediately copied into a register, and
        /tmp/7516/ five people who wrote to me about this immediately said "oh, yes,
        /tmp/7516/ continues immediately, possibly posting its message.  (It a symbolic link, its inode is returned immediately; iname() would reports success back to the process immediately, even though the
The output is an improvement, but it is not completely solved, and the cost is that the process and file management are much more complicated. In fact, the cost is so high that you have to wonder if it might not be simpler to replace z with a shell script that copies the data to a temporary directory, uncompresses the files, and runs the command on the uncompressed files, perhaps something along these lines:

        mkdir $DIR

        cp -p "$@" $DIR

        cd $DIR
        gzip -d *
        $COMMAND *
This has problems too, but my point is that if you are willing to accept a crappy, semi-working solution along the lines of the FIFO one, simpler ones are at hand. You can compare the FIFO version directly with the shell script, and I think the FIFO version loses. The z implementation I have is a solution in a different direction, and different tradeoffs, and so might be preferable to it in a number of ways.

But as I said, I don't know yet.

[ Addendum 20080325: Several people suggested a fix that I had considered so unwise that I didn't even mention it. But after receiving the suggestion repeatedly, I wrote an article about it. ]

[Other articles in category /Unix] permanent link

Fri, 21 Mar 2008

The gzip distribution includes a command called zcat. Its command-line arguments can include any number of filenames, compressed or not, and it prints out the contents, uncompressing them on the fly if necessary. Sometime later a zgrep command appeared, which was similar but which also performed a grep search.

But for anything else, you either need to uncompress the files, or build a special tool. I have a utility that scans the web logs of, and extracts a report about new referrers. The historical web logs are normally kept compressed, so I recently built in support for decompression. This is quite easy in Perl. Normally one scans a sequence of input files something like this:

        while (<>) {
          ... do something with $_ ...
The <> operator implicitly scans all the lines in all the files named in the command-line arguments, opening a new file each time the previous one is exhausted.

To decompress the files on the fly, one can preprocess the command-line arguments:

        for (@ARGV) {
          if (/\.gz$/) {
            $_ = "gzip -dc $_ |";

        while (<>) {
          ... do something with $_ ...
The for loop scans the command-line arguments, replacing each one that has the form foo.gz with gzip -dc foo.gz |. Perl's magic open semantics treat filenames specially if they end with a pipe symbol: a pipe to a command is opened instead. Of course, anyone can think of half a dozen ways in which this can go wrong. But Larry Wall's skill in making such tradeoffs has been a large factor in Perl's success.

But it bothered me to have to make this kind of change in every program that wanted to handle compressed files. We have zcat and zgrep; where are zcut, zpr, zrev, zwc, zcol, zbc, zsed, zawk, and so on? Echh.

But after I got to thinking about it, I decided that I could write a single z utility that would do a lot of the same things. Instead of this:

        zsed -e 's/:.*//' * | ...
where the * matches some files that have .gz suffixes and some that haven't, one would write:

        z sed -e 's/:.*//' * | ...
and it would Just Work. That's the idea, anyway.

If sed were written in Perl, z would have an easy job. It could rely on Perl's magic open, and simply preprocess the arguments before running sed:

        # hypothetical implementation of z
        my $command = shift;
        for (@ARGV) {
          if (/\.gz$/) {
            $_ = "gzip -dc $_ |";
        exec $command, @ARGV;
        die "Couldn't run command '$command': $!\n";
But sed is not written in Perl, and has no magic open. So I have to play a trickier trick:

        for my $file (@ARGV) {
          if ($file =~ /\.gz$/) {
            unless (open($fhs[@fhs], "-|", "gzip", "-cd", $file)) {
              warn "Couldn't open file '$file': $!; skipping\n";
            my $fd = fileno $fhs[-1];
            $_ = "/proc/self/fd/$fd";

        # warn "running $command @ARGV\n";
        exec $command, @ARGV;
        die "Couldn't run command '$command': $!\n";
This is a stripped-down version to illustrate the idea. For various reasons that I explained yesterday, it does not actually work. The complete, working source code is here.

The idea, as before, is that the program preprocesses the command-line arguments. But instead of replacing the arguments with pipe commands, which are not supported by open(2), the program sets up the pipes itself, and then directs the command to take its input from the pipes by specifying the appropriate items from /proc/self/fd.

The trick depends crucially on having /proc/self/fd, or /dev/fd, or something of the sort, because otherwise there's no way to trick the command into reading from a pipe when it thinks it is opening a file. (Actually there is at least one other way, involving FIFOs, which I plan to discuss tomorrow.) Most modern systems do have /proc/self/fd. That feature postdates my earliest involvement with Unix, so it isn't a ready part of my mental apparatus as perhaps it ought to be. But this utility seems to me like a sort of canonical application of /proc/self/fd, in the sense that, if you couldn't think what /proc/self/fd might be good for, then you could read this example and afterwards have a pretty clear idea.

The z utility has a number of flaws. Principally, the original filenames are gone. Here's a typical run with regular zgrep:

        % zgrep immediately *  want to update.  It is immediately copied into a register, and five people who wrote to me about this immediately said "oh, yes, continues immediately, possibly posting its message.  (It a symbolic link, its inode is returned immediately; iname() would reports success back to the process immediately, even though the
But here's the same thing with z:

        % z grep immediately *  want to update.  It is immediately copied into a register, and
        /proc/self/fd/3:All five people who wrote to me about this immediately said "oh, yes,
        /proc/self/fd/5:program continues immediately, possibly posting its message.  (It a symbolic link, its inode is returned immediately; iname() would reports success back to the process immediately, even though the
The problem is even more glaring in the case of commands like wc:

        % z wc *  
            411    2611   16988
             71     358    2351 /proc/self/fd/3
            121     725    5053 /proc/self/fd/4
             51     380    2381
             48     145     885
            288    2159   12829 /proc/self/fd/5
             95     665    4337
            221     941    6733
            106     555    3976
            115     793    4904
            124     624    4208 /proc/self/fd/6
           1651    9956   64645 total

So perhaps z will not turn out to be useful enough to be more than a curiosity. But I'm not sure yet.

This is article #300 on my blog. Thanks for reading.

[ Addendum 20080322: There is a followup to this article. ]

[ Addendum 20080325: Another followup. ]

[Other articles in category /Unix] permanent link

Thu, 06 Mar 2008

Throttling qmail
This may well turn out to be another oops. Sometimes when I screw around with the mail system, it's a big win, and sometimes it's a big lose. I don't know yet how this will turn out.

Since I moved house, I have all sorts of internet-related problems that I didn't have before. I used to do business with a small ISP, and I ran my own web server, my own mail service, and so on. When something was wrong, or I needed them to do something, I called or emailed and they did it. Everything was fine.

Since moving, my ISP is Verizon. I have great respect for Verizon as a provider of telephone services. They have been doing it for over a hundred years, and they are good at it. Maybe in a hundred years they will be good at providing computer network services too. Maybe it will take less than a hundred years. But I'm not as young as I once was, and whenever that glorious day comes, I don't suppose I'll be around to see it.

One of the unexpected problems that arose when I switched ISPs was that Verizon helpfully blocks incoming access to port 80. I had moved my blog to outside hosting anyway, because the blog was consuming too much bandwidth, so I moved the other web services to the same place. There are still some things that don't work, but I'm dealing with them as I have time.

Another problem was that a lot of sites now rejected my SMTP connections. My address was in a different netblock. A Verizon DSL netblock. Remote SMTP servers assume that anybody who is dumb enough to sign up with Verizon is also too dumb to run their own MTA. So any mail coming from a DSL connection in Verizonland must be spam, probably generated by some Trojan software on some infected Windows box.

The solution here (short of getting rid of Verizon) is to relay the mail through Verizon's SMTP relay service. sends to, and lets forward the mail to its final destination. Fine.

But but but.

If my machine sends more than X messages per Y time, will assume that has been taken over by a Trojan spam generator, and cut off access. All outgoing mail will be rejected with a permanent failure.

So what happens if someone sends a message to one of the 500-subscriber email lists that I host here? generates 500 outgoing messages, sends the first hundred or so through Verizon. Then Verizon cuts off my mail service. The mailing list detects 400 bounce messages, and unsubscribes 400 subscribers. If any mail comes in for another mailing list before Verizon lifts my ban, every outgoing message will bounce and every subscriber will be unsubscribed.

One solution is to get a better mail provider. Lorrie has an Earthlink account that comes with outbound mail relay service. But they do the same thing for the same reason. My Dreamhost subscription comes with an outbound mail relay service. But they do the same thing for the same reason. My account comes with an unlimited outbound mail relay service. But they require SASL authentication. If there's a SASL patch for qmail, I haven't been able to find it. I could implement it myself, I suppose, but I don't wanna.

So far there are at least five solutions that are on the "eh, maybe, if I have to" list:

  • Get a non-suck ISP
  • Find a better mail relay service
  • Hack SASL into qmail and send mail through
  • Do some skanky thing with serialmail
  • Get rid of qmail in favor of postfix, which presumably supports SASL
(Yeah, I know the Postfix weenies in the audience are shaking their heads sadly and wondering when the scales will fall from my eyes. They show up at my door every Sunday morning in their starched white shirts and their pictures of DJB with horns and a pointy tail...)

It also occurred to me in the shower this morning that the old ISP might be willing to sell me mail relaying and nothing else, for a small fee. That might be worth pursuing. It's gotta be easier than turning qmail-remote into a SASL mail client.

The serialmail thing is worth a couple of sentences, because there's an autoresponder on the qmail-users mailing-list that replies with "Use serialmail. This is discussed in the archives." whenever someone says the word "throttle". The serialmail suite, also written by Daniel J. Bernstein, takes a maildir-format directory and posts every message in it to some remote server, one message at a time. Say you want to run qmail on your laptop. Then you arrange to have qmail deliver all its mail into a maildir, and then when your laptop is connected to the network, you run serialmail, and it delivers the mail from the maildir to your mail relay host. serialmail is good for some throttling problems. You can run serialmail under control of a daemon that will cut off its network connection after it has written a certain amount of data, for example. But there seems to be no easy way to do what I want with serialmail, because it always wants to deliver all the messages from the maildir, and I want it to deliver one message.

There have been some people on the qmail-users mailing-list asking for something close to what I want, and sometimes the answer was "qmail was designed to deliver mail as quickly and efficiently as possible, so it won't do what you want." This is a variation of "Our software doesn't do what you want, so I'll tell you that you shouldn't want to do it." That's another rant for another day. Anyway, I shouldn't badmouth qmail-users mailing-list, because the archives did get me what I wanted. It's only a stopgap solution, and it might turn out to be a big mistake, but so far it seems okay, and so at last I am coming to the point of this article.

I hacked qmail to support outbound message rate throttling. Following a suggestion of Richard Lyons from the qmail-users mailing-list, it was much easier to do than I had initially thought.

Here's how it works. Whenever qmail wants to try to deliver a message to a remote address, it runs a program called qmail-remote. qmail-remote is responsible for looking up the MX records for the host, contacting the right server, conducting the SMTP conversation, and returning a status code back to the main component. Rather than hacking directly on qmail-remote, I've replaced it with a wrapper. The real qmail-remote is now in qmail-remote-real. The qmail-remote program is now written in Perl. It maintains a log file recording the times at which the last few messages were sent. When it runs, it reads the log file, and a policy file that says how quickly it is allowed to send messages. If it is okay to send another message, the Perl program appends the current time to the log file and invokes the real qmail-remote. Otherwise, it sleeps for a while and checks again.

The program is not strictly correct. It has some race conditions. Suppose the policy limits qmail to sending 8 messages per minute. Suppose 7 messages have been sent in the last minute. Then six instances of qmail-remote might all run at once, decide that it is OK to send a message, and send one. Then 13 messages have been sent in the last minute, which exceeds the policy limit. So far this has not been much of a problem. It's happened twice in the last few hours that the system sent 9 messages in a minute instead of 8. If it worries me too much, I can tell qmail to run only one qmail-remote at a time, instead of 10. On a normal qmail system, qmail speeds up outbound delivery by running multiple qmail-remote processes concurrently. On my crippled system, speeding up outbound delivery is just what I'm trying to avoid. Running at most one qmail-remote at a time will cure all race conditions. If I were doing the project over, I think I'd take out all the file locking and such, and just run one qmail-remote. But I didn't think of it in time, and for now I think I'll live with the race conditions and see what happens.

So let's see? What else is interesting about this program? I made at least one error, and almost made at least one more.

The almost-error was this: The original design for the program was something like:

  1. do
    • lock the history file, read it, and unlock it
    until it's time to send a message
  2. lock the history file, update it, and unlock it
  3. send the message
This is a classic mistake in writing programs that run concurrently and update a file. The problem is that process A update the file after process B reads but before B updates it. Then B's update will destroy A's.

One way to fix this is to have the processes append to the history file, but never remove anything from it. That is clearly not a sustainable strategy. Someone must remove expired entries from the history file.

Another fix is to have the read and the update in the same critical section:

  1. lock the history file
  2. do
    • read the history file
    until it's time to send a message
  3. update the history file and unlock it
  4. send the message
But that loop could take a long time, during which no other qmail-remote process can make progress. I had decided that I wanted to try to retain the concurrency, and so I wasn't willing to accept this.

Cleaning the history file could be done by a separate process that periodically locks the file and rewrites it. But instead, I have the qmail-remote processes to it on the fly:

  1. do
    • lock the history file, read it, and unlock it
    until it's time to send a message
  2. lock the history file, read it, update it, and unlock it
  3. send the message
I'm happy that I didn't actually make this mistake. I only thought about it.

Here's a mistake that I did make. This is the block of code that sleeps until it's time to send the message:

          while (@last >= $msgs) {
            my $oldest = $last[0];
            my $age = time() - $oldest;
            my $zzz = $time - $age + int(rand(3));
            $zzz = 1 if $zzz < 1;
       #    Log("Sleeping for $zzz secs");
            sleep $zzz;
            shift @last while $last[0] < time() - $time;
The throttling policy is expressed by two numbers, $msgs and $time, and the program tries to send no more than $msgs messages per $time seconds. The @last array contains a list of Unix epoch timestamps of the times at which the messages of the last $time seconds were sent. So the loop condition checks to see if fewer than $msgs messages were sent in the last $time seconds. If not, the program continues immediately, possibly posting its message. (It rereads the history file first, in case some other messages have been posted while it was asleep.)

Otherwise the program will sleep for a while. The first three lines in the loop calculate how long to sleep for. It sleeps until the time the oldest message in the history will fall off the queue, possibly plus a second or two. Then the crucial line:

            shift @last while $last[0] < time() - $time;
which discards the expired items from the history. Finally, the call to load_policy() checks to see if the policy has changed, and the loop repeats if necessary.

The bug is in this crucial line. if @last becomes empty, this line turns into an infinite busy-loop. It should have been:

            shift @last while @last && $last[0] < time() - $time;
Whoops. I noticed this this morning when my system's load was around 12, and eight or nine qmail-remote processes were collectively eating 100% of the CPU. I would have noticed sooner, but outbound deliveries hadn't come to a complete halt yet.

Incidentally, there's another potential problem here arising from the concurrency. A process will complete the sleep loop in at most $time+3 seconds. But then it will go back and reread the history file, and it may have to repeat the loop. This could go on indefinitely if the system is busy. I can't think of a good way to fix this without getting rid of the concurrent qmail-remote processes.

Here's the code. I hereby place it in the public domain. It was written between 1 AM and 3 AM last night, so don't expect too much.

[Other articles in category /Unix] permanent link

Sat, 08 Dec 2007

Corrections about sync(2)
I made some errors in today's post about sync and fsync.

Most important, I said that "the sync() system call marks all the kernel buffers as dirty". This is totally wrong, and doesn't even make sense. Dirty buffers are those with data that needs to be written out. Marking a non-dirty buffer as dirty is a waste of time, since nothing has changed in the buffer, but it will now be rewritten anyway. What sync() does is schedule all the dirty buffers to be written as soon as possible.

On some recent systems, sync() actually waits for all the dirty buffers to be written, and a bunch of people tried to correct me about this. But my original article was right: historically, it was not so, and even today it's not universally true. In former times, sync() would schedule the buffers for writing, and then return before the data was actually written.

I said that one of the duties of init was to call sync() every thirty seconds, but this was mistaken. That duty actually fell to a separate program, known as update. While discussing this with one of the readers who wrote to correct me, I looked up the source for Version 7 Unix, to make sure I was right, and it's so short I thought I might as well show it here:

         * Update the file system every 30 seconds.
         * For cache benefit, open certain system directories.

        #include <signal.h>

        char *fillst[] = {

                char **f;

                for(f = fillst; *f; f++)
                        open(*f, 0);

                signal(SIGALRM, dosync);
The program is so simple I don't have much more to say about it. It initially invokes dosync(), which calls sync() and then schedules another call to dosync() in 30 seconds. Note that the 0 in the second argument to open had not yet been changed to O_RDONLY. The pause() call is equivalent to sleep(0): it causes the process to relinquish its time slice whenever it is active.

In various systems more recent than V7, the program was known by various names, but it was update for a very long time.

Several people wrote to correct me about the:

        # sync
        # sync
        # sync
        # halt
thing, some saying that I had the reason wrong, or that it did not make sense, or that only two syncs were used, rather than three. But I had it right. People did use three, and they did it for the reason I said, whether that makes sense or not. (Some of the people who miscorrected me were unaware that sync() would finish and exit before the data was actually written.) But for example, see this old Usenet thread for a discussion of the topic that confirms what I said.

Nobody disputed my contention that Linus was suffering from the promptings of the Evil One when he tried to change the semantics of fsync(), and nobody seems to know the proper name of the false god of false efficiency. I'll give this some thought and see what I can come up with.

Thanks to Tony Finch, Dmitry Kim, and Stefan O'Rear for discussion of these points.

[Other articles in category /Unix] permanent link

Dirty, dirty buffers!
One side issue that arose during my talk on Monday about inodes was the write-buffering normally done by Unix kernels. I wrote a pretty long note to the PLUG mailing list about it, and I thought I'd repost it here.

When your process asks the kernel to write data:

        int bytes_written = write(file_descriptor,
the kernel normally copies the data from your buffer into a kernel buffer, and then, instead of writing out the data to disk, it marks its buffer as "dirty" (that is, as needing to be written eventually), and reports success back to the process immediately, even though the dirty buffer has not yet been written, and the data is not yet on the disk.

Normally, the kernel writes out the dirty buffer in due time, and the data makes it to the disk, and you are happy because your process got to go ahead and do some more work without having to wait for the disk, which could take milliseconds. ("A long time", as I so quaintly called it in the talk.) If some other process reads the data before it is written, that is okay, because the kernel can give it the updated data out of the buffer.

But if there is a catastrophe, say a power failure, then you see the bad side of this asynchronous writing technique, because the data, which your process thought had been written, and which the kernel reported as having been written, has actually been lost.

There are a number of mechanisms in place to deal with this. The oldest is the sync() system call, which marks all the kernel buffers as dirty. All Unix systems run a program called init, and one of init's principal duties is to call sync() every thirty seconds or so, to make sure that the kernel buffers get flushed to disk at least every thirty seconds, and so that no crash will lose more than about thirty seconds' worth of data.

(There is also a command-line program sync which just does a sync() call and then exits, and old-time Unix sysadmins are in the habit of halting the system with:

        # sync
        # sync
        # sync
        # halt
because the second and third syncs give the kernel time to actually write out the buffers that were marked dirty by the first sync. Although I suspect that few of them know why they do this. I swear I am not making this up.)

But for really crucial data, sync() is not enough, because, although it marks the kernel buffers as dirty, it still does not actually write the data to the disk.

So there is also an fsync() call; I forget when this was introduced. The process gives fsync() a file descriptor, and the call demands that the kernel actually write the associated dirty buffers to disk, and does not return until they have been. And since, unlike write(), it actually waits for the data to go to the disk, a successful return from fsync() indicates that the data is truly safe.

The mail delivery agent will use this when it is writing your email to your mailbox, to make sure that no mail is lost.

Some systems have an O_SYNC flag than the process can supply when it opens the file for writing:

        int fd = open("blookus", O_WRONLY | O_SYNC);
This sets the O_SYNC flag in the kernel file pointer structure, which means that whenever data is written to this file pointer, the kernel, contrary to its usual practice, will implicitly fsync() the descriptor.

Well, that's not what I wanted to write about here. What I meant to discuss was...

No, wait. That is what I wanted to write about. How about that?

Anyway, there's an interesting question that arises in connection with fsync(): suppose you fsync() a file. That guarantees that the data will be written. But does it also guarantee that the mtime and the file extent of the file will be updated? That is, does it guarantee that the file's inode will be written?

On most systems, yes. But on some versions of Linux's ext2 filesystem, no. Linus himself broke this as a sacrifice to the false god of efficiency, a very bad decision in my opinion, for reasons that should be obvious to everyone but those in the thrall of Mammon. (Mammon's not right here. What is the proper name of the false god of efficiency?)

Sanity eventually prevailed. Recent versions of Linux have an fsync() call, which updates both the data and the inode, and a fdatasync() call, which only guarantees to update the data.

[ Addendum 20071208: Some of this is wrong. I posted corrections. ]

[Other articles in category /Unix] permanent link

Thu, 06 Dec 2007

What's a File?
Almost every December since 2001 I have given a talk to the local Linux users' group on some aspect of Unix internals. My first talk was on the internals of the ext2 filesystem. This year I was under a lot of deadline pressure at work, so I decided I would give the 2001 talk again, maybe with a few revisions.

Actually I was under so much deadline pressure that I did not have time to revise the talk. I arrived at the user group meeting without a certain idea of what talk I was going to give.

Fortunately, the meeting structure is to have a Q&A and discussion period before the invited speaker gives his talk. The Q&A period always lasts about an hour. In that hour before I had to speak, I wrote a new talk called What's a File?. It mostly concerns the Unix "inode" structure, and what the kernel uses it for. It uses the output of the well-known ls -l command as a jumping-off point, since most of the ls -l information comes from the inode.

Then I talk about how files are opened and permissions are checked, how the filesystem is organized, how the kernel reads and writes data, how directories are structured, how it's possible to have one file with two names, how symbolic links work, and what that mysterious field is in the ls -l output between the permissions and the owner.

The talk was quite successful, much more so than I would have expected, given how quickly I wrote it and my complete inability to edit or revise it. Of course, it does help that I know this material backwards and forwards and standing on my head, and also that I could reuse all the diagrams and illustrations from the 2001 version of the talk.

I would not, however, recommend this technique.

As my talks have gotten better over the years, I find that less and less of the talk material is captured in the slides, and so the slides become less and less representative of the talk itself. But I put them online anyway, and here they are.

Here's a .tgz file in case you want to download it all at once.

[Other articles in category /Unix] permanent link

Sat, 27 Jan 2007

Software archaeology
For appropriate values of "everyone", everyone knows that Unix files do not record any sort of "creation time". A fairly frequently asked question in Unix programming forums, and other related forums, such as Perl programming forums, is how to get the creation date of a file; the answer is that you cannot do that because it is not there.

This lack is exacerbated by several unfortunate facts: creation times are available on Windows systems; the Unix inode contains three timestamps, one of which is called the "ctime", and the "c" is suggestive of the wrong thing; Perl's built-in stat function overloads the return value to return the Windows creation time in the same position (on Windows) as it returns the ctime (on Unix).

So we see questions like this one, which appeared this week on the Philadelphia Linux Users' Group mailing list:

How does one check and change ctime?
And when questioned as to why he or she wanted to do this, this person replied:

We are looking to change the creation time. From what I understand, ctime is the closest thing to creation time.
There is something about this reply that irritates me, but I'm not quite sure what it is. Several responses come to mind: "Close" is not sufficient in system programming; the ctime is not "close" to a creation time, in any sense; before you go trying to change the thing, you ought to do a minimal amount of research to find out what it is. It is a perfect example of the Wrong Question, on the same order as that poor slob all those years ago who wanted to know how to tell if a file was a hard link or a soft link.

But anyway, that got me thinking about ctimes in general, and I did some research into the history and semantics of the thing, and made some rather surprising discoveries.

One good reference for the broad outlines of early Unix is the paper that Dennis Ritchie and Ken Thompson published in Communications of the ACM in 1974. This was updated in 1978, but the part I'm quoting wasn't revised and is current to 1974. Here is what it has to say about the relevant parts of the inode structure:


... The entry found thereby (the file's i-node) contains the description of the file:
... time of creation, last use, and last modification
An error? I don't think so. Here is corroborating evidence, the stat man page from the first edition of Unix, from 1971:

NAME        stat -- get file status
SYNOPSIS    sys      stat; name; buf / stat = 18.
DESCRIPTION name points to a null-terminated string naming a file; buf is the
            address of a 34(10) byte buffer into which information is placed
            concerning the file. It is unnecessary to have any permissions at all
            with respect to the file, but all directories leading to the file
            must be readable.
            After stat, buf has the following format:
            buf, +1             i-number
            +2, +3              flags (see below)
            +4                  number of links
            +5                     user ID of owner size in bytes
            +6,+7            size in bytes
            +8,+9            first indirect block or contents block
            +22,+23             eighth indirect block or contents block
            +24,+25,+26,+27 creation time
            +28,+29, +30,+31 modification time
                +32,+33         unused
(Dennis Ritchie provides the Unix first edition manual; the stat page is in section 2.1.)

Now how about that?

When did the ctime change from being called a "creation time" to a "change time"? Did the semantics change too, or was the "creation time" description a misnomer? If I can't find out, I might write to Ritchie to ask. But this is, of course, a last resort.

In the meantime, I do have the source code for the fifth edition kernel, but it appears that, around that time (1975 or so), there was no creation time. At least, I can't find one.

The inode operations inside the kernel are defined to operate on struct inodes:

	struct inode {
		char    i_flag;
		char    i_count;
		int     i_dev;
		int     i_number;
		int     i_mode;
		char    i_nlink;
		char    i_uid;
		char    i_gid;
		char    i_size0;
		char    *i_size1;
		int     i_addr[8];
		int     i_lastr;
	} inode[NINODE];
The i_lastr field is what we would now call the atime. (I suppose it stands for "last read".) The mtime and ctime are not there, because they are not stored in the in-memory copy of the inode. They are fetched directly from the disk when needed.

We can see an example of this in the stat1 function, which is the backend for the stat and fstat system calls:

     1	stat1(ip, ub)
     2	int *ip;
     3	{
     4	        register i, *bp, *cp;
     6	        iupdat(ip, time);
     7	        bp = bread(ip->i_dev, ldiv(ip->i_number+31, 16));
     8	        cp = bp->b_addr + 32*lrem(ip->i_number+31, 16) + 24;
     9	        ip = &(ip->i_dev);
    10	        for(i=0; i<14; i++) {
    11	                suword(ub, *ip++);
    12	                ub =+ 2;
    13	        }
    14	        for(i=0; i<4; i++) {
    15	                suword(ub, *cp++);
    16	                ub =+ 2;
    17	        }
    18	        brelse(bp);
    19	}
ub is the user buffer into which the stat data will be deposited. ip is the inode structure from which most of this data will be copied. The suword utility copies a two-byte unsigned integer ("short unsigned word") from source to destination. This is done starting at the i_dev field (line 9), which effectively skips the two earlier fields, i_flag and i_count, which are internal kernel matters that are none of the user's business.

14 words are copied from the inode structure starting from this position, including the device and i-number fields, the mode, the link count, and so on, up through the addresses of the data or indirect blocks. (In modern Unixes, the stat call omits these addresses.) Then four words are copied out of the cp buffer, which has been read from the inode actually on the disk; these eight bytes are at position 24 in the inode, and ought to contain the mtime and the ctime. The question is, which is which? This simple question turns out to have a surprisingly complicated answer.

When an inode is modified, the IUPD flag is set in the i_flag member. For example, here is chmod, which modifies the inode but not the underlying data. On a modern unix system, we would expect this to update the ctime, but not the mtime. Let's see what it does in version 5:

     1	chmod()
     2	{
     3	        register *ip;
     5	        if ((ip = owner()) == NULL)
     6	                return;
     7	        ip->i_mode =& ~07777;
     8	        if (u.u_uid)
     9	                u.u_arg[1] =& ~ISVTX;
    10	        ip->i_mode =| u.u_arg[1]&07777;
    11	        ip->i_flag =| IUPD;
    12	        iput(ip);
    13	}
Line 10 is the important one; it sets the mode on the in-memory copy of the inode to the argument supplied by the user. Then line 11 sets the IUPD flag to indicate that the inode has been modified. Line 12 calls iput, whose principal job is to maintain the kernel's internal reference count of the number of file descriptors that are attached to this inode. When this number reaches zero, the inode is written back to disk, and discarded from the kernel's open file table. The iupdat function, called from iput, is the one that actually writes the modified inode back to the disk:

     1	iupdat(p, tm)
     2	int *p;
     3	int *tm;
     4	{
     5		register *ip1, *ip2, *rp;
     6		int *bp, i;
     8		rp = p;
     9		if((rp->i_flag&(IUPD|IACC)) != 0) {
    10			if(getfs(rp->i_dev)->s_ronly)
    11				return;
    12			i = rp->i_number+31;
    13			bp = bread(rp->i_dev, ldiv(i,16));
    14			ip1 = bp->b_addr + 32*lrem(i, 16);
    15			ip2 = &rp->i_mode;
    16			while(ip2 < &rp->i_addr[8])
    17				*ip1++ = *ip2++;
    18			if(rp->i_flag&IACC) {
    19				*ip1++ = time[0];
    20				*ip1++ = time[1];
    21			} else
    22				ip1 =+ 2;
    23			if(rp->i_flag&IUPD) {
    24				*ip1++ = *tm++;
    25				*ip1++ = *tm;
    26			}
    27			bwrite(bp);
    28		}
    29	}
What is going on here? p is the in-memory copy of the inode we want to update. It is immediately copied into a register, and called by the alias rp thereafter. tm is the time that the kernel should write into the mtime field of the inode. Usually this is the current time, but the smdate system call ("set modified date") supplies it from the user instead.

Lines 16–17 copy the mode, link count, uid, gid, "size", and "addr" fields from the in-memory copy of the inode into the block buffer that will be written back to the disk. Lines 18–22 update the atime if the IACC flag is set, or skip it if not. Then, if the IUPD flag is set, lines 24–25 write the tm value into the next slot in the buffer, where the mtime is stored. The bwrite call on line 27 commits the data to the disk; this results in a call into the appropriate device driver code.

There is no sign of updating the ctime field, but recall that we started this search by looking at what the chmod call does; it sets IUPD, which eventually results in the updating of the mtime field. So the mtime field is not really an mtime field as we now know it; it is doing the job that is now done by the ctime field. And in fact, the dump command predicates its decision about whether to dump a file on the contents of the mtime field. Which is really the ctime field. So functionally, dump is doing the same thing it does now.

It's possible that I missed it, but I cannot find the advertised creation time anywhere. The logical place to look is in the maknode function, which allocates new inodes. The maknode function calls ialloc to get an unused inode from the device, and this initializes its mode (as specified by the user), its link count (to 1), and its uid and gid (to the current process's uid and gid). It does not set a creation time. The ialloc function is fairly complicated, but as far as I can tell it is not setting any creation time either.

Working it from the other end, asking who might look at the ctime field, we have the find command, which has a -mtime option, but no -ctime option. The dump command, as noted before, uses the mtime. Several commands perform stat calls and declare structs to hold the result. For example, pr, which prints files with nice pagination, declares a struct inode, which is the inode as returned by stat, as opposed to the inode as used internally by the kernel—what we would call a struct stat now. There was no /usr/include in the fifth edition, so the pr command contains its own declaration of the struct inode. It looks like this:

struct inode {
        int dev;
        int atime[2];
        int mtime[2];
No sign of the ctime, which would have been after the mtime field. (Of course, it could be there anyway, unmentioned in the declaration, since it is last.) And similarly, the ls command has:

struct ibuf {
	int	idev;
	int	inum;
	char	*iatime[2];
	char	*imtime[2];
A couple of commands have extremely misleading declarations. Here's the struct inode from the prof command, which prints profiling reports:

struct inode {
        int     idev;
        int ctime[2];
        int mtime[2];
        int fill;
The atime field has erroneously been called ctime here, but it seems that since prof does not use the atime, nobody noticed the bug. And there's a mystery fill field at the end, as if prof is expecting one more field, but doesn't know what it will be for. The declaration of ibuf in the ln command has similar oddities.

So the creation time advertised by the CACM paper (1974) and the version 1 manual (1971) seems to have disappeared by the time of version 5 (1975), if indeed it ever existed.

But there was some schizophrenia in the version 5 system about whether there was a third date in addition to the atime and the mtime. The stat call copied it into the stat buffer, and some commands assumed that it would be there, although they weren't sure what it would be for, and none of them seem look at it. It's quite possible that there was at one time a creation date, which had been eliminated by the time of the fifth edition, leaving behind the vestigial remains we saw in commands like ln and prof and in the code of the stat1 function.

Lions' Commentary on Unix 6th Edition
Lions' Commentary on Unix 6th Edition
with kickback
no kickback
Functionally, the version 5 mtime is actually what we would now call the ctime: it is updated by operations like chmod that in modern Unix will update the ctime but not the mtime. A quick scan of the Lions Book suggests that it was the same way in version 6 as well. I imagine that the ctime-mtime distinction arose in version 7, because that was the last version before the BSD/AT&T fork, and nearly everything common to those two great branches of the Unix tree was in version 7.

Oh, what the hell; I have the version 7 source code; I may as well look at it. Yes, by this time the /usr/include/sys/stat.h file had been invented, and does indeed include all three times in the struct stat. So the mtime (as we now know it) appears to have been introduced in v7.

One sometimes hears that early Unix had atime and mtime, and that ctime was introduced later. But actually, it appears that early Unix had atime and ctime, and it was the mtime that was introduced later. The confusion arises because in those days the ctime was called "mtime".

Addendum: It occurs to me now that the version 5 mtime is not precisely like the modern ctime, because it can be set via the smdate call, which is analogous to the modern utime call. The modern ctime cannot be set at all.

The C Programming Language
The C Programming Language
with kickback
no kickback
(Minor trivium: line 22 of iupdat is ip1 =+ 2. In modern C, we would write ip1 += 2. The =+ and =- operators had turned out to be a mistake, because people would write i=-1, intending i = -1, but the compiler would understand it as i =- 1, producing subtle bugs. The spellings of the operators were changed to avoid these bugs. The change from =+ to += was complete by the time K&R first edition was published in 1978: K&R mentions the old-style operators and says that the are obsolete. In spite of this, the Sun compiler I used in 1987 would still produce a warning for i=-1, despite interpreting it as i = -1. I believe this was because it was PCC-derived, and all PCC compilers emitted this warning. In the fifth edition code, we can see the obsolete form still in use.)

(Totally peripheral addendum: Google search for dmr puts Dennis M. Ritchie in fourth position, not the first. Is this grave insult to our community to be tolerated? I think not! It must be avenged! With fire and steel!)

[ Addendum 20070127: Unix source code prior to the fifth edition is lost. The manuals for the third and fourth editions are available from the Unix Heritage Society. The manual for the third edition (February 1973) mentions the creation time, but by the fourth edition (November 1973) the stat(2) man page no longer mentions a creation time. In v4, the two dates in the stat structure are called actime (modern atime) and modtime (modern mtime/ctime). ]

[Other articles in category /Unix] permanent link

Fri, 26 Jan 2007

Environmental manipulations
Unix is full of little utility programs that run some other program in a slightly modified environment. For example, the nohup command:


nohup COMMAND [ARG]...


Run COMMAND, ignoring hangup signals.

The nohup basically does signal(NOHUP, SIG_IGN) before calling execvp(COMMAND, ARGV) to execute the command.

Similarly, there is a chroot command, run as chroot new-root-directory command args..., which runs the specified command with its default root inode set to somewhere else. And there is a nice command, run as nice nice-value-adjustment command args..., which runs the specified command with its "nice" value changed. And there is an env environment-settings command args... which runs the specified command with new variables installed into the environment. The standard sudo command could also be considered to be of this type.

I have also found it useful to write trivial commands called indir, which runs a command after chdir-ing to a new directory, and stopafter, which runs a command after setting the alarm timer to a specified amount, and, just today, with-umask, which runs a command after setting the umask to a particular value.

I could probably have avoided indir and with-umask. Instead of indir DIR COMMAND, I could use sh -c 'cd DIR; exec COMMAND', for example. But indir avoids an extra layer of horrible shell quotes, which can be convenient.

Today it occurred to me to wonder if this proliferation of commands was really the best way to solve the problem. The sh -c '...' method solves it partly, for those parts of the process user area to which correspond shell builtin commands. This includes the working directory, umask, and environment variables, but not the signal table, the alarm timer, or the root directory.

There is no standardized interface to all of these things at any level. At the system call level, the working directory is changed by the chdir system call, the root directory by chroot, the alarm timer by alarm, the signal table by a bunch of OS-dependent nonsense like signal or sigaction, the nice value by setpriority, environment variables by a potentially complex bunch of memory manipulation and pointer banging, and so on.

Since there's no single interface for controlling all these things, we might get a win by making an abstraction layer for dealing with them. One place to put this abstraction layer is at the system level, and might look something like this:

	/* declares USERAREA_* constants,
                     int  userarea_set(int, ...)
	        and void *userarea_get(int) 
	#include <sys/userarea.h>

	userarea_set(USERAREA_NICE, 12);
	userarea_set(USERAREA_CWD, "/tmp");
	userarea_set(USERAREA_UMASK, 0022);
This has several drawbacks. One is that it requires kernel hacking. A subitem of this is that it will never become widespread, and that if you can't (or don't want to) replace your kernel, it cannot be made to work for you. Another is that it does not work for the environment variables, which are not really administered by the kernel. Another is that it does not fully solve the original problem, which is to obviate the plethora of nice, nohup, sudo, and env commands. You would still have to write a command to replace them. I had thought of another drawback, but forgot it while I was writing the last two sentences.

You can also put the abstraction layer at the C library level. This has fewer drawbacks. It no longer requires kernel hacking, and can provide a method for modifying the environment. But you still need to write the command that uses the library.

We may as well put the abstraction layer at the Unix command level. This means writing a command in some language, like Perl or C, which offers a shell-level interface to manipulating the process environment, perhaps something like this:

	newenv nice=12 cwd=/tmp signal=HUP:IGNORE umask=0022 -- command args...
Then newenv has a giant dispatch table inside it to process the settings accordingly:

	nice => sub { setpriority(PRIO_PROCESS, $$, $_) },
	cwd  => sub { chdir($_) },
	signal => sub {
		    my ($name, $result) = split /:/;
		    $SIG{$name} = $result;
	umask => sub { umask(oct($_)) },
One question to ask is whether something like this already exists. Another is, if not, whether it's because there's some reason why it's a bad idea, or because there's a simpler solution, or just because nobody has done it yet.

[Other articles in category /Unix] permanent link

Fri, 05 Jan 2007

ssh-agent, revisited
My recent article about reusing ssh-agent processes attracted a lot of mail, most of it very interesting.

  1. A number of people missed an important piece of context: since the article was filed in 'oops' section of my blog, it was intended as a description of a mistake I had made. The mistake in this case being to work really hard on the first solution I thought of, rather than to back up at early signs of trouble, and scout around for a better and simpler solution. I need to find a way to point out the "oops" label more clearly, and at the top of the article instead of at the bottom.

  2. Several people pointed out other good solutions to my problem. For example, Adam Sampson and Robert Loomans pointed out that versions of ssh-agent support a -a option, which orders the process to use a particular path for its Unix domain socket, rather than making up a path, as it does by default. You can then use something like ssh-agent -a $HOME/.ssh/agent when you first start the agent, and then you always know where to find the socket.

  3. An even simpler solution is as follows: My principal difficulty was in determining the correct value for the SSH_AGENT_PID variable. But it turns out that I don't need this; it is only used for ssh-agent -k, which kills the existing ssh-agent process. For authentication, it is only necessary to have SSH_AUTH_SOCK set. The appropriate value for this variable is readily determined by scanning /tmp, as I noted in the original article. Thanks to Aristotle Pagaltzis and Adam Turoff for pointing this out.

  4. Several people pointed me to the keychain project. This program is a front-end to ssh-agent. It contains functions to check for a running agent, and to start one if there is none yet, and to save the environment settings to a file, as I did manually in my article.

  5. A number of people suggested that I should just run ssh-agent from my X session manager. This suggests that they did not read the article carefully; I already do this. Processes running on my home machine, B, all inherit the ssh-agent settings from the session manager process. The question is what to do when I remote login from a different machine, say A, and want the login shell, which was not started under X, to acquire the same settings.

    Other machines trust B, but not A, so credential forwarding is not the solution here either.

  6. After extracting the ssh-agent process's file descriptor table with ls -l /proc/pid/fd, and getting:

            lrwx------    1 mjd      users          64 Dec 12 23:34 3 -> socket:[711505562]

    I concluded that the identifying information, 711505562, was useless. Aaron Crane corrected me on this; you can find it listed in /proc/net/unix, which gives the pathname in the filesystem:

        % grep 711505562 /proc/net/unix 
        ce030540: 00000002 00000000 00010000 0001 01 711505562 /tmp/ssh-tNT31655/agent.31655

    I had suggested that the kernel probably maintained no direct mapping from the socket i-number to the filesystem path, and that obtaining this information would require difficult grovelling of the kernel data structures. But apparently to whatever extent that is true, it is irrelevant, since the /proc/net/unix driver has already been written to do it.

  7. Saving the socket information in a file solves another problem I had. Suppose I want some automated process, say the cron job that makes my offsite network backups, to get access to SSH credentials. I can store the credentials in an ssh-agent process, and save the variable settings to a file. The backup process can then reinstate the settings from the file, and will thenceforward have the credentials for the remote login.

  8. Finally, I should add that since implementing this scheme for the first time on 21 November, I have started exactly zero new ssh-agent processes, so I consider it a rousing success.
Thanks to everyone who wrote in on this matter.

[Other articles in category /Unix] permanent link

Fri, 09 Dec 2005

Elimination of "f" system calls

Michael C. Toren:
> 1) Open a file descriptor pointing to the current working directory.
> 2) Create a temporary directory within the jail, and chroot() to it.
> 3) Using fchdir(), change the working directory to the file descriptor
> saved from step 1.

Oho, I hadn't seen that before. The chroot() in step 2 is required to avoid the special case in the Kernel that checks to see if you are doing ".." in the current root directory. But because you chrooted() yourself somewhere else, the special case isn't exercised.

Older systems don't have fchdir(), which is a fairly recent addition.

With the proliferation of "f" calls in recent years (fchdir, fchmod, fchown, fstat, fsync, etc.) I wonder what would be the result if the Unix system interface were redesigned to eliminate the non-"f" versions of the calls entirely. Instead, there would be a generic function, which we might call "iname", which transforms a path name to an "inode" structure:

        struct inode * iname (const char *path);

Unix kernels already contain a function with this name that does this job.

The system calls that formerly accepted path names are changed to require an inode structure. So instead of

        fd = open("dir/file", ...)

one now has

        fd = open(iname("dir/file"), ...)

(There are some minor language and usability issues here: what if iname() returns NULL? Ignore those; I want to discuss OS issues, not language issues.)

There would be a function, analogous to iname(), that also returned an inode structure, but which took an open file descriptor instead of a path name:

        struct inode * inode(int fd);

This is essentially equivalent to the fstat() function we have now.

chown() and fchown() would merge to become a single call that accepted an inode structure; instead of:

        chown("dir/file", owner) 
        fchown(fd, owner)

one would have:

        chown(iname("dir/file"), owner)
        chown(inode(fd), owner)

Similarly, instead of:


one would have:


stat() and fstat() would not only merge but would disappear entirely; the struct inode can do everything that the struct stat can do. This code:

        stat(&statbuf, "dir/file");
        fstat(&statbuf, fd);

turns into this:

        statbuf = iname("dir/file"));
        statbuf = inode(fd);

There are some security implications to this idea. There needs to be protection against counterfeiting an inode structure. For example, consider a world-readable file in a secret, nonsearchable directory. Suppose the file happens to have i-number 123456. If it's possible to do this, then security has failed:

        struct inode I;
        I.inumber = 123456;
        fd = open(I, O_RDWR);

It should be impossible for anyone to manufacture the struct inode that represents the secret file without actually using iname() somewhere along the line. A simple way to arrange this would be to have the kernel cryptographically sign each struct inode. This can be done inexpensively.

This still has some access implications. Consider a world-readable file in a world-searchable directory. Process A iname()s the file, obtaining its struct inode. The search permissions on the directory are then removed. Process A can still open the file. This is analogous to a similar situation in standard Unix in which process A opens the file before the permissions are changed, and can still read and write it afterwards. So that's not a big change. What might be a big change is that A can dump the struct inode to a file and the a different process can read it back again, evading the increased access protections on the directory. The cryptographic signature technique can fix this problem by restricting struct inodes to be used by a single process.

Whether this is worth doing I don't know. My main idea in thinking it up was to avoid the increasing duplication of system calls. Does Unix need an "fsymlink" call? Does it need three different ones?

        symlink(oldpath, newpath);
        fsymlink1(fd, newpath);
        fsymlink2(oldpath, fd);
        fsymlink3(oldfile_fd, newdir_fd);

Perhaps not this week, but who knows what the future holds? With the iname() / inode() style, these are all a single call:

        symlink(iname(oldpath), iname(newpath));
        symlink(inode(fd), iname(newpath));
        symlink(iname(oldpath), inode(fd));
        symlink(inode(oldfile_fd), inode(newdir_fd));

This also fixes some of the proliferation in the system call interface between calls that work on symlinks and calls that work through symlinks. For example, stat() and lstat(), and chown() and lchown(). On normal files, each pair is the same. But on a symlink, stat() stats the pointed-to file while lstat() stats the symlink itself; similarly chown() changes the owner of the pointed-to file while lchown() changes the owner of the symlink itself. But where's lchmod()? What about llink()? There's no way to make a hard link to a symbolic link! With the inode() / iname() technique above, you only need one extra call to handle all possible operations on a symbolic link:

        lchown(path, owner);
        llink(path, newpath);


        chown(liname(path), owner);
        link(liname(path), iname(newpath));

where liname() is just like iname(), except that if the resulting file is a symbolic link, its inode is returned immediately; iname() would have read the target of the symbolic link and called itself recursively to resolve the target.

It also seems to me that this interface might make it easier to communicate open files from one process to another. Some unix systems offer a experimental features for passing file descriptors around; this system only requires that the struct inode be communicated directly to the receiving process.

[Other articles in category /Unix] permanent link