# The Universe of Discourse

Fri, 03 Jan 2020

Sometimes I look through the HTTP referrer logs to see if anyone is talking about my blog. I use the f 11 command to extract the referrer field from the log files, count up the number of occurrences of each referring URL, then discard the ones that are internal referrers from elsewhere on my blog. It looks like this:

    f 11 access.2020-01-0* | count | grep -v plover


(I've discussed f before. The f 11 just prints the eleventh field of each line. It is essentially shorthand for awk '{print $11}' or perl -lane 'print$F[10]'. The count utility is even simpler; it counts the number of occurrences of each distinct line in its input, and emits a report sorted from least to most frequent, essentially a trivial wrapper around sort | uniq -c | sort -n. Civilization advances by extending the number of important operations which we can perform without thinking about them.)

This has obvious defects, but it works well enough. But every time I used it, I wondered: is it faster to do the grep before the count, or after? I didn't ever notice a difference. But I still wanted to know.

After years of idly wondering this, I have finally looked into it. The point of this article is that the investigation produced the following pipeline, which I think is a great example of the Unix “tools” philosophy:

        for i in $(seq 20); do TIME="%U+%S" time \ sh -c f 11 access.2020-01-0* | grep -v plover | count > /dev/null' \ 2>&1 | bc -l ; done | addup  I typed this on the command line, with no backslashes or newlines, so it actually looked like this:  for i in$(seq 20); do TIME="%U+%S" time sh -c 'f 11 access.2020-01-0* | grep -v plover |count > /dev/null' 2>&1 | bc -l ; done | addup


Okay, what's going on here? The pipeline I actually want to analyze, with f | grep| count, is there in the middle, and I've already explained it, so let's elide it:

        for i in $(seq 20); do TIME="%U+%S" time \ sh -c '¿SOMETHING? > /dev/null' 2>&1 | bc -l ; done | addup  Continuing to work from inside to out, we're going to use time to actually do the timings. The time command is standard. It runs a program, asks the kernel how long the program took, then prints a report. The time command will only time a single process (plus its subprocesses, a crucial fact that is inexplicably omitted from the man page). The ¿SOMETHING? includes a pipeline, which must be set up by the shell, so we're actually timing a shell command sh -c '...' which tells time to run the shell and instruct it to run the pipeline we're interested in. We tell the shell to throw away the output of the pipeline, with > /dev/null, so that the output doesn't get mixed up with time's own report. The default format for the report printed by time is intended for human consumption. We can supply an alternative format in the $TIME variable. The format I'm using here is %U+%S, which comes out as something like 0.25+0.37, where 0.25 is the user CPU time and 0.37 is the system CPU time. I didn't see a format specifier that would emit the sum of these directly. So instead I had it emit them with a + in between, and then piped the result through the bc command, which performs the requested arithmetic and emits the result. We need the -l flag on bc because otherwise it stupidly does integer arithmetic. The time command emits its report to standard error, so I use 2>&1 to redirect the standard error into the pipe.

[ Addendum 20200108: We don't actually need -l here; I was mistaken. ]

Collapsing the details I just discussed, we have:

        for i in $(seq 20); do (run once and emit the total CPU time) done | addup  seq is a utility I invented no later than 1993 which has since become standard in most Unix systems. (As with netcat, I am not claiming to be the first or only person to have invented this, only to have invented it independently.) There are many variations of seq, but the main use case is that seq 20 prints  1 2 3 … 19 20  Here we don't actually care about the output (we never actually use $i) but it's a convenient way to get the for loop to run twenty times. The output of the for loop is the twenty total CPU times that were emitted by the twenty invocations of bc. (Did you know that you can pipe the output of a loop?) These twenty lines of output are passed into addup, which I wrote no later than 2011. (Why did it take me so long to do this?) It reads a list of numbers and prints the sum.

All together, the command runs and prints a single number like 5.17, indicating that the twenty runs of the pipeline took 5.17 CPU-seconds total. I can do this a few times for the original pipeline, with count before grep, get times between 4.77 and 5.78, and then try again with the grep before the count, producing times between 4.32 and 5.14. The difference is large enough to detect but too small to notice.

(To do this right we also need to test a null command, say

    sh -c 'sleep 0.1 < /dev/null'


because we might learn that 95% of the reported time is spent in running the shell, so the actual difference between the two pipelines is twenty times as large as we thought. I did this; it turns out that the time spent to run the shell is insignificant.)

What to learn from all this? On the one hand, Unix wins: it's supposed to be quick and easy to assemble small tools to do whatever it is you're trying to do. When time wouldn't do the arithmetic I needed it to, I sent its output to a generic arithmetic-doing utility. When I needed to count to twenty, I had a utility for doing that; if I hadn't there are any number of easy workarounds. The shell provided the I/O redirection and control flow I needed.

On the other hand, gosh, what a weird mishmash of stuff I had to remember or look up. The -l flag for bc. The fact that I needed bc at all because time won't report total CPU time. The $TIME variable that controls its report format. The bizarro 2>&1 syntax for redirecting standard error into a pipe. The sh -c trick to get time to execute a pipeline. The missing documentation of the core functionality of time. Was it a win overall? What if Unix had less compositionality but I could use it with less memorized trivia? Would that be an improvement? I don't know. I rather suspect that there's no way to actually reach that hypothetical universe. The bizarre mishmash of weirdness exists because so many different people invented so many tools over such a long period. And they wouldn't have done any of that inventing if the compositionality hadn't been there. I think we don't actually get to make a choice between an incoherent mess of composable paraphernalia and a coherent, well-designed but noncompositional system. Rather, we get a choice between a incoherent but useful mess and an incomplete, limited noncompositional system. (Notes to self: (1) In connection with Parse::RecDescent, you once wrote about open versus closed systems. This is another point in that discussion. (2) Open systems tend to evolve into messes. But closed systems tend not to evolve at all, and die. (3) Closed systems are centralized and hierarchical; open systems, when they succeed, are decentralized and organic. (4) If you are looking for another example of a successful incoherent mess of composable paraphernalia, consider Git.) [ Addendum: Add this to the list of “weird mishmash of trivia”: There are two time commands. One, which I discussed above, is a separate executable, usually in /usr/bin/time. The other is built into the shell. They are incompatible. Which was I actually using? I would have been pretty confused if I had accidentally gotten the built-in one, which ignores $TIME and uses a $TIMEFORMAT that is interpreted in a completely different way. I was fortunate, and got the one I intended to get. But it took me quite a while to understand why I had! The appearance of the TIME=… assignment at the start of the shell command disabled the shell's special builtin treatment of the keyword time, so it really did use /usr/bin/time. This computer stuff is amazingly complicated. I don't know how anyone gets anything done. ] [ Addenda 20200104: (1) Perl's module ecosystem is another example of a successful incoherent mess of composable paraphernalia. (2) Of the seven trivia I included in my “weird mishmash”, five were related to the time command. Is this a reflection on time, or is it just because time was central to this particular example? ] [ Addendum 20200104: And, of course, this is exactly what Richard Gabriel was thinking about in Worse is Better. Like Gabriel, I'm not sure. ] Thu, 08 Nov 2018 Yesterday I wanted to reconfigure the sshd on a remote machine. Although I'd never done sshd itself, I've done this kind of thing a zillion times before. It looks like this: there is a configuration file (in this case /etc/ssh/sshd-config) that you modify. But this doesn't change the running server; you have to notify the server that it should reread the file. One way would be by killing the server and starting a new one. This would interrupt service, so instead you can send the server a different signal (in this case SIGHUP) that tells it to reload its configuration without exiting. Simple enough. Except, it didn't work. I added:  Match User mjd ForceCommand echo "I like pie!"  and signalled the server, then made a new connection to see if it would print I like pie! instead of starting a shell. It started a shell. Okay, I've never used Match or ForceCommand before, maybe I don't understand how they work, I'll try something simpler. I added:  PrintMotd yes  which seemed straightforward enough, and I put some text into /etc/motd, but when I connected it didn't print the motd. I tried a couple of other things but none of them seemed to work. Okay, maybe the sshd is not getting the signal, or something? I hunted up the logs, but there was a report like what I expected:  sshd[1210]: Received SIGHUP; restarting.  This was a head-scratcher. Was I modifying the wrong file? It semed hardly possible, but I don't administer this machine so who knows? I tried lsof -p 1210 to see if maybe sshd had some other config file open, but it doesn't keep the file open after it reads it, so that was no help. Eventually I hit upon the answer, and I wish I had some useful piece of advice here for my future self about how to figure this out. But I don't because the answer just struck me all of a sudden. (It's nice when that happens, but I feel a bit cheated afterward: I solved the problem this time, but I didn't learn anything, so how does it help me for next time? I put in the toil, but I didn't get the full payoff.) “Aha,” I said. “I bet it's because my connection is multiplexed.” Normally when you make an ssh connection to a remote machine, it calls up the server, exchanges credentials, each side authenticates the the other, and they negotiate an encryption key. Then the server forks, the child starts up a login shell and mediates between the shell and the network, encrypting in one direction and decrypting in the other. All that negotiation and authentication takes time. There is a “multiplexing” option you can use instead. The handshaking process still occurs as usual for the first connection. But once the connection succeeds, there's no need to start all over again to make a second connection. You can tell ssh to multiplex several virtual connections over its one real connection. To make a new virtual connection, you run ssh in the same way, but instead of contacting the remote server as before, it contacts the local ssh client that's already running and requests a new virtual connection. The client, already connected to the remote server, tells the server to allocate a new virtual connection and to start up a new shell session for it. The server doesn't even have to fork; it just has to allocate another pseudo-tty and run a shell in it. This is a lot faster. I had my local ssh client configured to use a virtual connection if that was possible. So my subsequent ssh commands weren't going through the reconfigured parent server. They were all going through the child server that had been forked hours before when I started my first connection. It wasn't affected by reconfiguration of the parent server, from which it was now separate. I verified this by telling ssh to make a new connection without trying to reuse the existing virtual connection:  ssh -o ControlPath=none -o ControlMaster=no ...  This time I saw the MOTD and when I reinstated that Match command I got I like pie! instead of a shell. (It occurs to me now that I could have tried to SIGHUP the child server process that my connections were going through, and that would probably have reconfigured any future virtual connections through that process, but I didn't think of it at the time.) Then I went home for the day, feeling pretty darn clever, right up until I discovered, partway through writing this article, that I can't log in because all I get is I like pie! instead of a shell. Mon, 21 May 2018 In yesterday's article I described a simple and useful feature that could have been added to the standard I/O library, to allow an environment variable to override the default buffering behavior. This would allow the invoker of a program to request that the program change its buffering behavior even if the program itself didn't provide an option specifically for doing that. Simon Tatham directed me to the GNU Coreutils stdbuf command which does something of this sort. It is rather like the pseudo-tty-pipe program I described, but instead of using the pseudo-tty hack I suggested, it works by forcing the child program to dynamically load a custom replacement for stdio. There appears to be a very similar command in FreeBSD. Roderick Schertler pointed out that Dan Bernstein wrote a utility program, pty, in 1990, atop which my pseudo-tty-pipe program could easily be built; or maybe its ptybandage utility is exactly what I wanted. Jonathan de Boyne Pollard has a page explaining it in detail, and related packages. A later version of pty is still available. Here's M. Bernstein's blurb about it: ptyget is a universal pseudo-terminal interface. It is designed to be used by any program that needs a pty. ptyget can also serve as a wrapper to improve the behavior of existing programs. For example, ptybandage telnet is like telnet but can be put into a pipeline. nobuf grep is like grep but won't block-buffer if it's redirected. Previous pty-allocating programs — rlogind, telnetd, sshd, xterm, screen, emacs, expect, etc. — have caused dozens of security problems. There are two fundamental reasons for this. First, these programs are installed setuid root so that they can allocate ptys; this turns every little bug in hundreds of thousands of lines of code into a potential security hole. Second, these programs are not careful enough to protect the pty from access by other users. ptyget solves both of these problems. All the privileged code is in one tiny program. This program guarantees that one user can't touch another user's pty. ptyget is a complete rewrite of pty 4.0, my previous pty-allocating package. pty 4.0's session management features have been split off into a separate package, sess. Leonardo Taccari informed me that NetBSD's stdio actually has the environment variable feature I was asking for! Christos Zoulas suggested adding stdbuf similar to the GNU and FreeBSD implementations, but the NetBSD people observed, as I did, that it would be simpler to just control stdio directly with an environment variable, and did it. Here's the relevant part of the NetBSD setbuf(3) man page: The default buffer settings can be overwritten per descriptor (STDBUFn) where n is the numeric value of the file descriptor represented by the stream, or for all descriptors (STDBUF). The environment variable value is a letter followed by an optional numeric value indicating the size of the buffer. Valid sizes range from 0B to 1MB. Valid letters are: U unbuffered L line buffered F fully buffered Here's the discussion from the NetBSD tech-userlevel mailing list. The actual patch looks almost exactly the way I imagined it would. Finally, Mariusz Ceier pointed out that there is an ancient bug report in glibc suggesting essentially the same environment variable mechanism that I suggested and that was adopted in NetBSD. The suggestion was firmly and summarily rejected. (“Hell, no … this is a terrible idea.”) Interesting wrinkle: the bug report was submitted by Pádraig Brady, who subsequently wrote the stdbuf command I described above. Thank you, Gentle Readers! Sun, 20 May 2018 Some Unix commands, such as grep, will have a command-line flag to say that you want to turn off the buffering that is normally done in the standard I/O library. Some just try to guess what you probably want. Every command is a little different and if the command you want doesn't have the flag you need, you are basically out of luck. Maybe I should explain the putative use case here. You have some command (or pipeline) X that will produce dribbles of data at uncertain intervals. If you run it at the terminal, you see each dribble timely, as it appears. But if you put X into a pipeline, say with  X | tee ...  or  X | grep ...  then the dribbles are buffered and only come out of X when an entire block is ready to be written, and the dribbles could be very old before the downstream part of the pipeline, including yourself, sees them. Because this is happening in user space inside of X, there is not a damn thing anyone farther downstream can do about it. The only escape is if X has some mode in which it turns off standard I/O buffering. Since standard I/O buffering is on by default, there is a good chance that the author of X did not think to affirmatively add this feature. Note that adding the --unbuffered flag to the downstream grep does not solve the problem; grep will produce its own output timely, but it's still getting its input from X after a long delay. One could imagine a program which would interpose a pseudo-tty, and make X think it is writing to a terminal, and then the standard I/O library would stay in line-buffered mode by default. Instead of running  X | tee some-file | ...  or whatever, one would do  pseudo-tty-pipe -c X | tee some-file | ...  which allocates a pseudo-tty device, attaches standard output to it, and forks. The child runs X, which dribbles timely into the pseudo-tty while the parent runs a read loop to remove dribbles from the master end of the TTY and copy them timely into the pipe. This would work. Although tee itself also has no --unbuffered flag so you might even have to:  pseudo-tty-pipe -c X | pseudo-tty-pipe -c 'tee some-file' | ...  I don't think such a program exists, and anyway, this is all ridiculous, a ridiculous abuse of the standard I/O library's buffering behavior: we want line buffering, the library will only give it to us if the process is attached to a TTY device, so we fake up a TTY just to fool stdio into giving us what we want. And why? Simply because stdio has no way to explicitly say what we want. But it could easily expose this behavior as a controllable feature. Currently there is a branch in the library that says how to set up a buffering mode when a stream is opened for the first time: • if the stream is for writing, and is attached to descriptor 2, it should be unbuffered; otherwise … • if the stream is for writing, and connects descriptor 1 to a terminal device, it should be line-buffered; otherwise … • if the moon is waxing … • otherwise, the stream should be block-buffered To this, I propose a simple change, to be inserted right at the beginning: If the environment variable STDIO_BUF is set to "line", streams default to line buffering. If it's set to "none", streams default to no buffering. If it's set to "block", streams default to block buffered. If it's anything else, or unset, it is ignored. Now instead of this:  pseudo-tty-pipe --from X | tee some-file | ...  you write this:  STDIO_BUF=line X | tee some-file | ...  Problem solved. Or maybe you would like to do this:  export STDIO_BUF=line  which then it affects every program in every pipeline in the rest of the session:  X | tee some-file | ...  Control is global if you want it, and per-process if you want it. This feature would cost around 20 lines of C code in the standard I/O library and would impose only an insigificant run-time cost. It would effectively add an --unbuffered flag to every program in the universe, retroactively, and the flag would be the same for every program. You would not have to remember that in mysql the magic option is -n and that in GNU grep it is --line-buffered and that for jq is is --unbuffered and that Python scripts can be unbuffered by supplying the -u flag and that in tee you are just SOL, etc. Setting STDIO_BUF=line would Just Work. Programming languages would all get this for free also. Python already has PYTHONUNBUFFERED but in other languages you have to do something or other; in Perl you use some horrible Perl-4-ism like  { my$ofh = select OUTPUT; $|++; select$ofh }


This proposal would fix every programming language everywhere. The Perl code would become:

    $ENV{STDIO_BUF} = 'line';  and every other language would be similarly simple:  /* In C */ putenv("STDIO_BUF=line");  [ Addendum 20180521: Mariusz Ceier corrects me, pointing out that this will not work for the process’ own standard streams, as they are pre-opened before the process gets a chance to set the variable. ] It's easy to think of elaborations on this: STDIO_BUF=1:line might mean that only standard output gets line-buffering by default, everything else is up to the library. This is an easy thing to do. I have wanted this for twenty years. How is it possible that it hasn't been in the GNU/Linux standard library for that long? [ Addendum 20180521: it turns out there is quite a lot to say about the state of the art here. In particular, NetBSD has the feature very much as I described it. ] Tue, 02 Jan 2018 I was on vacation last week and I didn't bring my computer, which has been a good choice in the past. But I did bring my phone, and I spent some quiet time writing various parts of around 20 blog posts on the phone. I composed these in my phone's Google Docs app, which seemed at the time like a reasonable choice. But when I got back I found that it wasn't as easy as I had expected to get the documents back out. What I really wanted was Markdown. HTML would have been acceptable, since Blosxom accepts that also. I could download a single document in one of several formats, including HTML and ODF, but I had twenty and didn't want to do them one at a time. Google has a bulk download feature, to download a zip file of an entire folder, but upon unzipping I found that all twenty documents had been converted to Microsoft's docx format and I didn't know a good way to handle these. I could not find an option for a bulk download in any other format. Several tools will compose in Markdown and then export to Google docs, but the only option I found for translating from Google docs to Markdown was Renato Mangini's Google Apps script. I would have had to add the script to each of the 20 files, then run it, and the output appears in email, so for this task, it was even less like what I wanted. The right answer turned out to be: Accept Google's bulk download of docx files and then use Pandoc to convert the docx to Markdown: for i in *.docx; do echo -n "$i ? ";
read j; mv -i "$i"$j.docx;
pandoc --extract-media . -t markdown -o "$(suf "$j" mkdn)" "$j.docx"; done  The read is because I had given the files Unix-unfriendly names like Polyominoes as orthogonal polygons.docx and I wanted to give them shorter names like orthogonal-polyominoes.docx. The suf command is a little utility that performs the very common task of removing or changing the suffix of a filename. The suf "$j" mkdn command means that if $j is something like foo.docx it should turn into foo.mkdn. Here's the tiny source code:  #!/usr/bin/perl # # Usage: suf FILENAME [suffix] # # If filename ends with a suffix, the suffix is replaced with the given suffix # otheriswe, the given suffix is appended # # For example: # suf foo.bar baz => foo.baz # suf foo baz => foo.baz # suf foo.bar => foo # suf foo => foo @ARGV == 2 or @ARGV == 1 or usage(); my ($file, $suf) = @ARGV;$file =~ s/\.[^.]*$//; if (defined$suf) {
print "$file.$suf\n";
} else {
print "$file\n"; } sub usage { print STDERR "Usage: suf filename [newsuffix]\n"; exit 1; }  Often, I feel that I have written too much code, but not this time. Some people might be tempted to add bells and whistles to this: what if the suffix is not delimited by a dot character? What if I only want to change certain suffixes? What if my foot swells up? What if the moon falls out of the sky? Blah blah blah. No, for that we can break out sed. Next time I go on vacation I will know better and I will not use Google Docs. I don't know yet what instead. StackEdit maybe. [ Addendum 20180108: Eric Roode pointed out that the program above has a genuine bug: if given a filename like a.b/c.d it truncates the entire b/c.d instead of just the d. The current version fixes this. ] Sun, 02 Apr 2017 A Unix system administrator of my acquaintance once got curious about what people were putting into /dev/null. I think he also may have had some notion that it would contain secrets or other interesting material that people wanted thrown away. Both of these ideas are stupid, but what he did next was even more stupid: he decided to replace /dev/null with a plain file so that he could examine its contents. The root filesystem quickly filled up and the admin had to be called back from dinner to fix it. But he found that he couldn't fix it: to create a Unix device file you use the mknod command, and its arguments are the major and minor device numbers of the device to create. Our friend didn't remember the correct minor device number. The ls -l command will tell you the numbers of a device file but he had removed /dev/null so he couldn't use that. Having no other system of the same type with an intact device file to check, he was forced to restore /dev/null from the tape backups. Thu, 28 Jul 2016 Yesterday I wrote about how I was trying to control the KDE screenlocker's timeout from a shell script and all the fun stuff I learned along the way. Then after I published the article I discovered that my solution didn't work. But today I fixed it and it does work. ### What didn't work I had written this script:  timeout=${1:-3600}
perl -i -lpe 's/^Enabled=.*/Enabled=False/' $HOME/.kde/share/config/kscreensaverrc qdbus org.freedesktop.ScreenSaver /MainApplication reparseConfiguration sleep$timeout
perl -i -lpe 's/^Enabled=.*/Enabled=True/' $HOME/.kde/share/config/kscreensaverrc qdbus org.freedesktop.ScreenSaver /MainApplication reparseConfiguration  The strategy was: use perl to rewrite the screen locker's configuration file, and then use qdbus to send a D-Bus message to the screen locker to order it to load the updated configuration. This didn't work. The System Settings app would see the changed configuration, and report what I expected, but the screen saver itself was still behaving according to the old configuration. Maybe the qdbus command was wrong or maybe the whole theory was bad. ### More strace For want of anything else to do (when all you have is a hammer…), I went back to using strace to see what else I could dig up, and tried strace -ff -o /tmp/ss/s /usr/bin/systemsettings  which tells strace to write separate files for each process or thread. I had a fantasy that by splitting the trace for each process into a separate file, I might solve the mysterious problem of the missing string data. This didn't come true, unfortunately. I then ran tail -f on each of the output files, and used systemsettings to update the screen locker configuration, looking to see which the of the trace files changed. I didn't get too much out of this. A great deal of the trace was concerned with X protocol traffic between the application and the display server. But I did notice this portion, which I found extremely suggestive, even with the filenames missing:  3106 open(0x2bb57a8, O_RDWR|O_CREAT|O_CLOEXEC, 0666) = 18 3106 fcntl(18, F_SETFD, FD_CLOEXEC) = 0 3106 chmod(0x2bb57a8, 0600) = 0 3106 fstat(18, {...}) = 0 3106 write(18, 0x2bb5838, 178) = 178 3106 fstat(18, {...}) = 0 3106 close(18) = 0 3106 rename(0x2bb5578, 0x2bb4e48) = 0 3106 unlink(0x2b82848) = 0  You may recall that my theory was that when I click the “Apply” button in System Settings, it writes out a new version of $HOME/.kde/share/config/kscreensaverrc and then orders the screen locker to reload the configuration. Even with no filenames, this part of the trace looked to me like the replacement of the configuration file: a new file is created, then written, then closed, and then the rename replaces the old file with the new one. If I had been thinking about it a little harder, I might have thought to check if the return value of the write call, 178 bytes, matched the length of the file. (It does.) The unlink at the end is deleting the semaphore file that System Settings created to prevent a second process from trying to update the same file at the same time.

Supposing that this was the trace of the configuration update, the next section should be the secret sauce that tells the screen locker to look at the new configuration file. It looked like this:

3106  sendmsg(5, 0x7ffcf37e53b0, MSG_NOSIGNAL) = 168
3106  poll([?] 0x7ffcf37e5490, 1, 25000) = 1
3106  recvmsg(5, 0x7ffcf37e5390, MSG_CMSG_CLOEXEC) = 90
3106  recvmsg(5, 0x7ffcf37e5390, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
3106  sendmsg(5, 0x7ffcf37e5770, MSG_NOSIGNAL) = 278
3106  sendmsg(5, 0x7ffcf37e5740, MSG_NOSIGNAL) = 128


There is very little to go on here, but none of it is inconsistent with the theory that this is the secret sauce, or even with the more advanced theory that it is the secret suace and that the secret sauce is a D-Bus request. But without seeing the contents of the messages, I seemed to be at a dead end.

### Thrashing

Browsing random pages about the KDE screen locker, I learned that the lock screen configuration component could be run separately from the rest of System Settings. You use

kcmshell4 --list


to get a list of available components, and then

kcmshell4 screensaver


to run the screensaver component. I started running strace on this command instead of on the entire System Settings app, with the idea that if nothing else, the trace would be smaller and perhaps simpler, and for some reason the missing strings appeared. That suggestive block of code above turned out to be updating the configuration file, just as I had suspected:

open("/home/mjd/.kde/share/config/kscreensaverrcQ13893.new", O_RDWR|O_CREAT|O_CLOEXEC, 0666) = 19
fcntl(19, F_SETFD, FD_CLOEXEC)          = 0
chmod("/home/mjd/.kde/share/config/kscreensaverrcQ13893.new", 0600) = 0
fstat(19, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
write(19, "[ScreenSaver]\nActionBottomLeft=0\nActionBottomRight=0\nActionTopLeft=0\nActionTopRight=2\nEnabled=true\nLegacySaverEnabled=false\nPlasmaEnabled=false\nSaver=krandom.desktop\nTimeout=60\n", 177) = 177
fstat(19, {st_mode=S_IFREG|0600, st_size=177, ...}) = 0
close(19)                               = 0
rename("/home/mjd/.kde/share/config/kscreensaverrcQ13893.new", "/home/mjd/.kde/share/config/kscreensaverrc") = 0


And the following secret sauce was revealed as:

    sendmsg(7, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\30\0\0\0\v\0\0\0\177\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\f\0\0\0GetNameOwner\0\0\0\0\10\1g\0\1s\0\0", 144}, {"\23\0\0\0org.kde.screensaver\0", 24}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 168
sendmsg(7, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\1\1\206\0\0\0\f\0\0\0\177\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\10\0\0\0AddMatch\0\0\0\0\0\0\0\0\10\1g\0\1s\0\0", 144}, {"\201\0\0\0type='signal',sender='org.freedesktop.DBus',interface='org.freedesktop.DBus',member='NameOwnerChanged',arg0='org.kde.screensaver'\0", 134}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 278
sendmsg(7, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\0\0\0\0\r\0\0\0j\0\0\0\1\1o\0\f\0\0\0/ScreenSaver\0\0\0\0\6\1s\0\23\0\0\0org.kde.screensaver\0\0\0\0\0\2\1s\0\23\0\0\0org.kde.screensaver\0\0\0\0\0\3\1s\0\t\0\0\0configure\0\0\0\0\0\0\0", 128}, {"", 0}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 128
sendmsg(7, {msg_name(0)=NULL,
msg_iov(2)=[{"l\1\1\1\206\0\0\0\16\0\0\0\177\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\v\0\0\0RemoveMatch\0\0\0\0\0\10\1g\0\1s\0\0",
144},
{"\201\0\0\0type='signal',sender='org.freedesktop.DBus',interface='org.freedesktop.DBus',member='NameOwnerChanged',arg0='org.kde.screensaver'\0",
134}]


(I had to tell give strace the -s 256 flag to tell it not to truncate the string data to 32 characters.)

### Binary gibberish

A lot of this is illegible, but it is clear, from the frequent mentions of DBus, and from the names of D-Bus objects and methods, that this is is D-Bus requests, as theorized. Much of it is binary gibberish that we can only read if we understand the D-Bus line protocol, but the object and method names are visible. For example, consider this long string:

interface='org.freedesktop.DBus',member='NameOwnerChanged',arg0='org.kde.screensaver'


With qdbus I could confirm that there was a service named org.freedesktop.DBus with an object named / that supported a NameOwnerChanged method which expected three QString arguments. Presumably the first of these was org.kde.screensaver and the others are hiding in other the 134 characters that strace didn't expand. So I may not understand the whole thing, but I could see that I was on the right track.

That third line was the key:

sendmsg(7, {msg_name(0)=NULL,
msg_iov(2)=[{"… /ScreenSaver … org.kde.screensaver … org.kde.screensaver … configure …", 128}, {"", 0}],
msg_controllen=0,
msg_flags=0},
MSG_NOSIGNAL) = 128


Huh, it seems to be asking the screensaver to configure itself. Just like I thought it should. But there was no configure method, so what does that configure refer to, and how can I do the same thing?

But org.kde.screensaver was not quite the same path I had been using to talk to the screen locker—I had been using org.freedesktop.ScreenSaver, so I had qdbus list the methods at this new path, and there was a configure method.

When I tested

qdbus org.kde.screensaver /ScreenSaver configure


I found that this made the screen locker take note of the updated configuration. So, problem solved!

(As far as I can tell, org.kde.screensaver and org.freedesktop.ScreenSaver are completely identical. They each have a configure method, but I had overlooked it—several times in a row—earlier when I had gone over the method catalog for org.freedesktop.ScreenSaver.)

The working script is almost identical to what I had yesterday:

        timeout=${1:-3600} perl -i -lpe 's/^Enabled=.*/Enabled=False/'$HOME/.kde/share/config/kscreensaverrc
qdbus org.freedesktop.ScreenSaver /ScreenSaver configure
sleep $timeout perl -i -lpe 's/^Enabled=.*/Enabled=True/'$HOME/.kde/share/config/kscreensaverrc
qdbus org.freedesktop.ScreenSaver /ScreenSaver configure


That's not a bad way to fail, as failures go: I had a correct idea about what was going on, my plan about how to solve my problem would have worked, but I was tripped up by a trivium; I was calling MainApplication.reparseConfiguration when I should have been calling ScreenSaver.configure.

What if I hadn't been able to get strace to disgorge the internals of the D-Bus messages? I think I would have gotten the answer anyway. One way to have gotten there would have been to notice the configure method documented in the method catalog printed out by qdbus. I certainly looked at these catalogs enough times, and they are not very large. I don't know why I never noticed it on my own. But I might also have had the idea of spying on the network traffic through the D-Bus socket, which is under /tmp somewhere.

I was also starting to tinker with dbus-send, which is like qdbus but more powerful, and can post signals, which I think qdbus can't do, and with gdbus, another D-Bus introspector. I would have kept getting more familiar with these tools and this would have led somewhere useful.

Or had I taken just a little longer to solve this, I would have followed up on Sumana Harihareswara’s suggestion to look at Bustle, which is a utility that logs and traces D-Bus requests. It would certainly have solved my problem, because it makes perfectly clear that clicking that apply button invoked the configure method:

I still wish I knew why strace hadn't been able to print out those strings through.

Wed, 27 Jul 2016

Lately I've started watching stuff on Netflix. Every time I do this, the screen locker kicks in sixty seconds in, and I have to unlock it, pause the video, and adjust the system settings to turn off the automatic screen locker. I can live with this.

But when the show is over, I often forget to re-enable the automatic screen locker, and that I can't live with. So I wanted to write a shell script:

  #!/bin/sh
auto-screen-locker disable
sleep 3600
auto-screen-locker enable


Then I'll run the script in the background before I start watching, or at least after the first time I unlock the screen, and if I forget to re-enable the automatic locker, the script will do it for me.

The question is: how to write auto-screen-locker?

### strace

My first idea was: maybe there is actually an auto-screen-locker command, or a system-settings command, or something like that, which was being run by the System Settings app when I adjusted the screen locker from System Settings, and all I needed to do was to find out what that command was and to run it myself.

So I tried running System Settings under strace -f and then looking at the trace to see if it was execing anything suggestive.

It wasn't, and the trace was 93,000 lines long and frighting. Halfway through, it stopped recording filenames and started recording their string addresses instead, which meant I could see a lot of calls to execve but not what was being execed. I got sidetracked trying to understand why this had happened, and I never did figure it out—something to do with a call to clone, which is like fork, but different in a way I might understand once I read the man page.

The first thing the cloned process did was to call set_robust_list, which I had never heard of, and when I looked for its man page I found to my surprise that there was one. It begins:

    NAME
get_robust_list, set_robust_list - get/set list of robust futexes


And then I felt like an ass because, of course, everyone knows all about the robust futex list, duh, how silly of me to have forgotten ha ha just kidding WTF is a futex? Are the robust kind better than regular wimpy futexes?

It turns out that Ingo Molnár wrote a lovely explanation of robust futexes which are actually very interesting. In all seriousness, do check it out.

I seem to have digressed. This whole section can be summarized in one sentence:

strace was no help and took me a long way down a wacky rabbit hole.

Sorry, Julia!

### Stack Exchange

The next thing I tried was Google search for kde screen locker. The second or third link I followed was to this StackExchange question, “What is the screen locking mechanism under KDE? It wasn't exactly what I was looking for but it was suggestive and pointed me in the right direction. The crucial point in the answer was a mention of

    qdbus org.freedesktop.ScreenSaver /ScreenSaver Lock


When I saw this, it was like a new section of my brain coming on line. So many things that had been obscure suddenly became clear. Things I had wondered for years. Things like “What are these horrible

   Object::connect: No such signal org::freedesktop::UPower::DeviceAdded(QDBusObjectPath)


messages that KDE apps are always spewing into my terminal?” But now the light was on.

KDE is built atop a toolkit called Qt, and Qt provides an interprocess communication mechanism called “D-Bus”. The qdbus command, which I had not seen before, is apparently for sending queries and commands on the D-Bus. The arguments identify the recipient and the message you are sending. If you know the secret name of the correct demon, and you send it the correct secret command, it will do your bidding. ( The mystery message above probably has something to do with the app using an invalid secret name as a D-Bus address.)

Often these sorts of address hierarchies work well in theory and then fail utterly because there is no way to learn the secret names. The X Window System has always had a feature called “resources” by which almost every aspect of every application can be individually customized. If you are running xweasel and want just the frame of just the error panel of just the output window to be teal blue, you can do that… if you can find out the secret names of the xweasel program, its output window, its error panel, and its frame. Then you combine these into a secret X resource name, incant a certain command to load the new resource setting into the X server, and the next time you run xweasel the one frame, and only the one frame, will be blue.

In theory these secret names are documented somewhere, maybe. In practice, they are not documented anywhere. you can only extract them from the source, and not only from the source of xweasel itself but from the source of the entire widget toolkit that xweasel is linked with. Good luck, sucker.

### D-Bus has a directory

However! The authors of Qt did not forget to include a directory mechanism in D-Bus. If you run

    qdbus


you get a list of all the addressable services, which you can grep for suggestive items, including org.freedesktop.ScreenSaver. Then if you run

    qdbus org.freedesktop.ScreenSaver


you get a list of all the objects provided by the org.freedesktop.ScreenSaver service; there are only seven. So you pick a likely-seeming one, say /ScreenSaver, and run

    qdbus org.freedesktop.ScreenSaver /ScreenSaver


and get a list of all the methods that can be called on this object, and their argument types and return value types. And you see for example

    method void org.freedesktop.ScreenSaver.Lock()


and say “I wonder if that will lock the screen when I invoke it?” And then you try it:

    qdbus org.freedesktop.ScreenSaver /ScreenSaver Lock


and it does.

That was the most important thing I learned today, that I can go wandering around in the qdbus hierarchy looking for treasure. I don't yet know exactly what I'll find, but I bet there's a lot of good stuff.

When I was first learning Unix I used to wander around in the filesystem looking at all the files, and I learned a lot that way also.

• “Hey, look at all the stuff in /etc! Huh, I wonder what's in /etc/passwd?”

• “Hey, /etc/protocols has a catalog of protocol numbers. I wonder what that's for?”

• “Hey, there are a bunch of files in /usr/spool/mail named after users and the one with my name has my mail in it!”

• “Hey, the manuals are all under /usr/man. I could grep them!”

Later I learned (by browsing in /usr/man/man7) that there was a hier(7) man page that listed points of interest, including some I had overlooked.

### The right secret names

Everything after this point was pure fun of the “what happens if I turn this knob” variety. I tinkered around with the /ScreenSaver methods a bit (there are twenty) but none of them seemed to be quite what I wanted. There is a

    method uint Inhibit(QString application_name, QString reason_for_inhibit)


method which someone should be calling, because that's evidently what you call if you are a program playing a video and you want to inhibit the screen locker. But the unknown someone was delinquent and it wasn't what I needed for this problem.

Then I moved on to the /MainApplication object and found

    method void org.kde.KApplication.reparseConfiguration()


which wasn't quite what I was looking for either, but it might do: I could perhaps modify the configuration and then invoke this method. I dimly remembered that KDE keeps configuration files under $HOME/.kde, so I ls -la-ed that and quickly found share/config/kscreensaverrc, which looked plausible from the outside, and more plausible when I saw what was in it:  Enabled=True Timeout=60  among other things. I hand-edited the file to change the 60 to 243, ran  qdbus org.freedesktop.ScreenSaver /MainApplication reparseConfiguration  and then opened up the System Settings app. Sure enough, the System Settings app now reported that the lock timeout setting was “4 minutes”. And changing Enabled=True to Enabled=False and back made the System Settings app report that the locker was enabled or disabled. ### The answer So the script I wanted turned out to be:  timeout=${1:-3600}
perl -i -lpe 's/^Enabled=.*/Enabled=False/' $HOME/.kde/share/config/kscreensaverrc qdbus org.freedesktop.ScreenSaver /MainApplication reparseConfiguration sleep$timeout
&& ($statinfo[6] & 0xff) == 5) { say "Terminal" } else { say "Not a terminal" }  (This is Perl, written as if it were C.) It uses fstat (exposed in Perl as stat) to get the mode bits ($statinfo[2]) of the inode attached to STDOUT, and then it masks out the bits the determine if the inode is a character device file. If so, $statinfo[6] is the major and minor device numbers; if the major number (low byte) is equal to the magic number 5, the device is a terminal device. On my current computers the magic number is actually 136. Obviously this magic number is nonportable. You may hear people claim that those bit operations are also nonportable. I believe that claim is mistaken. The analogous code using isatty is:  use POSIX 'isatty'; if (isatty(STDOUT)) { say "Terminal" } else { say "Not a terminal" }  Is isatty doing what I wrote above? Or something else? Let's use strace to find out. Here's our test script:  % perl -MPOSIX=isatty -le 'print STDERR isatty(STDOUT) ? "terminal" : "nonterminal"' terminal % perl -MPOSIX=isatty -le 'print STDERR isatty(STDOUT) ? "terminal" : "nonterminal"' > /dev/null nonterminal  Now we use strace:  % strace -o /tmp/isatty perl -MPOSIX=isatty -le 'print STDERR isatty(STDOUT) ? "terminal" : "nonterminal"' > /dev/null nonterminal % less /tmp/isatty  We expect to see a long startup as Perl gets loaded and initialized, then whatever isatty is doing, the write of nonterminal, and then a short teardown, so we start searching at the end and quickly discover, a couple of screens up:  ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7ffea6840a58) = -1 ENOTTY (Inappropriate ioctl for device) write(2, "nonterminal", 11) = 11 write(2, "\n", 1) = 1  My guess about fstat was totally wrong! The actual method is that isatty makes an ioctl call; this is a device-driver-specific command. The TCGETS parameter says what command is, in this case “get the terminal configuration”. If you do this on a non-device, or a non-terminal device, the call fails with the error ENOTTY. When the ioctl call fails, you know you don't have a terminal. If you do have a terminal, the TCGETS command has no effects, because it is a passive read of the terminal state. Here's the successful call:  ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 write(2, "terminal", 8) = 8 write(2, "\n", 1) = 1  The B38400 opost… stuff is the terminal configuration; 38400 is the baud rate. (In the past the explanatory text for ENOTTY was the mystifying “Not a typewriter”, even more mystifying because it tended to pop up when you didn't expect it. Apparently Linux has revised the message to the possibly less mystifying “Inappropriate ioctl for device”.) (SNDCTL_TMR_TIMEBASE is mentioned because apparently someone decided to give their SNDCTL_TMR_TIMEBASE operation, whatever that is, the same numeric code as TCGETS, and strace isn't sure which one is being requested. It's possible that if we figured out which device was expecting SNDCTL_TMR_TIMEBASE, and redirected standard output to that device, that isatty would erroneously claim that it was a terminal.) [ Addendum 20150415: Paul Bolle has found that the SNDCTL_TMR_TIMEBASE pertains to the old and possibly deprecated OSS (Open Sound System) It is conceivable that isatty would yield the wrong answer when pointed at the OSS /dev/dsp or /dev/audio device or similar. If anyone is running OSS and willing to give it a try, please contact me at mjd@plover.com. ] [ Addendum 20191201: Thanks to Hacker News user jwilk for pointing out that strace is now able to distinguish TCGETS from SNDCTL_TMR_TIMEBASE. ] Sun, 19 Apr 2015 The marvelous Julia Evans is always looking for ways to express her love of strace and now has written a zine about it. I don't use strace that often (not as often as I should, perhaps) but every once in a while a problem comes up for which it's not only just the right thing to use but the only thing to use. This was one of those times. I sometimes use the ancient Unix drawing language pic. Pic has many good features, but is unfortunately coupled too closely to the Roff family of formatters (troff, nroff, and the GNU project version, groff). It only produces Roff output, and not anything more generally useful like SVG or even a bitmap. I need raw images to inline into my HTML pages. In the past I have produced these with a jury-rigged pipeline of groff, to produce PostScript, and then GNU Ghostscript (gs) to translate the PostScript to a PPM bitmap, some PPM utilities to crop and scale the result, and finally ppmtogif or whatever. This has some drawbacks. For example, gs requires that I set a paper size, and its largest paper size is A0. This means that large drawings go off the edge of the “paper” and gs discards the out-of-bounds portions. So yesterday I looked into eliminating gs. Specifically I wanted to see if I could get groff to produce the bitmap directly. GNU groff has a -Tdevice option that specifies the "output" device; some choices are -Tps for postscript output and -Tpdf for PDF output. So I thought perhaps there would be a -Tppm or something like that. A search of the manual did not suggest anything so useful, but did mention -TX100, which had something to do with 100-DPI X window system graphics. But when I tried this groff only said:  groff: can't find DESC' file groff:fatal error: invalid device X100  The groff -h command said only -Tdev use device dev. So what devices are actually available? strace to the rescue! I did:  % strace -o /tmp/gr groff -Tfpuzhpx  and then a search for fpuzhpx in the output file tells me exactly where groff is searching for device definitions:  % grep fpuzhpx /tmp/gr execve("/usr/bin/groff", ["groff", "-Tfpuzhpx"], [/* 80 vars */]) = 0 open("/usr/share/groff/site-font/devfpuzhpx/DESC", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/groff/1.22.2/font/devfpuzhpx/DESC", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/lib/font/devfpuzhpx/DESC", O_RDONLY) = -1 ENOENT (No such file or directory)  I could then examine those three directories to see if they existed, and if so find out what was in them. Without strace here, I would be reduced to groveling over the source, which in this case is likely to mean trawling through the autoconf output, and that is something that nobody wants to do. [ Addendum 20150424: I did figure out how to prevent gs from cropping my output. You can use the flag -p-P48i,48i to groff to set the page size to 48 inches (48i) by 48 inches. The flag is passed to grops, and then resulting PostScript file contains  %%DocumentMedia: Default 3456 3456 0 () ()  which instructs gs to pretend the paper size is that big. If it's not big enough, increase 48i to 120i or whatever. ] Fri, 17 Feb 2012 It came from... the HOLD SPACE Since 2002, I've given a talk almost every December for the Philadelphia Linux Users' Group. It seems like most of their talks are about the newest and best developments in Linux applications, which is a topic I don't know much about. So I've usually gone the other way, talking about the oldest and worst stuff. I gave a couple of pretty good talks about how files work, for example, and what's in the inode structure. I recently posted about my work on Zach Holman's spark program, which culminated in a ridiculous workaround for the shell's lack of fractional arithmetic. That work inspired me to do a talk about all the awful crap we had to deal with before we had Perl. (And the other 'P' languages that occupy a similar solution space.) Complete materials are here. I hope you check them out, because i think they are fun. This post is a bunch of miscellaneous notes about the talk. One example of awful crap we had to deal with before Perl etc. were invented was that some people used to write 'sed scripts', although I am really not sure how they did it. I tried once, without much success, and then for this talk I tried again, and again did not have much success. "The hold space" is a sed-ism. The basic model of sed is that it reads the next line of data into the 'pattern space', then applies a bunch of transformations to it, and then prints it out. If you need to save this line for later examination, or for emitting later on instead, you can hold it in the 'hold space'. Use of the hold space is what distinguishes sed experts from mere sed nobodies like me. So I planned to talk about the hold space, and then I got the happy idea to analogize the Hold Space to the Twilight Zone, or maybe the Phantom Zone, a place where you stick naughty data when you don't want it to escape. I never feel like audiences appreciate the work I put into this sort of thing; when I'm giving the talk it always sounds too much like a private joke. Explaining it just feels like everyone is sitting through my explanation of a private joke. The little guy to the right is known as hallucigenia. It is a creature so peculiar that when the paleontologists first saw the fossils, they could not even agree on which side was uppermost. It has nothing to do with Unix, but I put it on the slide to illustrate "alien horrors from the dawn of time". Between slides 9 and 10 (about the ed line editor) I did a quick demo of editing with ed. You will just have to imagine this. I first learned to program with a line editor like ed, on a teletypewriter just like the one on slide 8. Modern editors are much better. But it used to be that Unix sysadmins were expected to know at least a little ed, because if your system got into some horrible state where it couldn't mount the /usr partition, you wouldn't be able to run /usr/bin/vi or /usr/local/bin/emacs, but you would still be able to use /bin/ed to fix /etc/fstab or whatever else was broken. Knowing ed saved my bacon several times. (Speaking of teletypewriters, ours had an attachment for punching paper tape, which you can see on the left side of the picture. The punched chads fell into a plastic chad box (which is missing in the picture), and when I was about three I spilled the chad box. Chad was everywhere, and it was nearly impossible to pick up. There were still chads stuck in the cracks in the floorboards when we moved out three years later. That's why, when the contested election of 2000 came around, I was one of the few people in North America who was not bemused to learn that there was a name for the little punched-out bits.) Anyway, back to ed. ed has one and only one diagnostic: if you do something it didn't like, it prints ?. This explains the ancient joke on slide 10, which first appeared circa 1982 in the 4.2BSD fortune program. I really wanted to present a tour de force of sed mastery, but as slides 24–26 say, I was not clever enough. I tried really hard and just could not do it. If anyone wants to fix my not-quite-good-enough sed script, I will be quite grateful. On slide 28 I called awk a monster. This was a slip-up; awk is not a monster and that is why it does not otherwise appear in this talk. There is nothing really wrong with awk, other than being a little old, a little tired, and a little underpowered. If you are interested in the details of the classify program, described on slide 29, the sources are still available from the comp.sources.unix archive. People often say "Why don't you just use diff for that?" so I may as well answer that here: You use diff if you have two files and you want to see how they differ. You use classify if you have 59 files, of which 36 are identical, 17 more are also identical to each other but different from the first 36, and the remaining 6 are all weirdos, and you want to know which is which. These days you would probably just use md5sum FILES | accumulate, and in hindsight that's probably how I should have implemented classify. We didn't have md5sum but we had something like it, or I could have made a checksum program. The accumulate utility is trivial. Several people have asked me to clarify my claim to have invented netcat. It seems that a similar program with the same name is attributed to someone called "Hobbit". Here is the clarification: In 1991 I wrote a program with the functionality I described and called it "netcat". You would run netcat hostname port and it would open a network socket to the indicated address, and transfer data from standard input into the socket, and data from the socket to standard output. I still have the source code; the copyright notice at the top says "21 October 1991". Wikipedia says that the same-named program by the other guy was released on 20 March 1996. I do not claim that the other guy stole it from me, got the idea from me, or ever heard of my version. I do not claim to be the first or only person to have invented this program. I only claim to have invented mine independently. My own current version of the spark program is on GitHub, but I think Zach Holman's current version is probably simpler and better now. [ Addendum 20170325: I have revised this talk a couple of times since this blog article was written. Links to particular slides go to the 2011 versions, but the current version is from 2017. There are only minor changes. For example, I removed awk from the list of “monsters”. ] Wed, 11 Jan 2012 Where should usage messages go? Last week John Speno complained about Unix commands which, when used incorrectly, print usage messages to standard error instead of to standard output. The problem here is that if the usage message is long, it might scroll off the screen, and it's a pain when you try to pipe it through a pager with command | pager and discover that the usage output has gone to stderr, missed the pager, and scrolled off the screen anyway. Countervailing against this, though, is the usual argument for stderr: if you had run the command in a pipeline, and it wrote its error output to stdout instead of to stderr, then the error message would have gotten lost, and would possibly have caused havoc further down the pipeline. I considered this argument to be the controlling one, but I ran a quick and informal survey to see if I was in the minority. After 15 people had answered the survey, Ron Echeverri pointed out that although it makes sense for the usage message to go to stderr when the command is used erroneously, it also makes sense for it to go to stdout if the message is specifically requested, say by the addition of a --help flag, since in that case the message is not erroneous. So I added a second question to the survey to ask about where the message should go in such a case. 83 people answered the first question, "When a command is misused, should it deliver its usage message to standard output or to standard error?". 62 (75%) agreed that the message should go to stderr; 11 (13%) said it should go to stdout. 10 indicated that they preferred a more complicated policy, of which 4 were essentially (or exactly) what M. Echeverri suggested; this brings the total in favor of stderr to 66 (80%). The others were: 1. stdout, if it is a tty; stderr otherwise 2. stdout, if it is a pipe; stderr otherwise 3. A very long response that suggested syslog. 4. stderr, unless an empty stdout would cause problems 5. It depends, but the survey omitted the option of printing directly on the console 6. It depends I think #2 must have been trying to articulate #1, but (a) got it backwards and (b) missed. #3 seemed to be answering a different question than the one that was asked; syslog may make sense for general diagnostics, but to use it for usage messages seems peculiar. #5 also seems strange to me, since my idea of "console" is the line printer hardwired to the back of the mainframe down in the machine room; I think the writer might have meant "terminal". 68 people answered the second question, "Where should the command send the output when the user specifically requests usage information?". (15 people took the survey before I added this question.) 50 (74%) said the output should go to stdout, 12 (18%) to the user's default pager and then to stdout, and 5 (7%) to stderr. One person (The same as #5 above) said "it depends". Thanks to everyone who participated. Tue, 25 Mar 2008 The "z" command: output filtering My last few articles ([1] [2] [p] [p-2]) have been about this z program. The first part of this article is a summary of that discussion, which you can skip if you remember it. The idea of z is that you can do:  z grep pattern files...  and it does approximately the same as:  zgrep pattern files...  or you could do:  z sed script files...  and it would do the same as:  zsed script files...  if there were a zsed command, although there isn't. Much of the discussion has concerned a problem with the implementation, which is that the names of the original compressed files are not available to the command, due to the legerdemain z must perform in order to make the uncompressed data available to the command. The problem is especially apparent with wc:  % z wc * 411 2611 16988 ctime.blog 71 358 2351 /proc/self/fd/3 121 725 5053 /proc/self/fd/4 51 380 2381 files-talk.blog 48 145 885 find-uniq.pl 288 2159 12829 /proc/self/fd/5 95 665 4337 ssh-agent-revisted.blog 221 941 6733 struct-inode.blog 106 555 3976 sync-2.blog 115 793 4904 sync.blog 124 624 4208 /proc/self/fd/6 1651 9956 64645 total  Here /proc/self/fd/3 and the rest should have been names ending in .gz, such as env-2.blog.gz. ### Another possible solution At the time I wrote the first article, it occurred to me briefly that it would be possible to have z capture the output of the command and attempt to translate /proc/self/fd/3 back to env-2.blog.gz or whatever is appropriate, because although the subcommand does not know the original filenames, z itself does. The code would look something like this. Instead of ending by execing the command, as the original version of z did:  exec$command, @ARGV;
die "Couldn't run '$command':$!.\n";

this revised version of z, which we might call zz, would end with the code to translate back to the original filenames:

  open my($out), "-|",$command, @ARGV
or die "Couldn't run '$command':$!.\n";
while (<$out>) { s{/proc/self/fd/(\d+)}{$old[1]}g; print; }  Here @old is an array that translates from file descriptors back to the original filename. At the time, I thought of doing this, and my immediate thought was "well, that is so obviously a terrible idea that it is not worth even mentioning", so I left it out. But since then at least five people have written to me to suggest it, so it appears that it is not obviously a terrible idea. I had to think a little deeper about why I thought it was a terrible idea. Really the question is why I think this is a more terrible idea than the original z program was in the first place. Because one could say that z is garbling the output of its command, and the filtering code in zz is only un-garbling it. But I think this isn't the right way to look at it. The output of the command has a certain format, a certain structure. We don't know ahead of time what that structure is, but it can be described for any particular command. For instance, the output of wc is always a sequence of lines where each line has four whitespace-separated fields, of which the first three are numerals and the last is a filename, and then a final total line at the end. Similarly, the output of tar is a file in a complicated binary format, one which is documented somewhere and which is intelligible to other instances of the tar command that are trying to decode it. The original behavior of z may alter the content of the command output to some extent, replacing some filenames with others. But it cannot disrupt the structure or the format of the file, ever. This is because the output of z tar is the output of tar, unmodified. The z program tampers with the arguments it gives to tar, but having done that it runs tar and lets tar do what it wants, and tar then must produce a tar-format output, possibly not the one it would have normally produced—the content might be a little different—but a properly-formatted one for sure. In particular, any program written to deal properly with the output of tar will still work with the output of z tar. The output might not have the same meaning, but we can say very particularly what the extent of the differences might be: if the output mentions filenames, then some of these might have changed from the true filenames to filenames of the form /proc/self/fd/37. With zz, we cannot make any such guarantee. The output of zz tar zc foo.gz, for example, might be in proper .tar.gz format. But suppose the output of tar zc foo.gz creates compressed binary output that just happens to contain the byte sequence 2f70 726f 632f 7365 6c66 2f66 642f 33? (That is, "/proc/self/fd/3".) Then zz will silently replace these 15 bytes with the six bytes 666f 6f2e 677a. What if the original sequence was understood as part of a sequence of 2-byte integers? The result is not even properly aligned. What if that initial 2f was a count? The resulting count (66) is much too long. The result would be utterly garbled and unintelligible to tar zx. What the tar command will do with a garbled input is not well-defined: it might dump core, or it might write out random garbage data, or overwrite essential files in the filesystem. We are into nasal demon territory. With the original z, we never get anywhere near the nasal demons. I suppose the short summary here is that z treats its command as a black box, while zz pretends to understand what comes out of it. But zz's understanding is a false pretense. My experience says that programs should not screw around with things they don't understand, and this is why I instantly rejected the idea when I thought of it before. One correspondent argued that the garbling is very unlikely, and proposed various techniques to make it even less likely, mostly by rewriting the input filenames to various long random strings. But I felt then that this was missing the point, and I still do. He says it is unlikely, but he doesn't know that it is unlikely, and indeed the unlikeliness depends on the format of the output of the command, which is precisely the unknown here. In my view, the difference between z and zz is that the changes that z makes are bounded, because you can describe them briefly, as I did above, and the changes that zz makes are unbounded, because there is no limit to what could happen as a result. On the other hand, this correspondent made a good point that if the output of zz is not consumed by anything other than human eyeballs, there may be no real problem. And for some particular commands, such as wc, there is never any problem at all. So perhaps it's a good idea to add a command-line option to z to enable the zz behavior. I did this in my version, and I'm going to try it out and see how it goes. Sat, 22 Mar 2008 The "z" command: alternative implementations In yesterday's article I discussed a possibly-useful utility program named z, which has a flaw. To jog your memory, here is a demonstration:  % z grep immediately * ctime.blog:we want to update. It is immediately copied into a register, and /proc/self/fd/3:All five people who wrote to me about this immediately said "oh, yes, /proc/self/fd/5:program continues immediately, possibly posting its message. (It struct-inode.blog:is a symbolic link, its inode is returned immediately; iname() would sync.blog:and reports success back to the process immediately, even though the  For a detailed discussion, see the previous article. Fixing this flaw seems difficult-to-impossible. As I said earlier, the trick is to fool the command into reading from a pipe when it thinks it is opening a file, and this is precisely what /proc/self/fd is for. But there is an older, even more widely-implemented Unix feature that does the same thing, namely the FIFO. So an alternative implementation creates one FIFO for each compressed file, with a gzip process writing to the FIFO, and tells the command to read from the FIFO. Since we have some limited control over the name of the FIFO, we can ameliorate the missing-filename problem to some extent. Say, for example, we create the FIFOs in /tmp/PID. Then the broken zgrep example above might look like this instead:  % z grep immediately * ctime.blog:we want to update. It is immediately copied into a register, and /tmp/7516/env-2.blog.gz:All five people who wrote to me about this immediately said "oh, yes, /tmp/7516/qmail-throttle.blog.gz:program continues immediately, possibly posting its message. (It struct-inode.blog:is a symbolic link, its inode is returned immediately; iname() would sync.blog:and reports success back to the process immediately, even though the  The output is an improvement, but it is not completely solved, and the cost is that the process and file management are much more complicated. In fact, the cost is so high that you have to wonder if it might not be simpler to replace z with a shell script that copies the data to a temporary directory, uncompresses the files, and runs the command on the uncompressed files, perhaps something along these lines:  #!/bin/sh DIR=/tmp/$$mkdir DIR COMMAND=1 shift cp -p "@" DIR cd DIR gzip -d * COMMAND *  This has problems too, but my point is that if you are willing to accept a crappy, semi-working solution along the lines of the FIFO one, simpler ones are at hand. You can compare the FIFO version directly with the shell script, and I think the FIFO version loses. The z implementation I have is a solution in a different direction, and different tradeoffs, and so might be preferable to it in a number of ways. But as I said, I don't know yet. [ Addendum 20080325: Several people suggested a fix that I had considered so unwise that I didn't even mention it. But after receiving the suggestion repeatedly, I wrote an article about it. ] Fri, 21 Mar 2008 z-commands The gzip distribution includes a command called zcat. Its command-line arguments can include any number of filenames, compressed or not, and it prints out the contents, uncompressing them on the fly if necessary. Sometime later a zgrep command appeared, which was similar but which also performed a grep search. But for anything else, you either need to uncompress the files, or build a special tool. I have a utility that scans the web logs of blog.plover.com, and extracts a report about new referrers. The historical web logs are normally kept compressed, so I recently built in support for decompression. This is quite easy in Perl. Normally one scans a sequence of input files something like this:  while (<>) { ... do something with _ ... }  The <> operator implicitly scans all the lines in all the files named in the command-line arguments, opening a new file each time the previous one is exhausted. To decompress the files on the fly, one can preprocess the command-line arguments:  for (@ARGV) { if (/\.gz/) { _ = "gzip -dc _ |"; } } while (<>) { ... do something with _ ... }  The for loop scans the command-line arguments, replacing each one that has the form foo.gz with gzip -dc foo.gz |. Perl's magic open semantics treat filenames specially if they end with a pipe symbol: a pipe to a command is opened instead. Of course, anyone can think of half a dozen ways in which this can go wrong. But Larry Wall's skill in making such tradeoffs has been a large factor in Perl's success. But it bothered me to have to make this kind of change in every program that wanted to handle compressed files. We have zcat and zgrep; where are zcut, zpr, zrev, zwc, zcol, zbc, zsed, zawk, and so on? Echh. But after I got to thinking about it, I decided that I could write a single z utility that would do a lot of the same things. Instead of this:  zsed -e 's/:.*//' * | ...  where the * matches some files that have .gz suffixes and some that haven't, one would write:  z sed -e 's/:.*//' * | ...  and it would Just Work. That's the idea, anyway. If sed were written in Perl, z would have an easy job. It could rely on Perl's magic open, and simply preprocess the arguments before running sed:  # hypothetical implementation of z # my command = shift; for (@ARGV) { if (/\.gz/) { _ = "gzip -dc _ |"; } } exec command, @ARGV; die "Couldn't run command 'command': !\n";  But sed is not written in Perl, and has no magic open. So I have to play a trickier trick:  for my file (@ARGV) { if (file =~ /\.gz/) { unless (open(fhs[@fhs], "-|", "gzip", "-cd", file)) { warn "Couldn't open file 'file': !; skipping\n"; next; } my fd = fileno fhs[-1]; _ = "/proc/self/fd/fd"; } } # warn "running command @ARGV\n"; exec command, @ARGV; die "Couldn't run command 'command': !\n";  This is a stripped-down version to illustrate the idea. For various reasons that I explained yesterday, it does not actually work. The complete, working source code is here. The idea, as before, is that the program preprocesses the command-line arguments. But instead of replacing the arguments with pipe commands, which are not supported by open(2), the program sets up the pipes itself, and then directs the command to take its input from the pipes by specifying the appropriate items from /proc/self/fd. The trick depends crucially on having /proc/self/fd, or /dev/fd, or something of the sort, because otherwise there's no way to trick the command into reading from a pipe when it thinks it is opening a file. (Actually there is at least one other way, involving FIFOs, which I plan to discuss tomorrow.) Most modern systems do have /proc/self/fd. That feature postdates my earliest involvement with Unix, so it isn't a ready part of my mental apparatus as perhaps it ought to be. But this utility seems to me like a sort of canonical application of /proc/self/fd, in the sense that, if you couldn't think what /proc/self/fd might be good for, then you could read this example and afterwards have a pretty clear idea. The z utility has a number of flaws. Principally, the original filenames are gone. Here's a typical run with regular zgrep:  % zgrep immediately * ctime.blog:we want to update. It is immediately copied into a register, and env-2.blog.gz:All five people who wrote to me about this immediately said "oh, yes, qmail-throttle.blog.gz:program continues immediately, possibly posting its message. (It struct-inode.blog:is a symbolic link, its inode is returned immediately; iname() would sync.blog:and reports success back to the process immediately, even though the  But here's the same thing with z:  % z grep immediately * ctime.blog:we want to update. It is immediately copied into a register, and /proc/self/fd/3:All five people who wrote to me about this immediately said "oh, yes, /proc/self/fd/5:program continues immediately, possibly posting its message. (It struct-inode.blog:is a symbolic link, its inode is returned immediately; iname() would sync.blog:and reports success back to the process immediately, even though the  The problem is even more glaring in the case of commands like wc:  % z wc * 411 2611 16988 ctime.blog 71 358 2351 /proc/self/fd/3 121 725 5053 /proc/self/fd/4 51 380 2381 files-talk.blog 48 145 885 find-uniq.pl 288 2159 12829 /proc/self/fd/5 95 665 4337 ssh-agent-revisted.blog 221 941 6733 struct-inode.blog 106 555 3976 sync-2.blog 115 793 4904 sync.blog 124 624 4208 /proc/self/fd/6 1651 9956 64645 total  So perhaps z will not turn out to be useful enough to be more than a curiosity. But I'm not sure yet. This is article #300 on my blog. Thanks for reading. [ Addendum 20080322: There is a followup to this article. ] [ Addendum 20080325: Another followup. ] Thu, 06 Mar 2008 Throttling qmail This may well turn out to be another oops. Sometimes when I screw around with the mail system, it's a big win, and sometimes it's a big lose. I don't know yet how this will turn out. Since I moved house, I have all sorts of internet-related problems that I didn't have before. I used to do business with a small ISP, and I ran my own web server, my own mail service, and so on. When something was wrong, or I needed them to do something, I called or emailed and they did it. Everything was fine. Since moving, my ISP is Verizon. I have great respect for Verizon as a provider of telephone services. They have been doing it for over a hundred years, and they are good at it. Maybe in a hundred years they will be good at providing computer network services too. Maybe it will take less than a hundred years. But I'm not as young as I once was, and whenever that glorious day comes, I don't suppose I'll be around to see it. One of the unexpected problems that arose when I switched ISPs was that Verizon helpfully blocks incoming access to port 80. I had moved my blog to outside hosting anyway, because the blog was consuming too much bandwidth, so I moved the other plover.com web services to the same place. There are still some things that don't work, but I'm dealing with them as I have time. Another problem was that a lot of sites now rejected my SMTP connections. My address was in a different netblock. A Verizon DSL netblock. Remote SMTP servers assume that anybody who is dumb enough to sign up with Verizon is also too dumb to run their own MTA. So any mail coming from a DSL connection in Verizonland must be spam, probably generated by some Trojan software on some infected Windows box. The solution here (short of getting rid of Verizon) is to relay the mail through Verizon's SMTP relay service. mail.plover.com sends to outgoing.verizon.net, and lets outgoing.verizon.net forward the mail to its final destination. Fine. But but but. If my machine sends more than X messages per Y time, outgoing.verizon.net will assume that mail.plover.com has been taken over by a Trojan spam generator, and cut off access. All outgoing mail will be rejected with a permanent failure. So what happens if someone sends a message to one of the 500-subscriber email lists that I host here? mail.plover.com generates 500 outgoing messages, sends the first hundred or so through Verizon. Then Verizon cuts off my mail service. The mailing list detects 400 bounce messages, and unsubscribes 400 subscribers. If any mail comes in for another mailing list before Verizon lifts my ban, every outgoing message will bounce and every subscriber will be unsubscribed. One solution is to get a better mail provider. Lorrie has an Earthlink account that comes with outbound mail relay service. But they do the same thing for the same reason. My Dreamhost subscription comes with an outbound mail relay service. But they do the same thing for the same reason. My Pobox.com account comes with an unlimited outbound mail relay service. But they require SASL authentication. If there's a SASL patch for qmail, I haven't been able to find it. I could implement it myself, I suppose, but I don't wanna. So far there are at least five solutions that are on the "eh, maybe, if I have to" list: • Get a non-suck ISP • Find a better mail relay service • Hack SASL into qmail and send mail through Pobox.com • Do some skanky thing with serialmail • Get rid of qmail in favor of postfix, which presumably supports SASL (Yeah, I know the Postfix weenies in the audience are shaking their heads sadly and wondering when the scales will fall from my eyes. They show up at my door every Sunday morning in their starched white shirts and their pictures of DJB with horns and a pointy tail...) It also occurred to me in the shower this morning that the old ISP might be willing to sell me mail relaying and nothing else, for a small fee. That might be worth pursuing. It's gotta be easier than turning qmail-remote into a SASL mail client. The serialmail thing is worth a couple of sentences, because there's an autoresponder on the qmail-users mailing-list that replies with "Use serialmail. This is discussed in the archives." whenever someone says the word "throttle". The serialmail suite, also written by Daniel J. Bernstein, takes a maildir-format directory and posts every message in it to some remote server, one message at a time. Say you want to run qmail on your laptop. Then you arrange to have qmail deliver all its mail into a maildir, and then when your laptop is connected to the network, you run serialmail, and it delivers the mail from the maildir to your mail relay host. serialmail is good for some throttling problems. You can run serialmail under control of a daemon that will cut off its network connection after it has written a certain amount of data, for example. But there seems to be no easy way to do what I want with serialmail, because it always wants to deliver all the messages from the maildir, and I want it to deliver one message. There have been some people on the qmail-users mailing-list asking for something close to what I want, and sometimes the answer was "qmail was designed to deliver mail as quickly and efficiently as possible, so it won't do what you want." This is a variation of "Our software doesn't do what you want, so I'll tell you that you shouldn't want to do it." That's another rant for another day. Anyway, I shouldn't badmouth qmail-users mailing-list, because the archives did get me what I wanted. It's only a stopgap solution, and it might turn out to be a big mistake, but so far it seems okay, and so at last I am coming to the point of this article. I hacked qmail to support outbound message rate throttling. Following a suggestion of Richard Lyons from the qmail-users mailing-list, it was much easier to do than I had initially thought. Here's how it works. Whenever qmail wants to try to deliver a message to a remote address, it runs a program called qmail-remote. qmail-remote is responsible for looking up the MX records for the host, contacting the right server, conducting the SMTP conversation, and returning a status code back to the main component. Rather than hacking directly on qmail-remote, I've replaced it with a wrapper. The real qmail-remote is now in qmail-remote-real. The qmail-remote program is now written in Perl. It maintains a log file recording the times at which the last few messages were sent. When it runs, it reads the log file, and a policy file that says how quickly it is allowed to send messages. If it is okay to send another message, the Perl program appends the current time to the log file and invokes the real qmail-remote. Otherwise, it sleeps for a while and checks again. The program is not strictly correct. It has some race conditions. Suppose the policy limits qmail to sending 8 messages per minute. Suppose 7 messages have been sent in the last minute. Then six instances of qmail-remote might all run at once, decide that it is OK to send a message, and send one. Then 13 messages have been sent in the last minute, which exceeds the policy limit. So far this has not been much of a problem. It's happened twice in the last few hours that the system sent 9 messages in a minute instead of 8. If it worries me too much, I can tell qmail to run only one qmail-remote at a time, instead of 10. On a normal qmail system, qmail speeds up outbound delivery by running multiple qmail-remote processes concurrently. On my crippled system, speeding up outbound delivery is just what I'm trying to avoid. Running at most one qmail-remote at a time will cure all race conditions. If I were doing the project over, I think I'd take out all the file locking and such, and just run one qmail-remote. But I didn't think of it in time, and for now I think I'll live with the race conditions and see what happens. So let's see? What else is interesting about this program? I made at least one error, and almost made at least one more. The almost-error was this: The original design for the program was something like: 1. do • lock the history file, read it, and unlock it until it's time to send a message 2. lock the history file, update it, and unlock it 3. send the message This is a classic mistake in writing programs that run concurrently and update a file. The problem is that process A update the file after process B reads but before B updates it. Then B's update will destroy A's. One way to fix this is to have the processes append to the history file, but never remove anything from it. That is clearly not a sustainable strategy. Someone must remove expired entries from the history file. Another fix is to have the read and the update in the same critical section: 1. lock the history file 2. do • read the history file until it's time to send a message 3. update the history file and unlock it 4. send the message But that loop could take a long time, during which no other qmail-remote process can make progress. I had decided that I wanted to try to retain the concurrency, and so I wasn't willing to accept this. Cleaning the history file could be done by a separate process that periodically locks the file and rewrites it. But instead, I have the qmail-remote processes to it on the fly: 1. do • lock the history file, read it, and unlock it until it's time to send a message 2. lock the history file, read it, update it, and unlock it 3. send the message I'm happy that I didn't actually make this mistake. I only thought about it. Here's a mistake that I did make. This is the block of code that sleeps until it's time to send the message:  while (@last >= msgs) { my oldest = last[0]; my age = time() - oldest; my zzz = time - age + int(rand(3)); zzz = 1 if zzz < 1; # Log("Sleeping for zzz secs"); sleep zzz; shift @last while last[0] < time() - time; load_policy(); }  The throttling policy is expressed by two numbers, msgs and time, and the program tries to send no more than msgs messages per time seconds. The @last array contains a list of Unix epoch timestamps of the times at which the messages of the last time seconds were sent. So the loop condition checks to see if fewer than msgs messages were sent in the last time seconds. If not, the program continues immediately, possibly posting its message. (It rereads the history file first, in case some other messages have been posted while it was asleep.) Otherwise the program will sleep for a while. The first three lines in the loop calculate how long to sleep for. It sleeps until the time the oldest message in the history will fall off the queue, possibly plus a second or two. Then the crucial line:  shift @last while last[0] < time() - time;  which discards the expired items from the history. Finally, the call to load_policy() checks to see if the policy has changed, and the loop repeats if necessary. The bug is in this crucial line. if @last becomes empty, this line turns into an infinite busy-loop. It should have been:  shift @last while @last && last[0] < time() - time;  Whoops. I noticed this this morning when my system's load was around 12, and eight or nine qmail-remote processes were collectively eating 100% of the CPU. I would have noticed sooner, but outbound deliveries hadn't come to a complete halt yet. Incidentally, there's another potential problem here arising from the concurrency. A process will complete the sleep loop in at most time+3 seconds. But then it will go back and reread the history file, and it may have to repeat the loop. This could go on indefinitely if the system is busy. I can't think of a good way to fix this without getting rid of the concurrent qmail-remote processes. Here's the code. I hereby place it in the public domain. It was written between 1 AM and 3 AM last night, so don't expect too much. Sat, 08 Dec 2007 Corrections about sync(2) I made some errors in today's post about sync and fsync. Most important, I said that "the sync() system call marks all the kernel buffers as dirty". This is totally wrong, and doesn't even make sense. Dirty buffers are those with data that needs to be written out. Marking a non-dirty buffer as dirty is a waste of time, since nothing has changed in the buffer, but it will now be rewritten anyway. What sync() does is schedule all the dirty buffers to be written as soon as possible. On some recent systems, sync() actually waits for all the dirty buffers to be written, and a bunch of people tried to correct me about this. But my original article was right: historically, it was not so, and even today it's not universally true. In former times, sync() would schedule the buffers for writing, and then return before the data was actually written. I said that one of the duties of init was to call sync() every thirty seconds, but this was mistaken. That duty actually fell to a separate program, known as update. While discussing this with one of the readers who wrote to correct me, I looked up the source for Version 7 Unix, to make sure I was right, and it's so short I thought I might as well show it here:  /* * Update the file system every 30 seconds. * For cache benefit, open certain system directories. */ #include <signal.h> char *fillst[] = { "/bin", "/usr", "/usr/bin", 0, }; main() { char **f; if(fork()) exit(0); close(0); close(1); close(2); for(f = fillst; *f; f++) open(*f, 0); dosync(); for(;;) pause(); } dosync() { sync(); signal(SIGALRM, dosync); alarm(30); }  The program is so simple I don't have much more to say about it. It initially invokes dosync(), which calls sync() and then schedules another call to dosync() in 30 seconds. Note that the 0 in the second argument to open had not yet been changed to O_RDONLY. The pause() call is equivalent to sleep(0): it causes the process to relinquish its time slice whenever it is active. In various systems more recent than V7, the program was known by various names, but it was update for a very long time. Several people wrote to correct me about the:  # sync # sync # sync # halt  thing, some saying that I had the reason wrong, or that it did not make sense, or that only two syncs were used, rather than three. But I had it right. People did use three, and they did it for the reason I said, whether that makes sense or not. (Some of the people who miscorrected me were unaware that sync() would finish and exit before the data was actually written.) But for example, see this old Usenet thread for a discussion of the topic that confirms what I said. Nobody disputed my contention that Linus was suffering from the promptings of the Evil One when he tried to change the semantics of fsync(), and nobody seems to know the proper name of the false god of false efficiency. I'll give this some thought and see what I can come up with. Thanks to Tony Finch, Dmitry Kim, and Stefan O'Rear for discussion of these points. Dirty, dirty buffers! One side issue that arose during my talk on Monday about inodes was the write-buffering normally done by Unix kernels. I wrote a pretty long note to the PLUG mailing list about it, and I thought I'd repost it here. When your process asks the kernel to write data:  int bytes_written = write(file_descriptor, buffer, n_bytes);  the kernel normally copies the data from your buffer into a kernel buffer, and then, instead of writing out the data to disk, it marks its buffer as "dirty" (that is, as needing to be written eventually), and reports success back to the process immediately, even though the dirty buffer has not yet been written, and the data is not yet on the disk. Normally, the kernel writes out the dirty buffer in due time, and the data makes it to the disk, and you are happy because your process got to go ahead and do some more work without having to wait for the disk, which could take milliseconds. ("A long time", as I so quaintly called it in the talk.) If some other process reads the data before it is written, that is okay, because the kernel can give it the updated data out of the buffer. But if there is a catastrophe, say a power failure, then you see the bad side of this asynchronous writing technique, because the data, which your process thought had been written, and which the kernel reported as having been written, has actually been lost. There are a number of mechanisms in place to deal with this. The oldest is the sync() system call, which marks all the kernel buffers as dirty. All Unix systems run a program called init, and one of init's principal duties is to call sync() every thirty seconds or so, to make sure that the kernel buffers get flushed to disk at least every thirty seconds, and so that no crash will lose more than about thirty seconds' worth of data. (There is also a command-line program sync which just does a sync() call and then exits, and old-time Unix sysadmins are in the habit of halting the system with:  # sync # sync # sync # halt  because the second and third syncs give the kernel time to actually write out the buffers that were marked dirty by the first sync. Although I suspect that few of them know why they do this. I swear I am not making this up.) But for really crucial data, sync() is not enough, because, although it marks the kernel buffers as dirty, it still does not actually write the data to the disk. So there is also an fsync() call; I forget when this was introduced. The process gives fsync() a file descriptor, and the call demands that the kernel actually write the associated dirty buffers to disk, and does not return until they have been. And since, unlike write(), it actually waits for the data to go to the disk, a successful return from fsync() indicates that the data is truly safe. The mail delivery agent will use this when it is writing your email to your mailbox, to make sure that no mail is lost. Some systems have an O_SYNC flag than the process can supply when it opens the file for writing:  int fd = open("blookus", O_WRONLY | O_SYNC);  This sets the O_SYNC flag in the kernel file pointer structure, which means that whenever data is written to this file pointer, the kernel, contrary to its usual practice, will implicitly fsync() the descriptor. Well, that's not what I wanted to write about here. What I meant to discuss was... No, wait. That is what I wanted to write about. How about that? Anyway, there's an interesting question that arises in connection with fsync(): suppose you fsync() a file. That guarantees that the data will be written. But does it also guarantee that the mtime and the file extent of the file will be updated? That is, does it guarantee that the file's inode will be written? On most systems, yes. But on some versions of Linux's ext2 filesystem, no. Linus himself broke this as a sacrifice to the false god of efficiency, a very bad decision in my opinion, for reasons that should be obvious to everyone but those in the thrall of Mammon. (Mammon's not right here. What is the proper name of the false god of efficiency?) Sanity eventually prevailed. Recent versions of Linux have an fsync() call, which updates both the data and the inode, and a fdatasync() call, which only guarantees to update the data. [ Addendum 20071208: Some of this is wrong. I posted corrections. ] Thu, 06 Dec 2007 What's a File? Almost every December since 2001 I have given a talk to the local Linux users' group on some aspect of Unix internals. My first talk was on the internals of the ext2 filesystem. This year I was under a lot of deadline pressure at work, so I decided I would give the 2001 talk again, maybe with a few revisions. Actually I was under so much deadline pressure that I did not have time to revise the talk. I arrived at the user group meeting without a certain idea of what talk I was going to give. Fortunately, the meeting structure is to have a Q&A and discussion period before the invited speaker gives his talk. The Q&A period always lasts about an hour. In that hour before I had to speak, I wrote a new talk called What's a File?. It mostly concerns the Unix "inode" structure, and what the kernel uses it for. It uses the output of the well-known ls -l command as a jumping-off point, since most of the ls -l information comes from the inode. Then I talk about how files are opened and permissions are checked, how the filesystem is organized, how the kernel reads and writes data, how directories are structured, how it's possible to have one file with two names, how symbolic links work, and what that mysterious field is in the ls -l output between the permissions and the owner. The talk was quite successful, much more so than I would have expected, given how quickly I wrote it and my complete inability to edit or revise it. Of course, it does help that I know this material backwards and forwards and standing on my head, and also that I could reuse all the diagrams and illustrations from the 2001 version of the talk. I would not, however, recommend this technique. As my talks have gotten better over the years, I find that less and less of the talk material is captured in the slides, and so the slides become less and less representative of the talk itself. But I put them online anyway, and here they are. Sat, 27 Jan 2007 Software archaeology For appropriate values of "everyone", everyone knows that Unix files do not record any sort of "creation time". A fairly frequently asked question in Unix programming forums, and other related forums, such as Perl programming forums, is how to get the creation date of a file; the answer is that you cannot do that because it is not there. This lack is exacerbated by several unfortunate facts: creation times are available on Windows systems; the Unix inode contains three timestamps, one of which is called the "ctime", and the "c" is suggestive of the wrong thing; Perl's built-in stat function overloads the return value to return the Windows creation time in the same position (on Windows) as it returns the ctime (on Unix). So we see questions like this one, which appeared this week on the Philadelphia Linux Users' Group mailing list: How does one check and change ctime? And when questioned as to why he or she wanted to do this, this person replied: We are looking to change the creation time. From what I understand, ctime is the closest thing to creation time. There is something about this reply that irritates me, but I'm not quite sure what it is. Several responses come to mind: "Close" is not sufficient in system programming; the ctime is not "close" to a creation time, in any sense; before you go trying to change the thing, you ought to do a minimal amount of research to find out what it is. It is a perfect example of the Wrong Question, on the same order as that poor slob all those years ago who wanted to know how to tell if a file was a hard link or a soft link. But anyway, that got me thinking about ctimes in general, and I did some research into the history and semantics of the thing, and made some rather surprising discoveries. One good reference for the broad outlines of early Unix is the paper that Dennis Ritchie and Ken Thompson published in Communications of the ACM in 1974. This was updated in 1978, but the part I'm quoting wasn't revised and is current to 1974. Here is what it has to say about the relevant parts of the inode structure: ### IV. IMPLEMENTATION OF THE FILE SYSTEM ... The entry found thereby (the file's i-node) contains the description of the file: ... time of creation, last use, and last modification An error? I don't think so. Here is corroborating evidence, the stat man page from the first edition of Unix, from 1971: NAME stat -- get file status SYNOPSIS sys stat; name; buf / stat = 18. DESCRIPTION name points to a null-terminated string naming a file; buf is the address of a 34(10) byte buffer into which information is placed concerning the file. It is unnecessary to have any permissions at all with respect to the file, but all directories leading to the file must be readable. After stat, buf has the following format: buf, +1 i-number +2, +3 flags (see below) +4 number of links +5 user ID of owner size in bytes +6,+7 size in bytes +8,+9 first indirect block or contents block ... +22,+23 eighth indirect block or contents block +24,+25,+26,+27 creation time +28,+29, +30,+31 modification time +32,+33 unused  (Dennis Ritchie provides the Unix first edition manual; the stat page is in section 2.1.) Now how about that? When did the ctime change from being called a "creation time" to a "change time"? Did the semantics change too, or was the "creation time" description a misnomer? If I can't find out, I might write to Ritchie to ask. But this is, of course, a last resort. In the meantime, I do have the source code for the fifth edition kernel, but it appears that, around that time (1975 or so), there was no creation time. At least, I can't find one. The inode operations inside the kernel are defined to operate on struct inodes:  struct inode { char i_flag; char i_count; int i_dev; int i_number; int i_mode; char i_nlink; char i_uid; char i_gid; char i_size0; char *i_size1; int i_addr[8]; int i_lastr; } inode[NINODE];  The i_lastr field is what we would now call the atime. (I suppose it stands for "last read".) The mtime and ctime are not there, because they are not stored in the in-memory copy of the inode. They are fetched directly from the disk when needed. We can see an example of this in the stat1 function, which is the backend for the stat and fstat system calls:  1 stat1(ip, ub) 2 int *ip; 3 { 4 register i, *bp, *cp; 5 6 iupdat(ip, time); 7 bp = bread(ip->i_dev, ldiv(ip->i_number+31, 16)); 8 cp = bp->b_addr + 32*lrem(ip->i_number+31, 16) + 24; 9 ip = &(ip->i_dev); 10 for(i=0; i<14; i++) { 11 suword(ub, *ip++); 12 ub =+ 2; 13 } 14 for(i=0; i<4; i++) { 15 suword(ub, *cp++); 16 ub =+ 2; 17 } 18 brelse(bp); 19 }  ub is the user buffer into which the stat data will be deposited. ip is the inode structure from which most of this data will be copied. The suword utility copies a two-byte unsigned integer ("short unsigned word") from source to destination. This is done starting at the i_dev field (line 9), which effectively skips the two earlier fields, i_flag and i_count, which are internal kernel matters that are none of the user's business. 14 words are copied from the inode structure starting from this position, including the device and i-number fields, the mode, the link count, and so on, up through the addresses of the data or indirect blocks. (In modern Unixes, the stat call omits these addresses.) Then four words are copied out of the cp buffer, which has been read from the inode actually on the disk; these eight bytes are at position 24 in the inode, and ought to contain the mtime and the ctime. The question is, which is which? This simple question turns out to have a surprisingly complicated answer. When an inode is modified, the IUPD flag is set in the i_flag member. For example, here is chmod, which modifies the inode but not the underlying data. On a modern unix system, we would expect this to update the ctime, but not the mtime. Let's see what it does in version 5:  1 chmod() 2 { 3 register *ip; 4 5 if ((ip = owner()) == NULL) 6 return; 7 ip->i_mode =& ~07777; 8 if (u.u_uid) 9 u.u_arg[1] =& ~ISVTX; 10 ip->i_mode =| u.u_arg[1]&07777; 11 ip->i_flag =| IUPD; 12 iput(ip); 13 }  Line 10 is the important one; it sets the mode on the in-memory copy of the inode to the argument supplied by the user. Then line 11 sets the IUPD flag to indicate that the inode has been modified. Line 12 calls iput, whose principal job is to maintain the kernel's internal reference count of the number of file descriptors that are attached to this inode. When this number reaches zero, the inode is written back to disk, and discarded from the kernel's open file table. The iupdat function, called from iput, is the one that actually writes the modified inode back to the disk:  1 iupdat(p, tm) 2 int *p; 3 int *tm; 4 { 5 register *ip1, *ip2, *rp; 6 int *bp, i; 7 8 rp = p; 9 if((rp->i_flag&(IUPD|IACC)) != 0) { 10 if(getfs(rp->i_dev)->s_ronly) 11 return; 12 i = rp->i_number+31; 13 bp = bread(rp->i_dev, ldiv(i,16)); 14 ip1 = bp->b_addr + 32*lrem(i, 16); 15 ip2 = &rp->i_mode; 16 while(ip2 < &rp->i_addr[8]) 17 *ip1++ = *ip2++; 18 if(rp->i_flag&IACC) { 19 *ip1++ = time[0]; 20 *ip1++ = time[1]; 21 } else 22 ip1 =+ 2; 23 if(rp->i_flag&IUPD) { 24 *ip1++ = *tm++; 25 *ip1++ = *tm; 26 } 27 bwrite(bp); 28 } 29 }  What is going on here? p is the in-memory copy of the inode we want to update. It is immediately copied into a register, and called by the alias rp thereafter. tm is the time that the kernel should write into the mtime field of the inode. Usually this is the current time, but the smdate system call ("set modified date") supplies it from the user instead. Lines 16–17 copy the mode, link count, uid, gid, "size", and "addr" fields from the in-memory copy of the inode into the block buffer that will be written back to the disk. Lines 18–22 update the atime if the IACC flag is set, or skip it if not. Then, if the IUPD flag is set, lines 24–25 write the tm value into the next slot in the buffer, where the mtime is stored. The bwrite call on line 27 commits the data to the disk; this results in a call into the appropriate device driver code. There is no sign of updating the ctime field, but recall that we started this search by looking at what the chmod call does; it sets IUPD, which eventually results in the updating of the mtime field. So the mtime field is not really an mtime field as we now know it; it is doing the job that is now done by the ctime field. And in fact, the dump command predicates its decision about whether to dump a file on the contents of the mtime field. Which is really the ctime field. So functionally, dump is doing the same thing it does now. It's possible that I missed it, but I cannot find the advertised creation time anywhere. The logical place to look is in the maknode function, which allocates new inodes. The maknode function calls ialloc to get an unused inode from the device, and this initializes its mode (as specified by the user), its link count (to 1), and its uid and gid (to the current process's uid and gid). It does not set a creation time. The ialloc function is fairly complicated, but as far as I can tell it is not setting any creation time either. Working it from the other end, asking who might look at the ctime field, we have the find command, which has a -mtime option, but no -ctime option. The dump command, as noted before, uses the mtime. Several commands perform stat calls and declare structs to hold the result. For example, pr, which prints files with nice pagination, declares a struct inode, which is the inode as returned by stat, as opposed to the inode as used internally by the kernel—what we would call a struct stat now. There was no /usr/include in the fifth edition, so the pr command contains its own declaration of the struct inode. It looks like this: struct inode { int dev; ... int atime[2]; int mtime[2]; };  No sign of the ctime, which would have been after the mtime field. (Of course, it could be there anyway, unmentioned in the declaration, since it is last.) And similarly, the ls command has: struct ibuf { int idev; int inum; ... char *iatime[2]; char *imtime[2]; };  A couple of commands have extremely misleading declarations. Here's the struct inode from the prof command, which prints profiling reports: struct inode { int idev; ... int ctime[2]; int mtime[2]; int fill; };  The atime field has erroneously been called ctime here, but it seems that since prof does not use the atime, nobody noticed the bug. And there's a mystery fill field at the end, as if prof is expecting one more field, but doesn't know what it will be for. The declaration of ibuf in the ln command has similar oddities. So the creation time advertised by the CACM paper (1974) and the version 1 manual (1971) seems to have disappeared by the time of version 5 (1975), if indeed it ever existed. But there was some schizophrenia in the version 5 system about whether there was a third date in addition to the atime and the mtime. The stat call copied it into the stat buffer, and some commands assumed that it would be there, although they weren't sure what it would be for, and none of them seem look at it. It's quite possible that there was at one time a creation date, which had been eliminated by the time of the fifth edition, leaving behind the vestigial remains we saw in commands like ln and prof and in the code of the stat1 function.  Order Lions' Commentary on Unix 6th Edition from Powell's Functionally, the version 5 mtime is actually what we would now call the ctime: it is updated by operations like chmod that in modern Unix will update the ctime but not the mtime. A quick scan of the Lions Book suggests that it was the same way in version 6 as well. I imagine that the ctime-mtime distinction arose in version 7, because that was the last version before the BSD/AT&T fork, and nearly everything common to those two great branches of the Unix tree was in version 7. Oh, what the hell; I have the version 7 source code; I may as well look at it. Yes, by this time the /usr/include/sys/stat.h file had been invented, and does indeed include all three times in the struct stat. So the mtime (as we now know it) appears to have been introduced in v7. One sometimes hears that early Unix had atime and mtime, and that ctime was introduced later. But actually, it appears that early Unix had atime and ctime, and it was the mtime that was introduced later. The confusion arises because in those days the ctime was called "mtime". Addendum: It occurs to me now that the version 5 mtime is not precisely like the modern ctime, because it can be set via the smdate call, which is analogous to the modern utime call. The modern ctime cannot be set at all.  Order The C Programming Language from Powell's (Minor trivium: line 22 of iupdat is ip1 =+ 2. In modern C, we would write ip1 += 2. The =+ and =- operators had turned out to be a mistake, because people would write i=-1, intending i = -1, but the compiler would understand it as i =- 1, producing subtle bugs. The spellings of the operators were changed to avoid these bugs. The change from =+ to += was complete by the time K&R first edition was published in 1978: K&R mentions the old-style operators and says that the are obsolete. In spite of this, the Sun compiler I used in 1987 would still produce a warning for i=-1, despite interpreting it as i = -1. I believe this was because it was PCC-derived, and all PCC compilers emitted this warning. In the fifth edition code, we can see the obsolete form still in use.) (Totally peripheral addendum: Google search for dmr puts Dennis M. Ritchie in fourth position, not the first. Is this grave insult to our community to be tolerated? I think not! It must be avenged! With fire and steel!) [ Addendum 20070127: Unix source code prior to the fifth edition is lost. The manuals for the third and fourth editions are available from the Unix Heritage Society. The manual for the third edition (February 1973) mentions the creation time, but by the fourth edition (November 1973) the stat(2) man page no longer mentions a creation time. In v4, the two dates in the stat structure are called actime (modern atime) and modtime (modern mtime/ctime). ] Fri, 26 Jan 2007 Environmental manipulations Unix is full of little utility programs that run some other program in a slightly modified environment. For example, the nohup command: ### SYNOPSIS nohup COMMAND [ARG]... ### DESCRIPTION Run COMMAND, ignoring hangup signals. The nohup basically does signal(NOHUP, SIG_IGN) before calling execvp(COMMAND, ARGV) to execute the command. Similarly, there is a chroot command, run as chroot new-root-directory command args..., which runs the specified command with its default root inode set to somewhere else. And there is a nice command, run as nice nice-value-adjustment command args..., which runs the specified command with its "nice" value changed. And there is an env environment-settings command args... which runs the specified command with new variables installed into the environment. The standard sudo command could also be considered to be of this type. I have also found it useful to write trivial commands called indir, which runs a command after chdir-ing to a new directory, and stopafter, which runs a command after setting the alarm timer to a specified amount, and, just today, with-umask, which runs a command after setting the umask to a particular value. I could probably have avoided indir and with-umask. Instead of indir DIR COMMAND, I could use sh -c 'cd DIR; exec COMMAND', for example. But indir avoids an extra layer of horrible shell quotes, which can be convenient. Today it occurred to me to wonder if this proliferation of commands was really the best way to solve the problem. The sh -c '...' method solves it partly, for those parts of the process user area to which correspond shell builtin commands. This includes the working directory, umask, and environment variables, but not the signal table, the alarm timer, or the root directory. There is no standardized interface to all of these things at any level. At the system call level, the working directory is changed by the chdir system call, the root directory by chroot, the alarm timer by alarm, the signal table by a bunch of OS-dependent nonsense like signal or sigaction, the nice value by setpriority, environment variables by a potentially complex bunch of memory manipulation and pointer banging, and so on. Since there's no single interface for controlling all these things, we might get a win by making an abstraction layer for dealing with them. One place to put this abstraction layer is at the system level, and might look something like this:  /* declares USERAREA_* constants, int userarea_set(int, ...) and void *userarea_get(int) */ #include <sys/userarea.h> userarea_set(USERAREA_NICE, 12); userarea_set(USERAREA_CWD, "/tmp"); userarea_set(USERAREA_SIGNAL, SIGHUP, SIG_IGN); userarea_set(USERAREA_UMASK, 0022); ...  This has several drawbacks. One is that it requires kernel hacking. A subitem of this is that it will never become widespread, and that if you can't (or don't want to) replace your kernel, it cannot be made to work for you. Another is that it does not work for the environment variables, which are not really administered by the kernel. Another is that it does not fully solve the original problem, which is to obviate the plethora of nice, nohup, sudo, and env commands. You would still have to write a command to replace them. I had thought of another drawback, but forgot it while I was writing the last two sentences. You can also put the abstraction layer at the C library level. This has fewer drawbacks. It no longer requires kernel hacking, and can provide a method for modifying the environment. But you still need to write the command that uses the library. We may as well put the abstraction layer at the Unix command level. This means writing a command in some language, like Perl or C, which offers a shell-level interface to manipulating the process environment, perhaps something like this:  newenv nice=12 cwd=/tmp signal=HUP:IGNORE umask=0022 -- command args...  Then newenv has a giant dispatch table inside it to process the settings accordingly:  ... nice => sub { setpriority(PRIO_PROCESS,$$,_) },
cwd  => sub { chdir($_) }, signal => sub { my ($name, $result) = split /:/;$SIG{$name} =$result;
},
umask => sub { umask(oct($_)) }, ...  One question to ask is whether something like this already exists. Another is, if not, whether it's because there's some reason why it's a bad idea, or because there's a simpler solution, or just because nobody has done it yet. Fri, 05 Jan 2007 ssh-agent, revisited My recent article about reusing ssh-agent processes attracted a lot of mail, most of it very interesting. 1. A number of people missed an important piece of context: since the article was filed in 'oops' section of my blog, it was intended as a description of a mistake I had made. The mistake in this case being to work really hard on the first solution I thought of, rather than to back up at early signs of trouble, and scout around for a better and simpler solution. I need to find a way to point out the "oops" label more clearly, and at the top of the article instead of at the bottom. 2. Several people pointed out other good solutions to my problem. For example, Adam Sampson and Robert Loomans pointed out that versions of ssh-agent support a -a option, which orders the process to use a particular path for its Unix domain socket, rather than making up a path, as it does by default. You can then use something like ssh-agent -a$HOME/.ssh/agent when you first start the agent, and then you always know where to find the socket.

3. An even simpler solution is as follows: My principal difficulty was in determining the correct value for the SSH_AGENT_PID variable. But it turns out that I don't need this; it is only used for ssh-agent -k, which kills the existing ssh-agent process. For authentication, it is only necessary to have SSH_AUTH_SOCK set. The appropriate value for this variable is readily determined by scanning /tmp, as I noted in the original article. Thanks to Aristotle Pagaltzis and Adam Turoff for pointing this out.

4. Several people pointed me to the keychain project. This program is a front-end to ssh-agent. It contains functions to check for a running agent, and to start one if there is none yet, and to save the environment settings to a file, as I did manually in my article.

5. A number of people suggested that I should just run ssh-agent from my X session manager. This suggests that they did not read the article carefully; I already do this. Processes running on my home machine, B, all inherit the ssh-agent settings from the session manager process. The question is what to do when I remote login from a different machine, say A, and want the login shell, which was not started under X, to acquire the same settings.

Other machines trust B, but not A, so credential forwarding is not the solution here either.

6. After extracting the ssh-agent process's file descriptor table with ls -l /proc/pid/fd, and getting:

        lrwx------    1 mjd      users          64 Dec 12 23:34 3 -> socket:[711505562]


I concluded that the identifying information, 711505562, was useless. Aaron Crane corrected me on this; you can find it listed in /proc/net/unix, which gives the pathname in the filesystem:

    % grep 711505562 /proc/net/unix
ce030540: 00000002 00000000 00010000 0001 01 711505562 /tmp/ssh-tNT31655/agent.31655


I had suggested that the kernel probably maintained no direct mapping from the socket i-number to the filesystem path, and that obtaining this information would require difficult grovelling of the kernel data structures. But apparently to whatever extent that is true, it is irrelevant, since the /proc/net/unix driver has already been written to do it.

7. Saving the socket information in a file solves another problem I had. Suppose I want some automated process, say the cron job that makes my offsite network backups, to get access to SSH credentials. I can store the credentials in an ssh-agent process, and save the variable settings to a file. The backup process can then reinstate the settings from the file, and will thenceforward have the credentials for the remote login.

8. Finally, I should add that since implementing this scheme for the first time on 21 November, I have started exactly zero new ssh-agent processes, so I consider it a rousing success.
Thanks to everyone who wrote in on this matter.

Fri, 09 Dec 2005

Michael C. Toren:
> 1) Open a file descriptor pointing to the current working directory.
>
> 2) Create a temporary directory within the jail, and chroot() to it.
>
> 3) Using fchdir(), change the working directory to the file descriptor
> saved from step 1.

Oho, I hadn't seen that before. The chroot() in step 2 is required to avoid the special case in the Kernel that checks to see if you are doing ".." in the current root directory. But because you chrooted() yourself somewhere else, the special case isn't exercised.

Older systems don't have fchdir(), which is a fairly recent addition.

With the proliferation of "f" calls in recent years (fchdir, fchmod, fchown, fstat, fsync, etc.) I wonder what would be the result if the Unix system interface were redesigned to eliminate the non-"f" versions of the calls entirely. Instead, there would be a generic function, which we might call "iname", which transforms a path name to an "inode" structure:

        struct inode * iname (const char *path);


Unix kernels already contain a function with this name that does this job.

The system calls that formerly accepted path names are changed to require an inode structure. So instead of

        fd = open("dir/file", ...)


one now has

        fd = open(iname("dir/file"), ...)


(There are some minor language and usability issues here: what if iname() returns NULL? Ignore those; I want to discuss OS issues, not language issues.)

There would be a function, analogous to iname(), that also returned an inode structure, but which took an open file descriptor instead of a path name:

        struct inode * inode(int fd);


This is essentially equivalent to the fstat() function we have now.

chown() and fchown() would merge to become a single call that accepted an inode structure; instead of:

        chown("dir/file", owner)
fchown(fd, owner)


one would have:

        chown(iname("dir/file"), owner)
chown(inode(fd), owner)


        chdir(path);
fchdir(fd);


one would have:

        chdir(iname(path));
chdir(inode(fd));


stat() and fstat() would not only merge but would disappear entirely; the struct inode can do everything that the struct stat can do. This code:

        stat(&statbuf, "dir/file");
fstat(&statbuf, fd);


turns into this:

        statbuf = iname("dir/file"));
statbuf = inode(fd);


There are some security implications to this idea. There needs to be protection against counterfeiting an inode structure. For example, consider a world-readable file in a secret, nonsearchable directory. Suppose the file happens to have i-number 123456. If it's possible to do this, then security has failed:

        struct inode I;
I.inumber = 123456;
fd = open(I, O_RDWR);


It should be impossible for anyone to manufacture the struct inode that represents the secret file without actually using iname() somewhere along the line. A simple way to arrange this would be to have the kernel cryptographically sign each struct inode. This can be done inexpensively.

This still has some access implications. Consider a world-readable file in a world-searchable directory. Process A iname()s the file, obtaining its struct inode. The search permissions on the directory are then removed. Process A can still open the file. This is analogous to a similar situation in standard Unix in which process A opens the file before the permissions are changed, and can still read and write it afterwards. So that's not a big change. What might be a big change is that A can dump the struct inode to a file and the a different process can read it back again, evading the increased access protections on the directory. The cryptographic signature technique can fix this problem by restricting struct inodes to be used by a single process.

Whether this is worth doing I don't know. My main idea in thinking it up was to avoid the increasing duplication of system calls. Does Unix need an "fsymlink" call? Does it need three different ones?

        symlink(oldpath, newpath);


Perhaps not this week, but who knows what the future holds? With the iname() / inode() style, these are all a single call:

        symlink(iname(oldpath), iname(newpath));


This also fixes some of the proliferation in the system call interface between calls that work on symlinks and calls that work through symlinks. For example, stat() and lstat(), and chown() and lchown(). On normal files, each pair is the same. But on a symlink, stat() stats the pointed-to file while lstat() stats the symlink itself; similarly chown() changes the owner of the pointed-to file while lchown() changes the owner of the symlink itself. But where's lchmod()? What about llink()? There's no way to make a hard link to a symbolic link! With the inode() / iname() technique above, you only need one extra call to handle all possible operations on a symbolic link:

        lstat(path);
lchown(path, owner);

        stat(liname(path));
`