The shell and its crappy handling of whitespace
I'm about thirty-five years into Unix shell programming now, and I
continue to despise it. The shell's treatment of whitespace is
a constant problem. The fact that
for i in *.jpg; do
cp $i /tmp
done
doesn't work is a constant pain. The problem here is that if one of
the filenames is bite me.jpg then the cp command will turn into
cp bite me.jpg /tmp
and fail, saying
cp: cannot stat 'bite': No such file or directory
cp: cannot stat 'me.jpg': No such file or directory
or worse there is a file named bite that is copied even though
you did not want to copy it, maybe overwriting /tmp/bite that you
wanted to keep.
To make it work properly you have to say
for i in *; do
cp "$i" /tmp
done
with the quotes around the $i .
Now suppose I have a command
that strips off the suffix from a filename. For example,
suf foo.html
simply prints foo to standard output. Suppose I want to change the
names of all the .jpeg files to the corresponding names with .jpg
instead. I can do it like this:
for i in *.jpeg; do
mv $i $(suf $i).jpg
done
Ha ha, no, some of the files might have spaces in their names.
I have to write:
for i in *.jpeg; do
mv "$i" $(suf "$i").jpg # two sets of quotes
done
Ha ha, no, fooled you, the output of suf will also have spaces. I
have to write:
for i in *.jpeg; do
mv "$i" "$(suf "$i")".jpg # three sets of quotes
done
At this point it's almost worth breaking out a real language and using
something like this:
ls *.jpeg | perl -nle '($z = $_) =~ s/\.jpeg$/.jpg/; rename $_ => $z'
I think what bugs me most about this problem in the shell is that it's
so uncharacteristic of the Bell Labs people to have made such an
unforced error. They got so many things right, why not this?
It's not even a hard choice! 99% of the time you don't want your
strings implicitly split on spaces, why would you?
For example they got the behavior of for i in *.jpeg right; if one of those
files is bite me.jpeg the loop still runs only once for that file.
And the shell
doesn't have this behavior for any other sort of special character.
If you have a file named foo|bar and a variable z='foo|bar' then
ls $z doesn't try to pipe the output of ls foo into the bar
command, it just tries to list the file foo|bar like you wanted. But
if z='foo bar' then ls $z wants to list files foo and bar .
How did the Bell Labs wizards get everything right except the
spaces?
Even if it was a simple or reasonable choice to make in the beginning,
at some point around 1979 Steve Bourne had a clear opportunity to
realize he had made a mistake. He introduced $* and must shortly
therefter have discovered that it wasn't useful. This should have
gotten him thinking.
$* is literally useless. It is the variable that is supposed to
contain the arguments to the current shell. So you can write a shell
script:
#!/bin/sh
# “yell”
echo "I am about to run '$*' now!!1!"
exec $*
and then run it:
$ yell date
I am about to run 'date' now!!1!
Wed Apr 2 15:10:54 EST 1980
except that doesn't work because $* is useless:
$ ls *.jpg
bite me.jpg
$ yell ls *.jpg
I am about to run 'ls bite me.jpg' now!!1!
ls: cannot access 'bite': No such file or directory
ls: cannot access 'me.jpg': No such file or directory
Oh, I see what went wrong, it thinks it got three arguments, instead
of two, because the elements of $* got auto-split. I needed to use
quotes around $* . Let's fix it:
#!/bin/sh
echo "I am about to run '$*' now!!1!"
exec "$*"
$ yell ls *.jpg
yell: 3: exec: ls /tmp/bite me.jpg: not found
No, the quotes disabled all the splitting so that now I got one
argument that happens to contain two spaces.
This cannot be made to work. You have to fix the shell itself.
Having realized that $* is useless, Bourne added a workaround to the
shell, a unique special case with special handling. He added a $@ variable which
is identical to $* in all ways but one: when it is in
double-quotes. Whereas $* expands to
$1 $2 $3 $4 …
and "$*" expands to
"$1 $2 $3 $4 …"
"$@" expands to
"$1" "$2" "$3" "$4" …
so that inside of yell ls *jpg , an exec "$@" will turn into
exec "ls" "bite me.jpg" and do what you wanted exec $* to do in
the first place.
I deeply regret that, at the moment that Steve Bourne coded up this
weird special case, he didn't instead stop and think that maybe
something deeper was wrong. But he didn't and here we are. Larry
Wall once said something about how too many programmers have a
problem, think of a simple solution, and implement the solution, and
what they really need to be doing is thinking of three solutions and
then choosing the best one. I sure wish that had happened here.
Anyway, having to use quotes everywhere is a pain, but usually it
works around the whitespace problems, and it is not much worse than a
million other things we have to do to make our programs work in this
programming language hell of our own making. But sometimes this
isn't an adequate solution.
One of my favorite trivial programs is called lastdl . All it does
is produce the name of the file most recently written in
$HOME/Downloads , something like this:
#!/bin/sh
cd $HOME/Downloads
echo $HOME/Downloads/"$(ls -t | head -1)"
Many programs stick files into that directory, often copied from the
web or from my phone, and often with long and difficult names like
e15c0366ecececa5770e6b798807c5cc.jpg or
2023_3_20230310_120000_PARTIALPAYMENT_3028707_01226.PDF or
gov.uscourts.nysd.590045.212.0.pdf that I do not want to type or
even autocomplete. No problem, I just do
rm $(lastdl)
or
okular $(lastdl)
or
mv $(lastdl) /tmp/receipt.pdf
except ha ha, no I don't, because none of those works reliably, they
all fail if the difficult filename happens to contain spaces, as it
often does. Instead I need to type
rm "$(lastdl)"
okular "$(lastdl)"
mv "$(lastdl)" /tmp/receipt.pdf
which in a command so short and throwaway is a noticeable cost,
a cost extorted by the shell in return for nothing.
And every time I do it I am angry with Steve Bourne all over again.
There is really no good way out in general. For lastdl there is a decent
workaround, but it is somewhat fishy. After my lastdl command finds the
filename, it renames it to a version with no spaces and then prints
the new filename:
#!/bin/sh
# This is not the real code
# and I did not test it
cd $HOME/Downloads
fns="$HOME/Downloads/$(ls -t | head -1)" # those stupid quotes again
fnd="$HOME/Downloads/$(echo "$fns" | tr ' \t\n' '_')" # two sets of stupid quotes this time
mv "$fns" $fnd # and again
echo $fnd
The actual script
is somewhat more reliable, and is written in Python, because shell
programming sucks.
[ Addendum 20230731: Drew DeVault has written a reply article about
how the rc shell does not have these problems.
rc was designed in the late 1980s by Tom Duff of Bell Labs, and I
was a satisfied user (of the Byron Rakitzis clone) for many years. Definitely give it a look. ]
[ Addendum 20230806: Chris Siebenmann also discusses rc . ]
[Other articles in category /Unix]
permanent link
There is a Unix error device
Yesterday I discussed /dev/full and asked why there wasn't a
generalization of it, and laid out out some very 1990s suggestions that I
have had in the back of my mind since the 1990s. I ended by
acknowledging that there was probably a more modern solution in user
space:
Eh, probably the right solution these days is to LD_PRELOAD a
complete mock filesystem library that has any hooks you want in it.
Carl Witty suggested that there is a more modern solution in userspace,
FUSE,
and Leah Neukirchen filled in the details:
UnreliableFS is a
FUSE-based fault injection filesystem that allows to change
fault-injections in runtime using simple configuration file.
Also, Dave Vasilevsky suggested that something like this could be done
with the device mapper.
I think the real takeaway from this is that I had not accepted the
hard truth that all Unix is Linux now, and non-Linux Unix is dead.
Thanks everyone who sent suggestions.
[ Addendum: Leah Neukirchen informs me that FUSE also runs on FreeBSD,
OpenBSD and macOS, and reminds me that there are a great many MacOS
systems. I should face the hard truth that my knowledge of Unix
systems capabilities is at least fifteen yers out of date. ]
[Other articles in category /Unix]
permanent link
Why no Unix error device?
Suppose you're writing some program that does file I/O. You'd like to
include a unit test to make sure it properly handles the error when
the disk fills up and the write can't complete. This is tough to
simulate. The test itself obviously can't (or at least shouldn't)
actually fill the disk.
A while back some Unix systems introduced a device called
/dev/full . Reading from /dev/full returns zero bytes, just like
/dev/zero . But all attempts to write to /dev/full fail with
ENOSPC , the system error that indices a full disk. You can set up
your tests to try to write to /dev/full and make sure they fail
gracefully.
That's fun, but why not generalize it? Suppose there was a
/dev/error device:
#include <sys/errdev.h>
error = open("/dev/error", O_RDWR);
ioctl(error, ERRDEV_SET, 23);
The device driver would remember the number 23 from this ioctl call,
and the next time the process tried to read or write the error
descriptor, the request would fail and set errno to 23, whatever
that is. Of course you wouldn't hardwire the 23, you'd actually do
#include <sys/errno.h>
ioctl(error, ERRDEV_SET, EBUSY);
and then the next I/O attempt would fail with EBUSY .
Well, that's the way I always imagined it, but now that I think about
it a little more, you don't need this to be a device driver. It
would be better if instead of an ioctl it was an fcntl that you
could do on any file descriptor at all.
Big drawback: the most common I/O errors are probably EACCESS and
ENOENT , failures in the open , not in the actual I/O. This idea
doesn't address that at all. But maybe some variation would work
there. Maybe for those we go back to the original idea, have a
/dev/openerror , and after you do ioctl(dev_openerror, ERRDEV_SET,
EACCESS) , the next call to open fails with EACCESS . That might
be useful.
There are some security concerns with the fcntl version of the idea.
Suppose I write a malicious program that opens some file descriptor,
dups it to standard input, does fcntl(1, ERRDEV_SET,
ESOMEWEIRDERROR) , then execs the target program t . Hapless t tries
to read standard input, gets ESOMEWEIRDERROR , and then does
something unexpected that it wasn't supposed to do. This particular attack is
easily foiled: exec should reset all the file descriptor saved-error
states. But there might be something more subtle that I haven't
thought of and in OS security there usually is.
Eh, probably the right solution these days is to LD_PRELOAD a
complete mock filesystem library that has any hooks you want in it. I
don't know what the security implications of LD_PRELOAD are but I
have to believe that someone figured them all out by now.
[ Addendum 20220314: Better solutions exist. ]
[Other articles in category /Unix]
permanent link
A one-character omission caused my Python program to hang (not)
I just ran into a weird and annoying program behavior. I was
writing a Python program, and when I ran it, it seemed to hang.
Worried that it was stuck in some sort of resource-intensive loop I
interrupted it, and then I got what looked like an error message from
the interpreter. I tried this several more times, with the same
result; I tried putting exit(0) near the top of the program to
figure out where the slowdown was, but the behavior didn't change.
The real problem was that the first line which said:
#/usr/bin/env python3
when it should have been:
#!/usr/bin/env python3
Without that magic #! at the beginning, the file is processed not by
Python but by the shell, and the first thing the shell saw was
import re
which tells it to run the import command.
I didn't even know there was an import command. It runs an X client
that waits for the user to click on a window, and then writes a dump
of the window contents to a file. That's why my program seemed to
hang; it was waiting for the click.
I might have picked up on this sooner if I had actually looked at the
error messages:
./license-plate-game.py: line 9: dictionary: command not found
./license-plate-game.py: line 10: syntax error near unexpected token `('
./license-plate-game.py: line 10: `words = read_dictionary(dictionary)'
In particular, dictionary: command not found is the shell giving
itself away. But I was so worried about the supposedly resource-bound
program crashing my session that I didn't look at the actual output,
and assumed it was Python-related syntax errors.
I don't remember making this mistake before but it seems like it would
be an easy mistake to make. It might serve as a good example when
explaining to nontechnical people how finicky and exacting programming
can be. I think it wouldn't be hard to understand what happened.
This computer stuff is amazingly complicated. I don't know how anyone
gets anything done.
[Other articles in category /Unix]
permanent link
Benchmarking shell pipelines and the Unix “tools” philosophy
Sometimes I look through the HTTP referrer logs to see if anyone is
talking about my blog. I use the f 11 command to extract the
referrer field from the log files, count up the number of occurrences
of each referring URL, then discard the ones that are internal
referrers from elsewhere on my blog. It looks like this:
f 11 access.2020-01-0* | count | grep -v plover
(I've discussed f before. The f 11 just
prints the eleventh field of each line. It is
essentially shorthand for awk '{print $11}' or perl -lane 'print
$F[10]' . The count utility is even simpler; it counts the number
of occurrences of each distinct line in its input, and emits a report
sorted from least to most frequent, essentially a trivial wrapper
around sort | uniq -c | sort -n . Civilization advances by extending the number of
important operations which we can perform without thinking about
them.)
This has obvious defects, but it works well enough. But every time I
used it, I wondered: is it faster to do the grep before the count ,
or after? I didn't ever notice a difference. But I still wanted to
know.
After years of idly wondering this, I have finally looked into it.
The point of this article is that the investigation produced the
following pipeline, which I think is a great example of the Unix
“tools” philosophy:
for i in $(seq 20); do
TIME="%U+%S" time \
sh -c 'f 11 access.2020-01-0* | grep -v plover | count > /dev/null' \
2>&1 | bc -l ;
done | addup
I typed this on the command line, with no backslashes or newlines, so
it actually looked like this:
for i in $(seq 20); do TIME="%U+%S" time sh -c 'f 11 access.2020-01-0* | grep -v plover |count > /dev/null' 2>&1 | bc -l ; done | addup
Okay, what's going on here? The pipeline I actually want to analyze,
with f | grep| count , is
there in the middle, and I've already explained it, so let's elide it:
for i in $(seq 20); do
TIME="%U+%S" time \
sh -c '¿SOMETHING? > /dev/null' 2>&1 | bc -l ;
done | addup
Continuing to work from inside to out, we're going to use time to
actually do the timings. The time command is standard. It runs a
program, asks the kernel how long the program took, then prints
a report.
The time command will only time a single process (plus its
subprocesses, a crucial fact that is inexplicably omitted from the man
page). The ¿SOMETHING? includes a pipeline, which must be set up by
the shell, so we're actually timing a shell command sh -c '...'
which tells time to run the shell and instruct it to run the pipeline we're
interested in. We tell the shell to throw away the output of
the pipeline, with > /dev/null , so that the output doesn't get mixed
up with time 's own report.
The default format for the report printed by time is intended for human consumption. We can
supply an alternative format in the $TIME variable. The format I'm using here is %U+%S , which comes out as something
like 0.25+0.37 , where 0.25 is the user CPU time and 0.37 is the
system CPU time. I didn't see a format specifier that would emit the
sum of these directly. So instead I had it emit them with a + in
between, and then piped the result through the bc command, which performs the requested arithmetic
and emits the result. We need the -l flag on bc
because otherwise it stupidly does integer arithmetic. The time command emits its report to
standard error, so I use 2>&1
to redirect the standard error into the pipe.
[ Addendum 20200108: We don't actually need -l here; I was mistaken. ]
Collapsing the details I just discussed, we have:
for i in $(seq 20); do
(run once and emit the total CPU time)
done | addup
seq is a utility I invented no later than 1993 which has since
become standard in most Unix systems. (As with
netcat , I am not claiming to be the
first or only person to have invented this, only to have invented it
independently.) There are many variations of seq , but
the main use case is that seq 20 prints
1
2
3
…
19
20
Here we don't actually care about the output (we never actually use
$i ) but it's a convenient way to get the for loop to run twenty
times. The output of the for loop is the twenty total CPU
times that were emitted by the twenty invocations of bc . (Did you know that
you can pipe the output of a loop?) These twenty lines of output are
passed into addup , which I wrote no later than 2011. (Why did it
take me so long to do this?) It reads a list of numbers and prints the
sum.
All together, the command runs and prints a single number like 5.17 ,
indicating that the twenty runs of the pipeline took 5.17 CPU-seconds
total. I can do this a few times for the original pipeline, with
count before grep , get times between 4.77 and 5.78, and then try
again with the grep before the count , producing times between 4.32
and 5.14. The difference is large enough to detect but too small to
notice.
(To do this right we also need to test a null command, say
sh -c 'sleep 0.1 < /dev/null'
because we might learn that 95% of the reported time is spent
in running the shell, so the actual difference between the two pipelines is
twenty times as large as we thought. I did this; it turns out that the time
spent to run the shell is insignificant.)
What to learn from all this? On the one hand, Unix wins: it's
supposed to be quick and easy to assemble small tools to do whatever it is
you're trying to do. When time wouldn't do the arithmetic I needed
it to, I sent its output to a generic arithmetic-doing utility. When
I needed to count to twenty, I had a utility for doing that; if I
hadn't there are any number of easy workarounds. The
shell provided the I/O redirection and control flow I needed.
On the other hand, gosh, what a weird mishmash of stuff I had to
remember or look up. The -l flag for bc . The fact that I needed
bc at all because time won't report total CPU time. The $TIME
variable that controls its report format. The bizarro 2>&1 syntax
for redirecting standard error into a pipe. The sh -c trick to get
time to execute a pipeline. The missing documentation of the core
functionality of time .
Was it a win overall? What if Unix had less compositionality but I
could use it with less memorized trivia? Would that be an
improvement?
I don't know. I rather suspect that there's no way to
actually reach that hypothetical universe. The bizarre mishmash of
weirdness exists because so many different people invented so many
tools over such a long period. And they wouldn't have done any of
that inventing if the compositionality hadn't been there. I think we
don't actually get to make
a choice between an incoherent mess of composable paraphernalia and a
coherent, well-designed but noncompositional system. Rather, we get a
choice between a incoherent but useful mess and an incomplete, limited
noncompositional system.
(Notes to self: (1) In connection with Parse::RecDescent , you once wrote
about open versus closed systems. This is another point in that
discussion. (2) Open systems tend to evolve into messes. But closed
systems tend not to evolve at all, and die. (3) Closed systems are
centralized and hierarchical; open systems, when they succeed, are
decentralized and organic. (4) If you are looking for another
example of a successful incoherent mess of composable paraphernalia,
consider Git.)
[ Addendum: Add this to the list of “weird mishmash of trivia”: There are two
time commands. One, which I discussed above, is a separate
executable, usually in /usr/bin/time . The other is built into the
shell. They are incompatible. Which was I actually using? I would
have been pretty confused if I had accidentally gotten the built-in
one, which ignores $TIME and uses a $TIMEFORMAT that is
interpreted in a completely different way. I was fortunate, and got
the one I intended to get. But it took me quite a while to understand
why I had! The appearance of the TIME=… assignment at the start of
the shell command disabled the shell's special builtin treatment of the
keyword time , so it really did use /usr/bin/time . This computer stuff
is amazingly complicated. I don't know how anyone gets anything done. ]
[ Addenda 20200104: (1) Perl's module ecosystem is another example of a
successful incoherent mess of composable paraphernalia. (2) Of the
seven trivia I included in my “weird mishmash”, five were related to
the time command. Is this a reflection on time , or is it just
because time was central to this particular example? ]
[ Addendum 20200104: And, of course, this is exactly what Richard
Gabriel was thinking about in Worse is
Better. Like Gabriel, I'm not sure. ]
[Other articles in category /Unix]
permanent link
How not to reconfigure your sshd
Yesterday I wanted to reconfigure the sshd on a remote machine.
Although I'd never done sshd itself, I've done this kind of thing
a zillion times before. It looks like this: there is a configuration
file (in this case /etc/ssh/sshd-config ) that you modify. But this
doesn't change the running server; you have to notify the server that
it should reread the file. One way would be by killing the server and
starting a new one. This would interrupt service, so instead you can
send the server a different signal (in this case SIGHUP ) that tells
it to reload its configuration without exiting. Simple enough.
Except, it didn't work. I added:
Match User mjd
ForceCommand echo "I like pie!"
and signalled the server, then made a new connection to see if it
would print I like pie! instead of starting a shell. It started a
shell. Okay, I've never used Match or ForceCommand before, maybe
I don't understand how they work, I'll try something simpler. I
added:
PrintMotd yes
which seemed straightforward enough, and I put some text into
/etc/motd , but when I connected it didn't print the motd.
I tried a couple of other things but none of them seemed to work.
Okay, maybe the sshd is not getting the signal, or something? I
hunted up the logs, but there was a report like what I expected:
sshd[1210]: Received SIGHUP; restarting.
This was a head-scratcher. Was I modifying the wrong file? It semed
hardly possible, but I don't administer this machine so who knows? I
tried lsof -p 1210 to see if maybe sshd had some other config file
open, but it doesn't keep the file open after it reads it, so that was
no help.
Eventually I hit upon the answer, and I wish I had some useful piece
of advice here for my future self about how to figure this out. But I
don't because the answer just struck me all of a sudden.
(It's nice when that happens, but I feel a bit cheated afterward: I
solved the problem this time, but I didn't learn anything, so how
does it help me for next time? I put in the toil, but I didn't get
the full payoff.)
“Aha,” I said. “I bet it's because my connection is multiplexed.”
Normally when you make an ssh connection to a remote machine, it
calls up the server, exchanges credentials, each side authenticates
the other, and they negotiate an encryption key. Then the server
forks, the child starts up a login shell and mediates between the
shell and the network, encrypting in one direction and decrypting in
the other. All that negotiation and authentication takes time.
There is a “multiplexing” option you can use instead. The handshaking
process still occurs as usual for the first connection. But once the
connection succeeds, there's no need to start all over again to make a
second connection. You can tell ssh to multiplex several virtual
connections over its one real connection. To make a new virtual
connection, you run ssh in the same way, but instead of contacting
the remote server as before, it contacts the local ssh client that's
already running and requests a new virtual connection. The client,
already connected to the remote server, tells the server to allocate a
new virtual connection and to start up a new shell session for it.
The server doesn't even have to fork; it just has to allocate another
pseudo-tty and run a shell in it. This is a lot faster.
I had my local ssh client configured to use a virtual connection if
that was possible. So my subsequent ssh commands weren't going
through the reconfigured parent server. They were all going through
the child server that had been forked hours before when I started my
first connection. It wasn't affected by reconfiguration of the parent
server, from which it was now separate.
I verified this by telling ssh to make a new connection without
trying to reuse the existing virtual connection:
ssh -o ControlPath=none -o ControlMaster=no ...
This time I saw the MOTD and when I reinstated that Match command I
got I like pie! instead of a shell.
(It occurs to me now that I could have tried to SIGHUP the child
server process that my connections were going through, and that would
probably have reconfigured any future virtual connections through that
process, but I didn't think of it at the time.)
Then I went home for the day, feeling pretty darn clever, right up
until I discovered, partway through writing this article, that I can't
log in because all I get is I like pie! instead of a shell.
[Other articles in category /Unix]
permanent link
More about disabling standard I/O buffering
In yesterday's article I
described a simple and useful feature that could have been added to
the standard I/O library, to allow an environment variable to override
the default buffering behavior. This would allow the invoker of a
program to request that the program change its buffering behavior even
if the program itself didn't provide an option specifically for doing
that.
Simon Tatham directed me to the GNU Coreutils stdbuf
command
which does something of this sort. It is rather like the
pseudo-tty-pipe program I described, but instead of using the
pseudo-tty hack I suggested, it works by forcing the child program to
dynamically
load
a custom replacement for
stdio .
There appears to be a very similar command in FreeBSD.
[ Addendum 20240820: This description is not accurate; see below. ]
Roderick Schertler pointed out that Dan Bernstein wrote a utility
program, pty , in 1990, atop which my pseudo-tty-pipe program could
easily be built; or maybe its ptybandage utility is exactly what I
wanted. Jonathan de Boyne Pollard has a page explaining it in
detail, and related
packages.
A later version of pty is still
available.
Here's M. Bernstein's blurb about it:
ptyget is a universal pseudo-terminal interface. It is designed to be
used by any program that needs a pty.
ptyget can also serve as a wrapper to improve the behavior of existing
programs. For example, ptybandage telnet is like telnet but
can
be put into a pipeline. nobuf grep is like grep but won't
block-buffer if it's redirected.
Previous pty-allocating programs — rlogind , telnetd , sshd , xterm ,
screen , emacs , expect , etc. — have caused dozens of security problems.
There are two fundamental reasons for this. First, these programs are
installed setuid root so that they can allocate ptys; this turns every
little bug in hundreds of thousands of lines of code into a potential
security hole. Second, these programs are not careful enough to
protect
the pty from access by other users.
ptyget solves both of these problems. All the privileged code is in
one
tiny program. This program guarantees that one user can't touch
another
user's pty.
ptyget is a complete rewrite of pty 4.0, my previous pty-allocating
package. pty 4.0's session management features have been split off
into
a separate package, sess .
Leonardo Taccari informed me that NetBSD's stdio actually
has the environment variable feature I was asking for! Christos
Zoulas suggested adding stdbuf similar to the GNU and FreeBSD
implementations, but the NetBSD people observed, as I did, that it
would be simpler to just control stdio directly with an environment
variable, and did it. Here's the relevant part of the NetBSD setbuf(3) man
page:
The default buffer settings can be overwritten per descriptor
(STDBUF n) where n is the numeric value of the file descriptor
represented by the stream, or for all descriptors (STDBUF ). The
environment variable value is a letter followed by an optional
numeric value indicating the size of the buffer. Valid sizes range
from 0B to 1MB. Valid letters are:
U unbuffered
L line buffered
F fully buffered
Here's the discussion from the NetBSD tech-userlevel mailing
list.
The actual patch looks almost exactly the way I imagined it would.
Finally, Mariusz Ceier pointed out that there is an ancient bug report in
glibc
suggesting essentially the same environment variable mechanism that I
suggested and that was adopted in NetBSD. The suggestion was firmly
and summarily rejected. (“Hell, no … this is a terrible idea.”)
Interesting wrinkle: the bug report was submitted by Pádraig Brady,
who subsequently wrote the stdbuf command I described above.
Thank you, Gentle Readers!
Addenda
20240820
nabijaczleweli has pointed out that my explanation of the GNU stdbuf
command above is not accurate. I said "it works by forcing the child
program to dynamically load a custom replacement for stdio ". It's
less heavyweight than that. Instead, it arranges to load
a dynamic library
that runs before the rest of the program starts, examines the
environment, and simply calls setvbuf as needed on the three
standard streams.
[Other articles in category /Unix]
permanent link
Proposal for turning off standard I/O buffering
Some Unix commands, such as grep , will have a command-line flag to
say that you want to turn off the buffering that is normally done in
the standard I/O library. Some just try to guess what you probably
want. Every command is a little different and if the command you want
doesn't have the flag you need, you are basically out of luck.
Maybe I should explain the putative use case here. You have some
command (or pipeline) X that will produce dribbles of data at
uncertain intervals. If you run it at the terminal, you see each
dribble timely, as it appears. But if you put X into a pipeline,
say with
X | tee ...
or
X | grep ...
then the dribbles are buffered and only come out of X when an entire
block is ready to be written, and the dribbles could be very old
before the downstream part of the pipeline, including yourself, sees
them. Because this is happening in user space inside of X , there is
not a damn thing anyone farther downstream can do about it. The only
escape is if X has some mode in which it turns off standard I/O
buffering. Since standard I/O buffering is on by default, there is a
good chance that the author of X did not think to affirmatively add
this feature.
Note that adding the --unbuffered flag to the downstream grep does
not solve the problem; grep will produce its own output timely, but
it's still getting its input from X after a long delay.
One could imagine a program which would interpose a pseudo-tty, and
make X think it is writing to a terminal, and then the standard I/O
library would stay in line-buffered mode by default. Instead of
running
X | tee some-file | ...
or whatever, one would do
pseudo-tty-pipe -c X | tee some-file | ...
which allocates a pseudo-tty device, attaches standard output to it,
and forks. The child runs X , which dribbles timely into the
pseudo-tty while the parent runs a read loop to remove dribbles from
the master end of the TTY and copy them timely into the pipe. This
would work. Although tee itself also has no --unbuffered flag
so you might even have to:
pseudo-tty-pipe -c X | pseudo-tty-pipe -c 'tee some-file' | ...
I don't think such a program exists, and anyway, this is all
ridiculous, a ridiculous abuse of the standard I/O library's buffering
behavior: we want line buffering, the library will only give it to us
if the process is attached to a TTY device, so we fake up a TTY just
to fool stdio into giving us what we want. And why? Simply because
stdio has no way to explicitly say what we want.
But it could easily expose this behavior as a controllable
feature. Currently there is a branch in the library that says how to
set up a buffering mode when a stream is opened for the first time:
if the stream is for writing, and is attached to descriptor 2,
it should be unbuffered; otherwise …
if the stream is for writing, and connects descriptor 1 to a
terminal device, it should be line-buffered; otherwise …
if the moon is waxing …
…
otherwise, the stream should be block-buffered
To this, I propose a simple change, to be inserted right at the beginning:
If the environment variable STDIO_BUF is set to "line" , streams
default to line buffering. If it's set to "none" , streams default
to no buffering. If it's set to "block" , streams default to block
buffered. If it's anything else, or unset, it is ignored.
Now instead of this:
pseudo-tty-pipe --from X | tee some-file | ...
you write this:
STDIO_BUF=line X | tee some-file | ...
Problem solved.
Or maybe you would like to do this:
export STDIO_BUF=line
which then it affects every program in every pipeline in the rest of
the session:
X | tee some-file | ...
Control is global if you want it, and per-process if you want it.
This feature would cost around 20 lines of C code in the standard I/O
library and would impose only an insigificant run-time cost. It would
effectively add an --unbuffered flag to every program in the
universe, retroactively, and the flag would be the same for every
program. You would not have to remember that in mysql the magic
option is -n and that in GNU grep it is --line-buffered and that
for jq is is --unbuffered and that Python scripts can be
unbuffered by supplying the -u flag and that in tee you are just
SOL, etc. Setting STDIO_BUF=line would Just Work.
Programming languages would all get this for free also. Python
already has PYTHONUNBUFFERED but in other languages
you have to do something or other; in Perl you use some
horrible Perl-4-ism like
{ my $ofh = select OUTPUT; $|++; select $ofh }
This proposal would fix every programming language everywhere. The
Perl code would become:
$ENV{STDIO_BUF} = 'line';
and every other language would be similarly simple:
/* In C */
putenv("STDIO_BUF=line");
[ Addendum 20180521: Mariusz Ceier corrects me,
pointing out that this will not work for the process’ own standard
streams, as they are pre-opened before the process gets a chance to
set the variable. ]
It's easy to think of elaborations on this: STDIO_BUF=1:line might
mean that only standard output gets line-buffering by default,
everything else is up to the library.
This is an easy thing to do. I have wanted this for twenty years.
How is it possible that it hasn't been in the GNU/Linux standard
library for that long?
[ Addendum 20180521: it turns out there is quite a lot to say about
the state of the art here. In
particular, NetBSD has the feature very much as I described it. ]
[Other articles in category /Unix]
permanent link
Converting Google Docs to Markdown
I was on vacation last week and I didn't bring my computer, which has
been a good choice in the past. But I did bring my phone, and I spent
some quiet time writing various parts of around 20 blog posts on the
phone. I composed these in my phone's Google Docs app, which seemed
at the time like a reasonable choice.
But when I got back I found that it wasn't as easy as I had expected
to get the documents back out. What I really wanted was Markdown.
HTML would have been acceptable, since Blosxom accepts that also. I
could download a single document in one of several formats, including
HTML and ODF, but I had twenty and didn't want to do them one at a
time. Google has a bulk download feature, to download a zip file of
an entire folder, but upon unzipping I found that all twenty documents
had been converted to Microsoft's docx format and I didn't know a
good way to handle these. I could not find an option for a bulk
download in any other format.
Several tools will compose in Markdown and then export to Google
docs, but the only option I found for translating from Google docs
to Markdown was Renato Mangini's Google Apps
script.
I would have had to add the script to each of the 20 files, then run
it, and the output appears in email, so for this task, it was even
less like what I wanted.
The right answer turned out to be: Accept Google's bulk download of
docx files and then use Pandoc to convert the
docx to Markdown:
for i in *.docx; do
echo -n "$i ? ";
read j; mv -i "$i" $j.docx;
pandoc --extract-media . -t markdown -o "$(suf "$j" mkdn)" "$j.docx";
done
The read is because I had given the files Unix-unfriendly names like
Polyominoes as orthogonal polygons.docx and I wanted to give them
shorter names like orthogonal-polyominoes.docx .
The suf command is a little utility that performs the very common
task of removing or changing the suffix of a filename. The suf "$j"
mkdn command means that if $j is something like foo.docx it
should turn into foo.mkdn . Here's the tiny source code:
#!/usr/bin/perl
#
# Usage: suf FILENAME [suffix]
#
# If filename ends with a suffix, the suffix is replaced with the given suffix
# otheriswe, the given suffix is appended
#
# For example:
# suf foo.bar baz => foo.baz
# suf foo baz => foo.baz
# suf foo.bar => foo
# suf foo => foo
@ARGV == 2 or @ARGV == 1 or usage();
my ($file, $suf) = @ARGV;
$file =~ s/\.[^.]*$//;
if (defined $suf) {
print "$file.$suf\n";
} else {
print "$file\n";
}
sub usage {
print STDERR "Usage: suf filename [newsuffix]\n";
exit 1;
}
Often, I feel that I have written too much code, but not this time.
Some people might be tempted to add bells and whistles to this: what
if the suffix is not delimited by a dot character? What if I only
want to change certain suffixes? What if my foot swells up? What if
the moon falls out of the sky? Blah blah blah. No, for that we can
break out sed .
Next time I go on vacation I will know better and I will not use
Google Docs. I don't know yet what instead.
StackEdit maybe.
[ Addendum 20180108: Eric Roode pointed out that the program above has
a genuine bug: if given a filename like a.b/c.d it truncates the
entire b/c.d instead of just the d . The current version fixes
this. ]
[Other articles in category /Unix]
permanent link
/dev/null Follies
A Unix system administrator of my acquaintance once got curious about
what people were putting into /dev/null . I think he also may have
had some notion that it would contain secrets or other interesting
material that people wanted thrown away. Both of these ideas are
stupid, but what he did next was even more stupid: he decided to
replace /dev/null with a plain file so that he could examine its
contents.
The root filesystem quickly filled up and the admin had to be called
back from dinner to fix it. But he found that he couldn't fix it: to
create a Unix device file you use the mknod command, and its
arguments are the major and minor device numbers of the device to
create. Our friend didn't remember the correct minor device
number. The ls -l command will tell you the numbers of a device file
but he had removed /dev/null so he couldn't use that.
Having no other system of the same type with an intact device file to
check, he was forced to restore /dev/null from the tape backups.
[Other articles in category /Unix]
permanent link
Controlling the KDE screen locking works now
Yesterday
I wrote about how I was trying to control the KDE screenlocker's timeout from a shell script
and all the fun stuff I learned along the way. Then after I published
the article I discovered that my solution didn't work. But today I
fixed it and it does work.
What didn't work
I had written this script:
timeout=${1:-3600}
perl -i -lpe 's/^Enabled=.*/Enabled=False/' $HOME/.kde/share/config/kscreensaverrc
qdbus org.freedesktop.ScreenSaver /MainApplication reparseConfiguration
sleep $timeout
perl -i -lpe 's/^Enabled=.*/Enabled=True/' $HOME/.kde/share/config/kscreensaverrc
qdbus org.freedesktop.ScreenSaver /MainApplication reparseConfiguration
The strategy was: use perl to rewrite the screen locker's
configuration file, and then use qdbus to send a D-Bus message to
the screen locker to order it to load the updated configuration.
This didn't work. The System Settings app would see the changed
configuration, and report what I expected, but the screen saver itself
was still behaving according to the old configuration. Maybe the
qdbus command was wrong or maybe the whole theory was bad.
More strace
For want of anything else to do (when all you have is a hammer…), I
went back to using strace to see what else I could dig up, and tried
strace -ff -o /tmp/ss/s /usr/bin/systemsettings
which tells strace to write separate files for each process or
thread.
I had a fantasy that by splitting the trace for each process into a
separate file, I might solve the mysterious problem of the missing
string data. This didn't come true, unfortunately.
I then ran tail -f on each of the output files, and used
systemsettings to update the screen locker configuration, looking to
see which the of the trace files changed. I didn't get too much out
of this. A great deal of the trace was concerned with X protocol
traffic between the application and the display server. But I did
notice this portion, which I found extremely suggestive, even with the
filenames missing:
3106 open(0x2bb57a8, O_RDWR|O_CREAT|O_CLOEXEC, 0666) = 18
3106 fcntl(18, F_SETFD, FD_CLOEXEC) = 0
3106 chmod(0x2bb57a8, 0600) = 0
3106 fstat(18, {...}) = 0
3106 write(18, 0x2bb5838, 178) = 178
3106 fstat(18, {...}) = 0
3106 close(18) = 0
3106 rename(0x2bb5578, 0x2bb4e48) = 0
3106 unlink(0x2b82848) = 0
You may recall that my theory was that when I click the “Apply” button
in System Settings, it writes out a new version of
$HOME/.kde/share/config/kscreensaverrc and then orders the screen
locker to reload the configuration. Even with no filenames, this part
of the trace looked to me like the replacement of the configuration
file: a new file is created, then written, then closed, and then the
rename replaces the old file with the new one. If I had been
thinking about it a little harder, I might have thought to check if
the return value of the write call, 178 bytes, matched the length of
the file. (It does.) The unlink at the end is deleting the
semaphore file that System Settings created to prevent a second
process from trying to update the same file at the same time.
Supposing that this was the trace of the configuration update, the
next section should be the secret sauce that tells the screen locker
to look at the new configuration file. It looked like this:
3106 sendmsg(5, 0x7ffcf37e53b0, MSG_NOSIGNAL) = 168
3106 poll([?] 0x7ffcf37e5490, 1, 25000) = 1
3106 recvmsg(5, 0x7ffcf37e5390, MSG_CMSG_CLOEXEC) = 90
3106 recvmsg(5, 0x7ffcf37e5390, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
3106 sendmsg(5, 0x7ffcf37e5770, MSG_NOSIGNAL) = 278
3106 sendmsg(5, 0x7ffcf37e5740, MSG_NOSIGNAL) = 128
There is very little to go on here, but none of it is inconsistent
with the theory that this is the secret sauce, or even with the more
advanced theory that it is the secret suace and that the secret sauce
is a D-Bus request. But without seeing the contents of the messages,
I seemed to be at a dead end.
Thrashing
Browsing random pages about the KDE screen locker, I learned that the
lock screen configuration component could be run separately from the
rest of System Settings. You use
kcmshell4 --list
to get a list of available components, and then
kcmshell4 screensaver
to run the screensaver component. I started running strace on this
command instead of on the entire System Settings app, with the idea
that if nothing else, the trace would be smaller and perhaps simpler,
and for some reason the missing strings appeared. That suggestive
block of code above turned out to be updating the configuration file, just
as I had suspected:
open("/home/mjd/.kde/share/config/kscreensaverrcQ13893.new", O_RDWR|O_CREAT|O_CLOEXEC, 0666) = 19
fcntl(19, F_SETFD, FD_CLOEXEC) = 0
chmod("/home/mjd/.kde/share/config/kscreensaverrcQ13893.new", 0600) = 0
fstat(19, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
write(19, "[ScreenSaver]\nActionBottomLeft=0\nActionBottomRight=0\nActionTopLeft=0\nActionTopRight=2\nEnabled=true\nLegacySaverEnabled=false\nPlasmaEnabled=false\nSaver=krandom.desktop\nTimeout=60\n", 177) = 177
fstat(19, {st_mode=S_IFREG|0600, st_size=177, ...}) = 0
close(19) = 0
rename("/home/mjd/.kde/share/config/kscreensaverrcQ13893.new", "/home/mjd/.kde/share/config/kscreensaverrc") = 0
unlink("/home/mjd/.kde/share/config/kscreensaverrc.lock") = 0
And the following secret sauce was revealed as:
sendmsg(7, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\30\0\0\0\v\0\0\0\177\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\f\0\0\0GetNameOwner\0\0\0\0\10\1g\0\1s\0\0", 144}, {"\23\0\0\0org.kde.screensaver\0", 24}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 168
sendmsg(7, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\1\1\206\0\0\0\f\0\0\0\177\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\10\0\0\0AddMatch\0\0\0\0\0\0\0\0\10\1g\0\1s\0\0", 144}, {"\201\0\0\0type='signal',sender='org.freedesktop.DBus',interface='org.freedesktop.DBus',member='NameOwnerChanged',arg0='org.kde.screensaver'\0", 134}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 278
sendmsg(7, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\0\0\0\0\r\0\0\0j\0\0\0\1\1o\0\f\0\0\0/ScreenSaver\0\0\0\0\6\1s\0\23\0\0\0org.kde.screensaver\0\0\0\0\0\2\1s\0\23\0\0\0org.kde.screensaver\0\0\0\0\0\3\1s\0\t\0\0\0configure\0\0\0\0\0\0\0", 128}, {"", 0}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 128
sendmsg(7, {msg_name(0)=NULL,
msg_iov(2)=[{"l\1\1\1\206\0\0\0\16\0\0\0\177\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\v\0\0\0RemoveMatch\0\0\0\0\0\10\1g\0\1s\0\0",
144},
{"\201\0\0\0type='signal',sender='org.freedesktop.DBus',interface='org.freedesktop.DBus',member='NameOwnerChanged',arg0='org.kde.screensaver'\0",
134}]
(I had to tell give strace the -s 256 flag to tell it not to
truncate the string data to 32 characters.)
Binary gibberish
A lot of this is illegible, but it is clear, from the frequent
mentions of DBus , and from the names of D-Bus objects and methods,
that this is is D-Bus requests, as theorized. Much of it is binary
gibberish that we can only read if we understand the D-Bus line
protocol, but the object and method names are visible. For example,
consider this long string:
interface='org.freedesktop.DBus',member='NameOwnerChanged',arg0='org.kde.screensaver'
With qdbus I could confirm that there was a service named
org.freedesktop.DBus with an object named / that supported a
NameOwnerChanged method which expected three QString arguments.
Presumably the first of these was org.kde.screensaver and the others
are hiding in other the 134 characters that strace didn't expand.
So I may not understand the whole thing, but I could see that I was on
the right track.
That third line was the key:
sendmsg(7, {msg_name(0)=NULL,
msg_iov(2)=[{"… /ScreenSaver … org.kde.screensaver … org.kde.screensaver … configure …", 128}, {"", 0}],
msg_controllen=0,
msg_flags=0},
MSG_NOSIGNAL) = 128
Huh, it seems to be asking the screensaver to configure itself. Just
like I thought it should. But there was no configure method, so what
does that configure refer to, and how can I do the same thing?
But org.kde.screensaver was not quite the same path I had been using
to talk to the screen locker—I had been using
org.freedesktop.ScreenSaver , so I had qdbus list the methods at
this new path, and there was a configure method.
When I tested
qdbus org.kde.screensaver /ScreenSaver configure
I found that this made the screen locker take note of the updated
configuration. So, problem solved!
(As far as I can tell, org.kde.screensaver and
org.freedesktop.ScreenSaver are completely identical. They each
have a configure method, but I had overlooked it—several times in a
row—earlier when I had gone over the method catalog for
org.freedesktop.ScreenSaver .)
The working script is almost identical to what I had yesterday:
timeout=${1:-3600}
perl -i -lpe 's/^Enabled=.*/Enabled=False/' $HOME/.kde/share/config/kscreensaverrc
qdbus org.freedesktop.ScreenSaver /ScreenSaver configure
sleep $timeout
perl -i -lpe 's/^Enabled=.*/Enabled=True/' $HOME/.kde/share/config/kscreensaverrc
qdbus org.freedesktop.ScreenSaver /ScreenSaver configure
That's not a bad way to fail, as failures go: I had a correct idea
about what was going on, my plan about how to solve my problem would
have worked, but I was tripped up by a trivium; I was calling
MainApplication.reparseConfiguration when I should have been calling
ScreenSaver.configure .
What if I hadn't been able to get strace to disgorge the internals
of the D-Bus messages? I think I would have gotten the answer anyway.
One way to have gotten there would have been to notice the configure
method documented in the method catalog printed out by qdbus . I
certainly looked at these catalogs enough times, and they are not very
large. I don't know why I never noticed it on my own. But I might
also have had the idea of spying on the network traffic through the
D-Bus socket, which is under /tmp somewhere.
I was also starting to tinker with dbus-send , which is like qdbus
but more powerful, and can post signals, which I think qdbus can't
do, and with gdbus , another D-Bus introspector. I would have kept
getting more familiar with these tools and this would have led
somewhere useful.
Or had I taken just a little longer to solve this, I would have
followed up on Sumana Harihareswara’s suggestion to look at
Bustle, which is
a utility that logs and traces D-Bus requests. It would certainly
have solved my problem, because it makes perfectly clear that clicking
that apply button invoked the configure method:
I still wish I knew why strace hadn't been able to print out those
strings through.
[Other articles in category /Unix]
permanent link
Controlling KDE screen locking from a shell script
Lately I've started watching stuff on Netflix. Every time I do this,
the screen locker kicks in sixty seconds in, and I have to unlock it,
pause the video, and adjust the system settings to turn off the
automatic screen locker. I can live with this.
But when the show is over, I often forget to re-enable the automatic
screen locker, and that I can't live with. So I wanted to write a
shell script:
#!/bin/sh
auto-screen-locker disable
sleep 3600
auto-screen-locker enable
Then I'll run the script in the background before I start watching, or
at least after the first time I unlock the screen, and if I forget to
re-enable the automatic locker, the script will do it for me.
The question is: how to write auto-screen-locker ?
strace
My first idea was: maybe there is actually an auto-screen-locker
command, or a system-settings command, or something like that, which
was being run by the System Settings app when I adjusted the screen
locker from System Settings, and all I needed to do was to find out
what that command was and to run it myself.
So I tried running System Settings under strace -f and then looking
at the trace to see if it was exec ing anything suggestive.
It wasn't, and the trace was 93,000 lines long and frighting. Halfway
through, it stopped recording filenames and started recording their
string addresses instead, which meant I could see a lot of calls to
execve but not what was being execed. I got sidetracked trying to
understand why this had happened, and I never did figure it
out—something to do with a call to clone , which is like fork , but
different in a way I might understand once I read the man page.
The first thing the cloned process did was to call set_robust_list ,
which I had never heard of, and when I looked for its man page I found
to my surprise that there was one. It begins:
NAME
get_robust_list, set_robust_list - get/set list of robust futexes
And then I felt like an ass because, of course, everyone knows all
about the robust futex list, duh, how silly of me to have forgotten ha
ha just kidding WTF is a futex? Are the robust kind better than
regular wimpy futexes?
It turns out that Ingo Molnár wrote a lovely explanation of robust
futexes
which are actually very interesting. In all seriousness, do check it
out.
I seem to have digressed. This whole section can be summarized in
one sentence:
strace was no help and took me a long way down a wacky rabbit hole.
Sorry, Julia!
Stack Exchange
The next thing I tried was Google search for kde screen locker . The
second or third link I followed was to this StackExchange question,
“What is the screen locking mechanism under
KDE?
It wasn't exactly what I was looking for but it was suggestive and
pointed me in the right direction. The crucial point in the answer
was a mention of
qdbus org.freedesktop.ScreenSaver /ScreenSaver Lock
When I saw this, it was like a new section of my brain coming on line.
So many things that had been obscure suddenly became clear. Things I
had wondered for years. Things like “What are these horrible
Object::connect: No such signal org::freedesktop::UPower::DeviceAdded(QDBusObjectPath)
messages that KDE apps are always spewing into my terminal?” But now
the light was on.
KDE is built atop a toolkit called Qt, and Qt provides an interprocess
communication mechanism called “D-Bus”. The qdbus command, which I
had not seen before, is apparently for sending queries and commands on
the D-Bus. The arguments identify the recipient and the message you
are sending. If you know the secret name of the correct demon, and
you send it the correct secret command, it will do your bidding. (
The mystery message above probably has something to do with the app
using an invalid secret name as a D-Bus address.)
Often these sorts of address hierarchies work well in theory and then
fail utterly because there is no way to learn the secret names. The X
Window System has always had a feature called “resources” by which
almost every aspect of every application can be individually
customized. If you are running xweasel and want just the frame of
just the error panel of just the output window to be teal blue, you
can do that… if you can find out the secret names of the xweasel
program, its output window, its error panel, and its frame. Then you
combine these into a secret X resource name, incant a certain command
to load the new resource setting into the X server, and the next time
you run xweasel the one frame, and only the one frame, will be blue.
In theory these secret names are documented somewhere, maybe. In
practice, they are not documented anywhere. you can only extract them
from the source, and not only from the source of xweasel itself but
from the source of the entire widget toolkit that xweasel is linked
with. Good luck, sucker.
D-Bus has a directory
However! The authors of Qt did not forget to include a directory
mechanism in D-Bus. If you run
qdbus
you get a list of all the addressable services, which you can grep for
suggestive items, including org.freedesktop.ScreenSaver . Then if
you run
qdbus org.freedesktop.ScreenSaver
you get a list of all the objects provided by the
org.freedesktop.ScreenSaver service; there are only seven. So you
pick a likely-seeming one, say /ScreenSaver , and run
qdbus org.freedesktop.ScreenSaver /ScreenSaver
and get a list of all the methods that can be called on this object,
and their argument types and return value types. And you see for
example
method void org.freedesktop.ScreenSaver.Lock()
and say “I wonder if that will lock the screen when I invoke it?” And
then you try it:
qdbus org.freedesktop.ScreenSaver /ScreenSaver Lock
and it does.
That was the most important thing I learned today, that I can go
wandering around in the qdbus hierarchy looking for treasure. I
don't yet know exactly what I'll find, but I bet there's a lot of good stuff.
When I was first learning Unix I used to wander around in the
filesystem looking at all the files, and I learned a lot that way
also.
“Hey, look at all the stuff in /etc ! Huh, I wonder what's in
/etc/passwd ?”
“Hey, /etc/protocols has a catalog of protocol numbers. I wonder
what that's for?”
“Hey, there are a bunch of files in /usr/spool/mail named after
users and the one with my name has my mail in it!”
“Hey, the manuals are all under /usr/man . I could grep them!”
Later I learned (by browsing in /usr/man/man7 ) that there was a
hier(7) man page that listed points of interest, including some I
had overlooked.
The right secret names
Everything after this point was pure fun of the “what happens if I
turn this knob” variety. I tinkered around with the /ScreenSaver
methods a bit (there are twenty) but none of them seemed to be quite
what I wanted. There is a
method uint Inhibit(QString application_name, QString reason_for_inhibit)
method which someone should be calling, because that's evidently
what you call if you are a program playing a video and you want to
inhibit the screen locker. But the unknown someone was delinquent and
it wasn't what I needed for this problem.
Then I moved on to the /MainApplication object and found
method void org.kde.KApplication.reparseConfiguration()
which wasn't quite what I was looking for either, but it might do: I
could perhaps modify the configuration and then invoke this method. I
dimly remembered that KDE keeps configuration files under
$HOME/.kde , so I ls -la -ed that and quickly found
share/config/kscreensaverrc , which looked plausible from the
outside, and more plausible when I saw what was in it:
Enabled=True
Timeout=60
among other things. I hand-edited the file to change the 60 to
243 , ran
qdbus org.freedesktop.ScreenSaver /MainApplication reparseConfiguration
and then opened up the System Settings app. Sure enough, the System
Settings app now reported that the lock timeout setting was “4
minutes”. And changing Enabled=True to Enabled=False and back
made the System Settings app report that the locker was enabled or
disabled.
The answer
So the script I wanted turned out to be:
timeout=${1:-3600}
perl -i -lpe 's/^Enabled=.*/Enabled=False/' $HOME/.kde/share/config/kscreensaverrc
qdbus org.freedesktop.ScreenSaver /MainApplication reparseConfiguration
sleep $timeout
perl -i -lpe 's/^Enabled=.*/Enabled=True/' $HOME/.kde/share/config/kscreensaverrc
qdbus org.freedesktop.ScreenSaver /MainApplication reparseConfiguration
Problem solved, but as so often happens, the journey was more
important than the destination.
I am greatly looking forward to exploring the D-Bus hierarchy and
sending all sorts of inappropriate messages to the wrong objects.
Just before he gets his ass kicked by Saruman, that insufferable
know-it-all Gandalf says “He who breaks a thing to find out what it is
has left the path of wisdom.” If I had been Saruman, I would have
kicked his ass at that point too.
Addendum
Right after I posted this, I started watching Netflix. The screen
locker cut in after sixty seconds. “Aha!” I said. “I'll run my new
script!”
I did, and went back to watching. Sixty seconds later, the screen
locker cut in again. My script doesn't work! The System Settings app
says the locker has been disabled, but it's mistaken. Probably it's only
reporting the contents of the configuration file that I edited, and
the secret sauce is still missing. The System Settings app does
something to update the state of the locker when I click that “Apply”
button, and I thought that my qdbus command was doing the same
thing, but it seems that it isn't.
I'll figure this out, but maybe not today. Good night all!
[ Addendum 20160728: I figured it out the next day ]
[ Addendum 20160729: It has come to my attention that there is
actually a program called xweasel . ]
[Other articles in category /Unix]
permanent link
Another use for strace (isatty)
(This is a followup to an earlier article describing an interesting use of strace .)
A while back I was writing a talk about Unix internals and I wanted to
discuss how the ls command does a different display when talking to
a terminal than otherwise:
ls to a terminal
ls not to a terminal
How does ls know when it is talking to a terminal? I expect that is
uses the standard POSIX function isatty . But how does isatty find
out?
I had written down my guess. Had I been programming in C, without
isatty , I would have written something like this:
@statinfo = stat STDOUT;
if ( $statinfo[2] & 0060000 == 0020000
&& ($statinfo[6] & 0xff) == 5) { say "Terminal" }
else { say "Not a terminal" }
(This is Perl, written as if it were C.) It uses fstat (exposed in
Perl as stat ) to get the mode bits ($statinfo[2] ) of the inode
attached to STDOUT , and then it masks out the bits the determine if
the inode is a character device file. If so, $statinfo[6] is the
major and minor device numbers; if the major number (low byte) is
equal to the magic number 5, the device is a terminal device. On my
current computers the magic number is actually 136. Obviously this
magic number is nonportable. You may hear people claim that those bit
operations are also nonportable. I believe that claim is mistaken.
The analogous code using isatty is:
use POSIX 'isatty';
if (isatty(STDOUT)) { say "Terminal" }
else { say "Not a terminal" }
Is isatty doing what I wrote above? Or something else?
Let's use strace to find out. Here's our test script:
% perl -MPOSIX=isatty -le 'print STDERR isatty(STDOUT) ? "terminal" : "nonterminal"'
terminal
% perl -MPOSIX=isatty -le 'print STDERR isatty(STDOUT) ? "terminal" : "nonterminal"' > /dev/null
nonterminal
Now we use strace :
% strace -o /tmp/isatty perl -MPOSIX=isatty -le 'print STDERR isatty(STDOUT) ? "terminal" : "nonterminal"' > /dev/null
nonterminal
% less /tmp/isatty
We expect to see a long startup as Perl gets loaded and initialized,
then whatever isatty is doing, the write of nonterminal , and then
a short teardown, so we start searching at the end and quickly
discover, a couple of screens up:
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7ffea6840a58) = -1 ENOTTY (Inappropriate ioctl for device)
write(2, "nonterminal", 11) = 11
write(2, "\n", 1) = 1
My guess about fstat was totally wrong! The actual method is that
isatty makes an ioctl call; this is a device-driver-specific
command. The TCGETS parameter says what command is, in this case
“get the terminal configuration”. If you do this on a non-device, or
a non-terminal device, the call fails with the error ENOTTY . When
the ioctl call fails, you know you don't have a terminal. If you do
have a terminal, the TCGETS command has no effects, because it is a
passive read of the terminal state. Here's the successful call:
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
write(2, "terminal", 8) = 8
write(2, "\n", 1) = 1
The B38400 opost… stuff is the terminal configuration; 38400 is the baud rate.
(In the past the explanatory text for ENOTTY was the mystifying “Not
a typewriter”, even more mystifying because it tended to pop up when
you didn't expect it. Apparently Linux has revised the message to the
possibly less mystifying “Inappropriate ioctl for device”.)
(SNDCTL_TMR_TIMEBASE is mentioned because apparently someone decided
to give their SNDCTL_TMR_TIMEBASE operation, whatever that is, the
same numeric code as TCGETS , and strace isn't sure which one is
being requested. It's possible that if we figured out which device was
expecting SNDCTL_TMR_TIMEBASE , and redirected standard output to
that device, that isatty would erroneously claim that it was a
terminal.)
[ Addendum 20150415: Paul Bolle has found that the
SNDCTL_TMR_TIMEBASE pertains to the old and possibly deprecated OSS
(Open Sound System)
It is conceivable that isatty would yield the wrong answer when
pointed at the OSS /dev/dsp or /dev/audio device or similar. If
anyone is running OSS and willing to give it a try, please contact me at mjd@plover.com . ]
[ Addendum 20191201: Thanks to Hacker News user
jwilk for pointing
out that strace is
now able to distinguish TCGETS from SNDCTL_TMR_TIMEBASE . ]
[Other articles in category /Unix]
permanent link
Another use for strace (groff)
The marvelous Julia Evans is always looking for ways to express her
love of strace and now has written a zine about
it. I don't use
strace that often (not as often as I should, perhaps) but every once
in a while a problem comes up for which it's not only just the right
thing to use but the only thing to use. This was one of those
times.
I sometimes use the ancient Unix drawing language
pic . Pic has many
good features, but is unfortunately coupled too closely to the Roff
family of formatters (troff , nroff , and the GNU project version,
groff ). It only produces Roff output, and not anything more
generally useful like SVG or even a bitmap. I need raw images to
inline into my HTML pages. In the past I have produced these with a
jury-rigged pipeline of groff , to produce PostScript, and then GNU
Ghostscript (gs ) to translate the PostScript to a PPM
bitmap, some PPM utilities to crop and
scale the result, and finally ppmtogif or whatever. This has some
drawbacks. For example, gs requires that I set a paper size, and
its largest paper size is A0. This means that large drawings go off
the edge of the “paper” and gs discards the out-of-bounds portions.
So yesterday I looked into eliminating gs . Specifically I wanted to
see if I could get groff to produce the bitmap directly.
GNU groff has a -T device option that specifies the "output"
device; some choices are -Tps for postscript output and -Tpdf for
PDF output. So I thought perhaps there would be a -Tppm or
something like that. A search of the manual did not suggest anything
so useful, but did mention -TX100 , which had something to do with
100-DPI X window system graphics. But when I tried this groff only said:
groff: can't find `DESC' file
groff:fatal error: invalid device `X100`
The groff -h command said only -Tdev use device dev . So what
devices are actually available?
strace to the rescue! I did:
% strace -o /tmp/gr groff -Tfpuzhpx
and then a search for fpuzhpx in the output file tells me exactly
where groff is searching for device definitions:
% grep fpuzhpx /tmp/gr
execve("/usr/bin/groff", ["groff", "-Tfpuzhpx"], [/* 80 vars */]) = 0
open("/usr/share/groff/site-font/devfpuzhpx/DESC", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/groff/1.22.2/font/devfpuzhpx/DESC", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/font/devfpuzhpx/DESC", O_RDONLY) = -1 ENOENT (No such file or directory)
I could then examine those three directories to see if they existed,
and if so find out what was in them.
Without strace here, I would be reduced to groveling over the
source, which in this case is likely to mean trawling through the
autoconf output, and that is something that nobody wants to do.
Addendum 20150421: another article about strace . ]
[ Addendum 20150424: I did figure out how to prevent gs from
cropping my output. You can use the flag -p-P48i,48i to groff to
set the page size to 48 inches (48i ) by 48 inches. The flag is
passed to grops , and then resulting PostScript file contains
%%DocumentMedia: Default 3456 3456 0 () ()
which instructs gs to pretend the paper size is that big. If it's
not big enough, increase 48i to 120i or whatever. ]
[Other articles in category /Unix]
permanent link
It came from... the HOLD SPACE
Since 2002, I've given a talk almost every December for the Philadelphia Linux Users'
Group. It seems like most of their talks are about the newest and
best developments in Linux applications, which is a topic I don't know
much about. So I've usually gone the other way, talking about the
oldest and worst stuff. I gave a couple of pretty good talks about
how files work, for example, and what's in the inode structure.
I recently posted about my work on Zach Holman's
spark program, which culminated in a ridiculous workaround
for the shell's lack of fractional arithmetic.
That work inspired me to do a talk about all the awful
crap we had to deal with before we had Perl. (And the other 'P'
languages that occupy a similar solution space.) Complete materials are
here. I hope you check them out, because i think they are fun.
This post is a bunch of miscellaneous notes about the talk.
One example of awful crap we had to deal with before Perl etc. were
invented was that some
people used to write 'sed scripts', although I am really not sure how
they did it. I tried once, without much success, and then for this
talk I tried again, and again did not have much success.
"The hold space" is a sed-ism. The basic model of sed is that it
reads the next line of data into the 'pattern space', then applies a
bunch of transformations to it, and then prints it out. If you need
to save this line for later examination, or for emitting later on
instead, you can hold it in the 'hold space'. Use of the hold space
is what distinguishes sed experts from mere sed nobodies like me. So
I planned to talk about the hold space, and then I got the happy idea
to analogize the Hold Space to the Twilight Zone, or maybe the Phantom
Zone, a place where you stick naughty data when you don't want it to
escape. I never feel like audiences appreciate the work I put into
this sort of thing; when I'm giving the talk it always sounds too much
like a private joke. Explaining it just feels like everyone is sitting
through my explanation of a private joke.
The little guy to the right is known
as hallucigenia. It is a creature so peculiar that when
the paleontologists first saw the fossils, they could not even agree
on which side was uppermost. It has nothing to do with Unix, but I
put it on the slide to illustrate "alien horrors from the dawn of
time".
Between slides 9 and 10 (about the ed line editor) I did a
quick demo of editing with ed. You will just have to imagine
this. I first learned to program with a line editor like ed,
on a teletypewriter just like the one on slide 8.
Modern editors are much better. But it used to
be that Unix sysadmins were expected to know at least a little ed,
because if your system got into some horrible state where it couldn't
mount the /usr partition, you wouldn't be able to run
/usr/bin/vi or /usr/local/bin/emacs, but you would
still be able to use /bin/ed to fix /etc/fstab or
whatever else was broken. Knowing ed saved my bacon several
times.
(Speaking of
teletypewriters, ours had an attachment for punching paper tape, which
you can see on the left side of the picture. The punched chads fell
into a plastic chad box (which is missing in the picture), and when I
was about three I spilled the chad box. Chad was everywhere, and it
was nearly impossible to pick up. There were still chads stuck
in the cracks in the floorboards when we moved out three years
later. That's why, when the contested election of 2000 came around, I
was one of the few people in North America who was not bemused to
learn that there was a name for the little punched-out bits.)
Anyway, back to ed. ed has one and only one diagnostic: if you do
something it didn't like, it prints ?. This explains the
ancient joke on slide
10, which first appeared circa 1982 in the 4.2BSD fortune
program.
I really wanted to present a tour de force of sed mastery,
but as slides
24–26 say, I was not clever enough. I tried really hard and just
could not do it. If anyone wants to fix my not-quite-good-enough
sed script, I will be quite grateful.
On slide
28 I called awk a monster. This was a slip-up;
awk is not a monster and that is why it does not otherwise
appear in this talk. There is nothing really wrong with awk,
other than being a little old, a little tired, and a little
underpowered.
If you are interested in the details of the classify program,
described on slide 29,
the sources are still available from the
comp.sources.unix archive. People often say "Why don't
you just use diff for that?" so I may as well answer that
here: You use diff if you have two files and you want to see
how they differ. You use classify if you have 59 files, of
which 36 are identical, 17 more are also identical to each other but
different from the first 36, and the remaining 6 are all weirdos, and
you want to know which is which. These days you would probably just
use md5sum FILES | accumulate, and in
hindsight that's probably how I should have implemented
classify. We didn't have md5sum but we had
something like it, or I could have made a checksum program. The
accumulate utility is trivial.
Several people have asked me to clarify my claim to have invented
netcat. It seems that a similar program with the same name is
attributed to someone called "Hobbit". Here is the clarification: In
1991 I wrote a program with the functionality I described and called
it "netcat". You would run
netcat hostname port and it would
open a network socket to the indicated address, and transfer data from
standard input into the socket, and data from the socket to standard
output. I still have the source code; the copyright notice at the top
says "21 October 1991". Wikipedia says that the
same-named program by the other guy was released on 20 March 1996. I
do not claim that the other guy stole it from me, got the idea from
me, or ever heard of my version. I do not claim to be the first or
only person to have invented this program. I only claim to have
invented mine independently.
My own current version of the spark program is on GitHub, but I think
Zach Holman's current
version is probably simpler and better now.
[ Addendum 20170325: I have revised this talk a couple of times since this blog article was written. Links to particular slides go to the 2011 versions, but the current version is from 2017. There are only minor changes. For example, I removed `awk` from the list of “monsters”. ]
[Other articles in category /Unix]
permanent link
Where should usage messages go?
Last week John Speno complained about Unix commands which, when
used incorrectly, print usage messages to standard error instead of to
standard output. The problem here is that if the usage message is
long, it might scroll off the screen, and it's a pain when you try to
pipe it through a pager with command | pager and discover
that the usage output has gone to stderr, missed the pager, and
scrolled off the screen anyway.
Countervailing against this, though, is the usual argument for stderr:
if you had run the command in a pipeline, and it wrote its error
output to stdout instead of to stderr, then the error message would
have gotten lost, and would possibly have caused havoc further down
the pipeline. I considered this argument to be the controlling one,
but I ran a quick and informal survey to see if I was in the
minority.
After 15 people had answered the survey, Ron Echeverri pointed out
that although it makes sense for the usage message to go to stderr
when the command is used erroneously, it also makes sense for it to go
to stdout if the message is specifically requested, say by the
addition of a --help flag, since in that case the message is
not erroneous. So I added a second question to the survey to ask
about where the message should go in such a case.
83 people answered the first question, "When a command is misused,
should it deliver its usage message to standard output or to standard
error?". 62 (75%) agreed that the message should go to stderr; 11 (13%)
said it should go to stdout. 10 indicated that they preferred a more
complicated policy, of which 4 were essentially (or exactly) what
M. Echeverri suggested; this brings the total in favor of stderr
to 66 (80%). The others were:
- stdout, if it is a tty; stderr otherwise
- stdout, if it is a pipe; stderr otherwise
- A very long response that suggested syslog.
- stderr, unless an empty stdout would cause problems
- It depends, but the survey omitted the option of printing directly
on the console
- It depends
I think #2 must have been trying to articulate #1, but (a) got it
backwards and (b) missed. #3 seemed to be answering a different
question than the one that was asked; syslog may make sense
for general diagnostics, but to use it for usage messages seems
peculiar. #5 also seems strange to me, since my idea of "console" is
the line printer hardwired to the back of the mainframe down in the
machine room; I think the writer might have meant "terminal".
68 people answered the second question, "Where should the command
send the output when the user specifically requests usage
information?". (15 people took the survey before I added this
question.) 50 (74%) said the output should go to stdout, 12 (18%) to
the user's default pager and then to stdout, and 5 (7%) to stderr.
One person (The same as #5 above) said "it depends".
Thanks to everyone who participated.
[Other articles in category /Unix]
permanent link
The "z" command: output filtering
My last few articles
([1]
[2]
[p]
[p-2]) have been about
this z program. The first part of
this article is a summary of that discussion, which you can
skip if you remember it.
The idea of z
is that you can do:
z grep pattern files...
and it does approximately the same as:
zgrep pattern files...
or you could do:
z sed script files...
and it would do the same as:
zsed script files...
if there were a zsed command, although there isn't.
Much of the discussion has concerned a problem with the implementation,
which is that the names of the original compressed files are not
available to the command, due to the legerdemain z must
perform in order to make the uncompressed data available to the
command. The problem is especially apparent with
wc:
% z wc *
411 2611 16988 ctime.blog
71 358 2351 /proc/self/fd/3
121 725 5053 /proc/self/fd/4
51 380 2381 files-talk.blog
48 145 885 find-uniq.pl
288 2159 12829 /proc/self/fd/5
95 665 4337 ssh-agent-revisted.blog
221 941 6733 struct-inode.blog
106 555 3976 sync-2.blog
115 793 4904 sync.blog
124 624 4208 /proc/self/fd/6
1651 9956 64645 total
Here /proc/self/fd/3 and the rest should have been names
ending in .gz, such as env-2.blog.gz.
At the time I wrote the first article, it occurred to me briefly that
it would be possible to have z capture the output of the
command and attempt to translate /proc/self/fd/3 back to
env-2.blog.gz or whatever is appropriate, because although
the subcommand does not know the original filenames, z itself
does. The code would look something like this. Instead of ending by
execing the command, as the original version of z
did:
exec $command, @ARGV;
die "Couldn't run '$command': $!.\n";
this revised version of z, which we might call zz,
would end with the code to translate back to the original
filenames:
open my($out), "-|", $command, @ARGV
or die "Couldn't run '$command': $!.\n";
while (<$out>) {
s{/proc/self/fd/(\d+)}{$old[$1]}g;
print;
}
Here @old is an array that translates from file descriptors
back to the original filename.
At the time, I thought of doing this, and my immediate thought was
"well, that is so obviously a terrible idea that it is not worth even
mentioning", so I left it out. But since then at least five people
have written to me to suggest it, so it appears that it is not
obviously a terrible idea. I had to think a little deeper about why I
thought it was a terrible idea.
Really the question is why I think this is a more terrible idea than
the original z program was in the first place. Because one
could say that z is garbling the output of its command, and
the filtering code in zz is only un-garbling it. But I think
this isn't the right way to look at it.
The output of the command has a certain format, a certain structure.
We don't know ahead of time what that structure is, but it can be
described for any particular command. For instance, the output of
wc is always a sequence of lines where each line has four
whitespace-separated fields, of which the first three are numerals and
the last is a filename, and then a final total line at the end.
Similarly, the output of tar is a file in a complicated
binary format, one which is documented somewhere and which is
intelligible to other instances of the tar command that are
trying to decode it.
The original behavior of z may alter the content of the
command output to some extent, replacing some filenames with others.
But it cannot disrupt the structure or the format of the file, ever.
This is because the output of z tar is the output of
tar, unmodified. The z program tampers with the
arguments it gives to tar, but having done that it runs
tar and lets tar do what it wants, and tar
then must produce a tar-format output, possibly not the one
it would have normally produced—the content might be a little
different—but a properly-formatted one for sure. In particular, any
program written to deal properly with the output of tar will
still work with the output of z tar. The output might not
have the same meaning, but we can say very particularly what the
extent of the differences might be: if the output mentions filenames,
then some of these might have changed from the true filenames to
filenames of the form /proc/self/fd/37.
With zz, we cannot make any such guarantee. The
output of zz tar zc foo.gz, for example, might
be in proper .tar.gz format. But suppose the output of
tar zc foo.gz creates compressed binary output that
just happens to contain the byte sequence 2f70 726f 632f 7365
6c66 2f66 642f 33? (That is, "/proc/self/fd/3".) Then
zz will silently replace these 15 bytes with the six bytes
666f 6f2e 677a.
What if the original sequence was understood as part of a sequence of 2-byte
integers? The result is not even properly aligned. What if that
initial 2f was a count? The resulting count (66) is much too long.
The result would be utterly garbled and unintelligible to
tar zx. What the tar command will do with a
garbled input is not well-defined: it might dump core, or it might
write out random garbage data, or overwrite essential files in the
filesystem. We are into nasal demon
territory. With the original z, we never get anywhere near
the nasal demons.
I suppose the short summary here is that z treats its command
as a black box, while zz pretends to understand what comes
out of it. But zz's understanding is a false pretense.
My experience says that programs should not screw around with things
they don't understand, and this is why I instantly rejected the idea
when I thought of it before.
One correspondent argued that the garbling is very unlikely, and
proposed various techniques to make it even less likely, mostly by
rewriting the input filenames to various long random strings. But I
felt then that this was missing the point, and I still do. He
says it is unlikely, but he doesn't know that it is
unlikely, and indeed the unlikeliness depends on the format of the
output of the command, which is precisely the unknown here. In my
view, the difference between z and zz is that the
changes that z makes are bounded, because you can describe them
briefly, as I did above, and the changes that zz makes are
unbounded, because there is no limit to what could happen as a
result.
On the other hand, this correspondent made a good point that if the
output of zz is not consumed by anything other than human
eyeballs, there may be no real problem. And for some particular
commands, such as wc, there is never any problem at all. So
perhaps it's a good idea to add a command-line option to z to
enable the zz behavior. I did this in my version, and I'm
going to try it out and see how it goes.
Complete modified source code is
available. (Diffs from previous version.)
[Other articles in category /Unix]
permanent link
The "z" command: alternative implementations
In yesterday's article I discussed a
possibly-useful utility program named z, which has a flaw.
To jog your memory, here is a demonstration:
% z grep immediately *
ctime.blog:we want to update. It is immediately copied into a register, and
/proc/self/fd/3:All five people who wrote to me about this immediately said "oh, yes,
/proc/self/fd/5:program continues immediately, possibly posting its message. (It
struct-inode.blog:is a symbolic link, its inode is returned immediately; iname() would
sync.blog:and reports success back to the process immediately, even though the
For a detailed discussion, see the
previous article.
Fixing this flaw seems difficult-to-impossible. As I said earlier,
the trick is to fool the command into reading from a pipe when it
thinks it is opening a file, and this is precisely what
/proc/self/fd is for. But there is an older, even more
widely-implemented Unix feature that does the same thing, namely the
FIFO. So an alternative implementation creates one FIFO for each
compressed file, with a gzip process writing to the FIFO, and
tells the command to read from the FIFO. Since we have some limited
control over the name of the FIFO, we can ameliorate the
missing-filename problem to some extent. Say, for example, we create
the FIFOs in /tmp/PID. Then the broken zgrep
example above might look like this instead:
% z grep immediately *
ctime.blog:we want to update. It is immediately copied into a register, and
/tmp/7516/env-2.blog.gz:All five people who wrote to me about this immediately said "oh, yes,
/tmp/7516/qmail-throttle.blog.gz:program continues immediately, possibly posting its message. (It
struct-inode.blog:is a symbolic link, its inode is returned immediately; iname() would
sync.blog:and reports success back to the process immediately, even though the
The output is an improvement, but it is not completely solved, and the
cost is that the process and file management are much more
complicated. In fact, the cost is so high that you have to wonder if
it might not be simpler to replace z with a shell script that
copies the data to a temporary directory, uncompresses the files, and
runs the command on the uncompressed files, perhaps something along
these lines:
#!/bin/sh
DIR=/tmp/$$
mkdir $DIR
COMMAND=$1
shift
cp -p "$@" $DIR
cd $DIR
gzip -d *
$COMMAND *
This has problems too, but my point is that if you are willing to
accept a crappy, semi-working solution along the lines of the FIFO
one, simpler ones are at hand. You can compare the FIFO version
directly with the shell script, and I think the FIFO version loses.
The z implementation I have is a
solution in a different direction, and different tradeoffs, and so
might be preferable to it in a number of ways.
But as I said, I don't know yet.
[ Addendum 20080325: Several people suggested a fix that I had
considered so unwise that I didn't even mention it. But after
receiving the suggestion repeatedly, I
wrote an article about it. ]
[Other articles in category /Unix]
permanent link
z-commands
The gzip distribution includes a command called zcat. Its
command-line arguments can include any number of filenames, compressed
or not, and it prints out the contents, uncompressing them on the fly
if necessary. Sometime later a zgrep command appeared, which
was similar but which also performed a grep search.
But for anything else, you either need to uncompress the files, or
build a special tool. I have a utility that scans the web logs of
blog.plover.com, and extracts a report about new referrers.
The historical web logs are normally kept compressed, so I recently
built in support for decompression. This is quite easy in Perl.
Normally one scans a sequence of input files something like this:
while (<>) {
... do something with $_ ...
}
The <> operator implicitly scans all the lines in all the
files named in the command-line arguments, opening a new file each
time the previous one is exhausted.
To decompress the files on the fly, one can preprocess the
command-line arguments:
for (@ARGV) {
if (/\.gz$/) {
$_ = "gzip -dc $_ |";
}
}
while (<>) {
... do something with $_ ...
}
The for loop scans the command-line arguments, replacing each
one that has the form foo.gz with gzip -dc foo.gz |.
Perl's magic open semantics treat filenames specially if they end with
a pipe symbol: a pipe to a command is opened instead. Of course, anyone can
think of half a dozen ways in which this can go wrong. But Larry
Wall's skill in making such tradeoffs has been a large factor in Perl's
success.
But it bothered me to have to make this kind of change in every
program that wanted to handle compressed files. We have
zcat and zgrep; where are zcut,
zpr, zrev, zwc, zcol,
zbc, zsed, zawk, and so on? Echh.
But after I got to thinking about it, I decided that I could write a
single z utility that would do a lot of the same things.
Instead of this:
zsed -e 's/:.*//' * | ...
where the * matches some files that have .gz
suffixes and some that haven't, one would write:
z sed -e 's/:.*//' * | ...
and it would Just Work. That's the idea, anyway.
If sed were written in Perl, z would have an easy
job. It could rely on Perl's magic open, and simply preprocess the
arguments before running sed:
# hypothetical implementation of z
#
my $command = shift;
for (@ARGV) {
if (/\.gz$/) {
$_ = "gzip -dc $_ |";
}
}
exec $command, @ARGV;
die "Couldn't run command '$command': $!\n";
But sed is not written in Perl, and has no magic open. So I
have to play a trickier trick:
for my $file (@ARGV) {
if ($file =~ /\.gz$/) {
unless (open($fhs[@fhs], "-|", "gzip", "-cd", $file)) {
warn "Couldn't open file '$file': $!; skipping\n";
next;
}
my $fd = fileno $fhs[-1];
$_ = "/proc/self/fd/$fd";
}
}
# warn "running $command @ARGV\n";
exec $command, @ARGV;
die "Couldn't run command '$command': $!\n";
This is a stripped-down version to illustrate the idea. For various
reasons that I explained
yesterday, it does not actually work.
The
complete, working source code is here.
The idea, as before, is that the program preprocesses the command-line
arguments. But instead of replacing the arguments with pipe commands, which
are not supported by open(2), the program
sets up the pipes itself, and then directs the command to take its
input from the pipes by specifying the appropriate items from
/proc/self/fd.
The trick depends crucially on having /proc/self/fd, or
/dev/fd, or something of the sort, because otherwise there's
no way to trick the command into reading from a pipe when it thinks it
is opening a file. (Actually there is at least one other way,
involving FIFOs, which I plan to discuss tomorrow.) Most modern
systems do have /proc/self/fd. That feature postdates my
earliest involvement with Unix, so it isn't a ready part of my mental
apparatus as perhaps it ought to be. But this utility seems to me
like a sort of canonical application of /proc/self/fd, in the
sense that, if you couldn't think what /proc/self/fd might be
good for, then you could read this example and afterwards have a
pretty clear idea.
The z utility has a number of flaws. Principally, the
original filenames are gone. Here's a typical run with regular
zgrep:
% zgrep immediately *
ctime.blog:we want to update. It is immediately copied into a register, and
env-2.blog.gz:All five people who wrote to me about this immediately said "oh, yes,
qmail-throttle.blog.gz:program continues immediately, possibly posting its message. (It
struct-inode.blog:is a symbolic link, its inode is returned immediately; iname() would
sync.blog:and reports success back to the process immediately, even though the
But here's the same thing with z:
% z grep immediately *
ctime.blog:we want to update. It is immediately copied into a register, and
/proc/self/fd/3:All five people who wrote to me about this immediately said "oh, yes,
/proc/self/fd/5:program continues immediately, possibly posting its message. (It
struct-inode.blog:is a symbolic link, its inode is returned immediately; iname() would
sync.blog:and reports success back to the process immediately, even though the
The problem is even more glaring in the case of commands like
wc:
% z wc *
411 2611 16988 ctime.blog
71 358 2351 /proc/self/fd/3
121 725 5053 /proc/self/fd/4
51 380 2381 files-talk.blog
48 145 885 find-uniq.pl
288 2159 12829 /proc/self/fd/5
95 665 4337 ssh-agent-revisted.blog
221 941 6733 struct-inode.blog
106 555 3976 sync-2.blog
115 793 4904 sync.blog
124 624 4208 /proc/self/fd/6
1651 9956 64645 total
So perhaps z will not turn out to be useful enough to be more
than a curiosity. But I'm not sure yet.
This is article #300 on my blog. Thanks for reading.
[ Addendum 20080322: There is a
followup to this article. ]
[ Addendum 20080325: Another
followup. ]
[Other articles in category /Unix]
permanent link
Throttling qmail
This may well turn out to be another oops. Sometimes when I screw
around with the mail system, it's a big win, and sometimes it's a big
lose. I don't know yet how this will turn out.
Since I moved house, I have all sorts of internet-related problems
that I didn't have before. I used to do business with a small ISP,
and I ran my own web server, my own mail service, and so on. When
something was wrong, or I needed them to do something, I called or
emailed and they did it. Everything was fine.
Since moving, my ISP is Verizon. I have great respect for Verizon as a
provider of telephone services. They have been doing it for over a
hundred years, and they are good at it. Maybe in a hundred years they
will be good at providing computer network services too. Maybe it
will take less than a hundred years. But I'm not as young as I once
was, and whenever that glorious day comes, I don't suppose I'll be around
to see it.
One of the unexpected problems that arose when I switched ISPs was
that Verizon helpfully blocks incoming access to port 80. I had moved
my blog to outside hosting anyway, because the blog was consuming too
much bandwidth, so I moved the other plover.com web services to the
same place. There are still some things that don't work, but I'm
dealing with them as I have time.
Another problem was that a lot of sites now rejected my SMTP
connections. My address was in a different netblock. A Verizon DSL
netblock. Remote SMTP servers assume that anybody who is dumb enough
to sign up with Verizon is also too dumb to run their own MTA. So any
mail coming from a DSL connection in Verizonland must be spam,
probably generated by some Trojan software on some infected Windows
box.
The solution here (short of getting rid of Verizon) is to relay the
mail through Verizon's SMTP relay service. mail.plover.com
sends to outgoing.verizon.net, and
lets outgoing.verizon.net forward the mail to its final
destination. Fine.
But but but.
If my machine sends more than X messages per
Y time, outgoing.verizon.net will assume that
mail.plover.com has been taken over by a Trojan spam
generator, and cut off access. All outgoing mail will be rejected with a
permanent failure.
So what happens if someone sends a message to one of the
500-subscriber email lists that I host here? mail.plover.com
generates 500 outgoing messages, sends the first hundred or so through
Verizon. Then Verizon cuts off my mail service. The mailing list
detects 400 bounce messages, and unsubscribes 400 subscribers. If any
mail comes in for another mailing list before Verizon lifts my ban,
every outgoing message will bounce and every subscriber
will be unsubscribed.
One solution is to get a better mail provider. Lorrie has an
Earthlink account that comes with outbound mail relay service. But
they do the same thing for the same reason. My Dreamhost subscription
comes with an outbound mail relay service. But they do the same thing
for the same reason. My Pobox.com account comes with an
unlimited outbound mail relay service. But they require SASL
authentication. If there's a SASL patch for qmail, I haven't been able
to find it. I could implement it myself, I suppose, but I don't
wanna.
So far there are at least five solutions that are on the "eh, maybe,
if I have to" list:
- Get a non-suck ISP
- Find a better mail relay service
- Hack SASL into qmail and send mail through Pobox.com
- Do some skanky thing with serialmail
- Get rid of qmail in favor of postfix, which presumably supports SASL
(Yeah, I know the Postfix weenies in the audience are shaking their
heads sadly and wondering when the scales will fall from my eyes.
They show up at my door every Sunday morning in their starched white
shirts and their pictures of DJB with horns and a pointy tail...)
It also occurred to me in the shower this morning that the old ISP might be
willing to sell me mail relaying and nothing else, for a small fee.
That might be worth pursuing. It's gotta be easier than turning qmail-remote
into a
SASL mail client.
The serialmail thing is worth a couple of sentences, because there's an
autoresponder on the qmail-users mailing-list that replies with "Use serialmail. This is discussed
in the archives." whenever someone says the word "throttle". The serialmail
suite, also written by Daniel J. Bernstein, takes a
maildir-format directory and posts every message in it to some remote
server, one message at a time. Say you want to run qmail on your laptop.
Then you arrange to have qmail deliver all its mail into a maildir, and
then when your laptop is connected to the network, you run serialmail, and it
delivers the mail from the maildir to your mail relay host. serialmail is
good for some throttling problems. You can run serialmail under control of a
daemon that will cut off its network connection after it has written a
certain amount of data, for example. But there seems to be no easy
way to do what I want with serialmail, because it always wants to deliver
all the messages from the maildir, and I want it to deliver
one message.
There have been some people on the qmail-users mailing-list asking for something close to
what I want, and sometimes the answer was "qmail was designed to deliver
mail as quickly and efficiently as possible, so it won't do what you
want." This is a variation of "Our software doesn't do what you want,
so I'll tell you that you shouldn't want to do it." That's another
rant for another day. Anyway, I shouldn't badmouth qmail-users mailing-list, because the
archives did get me what I wanted. It's only a stopgap solution, and
it might turn out to be a big mistake, but so far it seems okay, and
so at last I am coming to the point of this article.
I hacked qmail to support outbound message rate throttling. Following a
suggestion of Richard Lyons from the qmail-users mailing-list, it was much easier to do than I had
initially thought.
Here's how it works. Whenever qmail wants to try to deliver a message to
a remote address, it runs a program called qmail-remote. qmail-remote is responsible for
looking up the MX records for the host, contacting the right server,
conducting the SMTP conversation, and returning a status code back to
the main component. Rather than hacking directly on qmail-remote, I've
replaced it with a wrapper. The real qmail-remote is now in
qmail-remote-real. The qmail-remote program is now written in Perl.
It maintains a log file recording the times at which the last few
messages were sent. When it runs, it reads the log file, and a policy
file that says how quickly it is allowed to send messages. If it is
okay to send another message, the Perl program appends the current
time to the log file and invokes the real qmail-remote. Otherwise, it sleeps
for a while and checks again.
The program is not strictly correct. It has some race conditions.
Suppose the policy limits qmail to sending 8 messages per minute. Suppose
7 messages have been sent in the last minute. Then six instances of
qmail-remote might all run at once, decide that it is OK to send a message, and send
one. Then 13 messages have been sent in the last minute, which
exceeds the policy limit. So far this has not been much of a
problem. It's happened twice in the last few hours that the system
sent 9 messages in a minute instead of 8. If it worries me too much,
I can tell qmail to run only one qmail-remote at a time, instead of 10. On a normal
qmail system, qmail speeds up outbound delivery by running multiple qmail-remote
processes concurrently. On my crippled system, speeding up outbound
delivery is just what I'm trying to avoid. Running at most one qmail-remote at
a time will cure all race conditions. If I were doing the project
over, I think I'd take out all the file locking and such, and just run
one qmail-remote. But I didn't think of it in time, and for now I think I'll
live with the race conditions and see what happens.
So let's see? What else is interesting about this program? I made
at least one error, and almost made at least one more.
The almost-error was this: The original design for the program was
something like:
- do
- lock the history file, read it, and unlock it
until it's time to send a message
- lock the history file, update it, and unlock it
- send the message
This is a classic mistake in writing programs that run concurrently
and update a file. The problem is that process A
update the file after process B reads but before B
updates it. Then B's update will destroy A's.
One way to fix this is to have the processes append to the history
file, but never remove anything from it. That is clearly not a
sustainable strategy. Someone must remove expired entries from the
history file.
Another fix is to have the read and the update in the same critical
section:
- lock the history file
- do
until it's time to send a message
- update the history file and unlock it
- send the message
But that loop could take a long time, during which no other qmail-remote process
can make progress. I had decided that I wanted to try to retain the
concurrency, and so I wasn't willing to accept this.
Cleaning the history file could be done by a separate process that
periodically locks the file and rewrites it. But instead, I have the qmail-remote
processes to it on the fly:
- do
- lock the history file, read it, and unlock it
until it's time to send a message
- lock the history file, read it, update it, and unlock it
- send the message
I'm happy that I didn't actually make this mistake. I only thought
about it.
Here's a mistake that I did make. This is the block of code
that sleeps until it's time to send the message:
while (@last >= $msgs) {
my $oldest = $last[0];
my $age = time() - $oldest;
my $zzz = $time - $age + int(rand(3));
$zzz = 1 if $zzz < 1;
# Log("Sleeping for $zzz secs");
sleep $zzz;
shift @last while $last[0] < time() - $time;
load_policy();
}
The throttling
policy is expressed by two numbers, $msgs and $time,
and the program tries to send no more than $msgs messages per
$time seconds. The @last array contains a list of
Unix epoch timestamps of the times at which the messages of the last
$time seconds were sent.
So the loop condition checks to see if fewer than $msgs
messages were sent in the last $time seconds. If not, the
program continues immediately, possibly posting its message. (It
rereads the history file first, in case some other messages have been
posted while it was asleep.)
Otherwise the program will sleep for a while. The first three lines
in the loop calculate how long to sleep for. It sleeps until the time
the oldest message in the history will fall off the queue, possibly
plus a second or two. Then the crucial line:
shift @last while $last[0] < time() - $time;
which discards the expired items from the history. Finally, the call
to load_policy() checks to see if the policy has changed, and
the loop repeats if necessary.
The bug is in this crucial line. if @last becomes empty,
this line turns into an infinite busy-loop. It should have been:
shift @last while @last && $last[0] < time() - $time;
Whoops. I noticed this this morning when my system's load was around
12, and eight or nine qmail-remote processes were collectively eating 100% of
the CPU. I would have noticed sooner, but outbound deliveries hadn't
come to a complete halt yet.
Incidentally, there's another potential problem here arising from the
concurrency. A process will complete the sleep loop in at most
$time+3 seconds. But then it will go back and reread the history
file, and it may have to repeat the loop. This could go on
indefinitely if the system is busy. I can't think of a good way to
fix this without getting rid of the concurrent qmail-remote processes.
Here's the code. I
hereby place it in the public domain. It was written between 1 AM and
3 AM last night, so don't expect too much.
[Other articles in category /Unix]
permanent link
Corrections about sync(2)
I made some errors in today's post
about sync and fsync.
Most important, I said that "the sync() system call marks all
the kernel buffers as dirty". This is totally wrong, and doesn't
even make sense. Dirty buffers are those with data that needs to be
written out. Marking a non-dirty buffer as dirty is a waste of time,
since nothing has changed in the buffer, but it will now be rewritten
anyway. What sync() does is schedule all the dirty
buffers to be written as soon as possible.
On some recent systems, sync() actually waits for all the
dirty buffers to be written, and a bunch of people tried to correct me
about this. But my original article was right: historically, it was
not so, and even today it's not universally true. In former times,
sync() would schedule the buffers for writing, and then
return before the data was actually written.
I said that one of the duties of init was to call
sync() every thirty seconds, but this was mistaken. That
duty actually fell to a separate program, known as update.
While discussing this with one of the readers who wrote to correct me,
I looked up the source for Version 7 Unix, to make sure I was
right, and it's so short I thought I might as well show it here:
/*
* Update the file system every 30 seconds.
* For cache benefit, open certain system directories.
*/
#include <signal.h>
char *fillst[] = {
"/bin",
"/usr",
"/usr/bin",
0,
};
main()
{
char **f;
if(fork())
exit(0);
close(0);
close(1);
close(2);
for(f = fillst; *f; f++)
open(*f, 0);
dosync();
for(;;)
pause();
}
dosync()
{
sync();
signal(SIGALRM, dosync);
alarm(30);
}
The program is so simple I don't have much more to say about it. It
initially invokes dosync(), which calls sync() and
then schedules another call to dosync() in 30 seconds. Note
that the 0 in the second argument to open had not
yet been changed to O_RDONLY. The pause() call is
equivalent to sleep(0): it causes the process to relinquish
its time slice whenever it is active.
In various systems more recent than V7, the program was known by
various names, but it was update for a very long time.
Several people wrote to correct me about the:
# sync
# sync
# sync
# halt
thing, some saying that I had the reason wrong, or that it did not make
sense, or that only two syncs were used, rather than three.
But I had it right. People did use three, and they did it for the
reason I said, whether that makes sense or not. (Some of the people
who miscorrected me were unaware that sync() would finish and
exit before the data was actually written.) But for example, see this
old Usenet thread for a discussion of the topic that confirms what
I said.
Nobody disputed my contention that Linus was suffering from the
promptings of the Evil One when he tried to change the semantics of
fsync(), and nobody seems to know the proper name of the
false god of false efficiency. I'll give this some thought and see
what I can come up with.
Thanks to Tony Finch, Dmitry Kim, and Stefan O'Rear for discussion of
these points.
[Other articles in category /Unix]
permanent link
Dirty, dirty buffers!
One side issue that arose during my talk on Monday about
inodes was the write-buffering normally done by Unix kernels. I
wrote a pretty long note to the PLUG mailing list about it, and
I thought I'd repost it here.
When your process asks the kernel to write data:
int bytes_written = write(file_descriptor,
buffer,
n_bytes);
the kernel normally copies the data from your buffer into a kernel
buffer, and then, instead of writing out the data to disk, it marks
its buffer as "dirty" (that is, as needing to be written eventually),
and reports success back to the process immediately, even though the
dirty buffer has not yet been written, and the data is not yet on the disk.
Normally, the kernel writes out the dirty buffer in due time,
and the data makes it to the disk, and you are happy because your
process got to go ahead and do some more work without having to wait
for the disk, which could take milliseconds. ("A long time", as I so
quaintly called it in the talk.) If some other process reads the data
before it is written, that is okay, because the kernel can give it the
updated data out of the buffer.
But if there is a catastrophe, say a power failure, then you see the
bad side of this asynchronous writing technique, because the data,
which your process thought had been written, and which the kernel
reported as having been written, has actually been lost.
There are a number of mechanisms in place to deal with this. The
oldest is the sync() system call, which marks all the kernel
buffers as dirty. All Unix systems run a program called
init, and one of init's principal duties is to call
sync() every thirty seconds or so, to make sure that the
kernel buffers get flushed to disk at least every thirty seconds, and
so that no crash will lose more than about thirty seconds' worth of data.
(There is also a command-line program sync which just does a sync()
call and then exits, and old-time Unix sysadmins are in the habit of
halting the system with:
# sync
# sync
# sync
# halt
because the second and third syncs give the kernel time to actually
write out the
buffers that were marked dirty by the first sync. Although I
suspect that few of them know why they do this. I swear I am not
making this up.)
But for really crucial data, sync() is not enough, because, although
it marks the kernel buffers as dirty, it still does not actually
write the data to the disk.
So there is also an fsync() call; I forget when this was
introduced. The process gives fsync() a file descriptor, and
the call demands that the kernel actually write the associated
dirty buffers to disk, and does not return until they have been. And
since,
unlike write(), it actually waits for the data to go to the
disk, a successful return from fsync() indicates that the
data is truly safe.
The mail delivery agent will use this when it is writing your email to
your mailbox, to make sure that no mail is lost.
Some systems have an O_SYNC flag than the process can supply
when it opens the file for writing:
int fd = open("blookus", O_WRONLY | O_SYNC);
This sets the O_SYNC flag in the kernel file pointer
structure, which means that whenever data is
written to this file pointer, the kernel, contrary to its usual
practice, will implicitly fsync() the descriptor.
Well, that's not what I wanted to write about here. What I meant to
discuss was...
No, wait. That is what I wanted to write about. How about that?
Anyway,
there's an interesting question that arises in connection with fsync(): suppose you
fsync() a file. That guarantees that the data will be written. But
does it also guarantee that the mtime and the file extent of the file
will be updated? That is, does it guarantee that the file's inode
will be written?
On most systems, yes. But on some versions of Linux's ext2
filesystem, no. Linus himself broke this as a sacrifice to the false
god of efficiency, a very bad decision in my opinion, for reasons that
should be obvious to everyone but those in the thrall of Mammon.
(Mammon's not right here. What is the proper name of the false god
of efficiency?)
Sanity eventually prevailed. Recent versions of Linux have an
fsync() call, which updates both the data and the inode, and
a fdatasync() call, which only guarantees to update the
data.
[ Addendum 20071208: Some of this is wrong. I posted corrections. ]
[Other articles in category /Unix]
permanent link
What's a File?
Almost every December since 2001 I have given a talk to the
local Linux users' group on some aspect of Unix internals. My
first talk was on the
internals of the ext2 filesystem. This year I was under
a lot of deadline pressure at work, so I decided I would give the 2001
talk again, maybe with a few revisions.
Actually I was under so much deadline pressure that I did not
have time to revise the talk. I arrived at the user group meeting
without a certain idea of what talk I was going to give.
Fortunately, the meeting structure is to have a Q&A and discussion
period before the invited speaker gives his talk. The Q&A period
always lasts about an hour. In that hour before I had to speak, I
wrote a new talk called What's a File?. It
mostly concerns the Unix "inode" structure, and what the kernel uses
it for. It uses the output of the well-known ls -l command
as a jumping-off point, since most of the ls -l information
comes from the inode.
Then I talk about how files are opened and permissions are checked,
how the filesystem is organized, how the kernel reads and writes data,
how directories are structured, how it's possible to have one file
with two names, how symbolic links work, and what that mysterious
field is in the ls -l output between the permissions and the
owner.
The talk was quite successful, much more so than I would have
expected, given how quickly I wrote it and my complete inability to
edit or revise it. Of course, it does help that I know this material
backwards and forwards and standing on my head, and also that I could
reuse all the diagrams and illustrations from the 2001 version of the
talk.
I would not, however, recommend this technique.
As my talks have gotten better over the years, I find that less and
less of the talk material is captured in the slides, and so the slides
become less and less representative of the talk itself. But I put
them online anyway,
and here they are.
Here's
a .tgz file in case you want to download it all at
once.
[Other articles in category /Unix]
permanent link
Software archaeology
For appropriate values of "everyone", everyone knows that Unix files
do not record any sort of "creation time". A fairly frequently asked
question in Unix programming forums, and other related forums, such as
Perl programming forums, is how to get the creation date of a file;
the answer is that you cannot do that because it is not there.
This lack is exacerbated by several unfortunate facts: creation times
are available on Windows systems; the Unix inode contains three
timestamps, one of which is called the "ctime", and the "c" is
suggestive of the wrong thing; Perl's built-in stat function
overloads the return value to return the Windows creation time in the
same position (on Windows) as it returns the ctime (on Unix).
So we see questions like this one, which appeared this week on the
Philadelphia Linux Users' Group mailing list:
How does one check and change ctime?
And when questioned as to why he or she wanted to do this, this person
replied:
We are looking to change the creation time. From what I understand,
ctime is the closest thing to creation time.
There is something about this reply that irritates me, but I'm not
quite sure what it is. Several responses come to mind: "Close" is
not sufficient in system programming; the ctime is not "close" to a
creation time, in any sense; before you go trying to change the thing,
you ought to do a minimal amount of research to find out what it is.
It is a perfect example of the Wrong Question, on the same order as
that poor slob all those years ago who wanted to know how to tell if a
file was a hard link or a soft link.
But anyway, that got me thinking about ctimes in general, and I did
some research into the history and semantics of the thing, and made
some rather surprising discoveries.
One good reference for the broad outlines of early Unix is the paper that
Dennis Ritchie and Ken Thompson published in Communications of
the ACM in 1974. This was updated in 1978, but the part
I'm quoting wasn't revised and is current to 1974. Here is what it
has to say about the relevant parts of the inode structure:
IV. IMPLEMENTATION OF THE FILE SYSTEM
... The entry found thereby (the file's i-node) contains the
description of the file:
...
time of creation, last use, and last modification
An error? I don't think so. Here is corroborating evidence, the
stat man page from the first edition of Unix, from 1971:
NAME stat -- get file status
SYNOPSIS sys stat; name; buf / stat = 18.
DESCRIPTION name points to a null-terminated string naming a file; buf is the
address of a 34(10) byte buffer into which information is placed
concerning the file. It is unnecessary to have any permissions at all
with respect to the file, but all directories leading to the file
must be readable.
After stat, buf has the following format:
buf, +1 i-number
+2, +3 flags (see below)
+4 number of links
+5 user ID of owner size in bytes
+6,+7 size in bytes
+8,+9 first indirect block or contents block
...
+22,+23 eighth indirect block or contents block
+24,+25,+26,+27 creation time
+28,+29, +30,+31 modification time
+32,+33 unused
(Dennis Ritchie provides the Unix
first edition manual; the stat page is in section
2.1.)
Now how about that?
When did the ctime change from being called a "creation time" to a
"change time"? Did the semantics change too, or was the "creation
time" description a misnomer? If I can't find out, I might write to
Ritchie to ask. But this is, of course, a last resort.
In the meantime, I do have the source code for the fifth edition
kernel, but it appears that, around that time (1975 or so), there was
no creation time. At least, I can't find one.
The inode operations inside the kernel are defined to operate on struct
inodes:
struct inode {
char i_flag;
char i_count;
int i_dev;
int i_number;
int i_mode;
char i_nlink;
char i_uid;
char i_gid;
char i_size0;
char *i_size1;
int i_addr[8];
int i_lastr;
} inode[NINODE];
The i_lastr field is what we would now call the atime. (I
suppose it stands for "last read".) The mtime and ctime are not
there, because they are not stored in the in-memory copy of the inode.
They are fetched directly from the disk when needed.
We can see an example of this in the stat1 function, which is
the backend for the stat and fstat system calls:
1 stat1(ip, ub)
2 int *ip;
3 {
4 register i, *bp, *cp;
5
6 iupdat(ip, time);
7 bp = bread(ip->i_dev, ldiv(ip->i_number+31, 16));
8 cp = bp->b_addr + 32*lrem(ip->i_number+31, 16) + 24;
9 ip = &(ip->i_dev);
10 for(i=0; i<14; i++) {
11 suword(ub, *ip++);
12 ub =+ 2;
13 }
14 for(i=0; i<4; i++) {
15 suword(ub, *cp++);
16 ub =+ 2;
17 }
18 brelse(bp);
19 }
ub is the user buffer into which the stat data will be
deposited. ip is the inode structure from which most of
this data will be copied. The
suword utility copies a two-byte unsigned integer ("short
unsigned word") from source to destination. This is done starting at
the i_dev field (line 9), which effectively skips the two
earlier fields, i_flag and i_count, which are
internal kernel matters that are none of the user's business.
14 words are copied from the inode structure starting from this
position, including the device and i-number fields, the mode, the link
count, and so on, up through the addresses of the data or indirect
blocks. (In modern Unixes, the stat call omits these addresses.)
Then four words are copied out of the cp buffer, which has been
read from the inode actually on the disk; these eight bytes are at
position 24 in the inode, and ought to contain the mtime and the
ctime. The question is, which is which? This simple question turns
out to have a surprisingly complicated answer.
When an inode is modified, the IUPD flag is
set in the i_flag member. For example, here is
chmod, which modifies the inode but not the underlying data.
On a modern unix system, we would expect this to update the ctime, but
not the mtime. Let's see what it does in version 5:
1 chmod()
2 {
3 register *ip;
4
5 if ((ip = owner()) == NULL)
6 return;
7 ip->i_mode =& ~07777;
8 if (u.u_uid)
9 u.u_arg[1] =& ~ISVTX;
10 ip->i_mode =| u.u_arg[1]&07777;
11 ip->i_flag =| IUPD;
12 iput(ip);
13 }
Line 10 is the important one; it sets the mode on the in-memory copy
of the inode to the argument supplied by the user. Then line 11 sets
the IUPD flag to indicate that the inode has been modified.
Line 12 calls iput, whose principal job is to maintain the
kernel's internal reference count of the number of file descriptors
that are attached to this inode. When this number reaches zero, the
inode is written back to disk, and discarded from the kernel's open
file table. The iupdat function, called from iput,
is the one that actually writes the modified inode back to the
disk:
1 iupdat(p, tm)
2 int *p;
3 int *tm;
4 {
5 register *ip1, *ip2, *rp;
6 int *bp, i;
7
8 rp = p;
9 if((rp->i_flag&(IUPD|IACC)) != 0) {
10 if(getfs(rp->i_dev)->s_ronly)
11 return;
12 i = rp->i_number+31;
13 bp = bread(rp->i_dev, ldiv(i,16));
14 ip1 = bp->b_addr + 32*lrem(i, 16);
15 ip2 = &rp->i_mode;
16 while(ip2 < &rp->i_addr[8])
17 *ip1++ = *ip2++;
18 if(rp->i_flag&IACC) {
19 *ip1++ = time[0];
20 *ip1++ = time[1];
21 } else
22 ip1 =+ 2;
23 if(rp->i_flag&IUPD) {
24 *ip1++ = *tm++;
25 *ip1++ = *tm;
26 }
27 bwrite(bp);
28 }
29 }
What is going on here? p is the in-memory copy of the inode
we want to update. It is immediately copied into a register, and
called by the alias rp thereafter. tm is the time
that the kernel should write into the mtime field of the inode.
Usually this is the current time, but the smdate system call
("set modified date") supplies it from the user instead.
Lines 16–17 copy the mode, link count, uid, gid, "size", and
"addr" fields from the in-memory copy of the inode into the block
buffer that will be written back to the disk. Lines 18–22
update the atime if the IACC flag is set, or skip it if not.
Then, if the IUPD flag is set, lines 24–25 write the
tm value into the next slot in the buffer, where the mtime is
stored. The bwrite call on line 27 commits the data to the
disk; this results in a call into the appropriate device driver
code.
There is no sign of updating the ctime field, but recall that we
started this search by looking at what the chmod call does;
it sets IUPD, which eventually results in the updating of the
mtime field. So the mtime field is not really an mtime field as we
now know it; it is doing the job that is now done by the ctime field.
And in fact, the dump command predicates its decision about
whether to dump a file on the contents of the mtime field. Which is
really the ctime field. So functionally, dump is doing the
same thing it does now.
It's possible that I missed it, but I cannot find the advertised
creation time anywhere. The logical place to look is in the
maknode function, which allocates new inodes. The
maknode function calls ialloc to get an unused inode
from the device, and this initializes its mode (as specified by the
user), its link count (to 1), and its uid and gid (to the current
process's uid and gid). It does not set a creation time. The
ialloc function is fairly complicated, but as far as I can
tell it is not setting any creation time either.
Working it from the other end, asking who might look at the
ctime field, we have the find command, which has a
-mtime option, but no -ctime option. The
dump command, as noted before, uses the mtime. Several
commands perform stat calls and declare structs to hold the
result. For example, pr, which prints files with nice
pagination, declares a struct inode, which is the inode as
returned by stat, as opposed to the inode as used internally
by the kernel—what we would call a struct stat now.
There was no /usr/include in the fifth edition, so the
pr command contains its own declaration of the struct
inode. It looks like this:
struct inode {
int dev;
...
int atime[2];
int mtime[2];
};
No sign of the ctime, which would have been after the mtime
field. (Of course, it could be there anyway, unmentioned in the
declaration, since it is last.) And similarly, the ls command
has:
struct ibuf {
int idev;
int inum;
...
char *iatime[2];
char *imtime[2];
};
A couple of commands have extremely misleading declarations. Here's
the struct inode from the prof command, which prints
profiling reports:
struct inode {
int idev;
...
int ctime[2];
int mtime[2];
int fill;
};
The atime field has erroneously been called ctime here, but
it seems that since prof does not use the atime, nobody
noticed the bug. And there's a mystery fill field at the
end, as if prof is expecting one more field, but doesn't know
what it will be for. The declaration of ibuf in the
ln command has similar oddities.
So the creation time advertised by the CACM paper (1974)
and the version 1 manual (1971) seems to have disappeared by the time
of version 5 (1975), if indeed it ever existed.
But there was some schizophrenia in the version 5 system about whether
there was a third date in addition to the atime and the mtime. The
stat call copied it into the stat buffer, and some commands
assumed that it would be there, although they weren't sure what it
would be for, and none of them seem look at it. It's quite possible
that there was at one time a creation date, which had been eliminated
by the time of the fifth edition, leaving behind the vestigial remains
we saw in commands like ln and prof and in the code
of the stat1 function.
Functionally, the version 5 mtime is actually what
we would now call the ctime: it is updated by operations like
chmod that in modern Unix will update the ctime but not the
mtime. A quick scan of the Lions Book suggests that it was the same
way in version 6 as well. I imagine that the ctime-mtime distinction
arose in version 7, because that was the last version before the
BSD/AT&T fork, and nearly everything common to those two great
branches of the Unix tree was in version 7.
Oh, what the hell; I have the version 7 source code; I may as well
look at it. Yes, by this time the /usr/include/sys/stat.h
file had been invented, and does indeed include all three times in the
struct stat. So the mtime (as we now know it) appears to
have been introduced in v7.
One sometimes hears that early Unix had atime and mtime, and that
ctime was introduced later. But actually, it appears that early Unix
had atime and ctime, and it was the mtime that was introduced later.
The confusion arises because in those days the ctime was called
"mtime".
Addendum: It occurs to me now that the version 5 mtime is not
precisely like the modern ctime, because it can be set via the
smdate call, which is analogous to the modern utime
call. The modern ctime cannot be set at all.
(Minor trivium: line 22 of iupdat is ip1 =+ 2. In
modern C, we would write ip1 += 2. The =+ and
=- operators had turned out to be a mistake, because people
would write i=-1, intending i = -1, but the compiler
would understand it as i =- 1, producing subtle bugs. The
spellings of the operators were changed to avoid these bugs. The
change from =+ to += was complete by the time
K&R first edition was published in 1978: K&R mentions the
old-style operators and says that the are obsolete. In spite
of this, the Sun compiler I used in 1987 would still produce a warning
for i=-1, despite interpreting it as i = -1. I
believe this was because it was PCC-derived, and all PCC compilers
emitted this warning.
In the fifth edition code, we can see the obsolete form still in use.)
(Totally peripheral addendum: Google search for
dmr puts Dennis M. Ritchie in fourth position, not
the first. Is this grave insult to our community to be tolerated? I
think not! It must be avenged! With fire and steel!)
[ Addendum 20070127: Unix source code prior to the fifth edition is
lost. The manuals for the third and fourth editions are
available from the Unix Heritage
Society. The manual for the third edition (February 1973) mentions the
creation time, but by the fourth edition (November 1973) the
stat(2) man page no longer mentions a creation time. In
v4, the two dates in the stat structure are called actime
(modern atime) and modtime (modern mtime/ctime). ]
[Other articles in category /Unix]
permanent link
Environmental manipulations
Unix is full of little utility programs that run some other program in
a slightly modified environment. For example, the nohup
command:
SYNOPSIS
nohup COMMAND [ARG]...
DESCRIPTION
Run COMMAND, ignoring hangup signals.
The nohup basically does signal(NOHUP, SIG_IGN)
before calling execvp(COMMAND, ARGV) to execute the
command.
Similarly, there is a chroot command, run as chroot
new-root-directory command args..., which
runs the specified command with its default root inode set to
somewhere else. And there is a nice command, run as nice
nice-value-adjustment command args..., which
runs the specified command with its "nice" value changed. And there
is an env environment-settings command
args... which runs the specified command with new
variables installed into the environment. The standard sudo
command could also be considered to be of this type.
I have also found it useful to write trivial commands called
indir, which runs a command after chdir-ing to a new
directory, and stopafter, which runs a command after setting
the alarm timer to a specified amount, and, just today,
with-umask, which runs a command after setting the umask to a
particular value.
I could probably have avoided indir and with-umask.
Instead of indir DIR COMMAND, I could use sh -c 'cd DIR;
exec COMMAND', for example. But indir avoids an extra
layer of horrible shell quotes, which can be convenient.
Today it occurred to me to wonder if this proliferation of commands
was really the best way to solve the problem. The sh -c
'...' method solves it partly, for those parts of the process
user area to which correspond shell builtin commands. This includes
the working directory, umask, and environment variables, but not the
signal table, the alarm timer, or the root directory.
There is no standardized interface to all of these things at any
level. At the system call level, the working directory is changed by
the chdir system call, the root directory by chroot,
the alarm timer by alarm, the signal table by a bunch of
OS-dependent nonsense like signal or sigaction, the
nice value by setpriority, environment variables by a
potentially complex bunch of memory manipulation and pointer banging,
and so on.
Since there's no single interface for controlling all these things, we
might get a win by making an abstraction layer for dealing with them.
One place to put this abstraction layer is at the system level, and
might look something like this:
/* declares USERAREA_* constants,
int userarea_set(int, ...)
and void *userarea_get(int)
*/
#include <sys/userarea.h>
userarea_set(USERAREA_NICE, 12);
userarea_set(USERAREA_CWD, "/tmp");
userarea_set(USERAREA_SIGNAL, SIGHUP, SIG_IGN);
userarea_set(USERAREA_UMASK, 0022);
...
This has several drawbacks. One is that it requires kernel hacking.
A subitem of this is that it will never become widespread, and that if
you can't (or don't want to) replace your kernel, it cannot be made to
work for you. Another is that it does not work for the environment
variables, which are not really administered by the kernel. Another
is that it does not fully solve the original problem, which is to
obviate the plethora of nice, nohup, sudo,
and env commands. You would still have to write a command
to replace them. I had thought of another drawback, but forgot it while I
was writing the last two sentences.
You can also put the abstraction layer at the C library level. This
has fewer drawbacks. It no longer requires kernel hacking, and can
provide a method for modifying the environment. But you still need to
write the command that uses the library.
We may as well put the abstraction layer at the Unix command level.
This means writing a command in some language, like Perl or C, which
offers a shell-level interface to manipulating the process
environment, perhaps something like this:
newenv nice=12 cwd=/tmp signal=HUP:IGNORE umask=0022 -- command args...
Then newenv has a giant dispatch table inside it to process
the settings accordingly:
...
nice => sub { setpriority(PRIO_PROCESS, $$, $_) },
cwd => sub { chdir($_) },
signal => sub {
my ($name, $result) = split /:/;
$SIG{$name} = $result;
},
umask => sub { umask(oct($_)) },
...
One question to ask is whether something like this already exists.
Another is, if not, whether it's because there's some reason why it's
a bad idea, or because there's a simpler solution, or just because
nobody has done it yet.
[Other articles in category /Unix]
permanent link
ssh-agent, revisited
My recent article about reusing
ssh-agent processes attracted a lot of mail, most of it very
interesting.
- A number of people missed an important piece of context: since the
article was filed in 'oops' section of my
blog, it was intended as a description of a mistake I had made.
The mistake in this case being to work really hard on the first
solution I thought of, rather than to back up at early signs of
trouble, and scout around for a better and simpler solution. I need
to find a way to point out the "oops" label more clearly, and at the
top of the article instead of at the bottom.
- Several people pointed out other good solutions to my problem. For
example, Adam Sampson and Robert Loomans pointed out that versions of
ssh-agent support a -a option, which orders the process to
use a particular path for its Unix domain socket, rather than making
up a path, as it does by default. You can then use something like
ssh-agent -a $HOME/.ssh/agent when you first start the agent,
and then you always know where to find the socket.
- An even simpler solution is as follows: My principal difficulty was in
determining the correct value for the SSH_AGENT_PID variable.
But it turns out that I don't need this; it is only used for
ssh-agent -k, which kills the existing ssh-agent process.
For authentication, it is only necessary to have
SSH_AUTH_SOCK set. The appropriate value for this variable
is readily determined by scanning /tmp, as I noted in the
original article. Thanks to Aristotle Pagaltzis and Adam Turoff for
pointing this out.
- Several people pointed me to the keychain
project. This program is a front-end to ssh-agent. It contains
functions to check for a running agent, and to start one if there is
none yet, and to save the environment settings to a file, as I did
manually in my article.
- A number of people suggested that I should just run ssh-agent from my
X session manager. This suggests that they did not read the article
carefully; I already do this. Processes running on my home machine,
B, all inherit the ssh-agent settings from the session manager
process. The question is what to do when I remote login from a
different machine, say A, and want the login shell, which was
not started under X, to acquire the same settings.
Other machines trust B, but not A, so credential
forwarding is not the solution here either.
- After extracting the ssh-agent process's file descriptor
table with ls -l /proc/pid/fd, and getting:
lrwx------ 1 mjd users 64 Dec 12 23:34 3 -> socket:[711505562]
I concluded that the identifying information, 711505562, was useless.
Aaron Crane corrected me on this; you can find it listed in
/proc/net/unix, which gives the pathname in the
filesystem:
% grep 711505562 /proc/net/unix
ce030540: 00000002 00000000 00010000 0001 01 711505562 /tmp/ssh-tNT31655/agent.31655
I had suggested that the kernel probably maintained no direct mapping
from the socket i-number to the filesystem path, and that obtaining
this information would require difficult grovelling of the kernel data
structures. But apparently to whatever extent that is true, it is
irrelevant, since the /proc/net/unix driver has already been
written to do it.
- Saving the socket information in a file solves another problem I
had. Suppose I want some automated process, say the cron job that
makes my offsite network backups, to get access to SSH credentials. I
can store the credentials in an ssh-agent process, and save the
variable settings to a file. The backup process can then reinstate
the settings from the file, and will thenceforward have the
credentials for the remote login.
- Finally, I should add that since implementing this scheme for the
first time on 21 November, I have started exactly zero new ssh-agent
processes, so I consider it a rousing success.
Thanks to everyone who wrote in on this matter.
[Other articles in category /Unix]
permanent link
Elimination of "f" system calls
Michael C. Toren:
> 1) Open a file descriptor pointing to the current working directory.
>
> 2) Create a temporary directory within the jail, and chroot() to it.
>
> 3) Using fchdir(), change the working directory to the file descriptor
> saved from step 1.
Oho, I hadn't seen that before. The chroot() in step 2 is required to
avoid the special case in the Kernel that checks to see if you are
doing ".." in the current root directory. But because you chrooted()
yourself somewhere else, the special case isn't exercised.
Older systems don't have fchdir(), which is a fairly recent addition.
With the proliferation of "f" calls in recent years (fchdir, fchmod,
fchown, fstat, fsync, etc.) I wonder what would be the result if the
Unix system interface were redesigned to eliminate the non-"f"
versions of the calls entirely. Instead, there would be a generic
function, which we might call "iname", which transforms a path name to
an "inode" structure:
struct inode * iname (const char *path);
Unix kernels already contain a function with this name that does this
job.
The system calls that formerly accepted path names are changed to require
an inode structure. So instead of
fd = open("dir/file", ...)
one now has
fd = open(iname("dir/file"), ...)
(There are some minor language and usability issues here: what if
iname() returns NULL? Ignore those; I want to discuss OS issues, not
language issues.)
There would be a function, analogous to iname(), that also returned an
inode structure, but which took an open file descriptor instead of a
path name:
struct inode * inode(int fd);
This is essentially equivalent to the fstat() function we have now.
chown() and fchown() would merge to become a single call that accepted
an inode structure; instead of:
chown("dir/file", owner)
fchown(fd, owner)
one would have:
chown(iname("dir/file"), owner)
chown(inode(fd), owner)
Similarly, instead of:
chdir(path);
fchdir(fd);
one would have:
chdir(iname(path));
chdir(inode(fd));
stat() and fstat() would not only merge but would disappear entirely;
the struct inode can do everything that the struct stat can do. This
code:
stat(&statbuf, "dir/file");
fstat(&statbuf, fd);
turns into this:
statbuf = iname("dir/file"));
statbuf = inode(fd);
There are some security implications to this idea. There needs to be
protection against counterfeiting an inode structure. For example,
consider a world-readable file in a secret, nonsearchable directory.
Suppose the file happens to have i-number 123456. If it's possible to
do this, then security has failed:
struct inode I;
I.inumber = 123456;
fd = open(I, O_RDWR);
It should be impossible for anyone to manufacture the struct inode
that represents the secret file without actually using iname()
somewhere along the line. A simple way to arrange this would be to
have the kernel cryptographically sign each struct inode. This can be
done inexpensively.
This still has some access implications. Consider a
world-readable file in a world-searchable directory. Process A
iname()s the file, obtaining its struct inode. The search
permissions on the directory are then removed. Process A can still
open the file. This is analogous to a similar situation in standard
Unix in which process A opens the file before the permissions are
changed, and can still read and write it afterwards. So that's not a
big change. What might be a big change is that A can dump the struct
inode to a file and the a different process can read it back again,
evading the increased access protections on the directory. The
cryptographic signature technique can fix this problem by restricting
struct inodes to be used by a single process.
Whether this is worth doing I don't know. My main idea in thinking it
up was to avoid the increasing duplication of system calls. Does
Unix need an "fsymlink" call? Does it need three different ones?
symlink(oldpath, newpath);
fsymlink1(fd, newpath);
fsymlink2(oldpath, fd);
fsymlink3(oldfile_fd, newdir_fd);
Perhaps not this week, but who knows what the future holds? With the
iname() / inode() style, these are all a single call:
symlink(iname(oldpath), iname(newpath));
symlink(inode(fd), iname(newpath));
symlink(iname(oldpath), inode(fd));
symlink(inode(oldfile_fd), inode(newdir_fd));
This also fixes some of the proliferation in the system call interface
between calls that work on symlinks and calls that work through
symlinks. For example, stat() and lstat(), and chown() and lchown().
On normal files, each pair is the same. But on a symlink, stat() stats
the pointed-to file while lstat() stats the symlink itself; similarly
chown() changes the owner of the pointed-to file while lchown()
changes the owner of the symlink itself. But where's lchmod()? What
about llink()? There's no way to make a hard link to a symbolic
link! With the inode() / iname() technique above, you only need one
extra call to handle all possible operations on a symbolic link:
lstat(path);
lchown(path, owner);
llink(path, newpath);
becomes:
stat(liname(path));
chown(liname(path), owner);
link(liname(path), iname(newpath));
where liname() is just like iname(), except that if the resulting file
is a symbolic link, its inode is returned immediately; iname() would
have read the target of the symbolic link and called itself
recursively to resolve the target.
It also seems to me that this interface might make it easier to
communicate open files from one process to another. Some unix systems
offer a experimental features for passing file descriptors around;
this system only requires that the struct inode be communicated
directly to the receiving process.
|