Don't tug on that, you never know what it might be attached to
This is a story about a very interesting bug that I tracked down
yesterday. It was causing a bad effect very far from where the bug
actually was.
emacsclient
The emacs text editor comes with a separate utility, called
emacsclient , which can communicate with the main editor process and
tell it to open files for editing. You have your main emacs
running. Then somewhere else you run the command
emacsclient some-files...
and it sends the main emacs a message that you want to edit
some-files . Emacs gets the message and pops up new windows for editing
those files. When you're done editing some-files you tell Emacs, by
typing C-# or something, it
it communicates back to emacsclient that the editing is done, and
emacsclient exits.
This was more important in the olden days when Emacs was big and
bloated and took a long time to start up. (They used to joke that
“Emacs” was an abbreviation for “Eight Megs And Constantly Swapping”.
Eight megs!) But even today it's still useful, say from shell scripts
that need to run an editor.
Here's the reason I was running it. I have a very nice shell script,
called also , that does something like this:
- Interpret command-line arguments as patterns
- Find files matching those patterns
- Present a menu of the files
- Wait for me to select files of interest
- Run
emacsclient on the selected files
It is essentially a wrapper around
menupick ,
a menu-picking utility I wrote which has seen use as a component of
several other tools.
I can type
also Wizard
in the shell and get a menu of the files related to the wizard, select
the ones I actually want to edit, and they show up in Emacs. This is
more convenient than using Emacs itself to find and open them. I use
it many times a day.
Or rather, I did until this week, when it suddenly stopped working.
Everything ran fine until the execution of emacsclient , which would
fail, saying:
emacsclient: can't find socket; have you started the server?
(A socket is a facility that enables interprocess communication, in
this case between emacs and emacsclient .)
This message is familiar. It usually means that I have forgotten to
tell Emacs to start listening for emacsclient , by running M-x
server-start . (I should have Emacs do this when it starts up, but I
don't. Why not? I'm not sure.) So the first time it happened I went
to Emacs and ran M-x server-start . Emacs announced that it had
started the server, so I reran also . And the same thing happened.
emacsclient: can't find socket; have you started the server?
Finding the socket
So the first question is: why can't emacsclient find the socket?
And this resolves naturally into two subquestions: where is the
socket, and where is emacsclient looking?
The second one is easily answered; I ran strace emacsclient (hi
Julia!) and saw that the last interesting thing emacsclient did
before emitting the error message was
stat("/mnt/tmp/emacs2017/server", 0x7ffd90ec4d40) = -1 ENOENT (No such file or directory)
which means it's looking for the socket at /mnt/tmp/emacs2017/server
but didn't find it there.
The question of where Emacs actually put the socket file was a little
trickier. I did not run Emacs under strace because I felt sure that
the output would be voluminous and it would be tedious to grovel over
it.
I don't exactly remember now how I figured this out, but I think now
that I probably made an educated guess, something like: emacsclient
is looking in /mnt/tmp ; this seems unusual. I would expect the
socket to be under /tmp . Maybe it is under /tmp ? So I looked
under /tmp and there it was, in /tmp/emacs2017/server :
srwx------ 1 mjd mjd 0 Jun 27 11:43 /tmp/emacs2017/server
(The s at the beginning there means that the file is a “Unix-domain
socket”. A socket is an endpoint for interprocess communication. The
most familiar sort is a TCP socket, which has a TCP address, and which
enables communication over the internet. But since ancient times Unix
has also supported Unix-domain sockets, which enable communication
between two processes on the same machine. Instead of TCP addresses,
such sockets are addressed using paths in the filesystem, in this case
/tmp/emacs2017/server . When the server creates such a socket, it
appears in the filesystem as a special type of file, as here.)
I confirmed that this was the correct file by typing M-x
server-force-delete in Emacs; this immediately caused
/tmp/emacs2017/server to disappear. Similarly M-x server-start
made it reappear.
Why the disagreement?
Now the question is: Why is emacsclient looking for the socket under
/mnt/tmp when Emacs is putting it in /tmp ? They used to
rendezvous properly; what has gone wrong? I recalled that there was
some environment variable for controlling where temporary files are
put, so I did
env | grep mnt
to see if anything relevant turned up. And sure enough there was:
TMPDIR=/mnt/tmp
When programs want to create tmporary files and directories, they normally do it in /tmp . But
if there is a TMPDIR setting, they use that directory instead. This
explained why emacsclient was looking for
/mnt/tmp/emacs2017/socket . And the explanation for why Emacs itself
was creating the socket in /tmp seemed clear: Emacs was failing to
honor the TMPDIR setting.
With this clear explanation in hand, I began to report the bug in
Emacs, using M-x report-emacs-bug . (The folks in the #emacs IRC
channel on Freenode suggested this. I had a bad
experience last time I tried
#emacs , and then people mocked me for even trying to get useful
information out of IRC. But this time it went pretty well.)
Emacs popped up a buffer with full version information and invited me
to write down the steps to reproduce the problem. So I wrote down
% export TMPDIR=/mnt/tmp
% emacs
and as I did that I ran those commands in the shell.
Then I wrote
In Emacs:
M-x getenv TMPDIR
(emacs claims there is no such variable)
and I did that in Emacs also. But instead of claiming there was no
such variable, Emacs cheerfully informed me that the value of TMPDIR
was /mnt/tmp .
(There is an important lesson here! To submit a bug report, you find
a minimal demonstration. But then you also try the minimal
demonstration exactly as you reported it. Because of what just
happened! Had I sent off that bug report, I would have wasted
everyone else's time, and even worse, I would have looked like a
fool.)
My minimal demonstration did not demonstrate. Something else was
going on.
Why no TMPDIR ?
This was a head-scratcher. All I could think of was that
emacsclient and Emacs were somehow getting different environments,
one with the TMPDIR setting and one without. Maybe I had run them
from different shells, and only one of the shells had the setting?
I got on a sidetrack at this point to find out why TMPDIR was set in
the first place; I didn't think I had set it. I looked for it in
/etc/profile , which is the default Bash startup instructions, but it
wasn't there. But I also noticed an /etc/profile.d which seemed
relevant. (I saw later that the /etc/profile contained instructions
to load everything under /etc/profile.d .) And when I grepped for
TMPDIR in the profile.d files, I found that it was being set by
/etc/profile.d/ziprecruiter_environment.sh , which the sysadmins had
installed. So that mystery at least was cleared up.
That got me on a second sidetrack, looking through our Git history for
recent changes involving TMPDIR . There weren't any, so that was a
dead end.
I was still puzzled about why Emacs sometimes got the TMPDIR setting
and sometimes not. That's when I realized that my original Emacs
process, the one that had failed to rendezvous with emacsclient ,
had not been started in the usual way. Instead of simply running
emacs , I had run
git re-edit
which invokes Git, which then runs
/home/mjd/bin/git-re-edit
which is a Perl program I wrote that does a bunch of stuff to figure
out which files I was editing recently and then execs emacs to edit
them some more. So there are several programs here that could be
tampering with the environment and removing the TMPDIR setting.
To more accurately point the finger of blame, I put some diagnostics
into the git-re-edit program to have it print out the value of
TMPDIR . Indeed, git-re-edit reported that TMPDIR was unset.
Clearly, the culprit was Git, which must have been removing TMPDIR
from the environment before invoking my Perl program.
Who is stripping the environment?
To confirm this conclusion, I created a tiny shell script,
/home/mjd/bin/git-env , which simply printed out the environment, and
then I ran git env , which tells Git to find git-env and run it.
If the environment it printed were to omit TMPDIR , I would know Git
was the culprit. But TMPDIR was in the output.
So I created a Perl version of git-env , called git-perlenv , which
did the same thing, and I ran it via git perlenv . And this time
TMPDIR was not in the output. I ran diff on the outputs of git
env and git perlenv and they were identical—except that git
perlenv was missing TMPDIR .
So it was Perl's fault! And I verified this by running perl
/home/mjd/bin/git-re-edit directly, without involving Git at all.
The diagnostics I had put in reported that TMPDIR was unset.
WTF Perl?
At this point I tried getting rid of get-re-edit itself, and ran the
one-line program
perl -le 'print $ENV{TMPDIR}'
which simply runs Perl and tells it to print out the value of the
TMPDIR environment variable. It should print /mnt/tmp , but instead
it printed the empty string. This is a smoking gun, and Perl no
longer has anywhere to hide.
The mystery is not cleared up, however. Why was Perl doing this?
Surely not a bug; someone else would have noticed such an obvious bug
sometime in the past 25 years. And it only failed for TMPDIR , not
for other variables. For example
FOO=bar perl -le 'print $ENV{FOO}'
printed out bar as one would expect. This was weird: how could
Perl's environment handling be broken for just the TMPDIR variable?
At this point I got Rik Signes and Frew Schmidt to look at it with
me. They confirmed that the problem was not in Perl generally, but
just in this Perl. Perl on other systems did not display this
behavior.
I looked in the output of perl -V , which says what version of Perl
you are using and which patches have been applied, and wasted a lot of
time looking into
CVE-2016-2381,
which seemed relevant. But it turned out to be a red herring.
Working around the problem, 1.
While all this was going on I was looking for a workaround. Finding
one is at least as important as actually tracking down the problem
because ultimately I am paid to do something other than figure out why
Perl is losing TMPDIR . Having a workaround in hand means that when
I get sick and tired of looking into the underlying problem I can
abandon it instantly instead of having to push onward.
The first workaround I found was to not use the Unix-domain socket.
Emacs has an option to use a TCP socket instead, which is useful on
systems that do not support Unix-domain sockets, such as non-Unix
systems. (I am told that some do still exist.)
You set the server-use-tcp variable to a true value, and when you
start the server, Emacs creates a TCP socket and writes a description
of it into a “server file”, usually ~/.emacs.d/server/server . Then
when you run emacsclient you tell it to connect to the socket that
is described in the file, with
emacsclient --server-file=~/.emacs.d/server/server
or by setting the EMACS_SERVER_FILE environment variable. I tried
this, and it worked, once I figured out the thing about
server-use-tcp and what a “server file” was. (I had misunderstood
at first, and thought that “server file” meant the Unix-domain socket
itself, and I tried to get emacsclient to use the right one by
setting EMACS_SERVER_FILE , which didn't work at all. The resulting
error message was obscure enough to lead me to IRC to ask about it.)
Working around the problem, 2.
I spent quite a while looking for an environment variable analogous to
EMACS_SERVER_FILE to tell emacsclient where the Unix-domain socket
was. But while there is a --socket-name command-line argument to
control this, there is inexplicably no environment variable. I hacked
my also command (responsible for running emacsclient ) to look for
an environment variable named EMACS_SERVER_SOCKET , and to pass its
value to emacsclient --socket-name if there was one. (It probably
would have been better to write a wrapper for emacsclient , but I
didn't.) Then I put
EMACS_SERVER_SOCKET=$TMPDIR/emacs$(id -u)/server
in my Bash profile, which effectively solved the problem. This set
EMACS_SERVER_SOCKET to /mnt/tmp/emacs2017/server whenever I
started a new shell. When I ran also it would notice the setting
and pass it along to emacsclient with --socket-name , to tell
emacsclient to look in the right place. Having set this up I could
forget all about the original problem if I wanted to.
But but but WHY?
But why was Perl removing TMPDIR from the environment? I didn't
figure out the answer to this; Frew took it to the #p5p IRC channel
on perl.org , where the answer was eventually tracked down by Matthew
Horsfall and Zefrem.
The answer turned out to be quite subtle. One of the classic attacks
that can be mounted against a process with elevated privileges is as
follows. Suppose you know that the program is going to write to a
temporary file. So you set TMPDIR beforehand and trick it into
writing in the wrong place, possibly overwriting or destroying
something important.
When a program is loaded into a process, the dynamic loader does the
loading. To protect against this attack, the loader checks to see if
the program it is going to run has elevated privileges, say because it
is setuid, and if so it sanitizes the process’ environment to prevent
the attack. Among other things, it removes TMPDIR from the
environment.
I hadn't thought of exactly this, but I had thought of something like
it: If Perl detects that it is running setuid, it enables
a secure mode which, among other things, sanitizes the environment.
For example, it ignores the PERL5LIB environment variable that
normally tells it where to look for loadable modules, and instead
loads modules only from a few compiled-in trustworthy directories. I
had checked early on to see if this was causing the TMPDIR problem,
but the perl executable was not setuid and Perl was not running in
secure mode.
But Linux supports a feature called “capabilities”, which is a sort of
partial superuser privilege. You can give a program some of the
superuser's capabilities without giving away the keys to the whole
kingdom. Our systems were configured to give perl one extra
capability, of binding to low-numbered TCP ports, which is normally
permitted only to the superuser. And when the dynamic loader ran
perl , it saw this additional capability and removed TMPDIR from
the environment for safety.
This is why Emacs had the TMPDIR setting when run from the command
line, but not when run via git-re-edit .
Until this came up, I had not even been aware that the “capabilities”
feature existed.
A red herring
There was one more delightful confusion on the way to this happy
ending. When Frew found out that it was just the Perl on my
development machine that was misbehaving, he tried logging into his
own, nearly identical development machine to see if it misbehaved in
the same way. It did, but when he ran a system update to update Perl,
the problem went away. He told me this would fix the problem on my
machine. But I reported that I had updated my system a few hours
before, so there was nothing to update!
The elevated capabilities theory explained this also. When Frew
updated his system, the new Perl was installed without the elevated
capability feature, so the dynamic loader did not remove TMPDIR from
the environment.
When I had updated my system earlier, the same thing happened. But
as soon as the update was complete, I reloaded my system configuration, which
reinstated the capability setting. Frew hadn't done this.
Summary
- The system configuration gave
perl a special capability
- so the dynamic loader sanitized its environment
- so that when
perl ran emacs ,
- the Emacs process didn't have the
TMPDIR environment setting
- which caused Emacs to create its listening socket in the usual place
- but because
emacsclient did get the setting, it looked in the wrong place
Conclusion
This computer stuff is amazingly complicated. I don't know how anyone
gets anything done.
[ Addendum 20160709: Frew Schmidt has written up the same
incident,
but covers different ground than I do. ]
[ Addendum 20160709: A Hacker News comment asks what changed to cause
the problem? Why was Perl losing TMPDIR this week but not the week
before? Frew and I don't know! ]
[Other articles in category /tech]
permanent link
|