The Universe of Discourse


Fri, 01 Jul 2016

Don't tug on that, you never know what it might be attached to

This is a story about a very interesting bug that I tracked down yesterday. It was causing a bad effect very far from where the bug actually was.

emacsclient

The emacs text editor comes with a separate utility, called emacsclient, which can communicate with the main editor process and tell it to open files for editing. You have your main emacs running. Then somewhere else you run the command

     emacsclient some-files...

and it sends the main emacs a message that you want to edit some-files. Emacs gets the message and pops up new windows for editing those files. When you're done editing some-files you tell Emacs, by typing C-# or something, it it communicates back to emacsclient that the editing is done, and emacsclient exits.

This was more important in the olden days when Emacs was big and bloated and took a long time to start up. (They used to joke that “Emacs” was an abbreviation for “Eight Megs And Constantly Swapping”. Eight megs!) But even today it's still useful, say from shell scripts that need to run an editor.

Here's the reason I was running it. I have a very nice shell script, called also, that does something like this:

  • Interpret command-line arguments as patterns
  • Find files matching those patterns
  • Present a menu of the files
  • Wait for me to select files of interest
  • Run emacsclient on the selected files

It is essentially a wrapper around menupick, a menu-picking utility I wrote which has seen use as a component of several other tools. I can type

    also Wizard

in the shell and get a menu of the files related to the wizard, select the ones I actually want to edit, and they show up in Emacs. This is more convenient than using Emacs itself to find and open them. I use it many times a day.

Or rather, I did until this week, when it suddenly stopped working. Everything ran fine until the execution of emacsclient, which would fail, saying:

 emacsclient: can't find socket; have you started the server?

(A socket is a facility that enables interprocess communication, in this case between emacs and emacsclient.)

This message is familiar. It usually means that I have forgotten to tell Emacs to start listening for emacsclient, by running M-x server-start. (I should have Emacs do this when it starts up, but I don't. Why not? I'm not sure.) So the first time it happened I went to Emacs and ran M-x server-start. Emacs announced that it had started the server, so I reran also. And the same thing happened.

 emacsclient: can't find socket; have you started the server?

Finding the socket

So the first question is: why can't emacsclient find the socket? And this resolves naturally into two subquestions: where is the socket, and where is emacsclient looking?

The second one is easily answered; I ran strace emacsclient (hi Julia!) and saw that the last interesting thing emacsclient did before emitting the error message was

    stat("/mnt/tmp/emacs2017/server", 0x7ffd90ec4d40) = -1 ENOENT (No such file or directory)

which means it's looking for the socket at /mnt/tmp/emacs2017/server but didn't find it there.

The question of where Emacs actually put the socket file was a little trickier. I did not run Emacs under strace because I felt sure that the output would be voluminous and it would be tedious to grovel over it.

I don't exactly remember now how I figured this out, but I think now that I probably made an educated guess, something like: emacsclient is looking in /mnt/tmp; this seems unusual. I would expect the socket to be under /tmp. Maybe it is under /tmp? So I looked under /tmp and there it was, in /tmp/emacs2017/server:

    srwx------ 1 mjd mjd 0 Jun 27 11:43 /tmp/emacs2017/server

(The s at the beginning there means that the file is a “Unix-domain socket”. A socket is an endpoint for interprocess communication. The most familiar sort is a TCP socket, which has a TCP address, and which enables communication over the internet. But since ancient times Unix has also supported Unix-domain sockets, which enable communication between two processes on the same machine. Instead of TCP addresses, such sockets are addressed using paths in the filesystem, in this case /tmp/emacs2017/server. When the server creates such a socket, it appears in the filesystem as a special type of file, as here.)

I confirmed that this was the correct file by typing M-x server-force-delete in Emacs; this immediately caused /tmp/emacs2017/server to disappear. Similarly M-x server-start made it reappear.

Why the disagreement?

Now the question is: Why is emacsclient looking for the socket under /mnt/tmp when Emacs is putting it in /tmp? They used to rendezvous properly; what has gone wrong? I recalled that there was some environment variable for controlling where temporary files are put, so I did

       env | grep mnt

to see if anything relevant turned up. And sure enough there was:

       TMPDIR=/mnt/tmp

When programs want to create tmporary files and directories, they normally do it in /tmp. But if there is a TMPDIR setting, they use that directory instead. This explained why emacsclient was looking for /mnt/tmp/emacs2017/socket. And the explanation for why Emacs itself was creating the socket in /tmp seemed clear: Emacs was failing to honor the TMPDIR setting.

With this clear explanation in hand, I began to report the bug in Emacs, using M-x report-emacs-bug. (The folks in the #emacs IRC channel on Freenode suggested this. I had a bad experience last time I tried #emacs, and then people mocked me for even trying to get useful information out of IRC. But this time it went pretty well.)

Emacs popped up a buffer with full version information and invited me to write down the steps to reproduce the problem. So I wrote down

     % export TMPDIR=/mnt/tmp
     % emacs

and as I did that I ran those commands in the shell.

Then I wrote

     In Emacs:
     M-x getenv TMPDIR
     (emacs claims there is no such variable)

and I did that in Emacs also. But instead of claiming there was no such variable, Emacs cheerfully informed me that the value of TMPDIR was /mnt/tmp.

(There is an important lesson here! To submit a bug report, you find a minimal demonstration. But then you also try the minimal demonstration exactly as you reported it. Because of what just happened! Had I sent off that bug report, I would have wasted everyone else's time, and even worse, I would have looked like a fool.)

My minimal demonstration did not demonstrate. Something else was going on.

Why no TMPDIR?

This was a head-scratcher. All I could think of was that emacsclient and Emacs were somehow getting different environments, one with the TMPDIR setting and one without. Maybe I had run them from different shells, and only one of the shells had the setting?

I got on a sidetrack at this point to find out why TMPDIR was set in the first place; I didn't think I had set it. I looked for it in /etc/profile, which is the default Bash startup instructions, but it wasn't there. But I also noticed an /etc/profile.d which seemed relevant. (I saw later that the /etc/profile contained instructions to load everything under /etc/profile.d.) And when I grepped for TMPDIR in the profile.d files, I found that it was being set by /etc/profile.d/ziprecruiter_environment.sh, which the sysadmins had installed. So that mystery at least was cleared up.

That got me on a second sidetrack, looking through our Git history for recent changes involving TMPDIR. There weren't any, so that was a dead end.

I was still puzzled about why Emacs sometimes got the TMPDIR setting and sometimes not. That's when I realized that my original Emacs process, the one that had failed to rendezvous with emacsclient, had not been started in the usual way. Instead of simply running emacs, I had run

    git re-edit

which invokes Git, which then runs

    /home/mjd/bin/git-re-edit

which is a Perl program I wrote that does a bunch of stuff to figure out which files I was editing recently and then execs emacs to edit them some more. So there are several programs here that could be tampering with the environment and removing the TMPDIR setting.

To more accurately point the finger of blame, I put some diagnostics into the git-re-edit program to have it print out the value of TMPDIR. Indeed, git-re-edit reported that TMPDIR was unset. Clearly, the culprit was Git, which must have been removing TMPDIR from the environment before invoking my Perl program.

Who is stripping the environment?

To confirm this conclusion, I created a tiny shell script, /home/mjd/bin/git-env, which simply printed out the environment, and then I ran git env, which tells Git to find git-env and run it. If the environment it printed were to omit TMPDIR, I would know Git was the culprit. But TMPDIR was in the output.

So I created a Perl version of git-env, called git-perlenv, which did the same thing, and I ran it via git perlenv. And this time TMPDIR was not in the output. I ran diff on the outputs of git env and git perlenv and they were identical—except that git perlenv was missing TMPDIR.

So it was Perl's fault! And I verified this by running perl /home/mjd/bin/git-re-edit directly, without involving Git at all. The diagnostics I had put in reported that TMPDIR was unset.

WTF Perl?

At this point I tried getting rid of get-re-edit itself, and ran the one-line program

    perl -le 'print $ENV{TMPDIR}'

which simply runs Perl and tells it to print out the value of the TMPDIR environment variable. It should print /mnt/tmp, but instead it printed the empty string. This is a smoking gun, and Perl no longer has anywhere to hide.

The mystery is not cleared up, however. Why was Perl doing this? Surely not a bug; someone else would have noticed such an obvious bug sometime in the past 25 years. And it only failed for TMPDIR, not for other variables. For example

    FOO=bar perl -le 'print $ENV{FOO}'

printed out bar as one would expect. This was weird: how could Perl's environment handling be broken for just the TMPDIR variable?

At this point I got Rik Signes and Frew Schmidt to look at it with me. They confirmed that the problem was not in Perl generally, but just in this Perl. Perl on other systems did not display this behavior.

I looked in the output of perl -V, which says what version of Perl you are using and which patches have been applied, and wasted a lot of time looking into CVE-2016-2381, which seemed relevant. But it turned out to be a red herring.

Working around the problem, 1.

While all this was going on I was looking for a workaround. Finding one is at least as important as actually tracking down the problem because ultimately I am paid to do something other than figure out why Perl is losing TMPDIR. Having a workaround in hand means that when I get sick and tired of looking into the underlying problem I can abandon it instantly instead of having to push onward.

The first workaround I found was to not use the Unix-domain socket. Emacs has an option to use a TCP socket instead, which is useful on systems that do not support Unix-domain sockets, such as non-Unix systems. (I am told that some do still exist.)

You set the server-use-tcp variable to a true value, and when you start the server, Emacs creates a TCP socket and writes a description of it into a “server file”, usually ~/.emacs.d/server/server. Then when you run emacsclient you tell it to connect to the socket that is described in the file, with

    emacsclient --server-file=~/.emacs.d/server/server

or by setting the EMACS_SERVER_FILE environment variable. I tried this, and it worked, once I figured out the thing about server-use-tcp and what a “server file” was. (I had misunderstood at first, and thought that “server file” meant the Unix-domain socket itself, and I tried to get emacsclient to use the right one by setting EMACS_SERVER_FILE, which didn't work at all. The resulting error message was obscure enough to lead me to IRC to ask about it.)

Working around the problem, 2.

I spent quite a while looking for an environment variable analogous to EMACS_SERVER_FILE to tell emacsclient where the Unix-domain socket was. But while there is a --socket-name command-line argument to control this, there is inexplicably no environment variable. I hacked my also command (responsible for running emacsclient) to look for an environment variable named EMACS_SERVER_SOCKET, and to pass its value to emacsclient --socket-name if there was one. (It probably would have been better to write a wrapper for emacsclient, but I didn't.) Then I put

    EMACS_SERVER_SOCKET=$TMPDIR/emacs$(id -u)/server

in my Bash profile, which effectively solved the problem. This set EMACS_SERVER_SOCKET to /mnt/tmp/emacs2017/server whenever I started a new shell. When I ran also it would notice the setting and pass it along to emacsclient with --socket-name, to tell emacsclient to look in the right place. Having set this up I could forget all about the original problem if I wanted to.

But but but WHY?

But why was Perl removing TMPDIR from the environment? I didn't figure out the answer to this; Frew took it to the #p5p IRC channel on perl.org, where the answer was eventually tracked down by Matthew Horsfall and Zefrem.

The answer turned out to be quite subtle. One of the classic attacks that can be mounted against a process with elevated privileges is as follows. Suppose you know that the program is going to write to a temporary file. So you set TMPDIR beforehand and trick it into writing in the wrong place, possibly overwriting or destroying something important.

When a program is loaded into a process, the dynamic loader does the loading. To protect against this attack, the loader checks to see if the program it is going to run has elevated privileges, say because it is setuid, and if so it sanitizes the process’ environment to prevent the attack. Among other things, it removes TMPDIR from the environment.

I hadn't thought of exactly this, but I had thought of something like it: If Perl detects that it is running setuid, it enables a secure mode which, among other things, sanitizes the environment. For example, it ignores the PERL5LIB environment variable that normally tells it where to look for loadable modules, and instead loads modules only from a few compiled-in trustworthy directories. I had checked early on to see if this was causing the TMPDIR problem, but the perl executable was not setuid and Perl was not running in secure mode.

But Linux supports a feature called “capabilities”, which is a sort of partial superuser privilege. You can give a program some of the superuser's capabilities without giving away the keys to the whole kingdom. Our systems were configured to give perl one extra capability, of binding to low-numbered TCP ports, which is normally permitted only to the superuser. And when the dynamic loader ran perl, it saw this additional capability and removed TMPDIR from the environment for safety.

This is why Emacs had the TMPDIR setting when run from the command line, but not when run via git-re-edit.

Until this came up, I had not even been aware that the “capabilities” feature existed.

A red herring

There was one more delightful confusion on the way to this happy ending. When Frew found out that it was just the Perl on my development machine that was misbehaving, he tried logging into his own, nearly identical development machine to see if it misbehaved in the same way. It did, but when he ran a system update to update Perl, the problem went away. He told me this would fix the problem on my machine. But I reported that I had updated my system a few hours before, so there was nothing to update!

The elevated capabilities theory explained this also. When Frew updated his system, the new Perl was installed without the elevated capability feature, so the dynamic loader did not remove TMPDIR from the environment.

When I had updated my system earlier, the same thing happened. But as soon as the update was complete, I reloaded my system configuration, which reinstated the capability setting. Frew hadn't done this.

Summary

  • The system configuration gave perl a special capability
  • so the dynamic loader sanitized its environment
  • so that when perl ran emacs,
  • the Emacs process didn't have the TMPDIR environment setting
  • which caused Emacs to create its listening socket in the usual place
  • but because emacsclient did get the setting, it looked in the wrong place

Conclusion

This computer stuff is amazingly complicated. I don't know how anyone gets anything done.

[ Addendum 20160709: Frew Schmidt has written up the same incident, but covers different ground than I do. ]

[ Addendum 20160709: A Hacker News comment asks what changed to cause the problem? Why was Perl losing TMPDIR this week but not the week before? Frew and I don't know! ]


[Other articles in category /tech] permanent link