The Universe of Discourse
           
Tue, 12 Dec 2006

ssh-agent
When you do a remote login with ssh, the program needs to use your secret key to respond to the remote system's challenge. The secret key is itself kept scrambled on the disk, to protect it from being stolen, and requires a passphrase to be unscrambled. So the remote login process prompts you for the passphrase, unscrambles the secret key from the disk, and then responds to the challenge.

Typing the passphrase every time you want to do a remote login or run a remote process is a nuisance, so the ssh suite comes with a utility program called "ssh-agent". ssh-agent unscrambles the secret key and remembers it; it then provides the key to any process that can contact it through a certain network socket.

The process for login is now:

        plover% eval `ssh-agent`
        Agent pid 23918
        plover% ssh-add
        Need passphrase for /home/mjd/.ssh/identity
        Enter passphrase for mjd@plover: (supply passphrase here)
        plover% ssh remote-system-1
        (no input required here)
        ...
        plover% ssh remote-system-2
        (no input required here either)
        ...
The important thing here is that once ssh-agent is started up and supplied with the passphrase, which need be done only once, ssh itself can contact the agent through the socket and get the key as many times as needed.

How does ssh know where to contact the agent process? ssh-agent prints an output something like this one:

        SSH_AUTH_SOCK=/tmp/ssh-GdT23917/agent.23917; export SSH_AUTH_SOCK;
        SSH_AGENT_PID=23918; export SSH_AGENT_PID;
        echo Agent pid 23918;
The eval `...` command tells the shell to execute this output as code; this installs two variables into the shell's environment, whence they are inherited by programs run from the shell, such as ssh. When ssh runs, it looks for the SSH_AUTH_SOCK variable. Finding it, it connects to the specified socket, in /tmp/ssh-GdT23917/agent.23917, and asks the agent for the required private key.

Now, suppose I log in to my home machine, say from work. The new login shell does not have any environment settings pertaining to ssh-agent. I can, of course, run a new agent for the new login shell. But there is probably an agent process running somewhere on the machine already. If I run another every time I log in, the agent processes will proliferate. Also, running a new agent requires that the new agent be supplied with the private key, which would require that I type the passphrase, which is long.

Clearly, it would be more convenient if I could tell the new shell to contact the existing agent, which is already running. I can write a command which finds an existing agent and manufactures the appropriate environment settings for contacting that agent. Or, if it fails to find an already-running agent, it can just run a new one, just as if ssh-agent had been invoked directly. The program is called ssh-findagent.

I put a surprising amount of work into this program. My first idea was that it can look through /tmp for the sockets. For each socket, it can try to connect to the socket; if it fails, the agent that used to be listening on the other end is dead, and ssh-findagent should try a different socket. If none of the sockets are live, ssh-findagent should run the real ssh-agent. But if there is an agent listening on the other end, ssh-findagent can print out the appropriate environment settings. The socket's filename has the form /tmp/ssh-XXXPID/agent.PID; this is the appropriate value for the SSH_AUTH_SOCK variable.

The SSH_AGENT_PID variable turned out to be harder. I thought I would just get it from the SSH_AUTH_SOCK value with a pattern match. Wrong. The PID in the SSH_AUTH_SOCK value and in the filename is not actually the pid of the agent process. It is the pid of the parent of the agent process, which is typically, but not always, 1 less than the true pid. There is no reliable way to calculate the agent's pid from the SSH_AUTH_SOCK filename. (The mismatch seems unavoidable to me, and the reasons why seem interesting, but I think I'll save them for another blog article.)

After that I had the idea of having ssh-findagent scan the process table looking for agents. But this has exactly the reverse problem: You can find the agent processes in the process table, but you can't find out which socket files they are attached to.

My next thought was that maybe linux supports the getpeerpid system call, which takes an open file descriptor and returns the pid of the peer process on other end of the descriptor. Maybe it does, but probably not, because support for this system call is very rare, considering that I just made it up. And anyway, even if it were supported, it would be highly nonportable.

I had a brief hope that from a pid P, I could have ssh-findagent look in /proc/P/fd/* and examine the process file descriptor table for useful information about which process held open which file. Nope. The information is there, but it is not useful. Here is a listing of /proc/31656/fd, where process 31656 is an agent process:

        lrwx------    1 mjd      users          64 Dec 12 23:34 3 -> socket:[711505562]
File descriptors 0, 1, and 2 are missing. This is to be expected; it means that the agent process has closed stdin, stdout, and stderr. All that remains is descriptor 3, clearly connected to the socket. Which file represents the socket in the filesystem? No telling. All we have here is 711595562, which is an index into some table of sockets inside the kernel. The kernel probably doesn't know the filename either. Filenames appear in directories, where they are associated with file ID numbers (called "i-numbers" in Unix jargon.) The file ID numbers can be looked up in the kernel "inode" table to find out what kind of files they are, and, in this case, what socket the file represents. So the kernel can follow the pointers from the filesystem to the inode table to the open socket table, or from the process's open file table to the open socket table, but not from the process to the filesystem.

Not easily, anyway; there is a program called lsof which grovels over the kernel data structures and can associate a process with the names of the files it has open, or a filename with the ID numbers of the processes that have it open. My last attempt at getting ssh-findagent to work was to have it run lsof over the possible socket files, getting a listing of all processes that had any of the socket files open, and, for each process, which file. When it found a process-file pair in the output of lsof, it could emit the appropriate environment settings.

The final code was quite simple, since lsof did all the heavy lifting. Here it is:

        #!/usr/bin/perl

        open LSOF, "lsof /tmp/ssh-*/agent.* |"
            or die "couldn't run lsof: $!\n";
        <LSOF>;  # discard header
        my $line = <LSOF>;
        if ($line) {
          my ($pid, $file) = (split /\s+/, $line)[1,-1];
          print "SSH_AUTH_SOCK=$file; export SSH_AUTH_SOCK;
        SSH_AGENT_PID=$pid; export SSH_AGENT_PID;
        echo Agent pid $pid;
        ";
          exit 0;
        } else {
          print "echo starting new agent\n";
          exec "ssh-agent";
          die "Couldn't run new ssh-agent: $!\n";
        }
I complained about the mismatch between the agent's pid and the pid in the socket filename on IRC, and someone there mentioned offhandedly how they solve the same problem.

It's hard to figure out what the settings should be. So instead of trying to figure them out, one should save them to a file when they're first generated, then load them from the file as required.

Oh, yeah. A file. Duh.

It looks like this: Instead of loading the environment settings from ssh-agent directly into the shell, with eval `ssh-agent`, you should save the settings into a file:

        plover% ssh-agent > ~/.ssh/agent-env
Then tell the shell to load the settings from the file:

        plover% . ~/.ssh/agent-env
        Agent pid 23918
        plover% ssh remote-system-1
        (no input required here)
The . command reads the file and acquires the settings. And whenever I start a fresh login shell, it has no settings, but I can easily acquire them again with . ~/.ssh/agent-env, without running another agent process. I wonder why I didn't think of this before I started screwing around with sockets and getpeerpid and /proc/*/fd and lsof and all that rigmarole.

Oh well, I'm not yet a perfect sage.

[ Addendum 20070105: I have revisited this subject and the reader's responses. ]


[Other articles in category /oops] permanent link