# The Universe of Discourse

Wed, 05 Dec 2018

A couple of days ago I described a strange bug in my “Greenlight” project that was causing Git to fail unpredictably, saying:

fatal: this operation must be run in a work tree

The problem seemed to go away when I changed

with env_var("GIT_DIR", self.repo_dir):
with env_var("GIT_WORK_TREE", self.work_dir):
result = subprocess.run(command, ...)

to

with env_var("GIT_DIR", self.repo_dir, "GIT_WORK_TREE", self.work_dir):
result = subprocess.run(command, ...)

but I didn't understand why. I said:

This was so unexpected that I wondered if the real problem was nondeterministic and if some of the debugging messages had somehow perturbed it. But I removed everything but the context manager change and ran another test, which succeeded. By then I was five and half hours into the debugging and I didn't have any energy left to actually understand what the problem had been. I still don't know.

The problem re-manifested again today, and this time I was able to track it down and fix it. The context manager code I mentioned above was not the issue.

That subprocess.run call is made inside a git_util object which, as you can see in the tiny excerpt above, has a self.work_dir attribute that tells it where to find the working tree. Just before running a Git command, the git_util object installs self.work_dir into the environment to tell Git where the working tree is.

The git_util object is originally manufactured by Greenlight itself, which sets the work_dir attribute to a path that contains the current process ID number. Just before the process exits, Greenlight destroys the working tree. This way, concurrent processes never try to use the same working tree, which would be a mess.

When Greenlight needs to operate on the repository, it uses its git_util object directly. It also creates a submission object to represent the submitted branch, and it installs the git_util object into the submission object, so that the submission object can also operate on the repository. For example, the submission object may ask its git_util object if it needs to be rebased onto some other branch, and if so to please do it. So:

• Greenlight has a submission.
• submission.git is the git_util object that deals with Git.
• submission.git.work_dir is the path to the per-process temporary working tree.

Greenlight's main purpose is to track these submission objects, and it has a database of them. To save time when writing the initial implementation, instead of using a real database, I had Greenlight use Python's “pickle” feature to pickle the list of submissions.

Someone would submit a branch, and Greenlight would pickle the submission. The submission contained its git_util object, and that got pickled along with the rest. Then Greenlight would exit and, just before doing so, it would destroy its temporary working tree.

Then later, when someone else wanted to approve the submission for publication, Greenlight would set up a different working tree with its new process ID, and unpickle the submission. But the submission's git.work_dir had been pickled with the old path, which no longer existed.

The context manager was working just fine. It was setting GIT_WORK_TREE to the work_dir value in the git_util object. But the object was obsolete and its work_dir value pointed to a directory that had been destroyed!

Adding to the confusion:

1. Greenlight's own git_util object was always fresh and had the right path in it, so Git commands run directly by Greenlight all worked properly.

2. Any new submission objects created by Greenlight would have the right path, so Git commands run by fresh submissions also worked properly.

3. Greenlight doesn't always destroy the working tree when it exits. If it exits abnormally, it leaves the working tree intact, for a later autopsy. And the unpickled submission would work perfectly if the working tree still existed, and it would be impossible to reproduce the problem!

Toward the end of the previous article, I said:

I suspect I'm being sabotaged somewhere by Python's weird implicit ideas of scope and variable duration, but I don't know. Yet.

For the record, then: The issue was indeed one of variable duration. But Python's weird implicit ideas were, in this instance, completely blameless. Instead the issue was cause by a software component even more complex and more poorly understood: “Dominus”.

This computer stuff is amazingly complicated. I don't know how anyone gets anything done.

Sun, 02 Dec 2018

There are combinatorial objects called necklaces and bracelets and I can never remember which is which.

Both are finite sequences of things (typically symbols) where the start point is not important. So the bracelet ABCDE is considered to be the same as the bracelets BCDEA, CDEAB, DEABC, and EABCD.

One of the two also disregards the direction you go, so that ABCDE is also considered the same as EDCBA (and so also DCBAE, etc.). But which? I have to look it up every time.

I have finally thought of a mnemonic. In a necklace, the direction is important, because to reverse an actual necklace you have to pull it off over the wearer's head, turn it over, and put it back on. But in a bracelet the direction is not important, because it could be on either wrist, and a bracelet on the left wrist is the same as the reversed bracelet on the right wrist.

Okay, silly, maybe, but I think it's going to work.

I'm working on a large and wonderful project called “Greenlight”. It's a Git branch merging service that implements the following workflow:

1. Submitter submits a branch to Greenlight (greenlight submit my-topic-branch)
2. Greenlight analyzes the branch to decide if it changes anything that requires review and signoff
3. If so, it contacts the authorized reviewers, who then inform Greenlight that they approve the changes (greenlight approve 03a46dc1)
4. Greenlight merges the branch to master and publishes the result to the central repository

Of course, there are many details elided here.

Multiple instances of Greenlight share a local repository, but to avoid confusion each has its own working tree. In Git you can configure these by setting GIT_DIR and GIT_WORK_TREE environment variables, respectively. When Greenlight needs to run a Git command, it does so like this:

with env_var("GIT_DIR", self.repo_dir):
with env_var("GIT_WORK_TREE", self.work_dir):
result = subprocess.run(command, ...)

The env_var here is a Python context manager that saves the old environment, sets the new environment variable, and then when the body of the block is complete, it restores the environment to the way it was. This worked in testing every time.

But the first time a beta tester ran the approve command, Greenlight threw a fatal exception. It was trying to run git checkout --quiet --detach, and this was failing, with Git saying

fatal: this operation must be run in a work tree

Where was the GIT_WORK_TREE setting going? I still don't know. But in the course of trying to track the problem down, I changed the code above to:

with env_var("GIT_DIR", self.repo_dir, "GIT_WORK_TREE", self.work_dir):
result = subprocess.run(command, ...)

and the problem, whatever it was, no longer manifested.

But this revealed a second bug: Greenlight no longer failed in the approval phase. It went ahead and merged the branch, and then tried to publish the merge with git push origin .... But the push was rejected.

This is because the origin repository had an update hook that ran on every push, which performed the same review analysis that Greenlight was performing; one of Greenlight's main purposes is to be a replacement for this hook. To avoid tying up the main repository for too long, this hook had a two-minute timeout, after which it would die and reject the push. This had only happened very rarely in the past, usually when someone was inadvertently trying to push a malformed branch. For example, they might have rebased all of master onto their topic branch. In this case, however, the branch really was legitimately enormous; it contained over 2900 commits.

“Oh, right,” I said. “I forgot to add the exception to the hook that tells it that it can immediately approve anything pushed by Greenlight.” The hook can assume that if the push comes from Greenlight, it has already been checked and authorized.

Pushes are happening via SSH, and Greenlight has its own SSH identity, which is passed to the hook itself in the GL_USERNAME variable. Modifying the hook was easy: I just added:

if environ["GL_USERNAME"] == 'greenlight':
exit(0)

This didn't work. My first idea was that Greenlight's public SSH key had not been installed in the authorized_keys file in the right place. When I grepped for greenlight in the authorized_keys file, there were no matches. The key was actually there, but in Gitlab the authorized_keys file doesn't have actual usernames in it. It has internal userids, which are then mapped to GL_USERNAME variables by some other entity. So I chased that wild goose for a while. Eventually I determined that the key was in the right place, but that the name of the Greenlight identity on the receiving side was not greenlight but bot-greenlight, which I had forgotten.

So I changed the exception to say:

if environ["GL_USERNAME"] == 'bot-greenlight':
exit(0)

and it still didn't work. I eventually discovered that when Greenlight did the push, the GL_USERNAME was actually set to mjd.

“Oh, right,” I said. “I forgot to have Greenlight use its own SSH credentials in the ssh connection.”

The way you do this is to write a little wrapper program that obtains the correct credentials and runs ssh, and then you set GIT_SSH to point to the wrapper. It looks like this:

#!/usr/bin/env bash

export -n SSH_CLIENT SSH_TTY SSH_AUTH_SOCK SSH_CONNECTION
exec /usr/bin/ssh -i $HOME/.ssh/identity "$@"

But wait, why hadn't I noticed this before? Because, apparently, every single person who had alpha-tested Greenlight had had their own credentials stored in ssh-agent, and every single one had had agent-forwarding enabled, so that when Greenlight tried to use ssh to connect to the Git repository, SSH duly forwarded their credentials along and the pushes succeeded. Amazing.

With these changes, the publication went through. I committed the changes to the SSH credential stuff, and some other unrelated changes, and I looked at what was left to see what had actually fixed the original bug. Every change but one was to add diagnostic messages and logging. The fix for the original bug had been to replace the nested context managers with a single context manager. This was so unexpected that I wondered if the real problem was nondeterministic and if some of the debugging messages had somehow perturbed it. But I removed everything but the context manager change and ran another test, which succeeded. By then I was five and half hours into the debugging and I didn't have any energy left to actually understand what the problem had been. I still don't know.

If you'd like to play along at home, the context manager looks like this, and did not change during the debugging process:

from contextlib import contextmanager

@contextmanager
def env_var(*args):

# Save old values of environment variables in old
# A saved value of None means that the variable was not there before
old = {}
for i in range(len(args)//2):
(key, value) = (args[2*i : 2*i+2])
old[key] = None
if key in os.environ:
old[key] = os.environ[str(key)]

if value is None: os.environ.pop(str(key), "dummy")
else:
os.environ[str(key)] = str(value)

yield

# Undo changes from versions saved in old
for (key, value) in old.items():
if value is None: os.environ.pop(str(key), "dummy")
else:             os.environ[str(key)] = value

I suspect I'm being sabotaged somewhere by Python's weird implicit ideas of scope and variable duration, but I don't know. Yet.

This computer stuff is amazingly complicated. I don't know how anyone gets anything done.

[ Addendum 20181204: I figured it out. ]