| 
 
Another day, another bug.  No, four bugs.
 I'm working on a large and wonderful project called “Greenlight”.
It's a Git branch merging service that implements the following
workflow: 
Submitter submits a branch to Greenlight (greenlight submit my-topic-branch)Greenlight analyzes the branch to decide if it changes anything
that requires review and signoffIf so, it contacts the authorized reviewers, who then inform
Greenlight that they approve the changes (greenlight approve 03a46dc1)Greenlight merges the branch to masterand publishes the result
to the central repository Of course, there are many details elided here. Multiple instances of Greenlight share a local repository, but to avoid
confusion each has its own working tree.  In Git you can configure
these by setting GIT_DIRandGIT_WORK_TREEenvironment variables,
respectively.  When Greenlight needs to run a Git command, it does so
like this:     with env_var("GIT_DIR", self.repo_dir):
        with env_var("GIT_WORK_TREE", self.work_dir):
            result = subprocess.run(command, ...)
 The env_varhere is a Python context manager that saves the old
environment, sets the new environment variable, and then when the body
of the block is complete, it restores the environment to the way it
was.  This worked in testing every time. But the first time a beta tester ran the approvecommand, Greenlight
threw a fatal exception.  It was trying to rungit checkout --quiet
--detach, and this was failing, with Git saying fatal: this operation must be run in a work tree
 Where was the GIT_WORK_TREEsetting going?  I still don't know.  But
in the course of trying to track the problem down, I changed the code
above to:     with env_var("GIT_DIR", self.repo_dir, "GIT_WORK_TREE", self.work_dir):
        result = subprocess.run(command, ...)
 and the problem, whatever it was, no longer manifested. But this revealed a second bug:  Greenlight no longer failed in the
approval phase. It went ahead and merged the branch, and then tried
to publish the merge with git push origin ....  But the push was
rejected. This is because the originrepository had anupdatehook that ran
on every push, which performed the same review analysis that Greenlight
was performing; one of Greenlight's main purposes is to be a
replacement for this hook.  To avoid tying up the main repository for
too long, this hook had a two-minute timeout, after which it would die
and reject the push.  This had only happened very rarely in the past,
usually when someone was inadvertently trying to push a malformed
branch.  For example, they might have rebased all ofmasteronto
their topic branch.  In this case, however, the branch really was
legitimately enormous; it contained over 2900 commits. “Oh, right,” I said.  “I forgot to add the exception to the hook that
tells it that it can immediately approve anything pushed by
Greenlight.”  The hook can assume that if the push comes from
Greenlight, it has already been checked and authorized. Pushes are happening via SSH, and Greenlight has its own SSH identity,
which is passed to the hook itself in the GL_USERNAMEvariable.
Modifying the hook was easy: I just added:       if environ["GL_USERNAME"] == 'greenlight':
            exit(0)
 This didn't work.  My first idea was that Greenlight's public SSH key
had not been installed in the authorized_keysfile in the right
place.  When I grepped forgreenlightin theauthorized_keysfile,
there were no matches.  The key was actually there, but in Gitlab theauthorized_keysfile doesn't have actual usernames in it.  It has
internal userids, which are then mapped toGL_USERNAMEvariables by
some other entity.  So I chased that wild goose for a while.
Eventually I determined that the key was in the right place, but
that the name of the Greenlight identity on the receiving side was notgreenlightbutbot-greenlight, which I had forgotten. So I changed the exception to say:       if environ["GL_USERNAME"] == 'bot-greenlight':
            exit(0)
 and it still didn't work.  I eventually discovered that when
Greenlight did the push, the GL_USERNAMEwas actually set tomjd. “Oh, right,” I said.  “I forgot to have Greenlight use its own
SSH credentials in the sshconnection.” The way you do this is to write a little wrapper program that obtains
the correct credentials and runs ssh, and then you setGIT_SSHto
point to the wrapper.  It looks like this:     #!/usr/bin/env bash
    export -n SSH_CLIENT SSH_TTY SSH_AUTH_SOCK SSH_CONNECTION
    exec /usr/bin/ssh -i $HOME/.ssh/identity  "$@"
 But wait, why hadn't I noticed this before?  Because, apparently,
every single person who had alpha-tested Greenlight had had their own
credentials stored in ssh-agent, and every single one had had
agent-forwarding enabled, so that when Greenlight tried to usesshto connect to the Git repository, SSH duly forwarded their credentials
along and the pushes succeeded.  Amazing. With these changes, the publication went through.  I committed the
changes to the SSH credential stuff, and some other unrelated changes,
and I looked at what was left to see what had actually fixed the
original bug.  Every change but one was to add diagnostic messages and
logging.  The fix for the original bug had been to replace the nested
context managers with a single context manager.  This was so
unexpected that I wondered if the real problem was nondeterministic
and if some of the debugging messages had somehow perturbed it.  But I
removed everything but the context manager change and ran another
test, which succeeded.  By then I was five and half hours into the
debugging and I didn't have any energy left to actually understand
what the problem had been.  I still don't know. If you'd like to play along at home, the context manager looks like
this, and did not change during the debugging process:     from contextlib import contextmanager
    @contextmanager
    def env_var(*args):
        # Save old values of environment variables in `old`
        # A saved value of `None` means that the variable was not there before
        old = {}
        for i in range(len(args)//2):
            (key, value) = (args[2*i : 2*i+2])
            old[key] = None
            if key in os.environ:
                old[key] = os.environ[str(key)]
            if value is None: os.environ.pop(str(key), "dummy")
            else:
                os.environ[str(key)] = str(value)
        yield
        # Undo changes from versions saved in `old`
        for (key, value) in old.items():
            if value is None: os.environ.pop(str(key), "dummy")
            else:             os.environ[str(key)] = value
 I suspect I'm being sabotaged somewhere by Python's weird implicit
ideas of scope and variable duration, but I don't know.  Yet. This computer stuff is amazingly complicated.  I don't know how anyone
gets anything done. [ Addendum 20181204: I figured it
out. ] 
 
[Other articles in category /prog/bug] 
permanent link
 
 |