I figured out that context manager bug!
A couple of days ago I described a strange
bug in my “Greenlight” project
that was causing Git to fail unpredictably, saying:
fatal: this operation must be run in a work tree
The problem seemed to go away when I changed
with env_var("GIT_DIR", self.repo_dir):
with env_var("GIT_WORK_TREE", self.work_dir):
result = subprocess.run(command, ...)
to
with env_var("GIT_DIR", self.repo_dir, "GIT_WORK_TREE", self.work_dir):
result = subprocess.run(command, ...)
but I didn't understand why. I said:
This was so unexpected that I wondered if the real problem was
nondeterministic and if some of the debugging messages had somehow
perturbed it. But I removed everything but the context manager
change and ran another test, which succeeded. By then I was five
and half hours into the debugging and I didn't have any energy left
to actually understand what the problem had been. I still don't
know.
The problem re-manifested again today, and this time I was able to
track it down and fix it. The context manager code I mentioned above
was not the issue.
That subprocess.run call is made inside a git_util object which,
as you can see in the tiny excerpt above, has a self.work_dir
attribute that tells it where to find the working tree. Just before
running a Git command, the git_util object installs self.work_dir
into the environment to tell Git where the working tree is.
The git_util object is originally manufactured by Greenlight itself,
which sets the work_dir attribute to a path that contains the
current process ID number. Just before the process exits, Greenlight
destroys the working tree. This way, concurrent processes never try
to use the same working tree, which would be a mess.
When Greenlight needs to operate on the repository, it uses its
git_util object directly. It also creates a submission object to
represent the submitted branch, and it installs the git_util object
into the submission object, so that the submission object can also
operate on the repository. For example, the submission object may ask
its git_util object if it needs to be rebased onto some other
branch, and if so to please do it. So:
- Greenlight has a
submission .
submission.git is the git_util object that deals with Git.
submission.git.work_dir is the path to the per-process temporary working tree.
Greenlight's main purpose is to track these submission objects, and it
has a database of them. To save time when writing the initial
implementation, instead of using a real database, I had Greenlight use
Python's “pickle”
feature to pickle
the list of submissions.
Someone would submit a branch, and Greenlight would pickle the
submission. The submission contained its git_util object, and that
got pickled along with the rest. Then Greenlight would exit and, just
before doing so, it would destroy its temporary working tree.
Then later, when someone else wanted to approve the submission for
publication, Greenlight would set up a different working tree with its
new process ID, and unpickle the submission. But the submission's
git.work_dir had been pickled with the old path, which no longer
existed.
The context manager was working just fine. It was setting
GIT_WORK_TREE to the work_dir value in the git_util object. But
the object was obsolete and its work_dir value pointed to a
directory that had been destroyed!
Adding to the confusion:
Greenlight's own git_util object was always fresh and had the
right path in it, so Git commands run directly by Greenlight all
worked properly.
Any new submission objects created by Greenlight would have the
right path, so Git commands run by fresh submissions also worked
properly.
Greenlight doesn't always destroy the working tree when it exits.
If it exits abnormally, it leaves the working tree intact, for a
later autopsy. And the unpickled submission would work perfectly
if the working tree still existed, and it would be impossible to
reproduce the problem!
Toward the end of the previous article, I said:
I suspect I'm being sabotaged somewhere by Python's weird implicit
ideas of scope and variable duration, but I don't know. Yet.
For the record, then: The issue was indeed one of variable duration.
But Python's weird implicit ideas were, in this instance, completely
blameless. Instead the issue was caused by a software component even
more complex and more poorly understood: “Dominus”.
This computer stuff is amazingly complicated. I don't know how anyone
gets anything done.
[Other articles in category /prog/bug]
permanent link
|