The Universe of Discourse


Sat, 16 Dec 2023

My Git pre-commit hook contained a footgun

The other day I made some changes to a program, but when I ran the tests they failed in a very bizarre way I couldn't understand. After a bit of investigation I still didn't understand. I decided to try to narrow down the scope of possible problems by reverting the code to the unmodified state, then introducing changes from one file at a time.

My plan was: commit all the new work, reset the working directory back to the last good commit, and then start pulling in file changes. So I typed in rapid succession:

git add -u
git commit -m 'broken'
git branch wat
git reset --hard good

So the complete broken code was on the new branch wat.

Then I wanted to pull in the first file from wat. But when I examined wat there were no changes.

Wat.

I looked all around the history and couldn't find the changes. The wat branch was there but it was on the current commit, the one with none of the changes I wanted. I checked in the reflog for the commit and didn't see it.

Eventually I looked back in my terminal history and discovered the problem: I had a Git pre-commit hook which git-commit had attempted to run before it made the new commit. It checks for strings I don't usually intend to commit, such as XXX and the like.

This time one of the files had something like that. My pre-commit hook had printed an error message and exited with a failure status, so git-commit aborted without making the commit. But I had typed the commands in quick succession without paying attention to what they were saying, so I went ahead with the git-reset without even seeing the error message. This wiped out the working tree changes that I had wanted to preserve.

Fortunately the git-add had gone through, so the modified files were in the repository anyway, just hard to find. And even more fortunately, last time this happened to me, I wrote up instructions about what to do. This time around recovery was quicker and easier. I knew I only needed to recover stuff from the last add command, so instead of analyzing every loose object in the repository, I did

find .git/objects -mmin 10 --type f

to locate loose objects that had been modified in the last ten minutes. There were only half a dozen or so. I was able to recover the lost changes without too much trouble.

Looking back at that previous article, I see that it said:

it only took about twenty minutes… suppose that it had taken much longer, say forty minutes instead of twenty, to rescue the lost blobs from the repository. Would that extra twenty minutes have been time wasted? No! … The rescue might have cost twenty extra minutes, but if so it was paid back with forty minutes of additional Git expertise…

To that I would like to add, the time spent writing up the blog article was also well-spent, because it meant that seven years later I didn't have to figure everything out again, I just followed my own instructions from last time.

But there's a lesson here I'm still trying to figure out. Suppose I want to prevent this sort of error in the future. The obvious answer is “stop splatting stuff onto the terminal without paying attention, jackass”, but that strategy wasn't sufficient this time around and I couldn't think of any way to make it more likely to work next time around.

You have to play the hand you're dealt. If I can't fix myself, maybe I can fix the software. I would like to make some changes to the pre-commit hook to make it easier to recover from something like this.

My first idea was that the hook could unconditionally save the staged changes somewhere before it started, and then once it was sure that it would complete it could throw away the saved changes. For example, it might use the stash for this.

(Although, strangely, git-stash does not seem to have an easy way to say “stash the current changes, but without removing them from the working tree”. Maybe git-stash save followed by git-stash apply would do what I wanted? I have not yet experimented with it.)

Rather than using the stash, the hook might just commit everything (with commit -n to prevent infinite loops) and then reset the commit immediately, before doing whatever it was planning to do. Then if it was successful, Git would make a second, permanent commit and we could forget about the one made by the hook. But if something went wrong, the hook's commit would still be in the reflog. This doubles the number of commits you make. That doesn't take much time, because Git commit creation is lightning fast. But it would tend to clutter up the reflog.

Thinking on it now, I wonder if a better approach isn't to turn the pre-commit hook into a post-commit hook. Instead of a pre-commit hook that does this:

  1. Check for errors in staged files
    • If there are errors:
      1. Fix the files (if appropriate)
      2. Print a message
      3. Fail
    • Otherwise:
      1. Exit successfully
      2. (git-commit continues and commits the changes)

How about a post-commit hook that does this:

  1. Check for errors in the files that changed in the current head commit
    • If there are errors:
      1. Soft-reset back to the previous commit
      2. Fix the files (if appropriate)
      3. Print a message
      4. Fail
    • Otherwise:
      1. Exit successfully

Now suppose I ignore the failure, and throw away the staged changes. It's okay, the changes were still committed and the commit is still in the reflog. This seems clearly better than my earlier ideas.

I'll consider it further and report back if I actually do anything about this.

Larry Wall once said that too many programmers will have a problem, think of a solution, and implement it, but it works better if you can think of several solutions, then implement the one you think is best.

That's a lesson I think I have learned. Thanks, Larry.

Addendum

I see that Eric Raymond's version of the jargon file, last revised December 2003, omits “footgun”. Surely this word is not that new? I want to see if it was used on Usenet prior to that update, but Google Groups search is useless for this question. Does anyone have suggestions for how to proceed?


[Other articles in category /prog/git] permanent link

Mon, 27 Feb 2023

I wish people would stop insisting that Git branches are nothing but refs

I periodically write about Git, and sometimes I say something like:

Branches are named sequences of commits

and then a bunch of people show up and say “this is wrong, a branch is nothing but a ref”. This is true, but only in a very limited and unhelpful way. My description is a more useful approximation to the truth.

Git users think about branches and talk about branches. The Git documentation talks about branches and many of the commands mention branches. Pay attention to what experienced users say about branches while using Git, and it will be clear that they do not think of branches simply as just refs. In that sense, branches do exist: they are part of our mental model of how the repository works.

Are you a Git user who wants to argue about this? First ask yourself what we mean when we say “is your topic branch up to date?” “be sure to fetch the dev branch” “what branch did I do that work on?” “is that commit on the main branch or the dev branch?” “Has that work landed on the main branch?” “The history splits in two here, and the left branch is Alice's work but the right branch is Bob's”. None of these can be understood if you think that a branch is nothing but a ref. All of these examples show that when even the most sophisticated Git users talk about branches, they don't simply mean refs; they mean sequences of commits.

Here's an example from the official Git documentation, one of many: “If the upstream branch already contains a change you have made…”. There's no way to understand this if you insist that “branch” here means a ref or a single commit. The current Git documentation contains the word “branch” over 1400 times. Insisting that “a branch is nothing but a ref” is doing people disservice, because they are going to have to unlearn that in order to understand the documentation.

Some unusually dogmatic people might still argue that a branch is nothing but a ref. “All those people who say those things are wrong,” they might say, “even the Git documentation is wrong,” ignoring the fact that they also say those things. No, sorry, that is not the way language works. If someone claims that a true shoe is is really a Javanese dish of fried rice and fish cake, and that anyone who talks about putting shoes on their feet is confused or misguided, well, that person is just being silly.

The reason people say this, the disconnection is that the Git software doesn't have any formal representation of branches. Conceptually, the branch is there; the git commands just don't understand it. This is the most important mismatch between the conceptual model and what the Git software actually does.

Usually when a software model doesn't quite match its domain, we recognize that it's the software that is deficient. We say “the software doesn't represent that concept well” or “the way the software deals with that is kind of a hack”. We have a special technical term for it: it's a “leaky abstraction”. A “leaky abstraction” is when you ought to be able to ignore the underlying implementation, but the implementation doesn't reflect the model well enough, so you have to think about it more than you would like to.

When there's a leaky abstraction we don't normally try to pretend that the software's deficient model is actually correct, and that everyone in the world is confused. So why not just admit what's going on here? We all think about branches and talk about branches, but Git has a leaky abstraction for branches and doesn't handle branches very well. That's all, nothing unusual. Sometimes software isn't perfect.

When the Git software needs to deal with branches, it has to finesse the issue somehow. For some commands, hardly any finesse is required. When you do git log dev to get the history of the dev branch, Git starts at the commit named dev and then works its way back, parent by parent, to all the ancestor commits. For history logs, that's exactly what you want! But Git never has to think of the branch as a single entity; it just thinks of one commit at a time.

When you do git-merge, you might think you're merging two branches, but again Git can finesse the issue. Git has to look at a little bit of history to figure out a merge base, but after that it's not merging two branches at all, it's merging two sets of changes.

In other cases Git uses a ref to indicate the end point of the branch (called the ‘tip’), and sorta infers the start point from context. For example, when you push a branch, you give the software a ref to indicate the end point of the branch, and it infers the start point: the first commit that the remote doesn't have already. When you rebase a branch, you give the software a ref to indicate the end point of the branch, and the software infers the start point, which is the merge-base of the start point and the upstream commit you're rebasing onto. Sometimes this inference goes awry and the software tries to rebase way more than you thought it would: Git's idea of the branch you're rebasing isn't what you expected. That doesn't mean it's right and you're wrong; it's just a miscommunication.

And sometimes the mismatch isn't well-disguised. If I'm looking at some commit that was on a branch that was merged to master long ago, what branch was that exactly? There's no way to know, if the ref was deleted. (You can leave a note in the commit message, but that is not conceptually different from leaving a post-it on your monitor.) The best I can do is to start at the current master, work my way back in history until I find the merge point, then study the other commits that are on the same topic branch to try to figure out what was going on. What if I merged some topic branch into master last week, other work landed after that, and now I want to un-merge the topic? Sorry, Git doesn't do that. And why not? Because the software doesn't always understand branches in the way we might like. Not because the question doesn't make sense, just because the software doesn't always do what we want.

So yeah, the the software isn't as good as we might like. What software is? But to pretend that the software is right, and that all the defects are actually benefits is a little crazy. It's true that Git implements branches as refs, plus also a nebulous implicit part that varies from command to command. But that's an unfortunate implementation detail, not something we should be committed to.

[ Addendum 20230228: Several people have reminded me that the suggestions of the next-to-last paragraph are possible in some other VCSes, such as Mercurial. I meant to mention this, but forgot. Thanks for the reminder. ]


[Other articles in category /prog/git] permanent link

Wed, 06 Jul 2022

Things I wish everyone knew about Git (Part II)

This is a writeup of a talk I gave in December for my previous employer. It's long so I'm publishing it in several parts:

The most important material is in Part I.

It is really hard to lose stuff

A Git repository is an append-only filesystem. You can add snapshots of files and directories, but you can't modify or delete anything. Git commands sometimes purport to modify data. For example git commit --amend suggests that it amends a commit. It doesn't. There is no such thing as amending a commit; commits are immutable.

Rather, it writes a completely new commit, and then kinda turns its back on the old one. But the old commit is still in there, pristine, forever.

In a Git repository you can lose things, in the sense of forgetting where they are. But they can almost always be found again, one way or another, and when you find them they will be exactly the same as they were before. If you git commit --amend and change your mind later, it's not hard to get the old ⸢unamended⸣ commit back if you want it for some reason.

  • If you have the SHA for a file, it will always be the exact same version of the file with the exact same contents.

  • If you have the SHA for a directory (a “tree” in Git jargon) it will always contain the exact same versions of the exact same files with the exact same names.

  • If you have the SHA for a commit, it will always contain the exact same metainformation (description, when made, by whom, etc.) and the exact same snapshot of the entire file tree.

Objects can have other names and descriptions that come and go, but the SHA is forever.

(There's a small qualification to this: if the SHA is the only way to refer to a certain object, if it has no other names, and if you haven't used it for a few months, Git might discard it from the repository entirely.)

But what if you do lose something?

There are many good answers to this question but I think the one to know first is git-reflog, because it covers the great majority of cases.

The git-reflog command means:

“List the SHAs of commits I have visited recently”

When I run git reflog the top of the output says what commits I had checked out at recently, with the top line being the commit I have checked out right now:

    523e9fa1 HEAD@{0}: checkout: moving from dev to pasha
    5c31648d HEAD@{1}: pull: Fast-forward
    07053923 HEAD@{2}: checkout: moving from pr2323 to dev
    ...

The last thing I did was check out the branch named pasha; its tip commit is at 523e9f1a.

Before that, I did git pull and Git updated my local dev branch from the remote one, updating it to 5c31648d.

Before that, I had switched to dev from a different branch, pr2323. At that time, before the pull, dev referred to commit 07053923.

Farther down in the output are some commits I visited last August:

    ...
    58ec94f6 HEAD@{928}: pull --rebase origin dev: checkout 58ec94f6d6cb375e09e29a7a6f904e3b3c552772
    e0cfbaee HEAD@{929}: commit: WIP: model classes for condensedPlate and condensedRNAPlate
    f8d17671 HEAD@{930}: commit: Unskip tests that depend on standard seed data
    31137c90 HEAD@{931}: commit (amend): migrate pedigree tests into test/pedigree
    a4a2431a HEAD@{932}: commit: migrate pedigree tests into test/pedigree
    1fe585cb HEAD@{933}: checkout: moving from LAB-808-dao-transaction-test-mode to LAB-815-pedigree-extensions
    ...

Flux capacitor (magic
time-travel doohickey) from “Back to the Future”

Suppose I'm caught in some horrible Git nightmare. Maybe I deleted the entire test suite or accidentally put my Small Wonder fanfic into a commit message or overwrote the report templates with 150 gigabytes of goat porn. I can go back to how things were before. I look in the reflog for the SHA of the commit just before I made my big blunder, and then:

    git reset --hard 881f53fa

Phew, it was just a bad dream.

(Of course, if my colleagues actually saw the goat porn, it can't fix that.)

I would like to nominate Wile E. Coyote to be the mascot of Git. Because Wile E. is always getting himself into situations like this one:

Wile E., a cartoon coyote has just fired a
shotgun at Bugs Bunny.  For some reason the shotgun has fired
backwards and blown his face off, as Git sometimes does.

But then, in the next scene, he is magically unharmed. That's Git.

Finding old stuff with git-reflog

  • git reflog by itself lists the places that HEAD has been
  • git reflog some-branch lists the places that some-branch has been
  • That HEAD@{1} thing in the reflog output is another way to name that commit if you don't want to use the SHA.
  • You can abbreviate it to just @{1}.
  • The following locutions can be used with any git command that wants you to identify a commit:

    • @{17} (HEAD as it was 17 actions ago)
    • @{18:43} (HEAD as it was at 18:43 today)
    • @{yesterday} (HEAD as it was 24 hours ago)
    • dev@{'3 days ago'} (dev as it was 3 days ago)
    • some-branch@{'Aug 22'} (some-branch as it was last August 22)

    (Use with git-checkout, git-reset, git-show, git-diff, etc.)

  • Also useful:

    git show dev@{'Aug 22'}:path/to/some/file.txt
    

    “Print out that file, as it was on dev, as dev was on August 22”

It's all still in there.

What if you can't find it?

Don't panic! Someone with more experience can probably find it for you. If you have a local Git expert, ask them for help.

And if they are busy and can't help you immediately, the thing you're looking for won't disappear while you wait for them. The repository is append-only. Every version of everything is saved. If they could have found it today, they will still be able to find it tomorrow.

(Git will eventually throw away lost and unused snapshots, but typically not anything you have used in the last 90 days.)

What if you regret something you did?

Don't panic! It can probably put it back the way it was.

Git leaves a trail

When you make a commit, Git prints something like this:

    your-topic-branch 4e86fa23 Rework foozle subsystem

If you need to find that commit again, the SHA 4e86fa23 is in your terminal scrollback.

When you fetch a remote branch, Git prints:

       6e8fab43..bea7535b  dev        -> origin/dev

What commit was origin/dev before the fetch? At 6e8fab43. What commit is it now? bea7535b.

What if you want to look at how it was before? No problem, 6e8fab43 is still there. It's not called origin/dev any more, but the SHA is forever. You can still check it out and look at it:

    git checkout -b how-it-was-before 6e8fab43

What if you want to compare how it was with how it is now?

    git log 6e8fab43..bea7535b
    git show 6e8fab43..bea7535b
    git diff 6e8fab43..bea7535b

Git tries to leave a trail of breadcrumbs in your terminal. It's constantly printing out SHAs that you might want again.

A few things can be lost forever!

After all that talk about how Git will not lose things, I should point out the exceptions. The big exception is that if you have created files or made changes in the working tree, Git is unaware of them until you have added them with git-add. Until then, those changes are in the working tree but not in the repository, and if you discard them Git cannot help you get them back.

Good advice is Commit early and often. If you don't commit, at least add changes with git-add. Files added but not committed are saved in the repository, although they can be hard to find because they haven't been packaged into a commit with a single SHA id.

Some people automate this: they have a process that runs every few minutes and commits the current working tree to a special branch that they look at only in case of disaster.

The dangerous commands are git-reset and git-checkout

which modify the working tree, and so might wipe out changes that aren't in the repository. Git will try to warn you before doing something destructive to your working tree changes.

git-rev-parse

We saw a little while ago that Git's language for talking about commits and files is quite sophisticated:

            my-topic-branch@{'Aug 22'}:path/to/some/file.txt

Where is this language documented? Maybe not where you would expect: it's in the manual for git-rev-parse.

The git rev-parse command is less well-known than it should be. It takes a description of some object and turns it into a SHA. Why is that useful? Maybe not, but

The git-rev-parse man page explains the syntax of the descriptions Git understands.

A good habit is to skim over the manual every few months. You'll pick up something new and useful every time.

My favorite is that if you use the syntax :/foozle you get the most recent commit on the current branch whose message mentions foozle. For example:

    git show :/foozle

or

    git log :/introduce..:/remove

Coming next week (probably), a few miscellaneous matters about using Git more effectively.


[Other articles in category /prog/git] permanent link

Wed, 29 Jun 2022

Things I wish everyone knew about Git (Part I)

This is a writeup of a talk I gave in December for my previous employer. It's long so I'm publishing it in several parts:

How to approach Git; general strategy

Git has an elegant and powerful underlying model based on a few simple concepts:

  1. Commits are immutable snapshots of the repository
  2. Branches are named sequences of commits
  3. Every object has a unique ID, derived from its content

black and white white ink
diagram of the elegant geometry of the floor plan of a cathedral

Built atop this elegant system is a flaming trash pile.

literal dumpster fire

The command set wasn't always well thought out, and then over the years it grew by accretion, with new stuff piled on top of old stuff that couldn't be changed because Backward Compatibility. The commands are non-orthogonal and when two commands perform the same task they often have inconsistent options or are described with different terminology. Even when the individual commands don't conflict with one another, they are often badly-designed and confusing. The documentation is often very poorly written.

What this means

With a lot of software, you can opt to use it at a surface level without understanding it at a deeper level:

“I don't need to know how it works.
I just want to know which commands to run.”

This is often an effective strategy, but

with Git, this does not work.

You can't “just know which commands to run” because the commands do not make sense!

To work effectively with Git, you must have a model of what the repository is like, so that you can formulate questions like “is the repo on this state or that state?” and “the repo is in this state, how do I get it into that state?”. At that point you look around for a command that answers your question, and there are probably several ways to do what you want.

But if you try to understand the commands without the model, you will suffer, because the commands do not make sense.

Just a few examples:

  • git-reset does up to three different things, depending on flags

  • git-checkout is worse

  • The opposite of git-push is not git-pull, it's git-fetch

  • etc.

If you try to understand the commands without a clear idea of the model, you'll be perpetually confused about what is happening and why, and you won't know what questions to ask to find out what is going on.

READ THIS

When I first used Git it drove me almost to tears of rage and frustration. But I did get it under control. I don't love Git, but I use it every day, by choice, and I use it effectively.

The magic key that rescued me was

John Wiegley's
Git From the Bottom Up

Git From the Bottom Up explains the model. I read it. After that I wept no more. I understood what was going on. I knew how to try things out and how to interpret what I saw. Even when I got a surprise, I had a model to fit it into.

You should read it too.

That's the best advice I have. Read Wiegley's explanation. Set aside time to go over it carefully and try out his examples. It fixed me.

If I were going to tell every programmer just one thing about Git, that would be it.

The rest of this series is all downhill from here.

But if I were going to tell everyone just one more thing, it would be:

It is very hard to permanently lose work.
If something seems to have gone wrong, don't panic.
Remain calm and ask an expert.

Many more details about that are in the followup article.


[Other articles in category /prog/git] permanent link

Wed, 31 Dec 1969

Git articles on my blog

I often write about Git but the Git articles are mixed in with everything else. Someday I will rearrange everything. In the meantime I will try to keep a list of links on this page.


[Other articles in category /prog/git] permanent link