The Universe of Disco


Fri, 15 Apr 2016

How to recover lost files added to Git but not committed

A few days ago, I wrote:

If you lose something [in Git], don't panic. There's a good chance that you can find someone who will be able to hunt it down again.

I was not expecting to have a demonstration ready so soon. But today I finished working on a project, I had all the files staged in the index but not committed, and for some reason I no longer remember I chose that moment to do git reset --hard, which throws away the working tree and the staged files. I may have thought I had committed the changes. I hadn't.

If the files had only been in the working tree, there would have been nothing to do but to start over. Git does not track the working tree. But I had added the files to the index. When a file is added to the Git index, Git stores it in the repository. Later on, when the index is committed, Git creates a commit that refers to the files already stored. If you know how to look, you can find the stored files even before they are part of a commit.

(If they are part of a commit, the problem is much easier. Typically the answer is simply “use git-reflog to find the commit again and check it out”. The git-reflog command is probably the first thing anyone should learn on the path from being a Git beginner to becoming an intermediate Git user.)

Each file added to the Git index is stored as a “blob object”. Git stores objects in two ways. When it's fetching a lot of objects from a remote repository, it gets a big zip file with an attached table of contents; this is called a pack. Getting objects from a pack can be a pain. Fortunately, not all objects are in packs. When when you just use git-add to add a file to the index, git makes a single object, called a “loose” object. The loose object is basically the file contents, gzipped, with a header attached. At some point Git will decide there are too many loose objects and assemble them into a pack.

To make a loose object from a file, the contents of the file are checksummed, and the checksum is used as the name of the object file in the repository and as an identifier for the object, exactly the same as the way git uses the checksum of a commit as the commit's identifier. If the checksum is 0123456789abcdef0123456789abcdef01234567, the object is stored in

    .git/objects/01/23456789abcdef0123456789abcdef01234567

The pack files are elsewhere, in .git/objects/pack.

So the first thing I did was to get a list of the loose objects in the repository:

    cd .git/objects
    find ?? -type f  | perl -lpe 's#/##' > /tmp/OBJ

This produces a list of the object IDs of all the loose objects in the repository:

    00f1b6cc1dfc1c8872b6d7cd999820d1e922df4a
    0093a412d3fe23dd9acb9320156f20195040a063
    01f3a6946197d93f8edba2c49d1bb6fc291797b0
    …
    ffd505d2da2e4aac813122d8e469312fd03a3669
    fff732422ed8d82ceff4f406cdc2b12b09d81c2e

There were 500 loose objects in my repository. The goal was to find the eight I wanted.

There are several kinds of objects in a Git repository. In addition to blobs, which represent file contents, there are commit objects, which represent commits, and tree objects, which represent directories. These are usually constructed at the time the commit is done. Since my files hadn't been committed, I knew I wasn't interested in these types of objects. The command git cat-file -t will tell you what type an object is. I made a file that related each object to its type:

    for i in $(cat /tmp/OBJ); do
      echo -n "$i ";
      git type $i;
    done > /tmp/OBJTYPE

The git type command is just an alias for git cat-file -t. (Funny thing about that: I created that alias years ago when I first started using Git, thinking it would be useful, but I never used it, and just last week I was wondering why I still bothered to have it around.) The OBJTYPE file output by this loop looks like this:

    00f1b6cc1dfc1c8872b6d7cd999820d1e922df4a blob
    0093a412d3fe23dd9acb9320156f20195040a063 tree
    01f3a6946197d93f8edba2c49d1bb6fc291797b0 commit
    …
    fed6767ff7fa921601299d9a28545aa69364f87b tree
    ffd505d2da2e4aac813122d8e469312fd03a3669 tree
    fff732422ed8d82ceff4f406cdc2b12b09d81c2e blob

Then I just grepped out the blob objects:

    grep blob /tmp/OBJTYPE | f 1 > /tmp/OBJBLOB

The f 1 command throws away the types and keeps the object IDs. At this point I had filtered the original 500 objects down to just 108 blobs.

Now it was time to grep through the blobs to find the ones I was looking for. Fortunately, I knew that each of my lost files would contain the string org-service-currency, which was my name for the project I was working on. I couldn't grep the object files directly, because they're gzipped, but the command git cat-file disgorges the contents of an object:

    for i in $(cat /tmp/OBJBLOB ) ; do
      git cat-file blob $i |
        grep -q org-service-curr
          && echo $i;
    done > /tmp/MATCHES

The git cat-file blob $i produces the contents of the blob whose ID is in $i. The grep searches the contents for the magic string. Normally grep would print the matching lines, but this behavior is disabled by the -q flag—the q is for “quiet”—and tells grep instead that it is being used only as part of a test: it yields true if it finds the magic string, and false if not. The && is the test; it runs echo $i to print out the object ID $i only if the grep yields true because its input contained the magic string.

So this loop fills the file MATCHES with the list of IDs of the blobs that contain the magic string. This worked, and I found that there were only 18 matching blobs, so I wrote a very similar loop to extract their contents from the repository and save them in a directory:

    for i in $(cat /tmp/OBJBLOB ) ; do
      git cat-file blob $i | 
         grep -q org-service-curr
           && git cat-file blob $i > /tmp/rescue/$i;
    done

Instead of printing out the matching blob ID number, this loop passes it to git cat-file again to extract the contents into a file in /tmp/rescue.

The rest was simple. I made 8 subdirectories under /tmp/rescue representing the 8 different files I was expecting to find. I eyeballed each of the 18 blobs, decided what each one was, and sorted them into the 8 subdirectories. Some of the subdirectories had only 1 blob, some had up to 5. I looked at the blobs in each subdirectory to decide in each case which one I wanted to keep, using diff when it wasn't obvious what the differences were between two versions of the same file. When I found one I liked, I copied it back to its correct place in the working tree.

Finally, I went back to the working tree and added and committed the rescued files.

It seemed longer, but it only took about twenty minutes. To recreate the eight files from scratch might have taken about the same amount of time, or maybe longer (although it never takes as long as I think it will), and would have been tedious.

But let's suppose that it had taken much longer, say forty minutes instead of twenty, to rescue the lost blobs from the repository. Would that extra twenty minutes have been time wasted? No! The twenty minutes spent to recreate the files from scratch is a dead loss. But the forty minutes to rescue the blobs is time spent learning something that might be useful in the future. The Git rescue might have cost twenty extra minutes, but if so it was paid back with forty minutes of additional Git expertise, and time spent to gain expertise is well spent! Spending time to gain expertise is how you become an expert!

Git is a core tool, something I use every day. For a long time I have been prepared for the day when I would try to rescue someone's lost blobs, but until now I had never done it. Now, if that day comes, I will be able to say “Oh, it's no problem, I have done this before!”

So if you lose something in Git, don't panic. There's a good chance that you can find someone who will be able to hunt it down again.


[Other articles in category /prog] permanent link