The Universe of Disco


Mon, 08 Oct 2018

Notes on using git-replace to get rid of giant objects

A couple of years ago someone accidentally committed a 350 megabyte file to our Git repository. Now it's baked in. I wanted to get rid of it. I thought that I might be able to work out a partial but lightweight solution using git-replace.

Summary: It didn't work.

Details

In 2016 a programmer commited a 350 megabyte file to my employer's repo, then in the following commit they removed it again. Of course it's still in there, because someone might check out the one commit where it existed. Everyone who clones the repo gets a copy of the big file. Every copy of the repo takes up an extra 350 megabytes on disk.

The usual way to fix this is onerous:

  1. Use git-filter-branch to rebuild all the repository history after the bad commit.

  2. Update all the existing refs to point to the analogous rebuilt objects.

  3. Get everyone in the company to update all the refs in their local copies of the repo.

I thought I'd tinker around with git-replace to see if there was some way around this, maybe something that someone could do locally on their own repo without requiring everyone else to go along with it.

The git-replace command annotates the Git repository to say that whenever object A is wanted, object B should be used instead. Say that the 350 MB file has an ID of ffff9999ffff9999ffff9999ffff9999ffff9999. I can create a small file that says

 This is a replacement object.  It replaces a very large file
 that was committed by mistake.  To see the commit as it really
 was, use

      git --no-replace-objects show 183a5c7e90b2d4f6183a5c7e90b2d4f6183a5c7e
      git --no-replace-objects checkout 183a5c7e90b2d4f6183a5c7e90b2d4f6183a5c7e

 or similarly.  To see the file itself, use

      git --no-replace-objects show ffff9999ffff9999ffff9999ffff9999ffff9999

I can turn this small file into an object with git-add; say the new small object has ID 1111333311113333111133331111333311113333. I then run:

git replace ffff9999ffff9999ffff9999ffff9999ffff9999 1111333311113333111133331111333311113333

This creates .git/refs/replace/ffff9999ffff9999ffff9999ffff9999ffff9999, which contains the text 1111333311113333111133331111333311113333. thenceforward, any Git command that tries to access the original object ffff9999 will silently behave as if it were 11113333 instead. For example, git show 183a5c7e will show the diff between that commit and the previous, as if the user had committed my small file back in 2016 instead of their large one. And checking out that commit will check out the small file instead of the large one.

So far this doesn't help much. The checkout is smaller, but nobody was likely to have that commit checked out anyway. The large file is still in the repository, and clones and transfers still clone and transfer it.

The first thing I tried was a wan hope: will git gc discard the replaced object? No, of course not. The ref in refs/replace/ counts as a reference to it, and it will never be garbage-collected. If it had been, you would no longer be able to examine it with the --no-replace-objects commands. So much for following the rules!

Now comes the hacking part: I am going to destroy the actual object. Say for example, what if:

cp /dev/null .git/objects/ff/ff9999ffff9999ffff9999ffff9999ffff9999

Now the repository is smaller! And maybe Git won't notice, as long as I do not use --no-replace-objects?

Indeed, much normal Git usage doesn't notice. For example, I can make new commits with no trouble, and of course any other operation that doesn't go back as far as 2016 doesn't notice the change. And git-log works just fine even past the bad commit; it only looks at the replacement object and never notices that the bad object is missing.

But some things become wonky. You get an error message when you clone the repo because an object is missing. The replacement refs are local to the repo, and don't get cloned, so clone doesn't know to use the replacement object anyway. In the clone, you can use git replace -f .... to reinstate the replacement, and then all is well unless something tries to look at the missing object. So maybe a user could apply this hack on their own local copy if they are willing to tolerate a little wonkiness…?

No. Unfortunately, there is a show-stopper: git-gc no longer works in either the parent repo or in the clone:

fatal: unable to read ffff9999ffff9999ffff9999ffff9999ffff9999
error: failed to run repack

and it doesn't create the pack files. It dies, and leaves behind a .git/objects/pack/tmp_pack_XxXxXx that has to be cleaned up by hand.

I think I've reached the end of this road. Oh well, it was worth a look.

[ Addendum 20181009: A lot of people have unfortunately missed the point of this article, and have suggested that I use BFG or reposurgeon. I have a small problem and a large problem. The small problem is how to remove some files from the repository. This is straightforward, and the tools mentioned will help with it. But because of the way Git works, the result is effectively a new repository. The tools will not help with the much larger problem I would have then: How to get 350 developers to migrate to the new repository at the same time. The approach I investigated in this article was an attempt to work around this second, much larger problem. ]


[Other articles in category /prog] permanent link