Notes on using git-replace to get rid of giant objects
A couple of years ago someone accidentally committed a 350 megabyte
file to our Git repository. Now it's baked in. I wanted to get rid
of it. I thought that I might be able to work out a partial but
lightweight solution using
Summary: It didn't work.
In 2016 a programmer commited a 350 megabyte file to my employer's
repo, then in the following commit they removed it again. Of course
it's still in there, because someone might check out the one commit
where it existed. Everyone who clones the repo gets a copy of the big
file. Every copy of the repo takes up an extra 350 megabytes on disk.
The usual way to fix this is onerous:
git-filter-branch to rebuild all the repository history after
the bad commit.
Update all the existing refs to point to the analogous rebuilt
Get everyone in the company to update all the refs in their local
copies of the repo.
I thought I'd tinker around with
git-replace to see if there was
some way around this, maybe something that someone could do locally on
their own repo without requiring everyone else to go along with it.
git-replace command annotates the Git repository to say that
whenever object A is wanted, object B should be used instead. Say
that the 350 MB file has an ID of
ffff9999ffff9999ffff9999ffff9999ffff9999. I can create a small file
This is a replacement object. It replaces a very large file
that was committed by mistake. To see the commit as it really
git --no-replace-objects show 183a5c7e90b2d4f6183a5c7e90b2d4f6183a5c7e
git --no-replace-objects checkout 183a5c7e90b2d4f6183a5c7e90b2d4f6183a5c7e
or similarly. To see the file itself, use
git --no-replace-objects show ffff9999ffff9999ffff9999ffff9999ffff9999
I can turn this small file into an object with
git-add; say the new
small object has ID
git replace ffff9999ffff9999ffff9999ffff9999ffff9999 1111333311113333111133331111333311113333
contains the text
thenceforward, any Git command that tries to access the original
ffff9999 will silently behave as if it were
instead. For example,
git show 183a5c7e will show the diff between
that commit and the previous, as if the user had committed my small
file back in 2016 instead of their large one. And checking out that
commit will check out the small file instead of the large one.
So far this doesn't help much. The checkout is smaller, but nobody
was likely to have that commit checked out anyway. The large file is
still in the repository, and clones and transfers still clone and
The first thing I tried was a wan hope: will
git gc discard the
replaced object? No, of course not. The ref in
counts as a reference to it, and it will never be garbage-collected.
If it had been, you would no longer be able to examine it with the
--no-replace-objects commands. So much for following the rules!
Now comes the hacking part: I am going to destroy the actual object.
Say for example, what if:
cp /dev/null .git/objects/ff/ff9999ffff9999ffff9999ffff9999ffff9999
Now the repository is smaller! And maybe Git won't notice, as long as
I do not use
Indeed, much normal Git usage doesn't notice. For example, I can make
new commits with no trouble, and of course any other operation that
doesn't go back as far as 2016 doesn't notice the change. And
git-log works just fine even past the bad commit; it only looks at
the replacement object and never notices that the bad object is
But some things become wonky. You get an error message when you clone
the repo because an object is missing. The replacement refs are local
to the repo, and don't get cloned, so clone doesn't know to use the
replacement object anyway. In the clone, you can use
git replace -f
.... to reinstate the replacement, and then all is well unless
something tries to look at the missing object. So maybe a user could
apply this hack on their own local copy if they are willing to
tolerate a little wonkiness…?
No. Unfortunately, there is a show-stopper:
git-gc no longer
works in either the parent repo or in the clone:
fatal: unable to read ffff9999ffff9999ffff9999ffff9999ffff9999
error: failed to run repack
and it doesn't create the pack files. It dies, and leaves behind a
.git/objects/pack/tmp_pack_XxXxXx that has to be cleaned up by hand.
I think I've reached the end of this road. Oh well, it was worth a look.
[ Addendum 20181009: A lot of people have unfortunately missed the
point of this article, and have suggested that I use
reposurgeon. I have a small
problem and a large problem. The small problem is how to remove some
files from the repository. This is straightforward, and the tools
mentioned will help with it. But because of the way Git works, the
result is effectively a new repository. The tools will not help with
the much larger problem I would have then: How to get 350 developers
to migrate to the new repository at the same time. The approach I
investigated in this article was an attempt to work around this
second, much larger problem. ]
[Other articles in category /prog]