git log --follow enthusiastically tracks empty files
This bug I just found in git log --follow is impressively massive.
Until I worked out what was going on I was really perplexed, and even
considered that my repository might have become corrupted.
I knew I'd written a draft of a blog article about the Watchmen
movie, and I went to find out how long it had been sitting around:
% git log -- movie/Watchmen.blog
commit 934961428feff98fa3cb085e04a0d594b083f597
Author: Mark Dominus <mjd@plover.com>
Date: Fri Feb 3 16:32:25 2012 -0500
link to Mad Watchmen parody
also recategorize under movie instead of under book
The log stopped there, and the commit message says clearly that the
article was moved from elsewhere, so I used git-log --follow --stat
to find out how old it really was. The result was spectacularly
weird. It began in the right place:
commit 934961428feff98fa3cb085e04a0d594b083f597
Author: Mark Dominus <mjd@plover.com>
Date: Fri Feb 3 16:32:25 2012 -0500
link to Mad Watchmen parody
also recategorize under movie instead of under book
{book => movie}/Watchmen.blog | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
Okay, it was moved, with slight modifications, from book to movie ,
as the message says.
commit 5bf6e946f66e290fc6abf044aa26b9f7cfaaedc4
Author: Mark Jason Dominus (陶敏修) <mjd@plover.com>
Date: Tue Jan 17 20:36:27 2012 -0500
finally started article about Watchment movie
book/Watchmen.blog | 40 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 40 insertions(+)
Okay, the previous month I added some text to it.
Then I skipped to the bottom to see when it first appeared, and the
bottom was completely weird, mentioning a series of completely
unrelated articles:
commit e6779efdc9510374510705b4beb0b4c4b5853a93
Author: mjd <mjd>
Date: Thu May 4 15:21:57 2006 +0000
First chunk of linear regression article
prog/maxims/paste-code.notyet => math/linear-regression.notyet | 0
1 file changed, 0 insertions(+), 0 deletions(-)
commit 9d9038a3358a82616a159493c6bdc91dd03d03f4
Author: mjd <mjd>
Date: Tue May 2 14:16:24 2006 +0000
maxims directory reorganization
tech/mercury.notyet => prog/maxims/paste-code.notyet | 0
1 file changed, 0 insertions(+), 0 deletions(-)
commit 1273c618ed6efa4df75ce97255204251678d04d3
Author: mjd <mjd>
Date: Tue Apr 4 15:32:00 2006 +0000
Thingy about propagation delay and mercury delay lines
tech/mercury.notyet | 0
1 file changed, 0 insertions(+), 0 deletions(-)
(The complete output is available for your perusal.)
The log is showing unrelated files being moved to totally unrelated
places. And also, the log messages do not seem to match up. “First
chunk of linear regression article” should be on some commit that adds
text to math/linear-regression.notyet or
math/linear-regression.blog . But according to the output above,
that file is still empty after that commit. Maybe I added the text in
a later commit? “Maxims directory reorganization” suggests that I
reorganized the contents of prog/maxims , but the stat says
otherwise.
My first thought was: when I imported my blog from CVS to Git, many
years ago, I made a series of mistakes, and mismatched the log
messages to the commits, or worse, and I might have to do it over
again. Despair!
But no, it turns out that git-log is just intensely confused.
Let's look at one of the puzzling commits. Here it is as reported by
git log --follow --stat :
commit 9d9038a3358a82616a159493c6bdc91dd03d03f4
Author: mjd <mjd>
Date: Tue May 2 14:16:24 2006 +0000
maxims directory reorganization
tech/mercury.notyet => prog/maxims/paste-code.notyet | 0
1 file changed, 0 insertions(+), 0 deletions(-)
But if I do git show --stat 9d9038a3 , I get a very different
picture, one that makes sense:
% git show --stat 9d9038a3
commit 9d9038a3358a82616a159493c6bdc91dd03d03f4
Author: mjd <mjd>
Date: Tue May 2 14:16:24 2006 +0000
maxims directory reorganization
prog/maxims.notyet | 226 -------------------------------------------
prog/maxims/maxims.notyet | 95 ++++++++++++++++++
prog/maxims/paste-code.blog | 134 +++++++++++++++++++++++++
prog/maxims/paste-code.notyet | 0
4 files changed, 229 insertions(+), 226 deletions(-)
This is easy to understand. The commit message was correct: the
maxims are being reorganized. But git-log --stat , in conjunction
with --follow , has produced a stat that has only a tenuous
connection with reality.
I believe what happened here is this: In 2012 I “finally started
article”. But I didn't create the file at that time. Rather, I
had created the file in 2009 with the intention of putting something
into it later:
% git show --stat 5c8c5e66
commit 5c8c5e66bcd1b5485576348cb5bbca20c37bd330
Author: mjd <mjd>
Date: Tue Jun 23 18:42:31 2009 +0000
empty file
book/Watchmen.blog | 0
book/Watchmen.notyet | 0
2 files changed, 0 insertions(+), 0 deletions(-)
This commit does appear in the git-log --follow output, but it
looks like this:
commit 5c8c5e66bcd1b5485576348cb5bbca20c37bd330
Author: mjd <mjd>
Date: Tue Jun 23 18:42:31 2009 +0000
empty file
wikipedia/mega.notyet => book/Watchmen.blog | 0
1 file changed, 0 insertions(+), 0 deletions(-)
It appears that Git, having detected that book/Watchmen.blog was
moved to movie/Watchmen.blog in Febraury 2012, is now following
book/Watchmen.blog backward in time. It sees that in January 2012
the file was modified, and was formerly empty, and after that it sees
that in June 2009 the empty file was created. At that time there was
another empty file, wikipedia/mega.notyet . And git-log decides that the
empty file book/Watchmen.blog was copied from the other empty
file.
At this point it has gone completely off the rails, because it is now
following the unrelated empty file wikipedia/mega.notyet . It then
makes more mistakes of the same type. At one point there was an empty
wikipedia/mega.blog file, but commit ff0d744d5 added some text to it
and also created an empty wikipedia/mega.notyet alongside it. The
git-log --follow command has interpreted this as the empty
wikipedia/mega.blog being moved to wikipedia/mega.notyet and a
new wikipedia/mega.blog being created alongside it. It is now following
wikipedia/mega.blog .
Commit ff398402 created the empty file wikipedia/mega.blog fresh,
but git-log --follow interprets the commit as copying
wikipedia/mega.blog from the already-existing empty file
tech/mercury.notyet . Commit 1273c618 created tech/mercury.notyet ,
and after that the trail comes to an end, because that was shortly
after I started keeping my blog in revision control; there were no
empty files before that. I suppose that attempting to follow the
history of any file that started out empty is going to lead to the
same place, tech/mercury.notyet .
On a different machine with a different copy of the repository, the
git-log --follow on this file threads its way through ten
irrelvant files before winding up at tech/mercury.notyet .
There is a --find-renames=... flag to tell Git how conservative to
be when guessing that a file might have been renamed and modified at
the same time. The default is 50%. But even turning it up to 100%
doesn't help with this problem, because in this case the false
positives are files that are actually identical.
As far as I can tell there is no option to set an absolute threshhold
on when two files are considered the same by --follow . Perhaps it
would be enough to tell Git that it should simply not try to follow
files whose size is less than !!n!! bytes, for some small !!n!!, perhaps
even !!n=1!!.
The part I don't fully understand is how git-log --follow is
generating its stat outputs. Certainly it's not doing it in the
same way that git show is. Instead it is trying to do something
clever, to highlight the copies and renames it thinks it has found,
and in this case it goes badly wrong.
The problem appears in Git 1.7.11, 2.7.4, and 2.13.0.
[ Addendum 20180912: A followup about my work on a fix for
this. ]
[Other articles in category /prog]
permanent link
|