The Universe of Disco


Mon, 10 Sep 2018

git log --follow enthusiastically tracks empty files

This bug I just found in git log --follow is impressively massive. Until I worked out what was going on I was really perplexed, and even considered that my repository might have become corrupted.

I knew I'd written a draft of a blog article about the Watchmen movie, and I went to find out how long it had been sitting around:

    % git log -- movie/Watchmen.blog
    commit 934961428feff98fa3cb085e04a0d594b083f597
    Author: Mark Dominus <mjd@plover.com>
    Date:   Fri Feb 3 16:32:25 2012 -0500

        link to Mad Watchmen parody
        also recategorize under movie instead of under book

The log stopped there, and the commit message says clearly that the article was moved from elsewhere, so I used git-log --follow --stat to find out how old it really was. The result was spectacularly weird. It began in the right place:

    commit 934961428feff98fa3cb085e04a0d594b083f597
    Author: Mark Dominus <mjd@plover.com>
    Date:   Fri Feb 3 16:32:25 2012 -0500

        link to Mad Watchmen parody
        also recategorize under movie instead of under book

     {book => movie}/Watchmen.blog | 8 +++++++-
     1 file changed, 7 insertions(+), 1 deletion(-)

Okay, it was moved, with slight modifications, from book to movie, as the message says.

    commit 5bf6e946f66e290fc6abf044aa26b9f7cfaaedc4
    Author: Mark Jason Dominus (陶敏修) <mjd@plover.com>
    Date:   Tue Jan 17 20:36:27 2012 -0500

        finally started article about Watchment movie

     book/Watchmen.blog | 40 ++++++++++++++++++++++++++++++++++++++++
     1 file changed, 40 insertions(+)

Okay, the previous month I added some text to it.

Then I skipped to the bottom to see when it first appeared, and the bottom was completely weird, mentioning a series of completely unrelated articles:

    commit e6779efdc9510374510705b4beb0b4c4b5853a93
    Author: mjd <mjd>
    Date:   Thu May 4 15:21:57 2006 +0000

        First chunk of linear regression article

     prog/maxims/paste-code.notyet => math/linear-regression.notyet | 0
     1 file changed, 0 insertions(+), 0 deletions(-)

    commit 9d9038a3358a82616a159493c6bdc91dd03d03f4
    Author: mjd <mjd>
    Date:   Tue May 2 14:16:24 2006 +0000

        maxims directory reorganization

     tech/mercury.notyet => prog/maxims/paste-code.notyet | 0
     1 file changed, 0 insertions(+), 0 deletions(-)

    commit 1273c618ed6efa4df75ce97255204251678d04d3
    Author: mjd <mjd>
    Date:   Tue Apr 4 15:32:00 2006 +0000

        Thingy about propagation delay and mercury delay lines

     tech/mercury.notyet | 0
     1 file changed, 0 insertions(+), 0 deletions(-)

(The complete output is available for your perusal.)

The log is showing unrelated files being moved to totally unrelated places. And also, the log messages do not seem to match up. “First chunk of linear regression article” should be on some commit that adds text to math/linear-regression.notyet or math/linear-regression.blog. But according to the output above, that file is still empty after that commit. Maybe I added the text in a later commit? “Maxims directory reorganization” suggests that I reorganized the contents of prog/maxims, but the stat says otherwise.

My first thought was: when I imported my blog from CVS to Git, many years ago, I made a series of mistakes, and mismatched the log messages to the commits, or worse, and I might have to do it over again. Despair!

But no, it turns out that git-log is just intensely confused. Let's look at one of the puzzling commits. Here it is as reported by git log --follow --stat:

    commit 9d9038a3358a82616a159493c6bdc91dd03d03f4
    Author: mjd <mjd>
    Date:   Tue May 2 14:16:24 2006 +0000

        maxims directory reorganization

     tech/mercury.notyet => prog/maxims/paste-code.notyet | 0
     1 file changed, 0 insertions(+), 0 deletions(-)

But if I do git show --stat 9d9038a3, I get a very different picture, one that makes sense:

    % git show --stat 9d9038a3
    commit 9d9038a3358a82616a159493c6bdc91dd03d03f4
    Author: mjd <mjd>
    Date:   Tue May 2 14:16:24 2006 +0000

        maxims directory reorganization

     prog/maxims.notyet            | 226 -------------------------------------------
     prog/maxims/maxims.notyet     |  95 ++++++++++++++++++
     prog/maxims/paste-code.blog   | 134 +++++++++++++++++++++++++
     prog/maxims/paste-code.notyet |   0
     4 files changed, 229 insertions(+), 226 deletions(-)

This is easy to understand. The commit message was correct: the maxims are being reorganized. But git-log --stat, in conjunction with --follow, has produced a stat that has only a tenuous connection with reality.

I believe what happened here is this: In 2012 I “finally started article”. But I didn't create the file at that time. Rather, I had created the file in 2009 with the intention of putting something into it later:

    % git show --stat 5c8c5e66
    commit 5c8c5e66bcd1b5485576348cb5bbca20c37bd330
    Author: mjd <mjd>
    Date:   Tue Jun 23 18:42:31 2009 +0000

        empty file

     book/Watchmen.blog   | 0
     book/Watchmen.notyet | 0
     2 files changed, 0 insertions(+), 0 deletions(-)

This commit does appear in the git-log --follow output, but it looks like this:

    commit 5c8c5e66bcd1b5485576348cb5bbca20c37bd330
    Author: mjd <mjd>
    Date:   Tue Jun 23 18:42:31 2009 +0000

        empty file

     wikipedia/mega.notyet => book/Watchmen.blog | 0
     1 file changed, 0 insertions(+), 0 deletions(-)

It appears that Git, having detected that book/Watchmen.blog was moved to movie/Watchmen.blog in Febraury 2012, is now following book/Watchmen.blog backward in time. It sees that in January 2012 the file was modified, and was formerly empty, and after that it sees that in June 2009 the empty file was created. At that time there was another empty file, wikipedia/mega.notyet. And git-log decides that the empty file book/Watchmen.blog was copied from the other empty file.

At this point it has gone completely off the rails, because it is now following the unrelated empty file wikipedia/mega.notyet. It then makes more mistakes of the same type. At one point there was an empty wikipedia/mega.blog file, but commit ff0d744d5 added some text to it and also created an empty wikipedia/mega.notyet alongside it. The git-log --follow command has interpreted this as the empty wikipedia/mega.blog being moved to wikipedia/mega.notyet and a new wikipedia/mega.blog being created alongside it. It is now following wikipedia/mega.blog.

Commit ff398402 created the empty file wikipedia/mega.blog fresh, but git-log --follow interprets the commit as copying wikipedia/mega.blog from the already-existing empty file tech/mercury.notyet. Commit 1273c618 created tech/mercury.notyet, and after that the trail comes to an end, because that was shortly after I started keeping my blog in revision control; there were no empty files before that. I suppose that attempting to follow the history of any file that started out empty is going to lead to the same place, tech/mercury.notyet. On a different machine with a different copy of the repository, the git-log --follow on this file threads its way through ten irrelvant files before winding up at tech/mercury.notyet.

There is a --find-renames=... flag to tell Git how conservative to be when guessing that a file might have been renamed and modified at the same time. The default is 50%. But even turning it up to 100% doesn't help with this problem, because in this case the false positives are files that are actually identical.

As far as I can tell there is no option to set an absolute threshhold on when two files are considered the same by --follow. Perhaps it would be enough to tell Git that it should simply not try to follow files whose size is less than !!n!! bytes, for some small !!n!!, perhaps even !!n=1!!.

The part I don't fully understand is how git-log --follow is generating its stat outputs. Certainly it's not doing it in the same way that git show is. Instead it is trying to do something clever, to highlight the copies and renames it thinks it has found, and in this case it goes badly wrong.

The problem appears in Git 1.7.11, 2.7.4, and 2.13.0.

[ Addendum 20180912: A followup about my work on a fix for this. ]


[Other articles in category /prog] permanent link