The Universe of Discourse


Wed, 12 Sep 2018

Perils of hacking on mature software

Yesterday I wrote up an interesting bug in git-log --follow's handling of empty files. Afterward I thought I'd see if I could fix it.

People complain that the trouble of working on mature software like Git is to understand the way the code is structured, its conventions, the accumulated layers of cruft, and where everything is. I think this is a relatively minor difficulty. The hard part is no so much doing what you want, as knowing what you want to do.

My original idea for the fix was this: I can give git log a new option, say --follow-size-threshhold=n. This would disable all copy and rename detection for any files of size less than n bytes. If not specified or configured, n would default to 1, so that the default behavior would disable copy and rename detection of empty files but not of anything else. I was concerned that an integer option was unnecessarily delicate. It might have been sufficient to have a boolean --follow-empty-files flag. But either way the programming would be almost the same and it would be easy to simplify the option later if the Git maintainers wanted it that way

I excavated the code and found where the change needed to go. It's not actually in git-log itself. Git has an internal system for diffing pairs of files, and git-log --follow uses this to decide when two blobs are similar enough for it to switch from following one to the other. So the flag actually needed to be added to git-diff, where I called it --rename-size-threshhold. Then git-log would set that option internally before using the Git diff system to detect renames.

But then I ran into a roadblock. Diff already has an undocumented flag called --rename-empty that tells it to report on renames of empty files in certain contexts — not the context I was interested in unfortunately. The flag is set by default, but it is cleared internally when git-merge is resolving conflicts. The issue it addresses is this: Suppose the merge base has some empty file X. Somewhere along the line X has been removed. In one branch, an unrelated empty file Y has been created, and in the other branch a different unrelated empty file Z has been created. When merging these two branches, Git will detect a merge conflict: was file X moved to location Y or to location Z? This ⸢conflict⸣ is almost certainly spurious, and is is very unlikely that the user will thank us for demanding that they resolve it manually. So git-merge sets --no-rename-empty internally and Git resolves the ⸢conflict⸣ automatically.

(See this commit for further details.)

The roadblock is: how does --rename-empty fit together with my proposed --rename-size-threshhold flag? Should they be the same thing? Or should they be separate options? There appear to be at least three subsystems in Git that try to decide if two similar or identical files (which might have different names, or the same name in different directories) are “the same file” for various purposes. Do we want to control the behavior of these subsystems separately or in unison?

If they should be controlled in unison, should --rename-size-threshhold be demoted to a boolean, or should --rename-empty be promoted to an integer? And if they should be the same, what are the implications for backward compatibility? Should the existing --rename-empty be documented?

If we add new options, how do they interact with the existing and already non-orthogonal flags that do something like this? They include at least the following options of git-diff, git-log, and git-show:

--follow
--find-renames=n 
--find-copies
--find-copies-harder
-l 

Only git-log has --follow and my new feature was conceived as a modification of it, which is why I named it --follow-size-threshhold. But git-log wouldn't be implementing this itself, except to pass the flag into the diff system. Calling it --follow-size-threshhold in git-diff didn't make sense because git-diff doesn't have a --follow option. It needs a different name. But if I do that, then we have git-diff and git-log options with different names that nevertheless do exactly the same thing. Confusing!

Now suppose you would like to configure a default for this option in your .gitconfig. Does it make sense to have both diff.renameSizeThreshhold and log.followSizeThreshhold options? Not really. It would never be useful to set one but not the other. So eliminate log.followSizeThreshhold. But now someone like me who wants to change the behavior of git-log --follow will not know to look in the right place for the option they need.

The thing to do at this point is to come up with some reasonable-seeming proposal and send it to Jeff King, who created the undocumented --rename-empty feature, and who is also a good person to work with. But coming up with a good solution entirely on my own is unlikely.

Doing any particular thing would not be too hard. The hard part is deciding what particular thing to do.


[Other articles in category /prog] permanent link