The Universe of Disco


Sat, 03 Aug 2019

Git wishlist: aggregate changes across non-contiguous commits

(This is actually an essay on the difference between science and engineering.)

My co-worker Lemuel recently asked if there was a way to see all the changes to master from the last week that pertained to a certain ticket. The relevant commit messages all contained the ticket ID, so he knew which commits he wanted; that part is clear. Suppose Lemuel wanted to see the changes introduced in commits C, E, and H, but not those from A, B, D, F, or G.

The closest he could come was git show H E C, which wasn't quite what he wanted. It describes the complete history of the changes, but what he wanted is more analogous to a diff. For comparison, imagine a world in which git diff A H didn't exist, and you were told to use git show A B C D E F G H instead. See the problem? What Lemuel wants is more like diff than like show.

Lemuel's imaginary command would solve another common request: How can I see all the changes that I have landed on master in a certain time interval? Or similarly: how can I add up the git diff --stat line counts for all my commits in a certain interval?

He said:

It just kinda boggles my mind you can't just get a collective diff on command for a given set of commits

I remember that when I was first learning Git, I often felt boggled in this way. Why can't it just…? And there are several sorts of answers, of which one or more might apply in a particular situation:

  1. It surely could, but nobody has done it yet
  2. It perhaps could, but nobody is quite sure how
  3. It maybe could, but what you want is not as clear as you think
  4. It can't, because that is impossible
  5. I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question

Often, engineers will go straight to #5, when actually the answer is in a higher tier. Or they go to #4 without asking if maybe, once the desiderata are clarified a bit, it will move from “impossible” to merely “difficult”. These are bad habits.

I replied to Lemuel's (implicit) question here and tried to make it a mixture of 2 and 3, perhaps with a bit of 4:

Each commit is a snapshot of the state of the repo at a particular instant. A diff shows you the difference between two snapshots. When you do git show commit you're looking at the differences between the snapshot at that commit and at its parent.

Now suppose you have commit A with parent B, and commit C with parent D. I come to you and say I want to see the differences in both A and C at that same time. What would you have it do?

If A and B are on a separate branch and are completely unrelated to C and D, it is hard to see what to do here. But it's not impossible. Our hypothetical command could produce the same output as git show A C. Or it could print an error message Can't display changes from unrelated commits A, C and die without any more output. Either of those might be acceptable.

And if A, B, C, D are all related and on the same branch, say with D , then C, then B, then A, the situation is simpler and perhaps we can do better.

If so, very good, because this is probably the most common case by far. Note that Lemuel's request is of this type.

I continued:

Suppose, for example,that C changes some setting from 0 to 1, then B changes it again to be 2, then A changes it a third time, to say 3. What should the diff show?

This is a serious question, not a refutation. Lemuel could quite reasonably reply by saying that it should show 0 changing to 3, the intermediate changes being less important. (“If you wanted to see those, you should have used git show A C.”)

It may be that that wouldn't work well in practice, that you'd find there were common situations where it really didn't tell you what you wanted to know. But that's something we'd have to learn by trying it out.

I was trying really hard to get away from “what you want is stupid” and toward “there are good reasons why this doesn't exist, but perhaps they are surmountable”:

(I'm not trying to start an argument, just to reduce your bogglement by explaining why this may be less well-specified and more complex than you realize.)

I hoped that Lemuel would take up my invitation to continue the discussion and I tried to enocurage him:

I've wanted this too, and I think something like it could work, especially if all the commits are part of the same branch. … Similarly people often want a way to see all the changes made only by a certain person. Your idea would answer that use case also.

Let's consider another example. Suppose some file contains functions X, Y, Z in that order. Commit A removes Y entirely. Commit B adds a new function, YY, between X and Z. Commit C modifies YY to produce YY'. Lemuel asks for the changes introduced by A and C; he is not interested in B. What should happen?

If Y and YY are completely unrelated, and YY just happens to be at the same place in the file, I think we definitely want to show Y being removed by A, and then that C has made a change to an unrelated function. We certainly don't want to show all of YY beind added. But if YY is considered to be a replacement for Y, I'm not as sure. Maybe we can show the same thing? Or maybe we want to pretend that A replaced Y with YY? That seems dicier now than when I first thought about it, so perhaps it's not as big a problem as I thought.

Or maybe it's enough to do the following:

  1. Take all the chunks produced by the diffs in the output of git show .... In fact we can do better: if A, B, and C are a contiguous sequence, with A the parent of B and B the parent of C, then don't use the chunks from git show A B C; use git diff A C.

  2. Sort the chunks by filename.

  3. Merge the chunks that are making changes to the same file:

    • If two chunks don't overlap at all, there's no issue, just keep them as separate chunks.

    • If two chunks overlap and don't conflict, merge them into a single chunk

    • If they overlap and do conflict, just keep them separate but retain the date and commit ID information. (“This change, then this other change.”)

  4. Then output all the chunks in some reasonable order: grouped by file, and if there were unmergeable chunks for the same file, in chronological order.

This is certainly doable.

If there were no conflicts, it would certainly be better than git show ... would have been. Is it enough better to offset whatever weirdness might be introduced by the overlap handling? (We're grouping chunks by filename. What if files are renamed?) We don't know, and it does not even have an objective answer. We would have to try it, and then the result might be that some people like it and use it and other people hate it and refuse to use it. If so, that is a win!


[Other articles in category /prog] permanent link