Git wishlist: aggregate changes across non-contiguous commits
(This is actually an essay on the difference between science and
engineering.)
My co-worker Lemuel recently asked if there was a way to see all the
changes to master from the last week that pertained to a certain
ticket. The relevant commit messages all contained the ticket ID, so
he knew which commits he wanted; that part is clear. Suppose Lemuel
wanted to see the changes introduced in commits C, E, and H, but not
those from A, B, D, F, or G.
The closest he could come was git show H E C , which wasn't quite what
he wanted. It describes the complete history of the changes, but what
he wanted is more analogous to a diff. For comparison, imagine a
world in which
git diff A H didn't exist, and you were told to use git show A B C
D E F G H instead. See the problem? What Lemuel wants is more like
diff than like show .
Lemuel's imaginary command would solve another common request: How can
I see all the changes that I have landed on master in a certain
time interval? Or similarly: how can I add up the git diff --stat
line counts for all my commits in a certain interval?
He said:
It just kinda boggles my mind you can't just get a collective diff on
command for a given set of commits
I remember that when I was first learning Git, I often felt boggled in
this way. Why can't it just…? And there are several sorts of
answers, of which one or more might apply in a particular situation:
- It surely could, but nobody has done it yet
- It perhaps could, but nobody is quite sure how
- It maybe could, but what you want is not as clear as you think
- It can't, because that is impossible
- I am not able rightly to apprehend the kind of confusion of ideas
that could provoke such a question
Often, engineers will go straight to #5, when actually the answer is
in a higher tier. Or they go to #4 without asking if maybe, once the
desiderata are clarified a bit, it will move from “impossible” to
merely “difficult”. These are bad habits.
I replied to Lemuel's (implicit) question here and tried to make it a
mixture of 2 and 3, perhaps with a bit of 4:
Each commit is a snapshot of the state of the repo at a particular
instant. A diff shows you the difference between two snapshots. When
you do git show commit you're looking at the differences between the
snapshot at that commit and at its parent.
Now suppose you have commit A with parent B, and commit C with parent
D. I come to you and say I want to see the differences in both A and
C at that same time. What would you have it do?
If A and B are on a separate branch and are completely unrelated to C
and D, it is hard to see what to do here. But it's not impossible.
Our hypothetical command could produce the same output as git show A
C . Or it could print an error message Can't display changes from
unrelated commits A, C and die without any more output. Either of
those might be acceptable.
And if A, B, C, D are all related and on the same branch, say with D ,
then C, then B, then A, the situation is simpler and perhaps we can do
better.
If so, very good, because this is probably the most common
case by far. Note that Lemuel's request is of this type.
I continued:
Suppose, for example,that C changes some setting from 0 to 1, then
B changes it again to be 2, then A changes it a third time, to say
3. What should the diff show?
This is a serious question, not a refutation. Lemuel could quite
reasonably reply by saying that it should show 0 changing to 3, the
intermediate changes being less important. (“If you wanted to see
those, you should have used git show A C .”)
It may be that that wouldn't work well in practice, that you'd find
there were common situations where it really didn't tell you what you
wanted to know. But that's something we'd have to learn by trying it
out.
I was trying really hard to get away from “what you want is stupid”
and toward “there are good reasons why this doesn't exist, but perhaps
they are surmountable”:
(I'm not trying to start an argument, just to reduce your bogglement by
explaining why this may be less well-specified and more complex than
you realize.)
I hoped that Lemuel would take up my invitation to continue the
discussion and I tried to enocurage him:
I've wanted this too, and I think something like it could work,
especially if all the commits are part of the same branch. …
Similarly people often want a way to see all the changes made only
by a certain person. Your idea would answer that use case also.
Let's consider another example. Suppose some file contains functions
X, Y, Z in that order. Commit A removes Y entirely. Commit B adds a
new function, YY, between X and Z. Commit C modifies YY to produce
YY'. Lemuel asks for the changes introduced by A and C; he is not
interested in B. What should happen?
If Y and YY are completely unrelated, and YY just happens to be at the
same place in the file, I think we definitely want to show Y being
removed by A, and then that C has made a change to an unrelated
function. We certainly don't want to show all of YY beind added. But
if YY is considered to be a replacement for Y, I'm not as sure. Maybe
we can show the same thing? Or maybe we want to pretend that A
replaced Y with YY? That seems dicier now than when I first thought
about it, so perhaps it's not as big a problem as I thought.
Or maybe it's enough to do the following:
Take all the chunks produced by the diffs in the output of git
show ... . In fact we can do better: if A, B, and C are a
contiguous sequence, with A the parent of B and B the parent of C,
then don't use the chunks from git show A B C ; use git diff A
C .
Sort the chunks by filename.
Merge the chunks that are making changes to the same file:
If two chunks don't overlap at all, there's no issue, just keep
them as separate chunks.
If two chunks overlap and don't conflict, merge them into a single chunk
If they overlap and do conflict, just keep them separate but
retain the date and commit ID information. (“This change, then
this other change.”)
Then output all the chunks in some reasonable order: grouped by
file, and if there were unmergeable chunks for the same file, in
chronological order.
This is certainly doable.
If there were no conflicts, it would certainly be better than git
show ... would have been. Is it enough better to offset whatever
weirdness might be introduced by the overlap handling? (We're
grouping chunks by filename. What if files are renamed?) We don't
know, and it does not even have an objective answer. We would have to
try it, and then the result might be that some people like it and use
it and other people hate it and refuse to use it. If so, that is a win!
[Other articles in category /prog]
permanent link
|