[Solved] Why does output differ under git diff vs. git diff –staged?


You are, I think, being misled. Git doesn’t store changes at all. The whole thing seems very mysterious until you realize that Git just stores everything intact, but does so in a weird way.

What Git stores permanently

First and most important, Git doesn’t exactly store files. It winds up doing so, but that’s because Git stores commits, and each individual commit contains (all!) the files. That is, at some earlier point during development, you—or someone—told Git: Here’s this entire file-tree, some set of folders / directories containing files and sub-directories that contain more files and sub-directories and so on. Make a snapshot of how they all look right now. That snapshot, that entire copy of everything, goes into a new commit.

Next, commits, once made, are mostly permanent, and completely, totally, 100% read-only. You cannot change anything that’s inside a commit. You can just think of them as permanent: the only time a commit can truly go away is if you carefully arrange to make sure that no one—not yourself, nor anyone else—can find it later, using git reset or similar tools.

For many reasons, including not having the repository get enormously fat if you make many commits that keep re-using most of the old versions of most files, the files that are stored inside commits are kept in a special, compressed, Git-only format. Since the files inside commits are frozen, if new commit C9 is just like its previous commit C8 except for one file, the two commits will share all the identical files, too.

What Git lets you work with, temporarily

Since you can’t change any commit, at all, ever, Git would be useless if it did not have a way to extract all the files from some commit. Extracting a commit copies all of its files out of the deep-freeze, and then de-compresses the files and turns them back into ordinary, every-day files that you and your computer can work with. These files are copies of what was in that Git commit, but here, in this work area—the work-tree or working tree—they’re useful to you and your computer, and you can change them any way you like.

Git complicates things with its index

Now comes the tricky bit. Other version control systems may stop here: they too have commits, that save the files forever in frozen form, and a work-tree, that let you work on the files in ordinary form. To make a new commit, those other version control systems slowly, painfully, one by one, take each work-tree file, compress it down to get it ready for freezing, and then check to see if that frozen file will be the same as the old one. If so, they can re-use the old file! If not, they do whatever it takes to save away the new file. This is terribly slow, and there are various ways to speed it up, which they do use in general, but in these non-Git version control systems, after using their “commit” command, you can often get up and go get coffee, or go for a walk or have lunch or something.

Git does something radically different, and this is how git commit is so fast, compared to those other systems. When Git is taking files out of the deep-freeze to put into your work-tree, Git keeps a sort of semi-frozen—”slushy”, if you will—copy of every file, ready to go into the next commit. Initially, these copies all match the frozen commit copy.

These sort-of-slushy copies of files are in what Git calls, variously, the index, the staging area, or the cache, depending on who or what part of Git is doing the calling. The key difference between these index copies of every file, and the frozen copy in the current commit, is that the committed copies really are frozen. They can’t be changed. The index copies are only almost frozen: they can be changed, by writing a new file into the index in place of that old one.

What this means, in the end, is that for every file in the commit, you wind up with not two but three active copies, when you tell Git to make that commit be the current commit, using git checkout somebranch. (This checkout selects somebranch as the current branch name and therefore also extracts what Git calls its tip commit to be the current commit. There’s always a name for this current commit: Git calls it HEAD.) Suppose, for instance, that the tip commit of master has two files, named README.md and main.py:

   HEAD           index         work-tree
---------       ---------       ---------
README.md       README.md       README.md
main.py         main.py         main.py

At this point, all three copies of each file match each other. That is, all three README.mds are the same, except in terms of their format: the one in HEAD is frozen and Git-only; the one in the index is semi-frozen and Git-only; and the one in your work-tree is usable and useful to you; but all three represent the same file contents. The same goes for the three copies of main.py.

Now suppose you change one (or both) of the work-tree files. For instance, suppose you change your work-tree README.md. Let’s mark it with a (2) to indicate that it’s different, and mark the old ones with (1) to remember which the old ones were:

    HEAD            index         work-tree
------------    ------------    ------------
README.md(1)    README.md(1)    README.md(2)
main.py(1)      main.py(1)      main.py(1)

You can now ask Git to compare the index copies of every file to the work-tree copies of every file, and this time, you’ll see your change to README.md.

When you run git add, you are really telling Git: Take the work-tree copy of the files I’m adding, and prepare them for freezing. Git will copy the work-tree copy of README.md or main.py (or both) back into the index, Git-ifying the contents, getting them ready for the next freeze:

    HEAD            index         work-tree
------------    ------------    ------------
README.md(1)    README.md(2)    README.md(2)
main.py(1)      main.py(1)      main.py(1)

This time, asking Git to compare the index copy (of everything) to the work-tree copy (of everything) shows nothing! They are the same, after all. To see a difference, you must ask Git to compare the HEAD commit to the index, or the HEAD commit to the work-tree. Either will suffice right now, because right now the index and work-tree match again.

Note, however, that you can change the work-tree copy again after you use git add. Suppose you modify README.md one more time, giving:

    HEAD            index         work-tree
------------    ------------    ------------
README.md(1)    README.md(2)    README.md(3)
main.py(1)      main.py(1)      main.py(1)

Now all three copies of main.py match, but all three copies of README.md are different. So now it matters whether you have Git compare HEAD vs index, or HEAD vs work-tree, or index vs work-tree: each will show a different change to README.md.

Git makes new commits from the index

When and if you do choose to make a new commit—a new permanent snapshot of all the files as they stand now—Git makes the new commit’s snapshot using the semi-frozen files in the index. All that the commit verb has to do with them is finish the freezing process (which, at a technical level, consists of making tree objects to hold them, but you don’t need to know this). So git commit collects your name, email, the time, your log message, and the current commit’s hash ID, freezes the index, and puts all of those together into a new commit. The new commit becomes the HEAD commit, so that now HEAD refers to the new commit. If the old commit was C8 and the new one is C9, HEAD used to mean C8, but now it means C9.

Once that commit finishes, the HEAD and index copies of every file automatically match. It’s obvious that they must, since the new HEAD was made from the index. So if you make that new commit with the index holding the middle version of README.md, you get:

    HEAD            index         work-tree
------------    ------------    ------------
README.md(2)    README.md(2)    README.md(3)
main.py(1)      main.py(1)      main.py(1)

Note that Git completely ignored the work-tree during this process! There’s a way to tell git commit that it should look at the work-tree and automatically run git add, but let’s leave that for later.

The summary of this particular section is that a good way to think of the index is: The index contains the next commit you propose to make. The git add command means: Update my proposed next commit. This explains why you have to git add all the time.

Git’s diff verb

Because there are these three simultaneous, active copies of each file—one permanent, one proposed for the next commit, and one that you can actually see and work with—Git needs a way to compare these things. The diff verb is how you ask Git to compare two things, and its options are how you select which two things to compare:

  • git diff commit-A commit-B tells Git: Extract the snapshot in commit A to a temporary area; extract the snapshot in commit B to a temporary area, and then compare them and show me what’s different. This is useful in general, but not so much when making a new commit, since it’s about existing, frozen, unchangeable commits.

  • git diff—with no options or commit specifiers at all—tells Git: Compare the index to the work-tree. Git does not look at any actual commit, it just looks at the index—the proposed next commit—and compares to your usable copies of files. Whenever something is different, you could use git add to copy it into the index. So this tells you what you could git add, if you wanted.

  • git diff --cached or git diff --staged—the options have exactly the same meaning—tells Git: Compare the HEAD commit to the index. This time, Git does not look at your work-tree at all. It just finds out what’s different between the current commit and the proposed next commit. That is, this is what would be different if you committed right now.

  • git diff HEAD (or more generally, git diff commit) tells Git: Compare what’s in the commit I named, such as HEAD, to what’s in the work-tree. This time, Git ignores your index, and just goes with the specific commit—such as HEAD—and the contents of the work-tree. This is not as useful as the HEAD-vs-index or index-vs-work-tree comparisons, but you can do it if you want.

There are, of course, more ways you might want to compare any two items, so git diff has a lot of options. But these are the main ones of interest at this point.

git status runs two git diffs

Note that the two most useful git diffs above, when you’re actively developing, are git diff --cached, which tells you what would be different if you committed right now, and git diff with no options, which tells you what else could be different if you ran git add right now. The git status command, which you should use often, runs both of these diffs for you! It runs them with the --name-status flag set, internally, so that instead of showing the actual differences, it just shows the file’s name if the file is changed.1

Let’s see that again: git status runs two git diff commands. The first one is git diff --cached, i.e., what’s different in the proposed commit. These are changes that are staged for commit. The second is a plain git diff, i.e., what’s different in the index—the proposed commit—and the work-tree. These are changes that are not staged for commit.

So now you know what git status tells you, and when you would want to use git diff with or without --staged to see more than just the names of the files. Remember that the changes that git diff shows you are what Git is figuring out: the files inside the index, or in the work-tree, are full, complete copies. They just may be different from each other and/or different from the full, complete copy in HEAD.


1The “status” part of --name-status can instead say that a file is added—is in the index, but not in the HEAD commit, for instance. Or, in some cases, it can say that a file is renamed or has had some other auxiliary change, but let’s not get into this here.

1

solved Why does output differ under git diff vs. git diff –staged?