How big should a commit be?

How big should a commit actually be in Git? The guidelines given by itself say that there should be one commit "per logical changeset" but this is pretty vague. Additionally, there's not really much information available on this subject on the internet. There's a lot of information about how to write proper commit messages, but not much on the subject of what actually constitutes a commit.

The culture of commits

It seems that the lack of information available on the internet about this subject is largely due to the fact that the culture of commits varies greatly from one company/project to another and even from one individual to another. It's difficult to say why that is - perhaps most people think it's obvious or perhaps most people don't care that much. Hopefully the following will convince you that having a strategy about when to commit is important.

How to commit

The mistake that I tend to find in my own workflow is that I make commits which roughly summarize what changes were made in the commit rather than what they were intended to achieve. Often when writing commit messages the "what" seems much more important than the "why". This is not a good commit strategy however because the primary purpose of a commit is to make a change to the code base that affects the nature of the program in some way. The commit message should really be a description of what about the nature of the program changed which is why the commit was made in the first place. In other words, a commit should be thought of like a patch that might be sent off to someone. You wouldn't send out nine patches each of which steps through some changes to the code describing what happened to it at some point. Instead you would send out one patch with a description of the desired feature/bug fix/etc. The patch itself describes what happened to the code - it doesn't need to be repeated in your commit messages. Additionally, the person on the other end shouldn't be forced to keep all nine patches together in order to introduce a single change to the nature of the code.

Reaping the rewards

Code introduced in this way has the advantage that since each commit changes the nature of the program in some way it is guaranteed to be compilable/runnable (although perhaps not bug free). This is because changes to the nature of the program, by definition, cannot break the program since the result would be an unrunnable program which is really not a program. When people discuss changes that would be made to a program's nature the result of that discussion is always something that should compile and run. For example, a commit entitled 'extract function blah' could very well result in a program that does not compile or run (were the references to that function updated?). However, a commit entitled 'fixed bug blah' or 'introduced feature X' will always address a change that should compile and run.

Compilability is important because when you check out a commit, you really want that commit to represent a working state of the program. Without compilable commits you can't use git bisect well, you can't revert to those commits, and code review will be difficult and unintuitive since the semantics are missing.

The hardships listed above are certainly not the only hardships of mismanaging commits. In general, if your commits do not represent natural changes to your program then most anything you do with git will be more difficult because it will involve picking up missing pieces (other commits) laying around in order to make things work properly whether it be sending off a patch, code review, bisection, reverting, diffing, cherrypicking or any other operation that depends on specific commits rather than a collection of them (i.e. on a branch).

What about feature branches?

If we're writing commits which say 'introduced feature X' then what's the point of having feature branches at all? In truth, many features should be introduced as a single commit. However, feature and topic branches give you a lot of freedom. Even if the feature ends up being a single commit, an advantage of having a branch is that you can easily make temporary (partial) commits that ought to be squashed down or amended later. I personally don't like using stash all that much, so feature branches allow me to isolate my work without running through a list of stashes. I'd simply place a temporary commit at the top of the branch and then resume later by doing a mixed reset back to the commit that was branched off of. With feature/topic branches that won't be pushed till later, you can drop the notion of a commit that needs to compile and represent a whole change as long as you squash those commits back down later. You can use this to drive development in various directions without having to commit to a complete change for every commit (as long as you take care of it later).

Even without the use of temporary commits, feature branches can be useful because often a feature will need supporting changes that really are separate changes but changes whose need is only driven by the present feature. For instance, refactoring a design to allow for some new implementation or extracting some css to its own file because the wealth of new css requried for a feature constitutes this. These are commits that might have been made on dev or master in another life if the need had existed but the need was prompted by the new feature. In this case, switching to dev or master and making the changes there will seem unnecessary without the context of the feature which follows. Thus, there may be any number of commits preceding the actual implementation of the feature which simply prepare the code base for the feature which is about to come but which constitute natural changes to the program on their own. Note that in the above example, both commits should be named after their purpose rather than after their code changes. Thus, in the commit where the css was extracted it should have a message with something like "Cleanup noisy css in blah" whose commit would not only contain the css extraction but also fixup references to the extracted css which are no longer valid. The purpose was to cleanup the noisy css which constitutes its own change and could be reasonably applied to dev or master directly if the need had been there earlier.

Why does commit interdependency matter?

All commits depend on other commits anyway. Why does it matter if I need multiple commits to achieve a 'natural change'?

It's true that all commits, in order to apply correctly, depend on the state of the working tree that they're applying onto. However, there's a bit of a difference between having some changes not apply because the underlying code base is different (and merging them in) and having the changes apply correctly but not implement a complete change (possibly one that even doesn't compile). The former situation is much more honest and easier to deal with - it says "that change simply doesn't make sense without some help". The latter pretends to be a change when really it's only part of one.

Why not just use branches?

You might think your basic atomic unit of natural change could be a branch instead of a commit. After all, this would fix the problem of having to pick up all of the commits that were left lying around - just make them all part of the same branch and never talk about them independently (like with diff and cherry pick). You could do this, but it would result in a bad workflow as operations that function over multiple commits tend to be noisier and harder to do than those functioning over single commits. Git bisect becomes something that can only be used at merge points - really not at all. Your history will be littered with merges since these now represent "commits". You'll have to filter through most of your commits to find anything that's actually meaningful - the merge commits. When you send things off as patches you won't be able to properly use tools like git format-patch which creates one patch per commit. People who pull from you will have to understand that this is your stupid workflow and will likely become very impatient with you. Truly meaningful merge commits will be harder to find, especially in a large project where lots of real merges happen. These are the reasons that come to mind within 30 seconds of writing. I'm sure there are plenty more.

Slash commits

Sometimes a commit really achieves more than one thing and these changes can't be separated out into separate natural changes. For instance, if I remove some old code and write some new code to replace it and in the process put that new code in a separate function this is technically a refactoring since I've extracted a function and so the commit message might read that I cleaned up something which is a natural change to the code base and something that would be reasonable to do in its own right. Yet, the new code that I wrote in addition to being cleaner implemented a new feature. Cleaning up the old code and then making the new feature in two separate commits is unnecessary. Thus, the commit might be entitled feature X/cleaned up Y.

This is reasonable because there is really only one logical changeset that happens to achieve two objectives. This is fine - good even. Two birds with one stone. What's bad is when you have two separate sets of changes which end up in the same commit. In other words, if you really have a slash type situation then the code that you're committing is a single change which achieves two objectives not two changes going into the same commit. The latter situation is bad because those two commits could be used in separate contexts. In the former situation, the commit never could have been used in two separate contexts - the changes existing there inextricably cause both things to happen. It's the difference between eating a sandwich AND drinking a coke versus drinking a smoothie. In the latter situation it would be impossible to have a smoothie for lunch without drinking it. Therefore, if you had two features for your code base - to have lunch and drink a smoothie then your message might read, "Implement drink smoothie/eat lunch". Compare with "Implement eat sandwich and drink coke".

You know a commit isn't the same thing as a delta, right?

Yes. And in the above I use them pretty much interchangeably. I do this because typing "apply a commit" is easier than typing "apply the change introduced by a commit". Still, most of the relevant operations discussed above do actually treat sort of like a diff so it doesn't really change the flavor. I.e. git diff finds a diff, git rebase applies patches onto a new base, git merge deals with two diffs from a common ancestor, etc, etc. Most of the operations performed on commits treat them like deltas, they just aren't stored that way (which is significant).