Distributed Version Control Systems (DVCSs) are here to stay, steadily replacing the former generation of centralized version control systems. What are the key concepts to keep in mind as an SCM manager or senior developer to transition from your current system to the new generation? Is it just about the "multi-site" capabilities? Do you need to structure your repositories differently? How is it going to affect your day-to-day operations? Do you need a different mindset?
My goal with this article is to answer these questions and describe how to embrace and get the best out of the new generation of DVCSs.
Not Just About Its Distributed Nature
When we think about DVCSs, the first thing that comes to mind is their ability to work disconnected from a central server. A DVCS is about having a local repository on your machine, from which you can check-in code, branch, and merge locally. You then can push your changes to another server (a central server, a peer system, or whatever configuration you're working on).
The basic operations working in DVCS are those illustrated in Figure 1:
Figure 1: Overview of basic check-in and push in DVCS.
The basic approach is:
- You always work against a local repository
- Check-ins go to the local repository (on your own machine)
- Then you "push" the changes to the remote server
- Or "pull" new changes from the remote server
But, as the title says, DVCS is not just about the "D" in distributed. Distribution is the feature getting the flashy headlines, but the key points behind the paradigm shift are the ones I'm about to describe. The new DVCSs come with several important features, which I'll explain in this article:
- Whole repository versus file revisions
- Consistent and immutable configurations
- Vastly improved merge tracking not only from the functional standpoint, but also performance wise
- Ability to implement well-known (though commonly feared) branching patterns
A File-by-File World View
I used to imagine repository evolution in terms of files evolving separately. Several well-known version control systems taught us to think in the following way: Files evolve separately, with version numbers on their own, and we had the tools to configure them in different ways…even dynamically.
Figure 2 is an illustration of the standard way of looking at file changes might look familiar: The four featured files have different revisions and we can easily check their individual history, diff their different revisions, and also download them to our working areas, which I'll refer to as workspaces from now on (but are also known as views, working copies, and all sort of different names that version control creators use to confuse developers).
Figure 2: Standard way of looking at file changes.
At this point, labels come to the rescue: You can group revisions using a label or tag, which is a means to retrieve the file versions as a group at a later point. So, we can create a "version 1.0" label to mark the revision 1 of foo.c, bar.c rev 2, su.c rev 0, and doc.c rev 1, as Figure 3 shows.
Figure 3: Using tags or labels to group multiple file revisions.
This option enables us to take a snapshot of the entire structure from time to time to be able to create solid points that we can recover later on and use as baselines, checkpoints, or whatever we want to call them.
This file-based approach is great from a power developer point of view because you can do great things such as download to your workspace the revision 0 of foo.c, bar.c rev 3, su.c rev 0, and do.c rev 0, and work with them. But this is flexibility that comes at a cost: What exactly has been downloaded? Did this configuration ever exist as such? Maybe it was never "this way" on any developer workspace, so there's no way the files will work together correctly. Should we even bother with this sort of dynamic configuration then? It is a matter of taste, so I won't be entering this discussion now.
One Revision To Rule Them All
In the DVCS world view, things happen slightly differently. Instead of looking at your history file by file, you get used to checking what happened at a repository level. A key concept is the changeset (or commit), which is not unique to DVCS, but plays a different role. A changeset is a set of files that are checked-in together. A check-in is similar to a database transaction: If you want to check-in four files together, a new changeset grouping the four files will be created only if the check-in succeeds for all four files. Otherwise, the transaction will be rolled back and no new changeset is created. In other words, a changeset is the result of an atomic check-in.
In DVCS, check-ins become first-level players; in fact, they become the centerpiece of the history of a repository. Let's go to our previous example: You create foo.c and bar.c and check them in. Something similar to what is shown in Figure 4 will happen:
Figure 4: A simple check-in.
Suppose "changeset 0" is empty (the root empty changeset), then we've just created changeset 1, containing the two new files we added.
Let's continue and add su.c and do.c, and also make a change on foo.c (Figure 5).
Figure 5: Adding more files.
Now, make another change to foo.c by moving bar.c to foobar/bar.c and also modifying it. Finally, modify do.c, too. Then make another change on bar.c (Figure 6).
Figure 6: Even more changes.
As you can see, we've created exactly the same number of file revisions we did in the "file-by-file history" example discussed earlier, but we're not thinking about file history anymore; rather, we are tracking what happened to our project to our code base as a whole.
The changeset is playing two roles:
- It captures the change made at each step: In changeset 3, I modified 3 files and I also made a move
- It captures a full snapshot of the project at each step, check-in after check-in (and this is one of the key points to keep in mind when moving to DVCS)
What is the purpose of the snapshot? Suppose you want to go back to changeset number 1: If you "download" changeset 1, your workspace will look like Figure 7.
Figure 7: Workspace after downloading changeset 1.
Now, if you go to changeset 3, it will look like Figure 8.
Figure 8: Workspace after downloading changeset 3.
And you can even diff the trees at any point and understand what happened between them. For instance, a diff between cset 3 and 1 will tell us:
- bar.c has been moved to foobar/bar.c
- sudo/su.c and sudo/do.c have been added
- foo.c and bar.c were also modified
The key concept to keep in mind is that each changeset is actually capturing the status of the project at a certain point. When a developer checked-in changeset 3, he had a certain configuration, a certain "tree," and this configuration is recorded in the repository and can be easily recovered.