Dr. Dobb's | Distributed Software Development Explained

Distributed Software Development Explained

Distributed Software Development introduces a huge new world of possibilities on top of well-known SCM best practices

July 03, 2008
URL:http://www.drdobbs.com/parallel/distributed-software-development-explain/208802468

Pablo is a software engineer at Codice Software. He can be reached at [email protected]

You need a team in San Diego and a team in Beijing working on the same code base. You're already looking into the different source-code management (SCM) possibilities to cover scenarios when network connections are not always reliable or fast enough to ensure that your developers aren't stopped waiting for changes to be submitted or retrieved. And what's even more important, you're looking for the right ways to work in such scenario. If that's your case, let's see how true Distributed Software Development (DSD) can help.

In a previous article I introduced the basics of DSD, highlighting the differences between central and multi-site software development. I then examined the distributed path from proxy based to true distributed systems.

So at this point, your question is: "Okay, how can I set it up so my teams can seamlessly work even when they're worlds apart?"

While there are several alternatives, in this article I focus on the multi-site family as one of the possible solutions and develop it to reach full distributed scenarios. Granted, what I call "full replication" is almost the same as distributed on the technical side, because the two modes support concurrent changes to happen in different locations. From a process point of view, however, I limit full replication to server-based distributed development while distributed only focuses on scenarios where individual developers are running their own SCM system and pushing and fetching changes back and forth with peers (see Figure 1).

[Click image to view at full size]

Figure 1:

The Multi-site Deployment

Our requirements are clear: We have to allow our two distant teams to work together on a single code base. VPN-based solutions are not an option due to the possible network unreliability. Hence, Figure 2 depicts a typical solution.

[Click image to view at full size]

Figure 2: Multi-site development

Between them, Site A and Site B have two servers, one at each location, which lets their respective teams make modifications on the code base. The servers have to be synchronized periodically. The frequency of synchronization varies, depending on the development methodology. Simply put: When single branch development is chosen, synchronization will be more frequent. However. if heavily branched strategies are set up, synchronization won't happen that often.

The details of setting up a multi-site SCM environment varies from product to product. Once the design is clear, it can be a matter of a few minutes or a really tough process requiring experienced personnel. Fortunately new solutions greatly simplify the set up and overall replication process compared to older ones.

Once the servers are in place, and considering the original code base is at Site A, a replica has to be created at Site B (Figure 3).

[Click image to view at full size]

Figure 3: Multi-site development at branch level.

Ownership-based Multi-site

Some systems (Clearcase, for instance) only support ownership-based multi-site. What does this mean? It basically means that concurrent changes can't happen on the same branch on different locations. To prevent it, only one site at a time can be the owner of the branch, and only the owner site can let its users perform changes.

This is an important constraint, but it still allows a powerful multi-site set up to happen. Figure 4 shows how the ownership (or turn) is passed from one server to another and then changes can happen at the owner site. In Figure 4, the blue squares are changesets (or atomic commits grouping different files and folders checked in together, depending on your SCM slang). Notice that the active and disabled branches are depicted in different colors, depending on when the site has the rights to allow its users to modify them.

[Click image to view at full size]

Figure 4: Ownership-based multi-site

If only one site is able to write on the same branch at a time, how can teams work in parallel?

The answer is straightforward, although not always valid -- ownership can be based on the time zone. Suppose you've a team in Sidney and another in London. When the London team goes to sleep, their colleagues at Sidney will be starting their working day. So with this scenario, it is simple to switch masterships so that the awake office can perform changes.

Unfortunately for most of the distributed teams (and fortunately in terms of communication) their working hours overlap so the solution isn't valid when changes have to be performed simultaneously on the same branch at two sites.

Distributed Change at an Item Level

In ownership-based multi-sites, an item (file or directory) can't be changed when the site hasn't the branch ownership. Okay, but why?

Figure 5 shows the evolution of file foo.c at two remote sites.

Step 1 shows how the file has been replicated and it exists on both sites, on the same branch.
Then on step 2, two different developers working at the different sites decide to make modifications to the foo.c file.

They both start at the same revision, revision number 2, and make a different number of changes to fix different bugs.

The developer at Site A creates revisions 3 and 4, while the one at Site B creates another revision. (Note that revision 3 at Site A is not the same as revision 3 as Site B. To keep track of these situations, distributed systems normally identify revisions based on some sort of globally unique identifier).

[Click image to view at full size]

Figure 5: Ownership-based multi-site. File level study.

If only one developer at a time performs a change, the replication system would just copy the new revision from the source into the destination, and no conflict will happen. But on the depicted situation, if the new revision from Site B is directly copied into the main branch at Site A, the changes made after the revision 2 would be lost, and the same would happen if the replication is run from Site A to Site B.

To prevent this situation, mastership or ownership-based replication systems don't let users make simultaneous changes on the same revisions on the same branches.

True distributed systems -- those supporting both the full replication and distributed development categories -- implement ways to handle these conflicts. Hence, they support a broader set of development alternatives.

Concurrent Change on Distributed Systems

Let's continue examining an item-based example to understand how distributed systems can manage concurrent changes at different locations and their later reconciliation or merge.

The process I describe primarily focuses on Plastic from Codice Software because, as one of its developers, I'm most familiar with its concepts. But it is important to note that other SCMs supporting concurrent distributed changes implement very similar techniques. They can vary on the exact terminology or strategy, but basically share the same principles.

Figure 6 illustrates how the Site B on the previous example can handle a replication coming from Site A when both locations have modified the same file on the same branch.

[Click image to view at full size]

Figure 6: Concurrent change on DSD.

The revisions 3 and 4 from Site A are pushed into Site B but instead of being directly plugged on the main branch under revision 2 (which is the parent of revision 3), and they're located on a new branch. If revision 3 on Site B didn't exist, the two revisions would be directly plugged after revision 2, and no fetch branch would be created.

Although the pushed revisions have been re-branched (located at a different branch than their original one), they still preserve their original history as they are linked with their corresponding parent (revision 2 in this case).

Distributed systems must also correctly preserve the merge tracking information to guarantee replicated content can be correctly merged.

[Click image to view at full size]

Figure 7: Ownership-based multi-site. File level study, merging.

Once the revisions from Site A have been replicated into Site B and placed into their new fetch branch, a regular merge can happen between revision 4, coming from Site A, and revision 3 at Site B. Because revision parenthood and merge-tracking is preserved, a three-way merge including correct common ancestor calculation can happen, ensuring the merge between the local and replicated revisions is right.

Basically it can be stated that using some sort of fetch branch (or changeset, depending on the system), the distributed branch and merge problem is reduced into a local one, already supported by a number of systems.

The last step would be fetching from Site B, once the merge has been done, into Site A again to get the changes from B into A. Figure 8 details the process.

[Click image to view at full size]

Figure 8: Ownership-based multi-site. File level study, merging.

Note how the merge link which was created on Site B to merge changes is now replicated into A and correctly located linking the right revisions, to ensure further merges and project evolution is correct. Revision 5 at Site A is exactly the same as revision 4 at Site B. The history of both sites (repositories) is equivalent although not identical.

If mastership-based replication is used, then it can be ensured that the replicated repositories are identical. When concurrent changes are permitted, repositories will end up being equivalent -- although not exact -- copies.

The techniques to manage revision relinking after replication vary from system to system. Plastic and Mercurial, for instance, track revision history, so it happens that revision 3 on branch main on a repository is not the same as revision 3 at the same branch on another replicated repository, as the previous examples showed. Systems like GIT which don't preserve the exact revision history (it can be mutated although the right contents are preserved) don't care about this renumbering and only identify revisions by their internal globally unique identifiers ("hashes" in the case of GIT) which are correct, but harder to read.

Concurrent Multi-site Development at Branch Level

Let's zoom out again to the branch level and take a look at how branches evolve in the scenarios described so far. When focused on project evolution, it's easier to understand what happened when looking from a branch perspective than a file or directory one.

[Click image to view at full size]

Figure 9: Concurrent multi-site development at branch level.

Figure 9 shows how the two repositories at the two locations evolve in parallel. Changes can be performed at the same time at the two locations, even on the same branch.

From time to time (when a file or directory is modified at the same time on the two locations) a fetch branch is created to solve the merging conflicts.

With concurrent multi-site support or full multi-site, it is easier to implement almost any branching and organizational pattern, especially continuous integration ones. Mastership-based replication makes things a bit more complicated due to the restrictions it imposes.

Distributed Branch Per Task

It's no secret that branch per task is my favorite branching pattern. It enables real parallel development, provides great isolation and added services to developers (like committing changes really often without ever breaking the build), and it ensures mainline is always clean and stable.

Of course, branch per task is not everyone's favorite but I bet it will be. Take a look at widely used systems like Subversion which, among others, is responsible of the training of a huge number of developers in the SCM basics. Since Subversion has taken hold at universities, almost every new software engineer around the world knows how to use it.

But Subversion used to have big troubles with branching (particularly merging), and, in my opinion, that's one of the huge reasons why a big number of developers don't like branching at all. ("Subversion can't do it well, so I don't like it.")

But branching patterns, including branch per task, introduce a number of opportunities to push development into the next level. Real parallel development is only possible through proper branching schemes (or "streams", if you prefer a fancier name), so being afraid of branching is not good for software development. Of course I've traced the reasons of the problem to Subversion, but probably CVS is the original root of the problem.

Fortunately things are changing. With Subversion 1.5, branching and merging has been greatly improved, and I suspect that in a few years we'll see a move towards branching just because of Subversion. In fact, a number of recent online tutorials and webminars have introduced the basics of an initial branch per task support with Subversion.

Branches are not always a way to split development or to fork a codeline. There's much more about branching than that.

In branch-per-task, you associate each issue in your preferred bug, issue, or project management system with a branch. Yes, branches become the greatest and more powerful change containers ever.

Simply put: A new task, a new branch. So simple. Of course you need a tool which lets you easily create branches, track their history, and even evolution (and here we enter the field of streams, although I still prefer to keep the same names and call them just "smart branches").

There are other association mechanisms for changes like changelists (also implemented by Subversion 1.5 and present for years in award-winning tools such as Perforce).

The limits of changelists are clear: They normally live only on the client side and they can only contain on revision of each item (file or directory). And what's worse, they're not independent: You modify a file and fix a bug and associate it to a changelist which is in turn associated to a task. Then you jump to the next task and create a new changelist starting at the revisions you've just created. Okay, it's better than nothing but -- aren't branches better? Yes, they're better because branch-per-task supports the concept of stable baselines: Whatever change you make you always start from the last stable baseline instead of your latest changes. Consider the following example:

Issue 10101: fix a bug in the data layer of your rocket launch system. It can only take a few minutes but its impact can be great.
Issue 10102: change the launch button color from yellow to red. It is simple and risk-free.

If you implement 10101 first, you'll be touching some critical code. Then you implement 10102 on the next changelist, on the same branch (trunk maybe?). Yes, you run all your tests but as far as I know, no test suite is perfect and then you are pressed to release a new version. You and your team are confident about the button color change, but not so sure about the dangerous change in the datalayer. Yes, it is passing the tests but can you afford skipping it until it is more tested internally? The answer is "yes," but because you already implemented 10102 after it -- both tasks are linked! You'd have to get rid of the changes of 10101 probably running some sort of substractive merge and it will be easy unless 10101 is associated to more than one checkin with other unrelated checkins done in the meantime... if that happens... it will take longer to get rid of the changes.

Now think about branches: 10101 is in a branch (what about "rocket10101" as its name?) and 10102 is in another branch. You need to release the new version, then you just decide to merge 10102 and the other risk free tasks inside the trunk (or main branch). Easy, clean, and traceability happy.

Okay, now that you all see that branch per task is the way to go, let's move to the distributed scenario. How would you handle branch-per-task in a DSD environment?

Well, it will be even easier than the single-branch distributed development and it could be even supported to some extent by mastership-based replication systems (yes, unfortunately all the proxy-based systems are out of the game when true replication comes to place).

Figure 10 shows how a stable release has been created at Site A and replicated to Site B.

Then developers at the two sites start creating branches in parallel and making changes. Because the changes are isolated on branches, mastership-based replication systems can still play the game. Of course, the purest scenario won't always happen, and even if developers isolate the tasks in branches, they can be working together on the same task branch at more than one site, which is a very good idea.

[Click image to view at full size]

Figure 10: Distributed branch per task.

At a certain point in time a new release has to be created. The release building team can be located at Site A (there are several possibilities here -- the two sites could be running continuous integration at their respective sites, even combined with branch per task, which is a very good practice merging the best of the two worlds, or more than one build team at different sites, and so on, but the basics will be the same) will then fetch the finished tasks (branches) from Site B and create a new release.

Once the release is fully tested, it will be replicated back to Site B, and a new iteration (or SPRINT, if you are familiar with SCRUM and agile methods) will start.

[Click image to view at full size]

Figure 11: Distributed branch per task.

Conclusion

Distributed Software Development introduces a huge new world of possibilities on top of well-known SCM best practices. The ability to run development in parallel at different sites in an easy way, and understanding the underlying concepts, will greatly help improving the existing way of working of multi-site teams and companies, opening new doors to improved and more productive ways of working.