Kevin is a software architect for Nortel Networks and can be reached at [email protected] nortelnetworks.com.
The idea of dividing a computationally intensive task between multiple physical machines has been around for as long as computer networks have existed. For many programmers, [email protected] (http://setiathome.ssl.berkeley.edu/) provided an introduction to the power of network computing. In addition to SETI, there are a variety of toolkits available for distributing and load-balancing units of work over a network. For example, Sun provides its Grid Engine (http://www.sun.com/software/gridware/get.html), and there's a Gnu toolkit called "Gnu Queue" (http://www.gnuqueue.org/home.html). Large software projects are often faced with loadbuild times that approach 24 hours. Is it possible to reduce the time required to build loads by applying concepts from distributed computing?
Many implementations of make, such as Gnu make and Rational's ClearMake (which is part of its ClearCase toolset; http://www.rational.com/products/clearcase/), automatically distribute a single loadbuild onto different workstations. In typical software development shops, each software developer has a powerful workstation on his or her desk. Typically, most of these desktop workstations are relatively idledevelopers are writing code, reading e-mail, or performing other tasks that consume little CPU or I/O resources. There are also many times during the day (and night) when people are not using their computers. Why not use this spare horsepower to build loads faster? In this article, I describe my experiences in converting a large software system to use Rational's distributed build technology on a cluster of networked Sun Solaris 2.6 workstations; however, these concepts are applicable to any distributed build implementation on any platform. (ClearMake also supports Windows.)
Correctly written makefiles perform build avoidance, which means that most build times are not excessive since files that are not impacted by a code change are not recompiled. Unfortunately, there are situations where build avoidance is not enough. For example, changing low-level header files might force recompiles of all source files in the system. Also, there are times when you want to build cleanly to ensure you have a stable configuration. These situations are ideal candidates for distributed builds.
Large software projects often have several hundred makefiles that are all invoked from one master makefile. There are two common ways to connect these makefiles:
- Recursive make. With this architecture, your project is divided into a series of directories containing subdirectories. Each directory has a makefile that builds the contents of the directory, and calls the makefile contained in each subdirectory. The traditional disadvantage of this approach is that you cannot represent dependencies between targets in different directories since the build dependencies are described in physically separate makefiles.
- Inclusive make. With this approach, your project is still divided into a series of directories containing subdirectories, however, most directories have an incomplete makefile that describes only the rules needed to build the software contained in that directory. If a directory has subdirectories, the makefiles in those directories are included into the parent directory's makefile via make's built-in include command. Since all the rules are present in one (very large) composite makefile, it is possible to represent dependencies between different directories.
The recursive make architecture creates a unique problem for distributed builds. Each recursive call to make starts up a new round of parallel make processes with no knowledge that there are other parallel make processes running on the same machine. This causes an explosion in the number of build processes running on a given node, which greatly reduces build performance and can cause system crashes. I solved this problem by forcing every recursive makefile to always build serially via ClearMake's special .NOTPARALLEL target. This reduces performance somewhat, but most recursive makefiles do not actually perform any compiles (they just invoke other makefiles), so the performance penalty is not significant.
Distributed Build Technology
Before starting a distributed build, you must supply ClearMake with a list of workstations in the build pool and the maximum number of concurrent processes to launch across all machines. ClearMake uses the dependency information in your makefile to select targets that can be built on remote machines; for example, targets that do not depend on one another can be built in parallel. Next, ClearMake selects an idle machine from the pool and starts a remote build process using UNIX's rsh command. By checking the CPU load prior to building, ClearMake automatically performs load balancing across the pool of available workstations. So, when your coworker returns from lunch and fires up a CPU-intensive task on his workstation, ClearMake won't launch additional build processes on his machine until his CPU is idle again. Once all the nodes in the build pool are busy or the maximum number of concurrent processes has been reached, ClearMake waits for a build process to finish or for a machine to become free.
The build pool is relatively easy to configure. Each user must be able to run remote processes on each node; on UNIX systems, this is accomplished through a user's .rhosts file or through the system's hosts.equiv file. I used .rhosts in my testing. Obviously, each node in the pool must be able to access the source code and the build directories. After implementing distributed builds on UNIX systems, you must pay careful attention to the interactive and noninteractive portions of your login shell. Obviously, the remote builds will be running noninteractively (under rsh), so all environment variables and paths needed to compile must be setup in the noninteractive portion of your login shell. Equally true, the noninteractive portion of your login shell must not try to perform interactive operations. For example, one of my coworkers defined an alias for the UNIX rm command that forced rm to always request confirmation before deleting a file. Unfortunately, this alias was defined in the noninteractive section of the shell initialization file, so when a makefile running on a remote workstation attempted to delete (via rm) a temporary file, it waited indefinitely for someone to acknowledge the removal!
The list of machines in the pool is listed in the .bldhosts file, which is stored in your home directory. A series of command flags can also be given in the .bldhosts file. For example, you can set the CPU-free threshold for each machine using the -idle command. By default, ClearMake starts using the machines listed at the top of the .bldhosts file; however, you can instruct ClearMake to choose randomly instead by using the -random command. The #include command lets you include a file into the .bldhosts file. Example 1 is a typical .bldhost file.
In general, ClearMake build times are greatly influenced by a Rational network caching technology called "winking" (see the accompanying text box entitled, "ClearMake Advantages"). Table 1 identifies the observed performance improvements after distributed build technology was introduced into our environment with winking disabled and with winking enabled. Each test used a different number of CPUs and processes, and recorded the number of times that ClearMake waited (because all available CPUs were busy) and the total time required for the build. In other tests, I found that using more build processes than CPUs was not productive due to thrashing caused by task switching. Generally, as more CPUs are added to the pool, each CPU contributes less. These diminishing returns are caused by the dependencies in the makefilesyou can't build the entire system in parallel because makefiles contain rules that say one part of the system should be built before another part.
Distributed Build Disadvantages
Although introducing distributed build technology into your organization improves build performance, distributed builds create some unique problems.
- If your network suffers from performance issues, then your builds also suffer. For example, a slow network causes slow builds, and unreliable networks can cause builds to fail. Because the network adds several new points of failure into your build environment, building in a distributed environment will always be less reliable than building on a single workstation.
- You will invest time to convert your existing makefiles. In particular, any errors in your makefiles will become more apparent after distributed builds are introduced. For example, if your makefiles are missing any important dependencies, your distributed builds will fail.
- On UNIX systems it is possible to create files with a "#" character in the filename. Because "#" starts a comment line in many UNIX shell scripting languages, many scripts cannot successfully read files containing a "#" character in the file name. In particular, Rational's distributed build scripts cannot build files whose names contain "#" characters. I solved this problem by renaming all of our files with "#" characters in their names.
- Software licensing should be considered. If you have purchased a fixed number of compiler licenses, distributed builds will cause your license consumption to increase. Before starting this type of project, you should ensure that software licensing is not a concern in your environment and purchase additional licenses if needed.
Idle computers in your organization can be used to improve software developer productivity by reducing overall build times. In my environment, I was able to reduce build times by a factor of three. Distributed build technology is only useful when a build performs a large number of compile operations. If a build only performs a few compiles and a few links, then adding more processors will not improve performance.