Distributed Loadbuilds

Dividing computationally intensive tasks among multiple machines is a technique that has been around for a long time. Kevin uses Rational's ClearMake distributed build technology to put idle computers to work.


July 01, 2003
URL:http://www.drdobbs.com/architecture-and-design/distributed-loadbuilds/184405385

Jul03: Programmer's Toolchest

Kevin is a software architect for Nortel Networks and can be reached at kwsmith@ nortelnetworks.com.


ClearMake Advantages


The idea of dividing a computationally intensive task between multiple physical machines has been around for as long as computer networks have existed. For many programmers, SETI@home (http://setiathome.ssl.berkeley.edu/) provided an introduction to the power of network computing. In addition to SETI, there are a variety of toolkits available for distributing and load-balancing units of work over a network. For example, Sun provides its Grid Engine (http://www.sun.com/software/gridware/get.html), and there's a Gnu toolkit called "Gnu Queue" (http://www.gnuqueue.org/home.html). Large software projects are often faced with loadbuild times that approach 24 hours. Is it possible to reduce the time required to build loads by applying concepts from distributed computing?

Many implementations of make, such as Gnu make and Rational's ClearMake (which is part of its ClearCase toolset; http://www.rational.com/products/clearcase/), automatically distribute a single loadbuild onto different workstations. In typical software development shops, each software developer has a powerful workstation on his or her desk. Typically, most of these desktop workstations are relatively idle—developers are writing code, reading e-mail, or performing other tasks that consume little CPU or I/O resources. There are also many times during the day (and night) when people are not using their computers. Why not use this spare horsepower to build loads faster? In this article, I describe my experiences in converting a large software system to use Rational's distributed build technology on a cluster of networked Sun Solaris 2.6 workstations; however, these concepts are applicable to any distributed build implementation on any platform. (ClearMake also supports Windows.)

Correctly written makefiles perform build avoidance, which means that most build times are not excessive since files that are not impacted by a code change are not recompiled. Unfortunately, there are situations where build avoidance is not enough. For example, changing low-level header files might force recompiles of all source files in the system. Also, there are times when you want to build cleanly to ensure you have a stable configuration. These situations are ideal candidates for distributed builds.

Makefile Architectures

Large software projects often have several hundred makefiles that are all invoked from one master makefile. There are two common ways to connect these makefiles:

The recursive make architecture creates a unique problem for distributed builds. Each recursive call to make starts up a new round of parallel make processes with no knowledge that there are other parallel make processes running on the same machine. This causes an explosion in the number of build processes running on a given node, which greatly reduces build performance and can cause system crashes. I solved this problem by forcing every recursive makefile to always build serially via ClearMake's special .NOTPARALLEL target. This reduces performance somewhat, but most recursive makefiles do not actually perform any compiles (they just invoke other makefiles), so the performance penalty is not significant.

Distributed Build Technology

Before starting a distributed build, you must supply ClearMake with a list of workstations in the build pool and the maximum number of concurrent processes to launch across all machines. ClearMake uses the dependency information in your makefile to select targets that can be built on remote machines; for example, targets that do not depend on one another can be built in parallel. Next, ClearMake selects an idle machine from the pool and starts a remote build process using UNIX's rsh command. By checking the CPU load prior to building, ClearMake automatically performs load balancing across the pool of available workstations. So, when your coworker returns from lunch and fires up a CPU-intensive task on his workstation, ClearMake won't launch additional build processes on his machine until his CPU is idle again. Once all the nodes in the build pool are busy or the maximum number of concurrent processes has been reached, ClearMake waits for a build process to finish or for a machine to become free.

The build pool is relatively easy to configure. Each user must be able to run remote processes on each node; on UNIX systems, this is accomplished through a user's .rhosts file or through the system's hosts.equiv file. I used .rhosts in my testing. Obviously, each node in the pool must be able to access the source code and the build directories. After implementing distributed builds on UNIX systems, you must pay careful attention to the interactive and noninteractive portions of your login shell. Obviously, the remote builds will be running noninteractively (under rsh), so all environment variables and paths needed to compile must be setup in the noninteractive portion of your login shell. Equally true, the noninteractive portion of your login shell must not try to perform interactive operations. For example, one of my coworkers defined an alias for the UNIX rm command that forced rm to always request confirmation before deleting a file. Unfortunately, this alias was defined in the noninteractive section of the shell initialization file, so when a makefile running on a remote workstation attempted to delete (via rm) a temporary file, it waited indefinitely for someone to acknowledge the removal!

The list of machines in the pool is listed in the .bldhosts file, which is stored in your home directory. A series of command flags can also be given in the .bldhosts file. For example, you can set the CPU-free threshold for each machine using the -idle command. By default, ClearMake starts using the machines listed at the top of the .bldhosts file; however, you can instruct ClearMake to choose randomly instead by using the -random command. The #include command lets you include a file into the .bldhosts file. Example 1 is a typical .bldhost file.

In general, ClearMake build times are greatly influenced by a Rational network caching technology called "winking" (see the accompanying text box entitled, "ClearMake Advantages"). Table 1 identifies the observed performance improvements after distributed build technology was introduced into our environment with winking disabled and with winking enabled. Each test used a different number of CPUs and processes, and recorded the number of times that ClearMake waited (because all available CPUs were busy) and the total time required for the build. In other tests, I found that using more build processes than CPUs was not productive due to thrashing caused by task switching. Generally, as more CPUs are added to the pool, each CPU contributes less. These diminishing returns are caused by the dependencies in the makefiles—you can't build the entire system in parallel because makefiles contain rules that say one part of the system should be built before another part.

Distributed Build Disadvantages

Although introducing distributed build technology into your organization improves build performance, distributed builds create some unique problems.

Conclusion

Idle computers in your organization can be used to improve software developer productivity by reducing overall build times. In my environment, I was able to reduce build times by a factor of three. Distributed build technology is only useful when a build performs a large number of compile operations. If a build only performs a few compiles and a few links, then adding more processors will not improve performance.

DDJ

Jul03: Programmer's Toolchest


-random
# My Team's Machines
-idle 75
frodo
sam

# Sally's Machines
-idle 100
gandalf
bilbo

# Public Machines:
#include /cc/public.hosts

Example 1: Typical .bldhost file.

Jul03: ClearMake Advantages

ClearMake Advantages

By providing additional features, Rational's ClearMake offers several advantages over regular makes. These unique features automate some of the mundane tasks associated with building software. Most sophisticated build environments use a variety of hand-crafted scripts and tools to accomplish these tasks.

ClearMake automatically tracks build dependencies in most situations, thereby reducing the need to identify the header files that a particular source file depends upon. ClearMake does this by noting which files (name and version) are read during a particular build operation and by noting the build script (generally a compile line) used to build the resulting object file. ClearMake stores this information in a special file called a "Configuration Record."

Since the Configuration Record for your top-level target (the program that you are building) describes exactly how your program was built, you can reproduce any past build provided you have the Configuration Record for the build. Obviously, if you store the top-level Configuration Records for any binaries that you ship to your customers, you will always know exactly what was delivered to your customers, and can reproduce it if needed.

ClearMake automatically caches copies of build objects on network servers. If the object that you are about to build exactly matches an object in the network cache, ClearMake imports the object from the cache rather than rebuilding it on your workstation. The Configuration Records for the objects on the servers contain enough information to make a correct decision most of the time. This technology, called "winking," can greatly reduce build times by letting you benefit from build operations that someone else has performed. Unfortunately, winking slows down a build if that build exports a large number of objects to the network servers. For this reason, only certain users should be allowed to populate the network caches; this maximizes your winking benefits while minimizing the costs.

—K.W.S.

Jul03: Programmer's Toolchest

Table 1: Performance improvements after using distributed build technology.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.