Distcc & Distributed Computing

Daniel shows how distcc and its distributed compilation capabilities can significantly reduce compilation times, while Trevor Marshall tells why Gentoo Linux is a programmer's Linux.


February 01, 2004
URL:http://www.drdobbs.com/distcc-distributed-computing/184401764

February 04:

Gentoo Linux

Distcc is one of the more useful C/C++ tools to come around in a long time. With distcc, you can use a cluster of machines to compile a single GCC/g++ source-code tree, thereby dramatically reducing compilation times. The speed improvement you realize depends on the number of machines you have on your LAN that are available to donate their resources. Two identical machines, for instance, can typically compile about 1.8 times as fast as one machine alone, and four machines will typically be able to compile about 3.5 times faster than a single machine.

Originally developed by the Martin Pool for the Samba project on Linux (http://distcc.samba.org/), distcc is now supported by FreeBSD, NetBSD, Darwin, Solaris, HP-UX, IRX, Cygwin, and BSD/OS. Because most people compile Gentoo from source code, there are large numbers of Gentoo users who use distcc to speed up compilation; hence, Gentoo's strong support for distcc. (For more on Gentoo, see the accompanying sidebar entitled "Gentoo Linux.")

To use distcc, you must have:

That's about it. Distcc does not require identical hardware, synchronized system clocks, identical header files on every machine, identical libraries on every machine, kernel patches, or modifications to GCC or make binaries.

Here's how distcc works. First, you need to install distcc on each machine on your LAN that will participate in the distributed compilation. On the machines offering CPU resources to others, you need to run distccd. These machines are the "compile servers."

To use distcc, you need to choose one machine to compile on — that is, the "client." On this machine, you use one of several methods to get your makefiles to call distcc instead of gcc or g++. A machine can be configured to be a client, a compile server, or both.

Once setup is completed, you can compile sources on the client, and distcc intercepts the compiler calls and distributes the work across all the compile servers. The result? Your program compiles much faster, you save a lot of time, and you're happier at the end of the day.

Inside Distcc

On the surface, the theory behind its operation sounds simple. But if you're familiar with the internal workings of C and C++ compilers, distcc raises some interesting questions:

Distcc is able to do all this by doing source-code preprocessing on the client machine. It then sends the preprocessed source — along with all the gcc/g++ command-line options — to the remote machines. On the remote machines, the preprocessed source is compiled into object code, which is then sent back to the client.

By doing all linking locally on the client, distcc is able to link object code. Distcc recognizes calls to gcc/g++ that are intended to link object code, and performs these linking steps on the client machine. In theory, this would seem to make distcc less efficient, but in practicality it does not make much difference. Linking can't really benefit from being distributed across the network, and preprocessing is generally rather fast. Most of gcc/g++'s CPU time is spent converting preprocessed source code to object code — the very work that distcc is able to distribute across the compile servers.

You can execute multiple jobs simultaneously by calling make with the jobserver (-j) command-line option. With -j, most makefiles can be told to execute multiple jobs simultaneously. For example, -j4 tells make to keep four jobs running at all times. When four compilations are running at the same time, there are several jobs available to distribute to the compile servers.

Installation

Installation is fairly straightforward. Once you've download the distcc sources (http://distcc.samba.org/), extract, configure, compile, and install them by performing the following steps:

cat /path/to/distcc-x.y.tar.bz2 | bzip2 -dc | tar xvf -./configure --prefix=/usr
make
make install

Distcc and distccd are then installed. If a machine is going to be run a compile server, start distccd (it detaches from your terminal and runs in the background) by typing:

distccd

If your machine is a client, there are three ways to configure the system so that the /usr/bin/distcc executable intercepts compiler calls. Here, I perform the initial setup for the gcc/g++ masquerading option so that it's available later. You only need to set up masquerading on the client machine(s), not the compile servers.

Masquerading

To use masquerading, you first create a directory that contains symbolic links that have the names of the compilers on your system and the distcc program as the link target. Later, you can use this masquerading technique to intercept gcc/g++ calls by inserting your new /usr/lib/distcc/bin directory at the beginning of the shell's executable search path. This stealthily redirects all calls to distcc instead.

Masquerading is set up by performing these configuration steps:

install -d /usr/lib/distcc/bin
cd /usr/lib/distcc/bin
ln -s /usr/bin/distcc gcc
ln -s /usr/bin/distcc cc
ln -s /usr/bin/distcc g++ 
ln -s /usr/bin/distcc c++
ln -s /usr/bin/distcc i486-pc-linux-gnu-gcc
ln -s /usr/bin/distcc i486-pc-linux-gnu-c++
ln -s /usr/bin/distcc i486-pc-linux-gnu-g++

You'll want to replace the i486-pc-linux-gnu with the appropriate host string that matches your installed version of GCC. To see which you should use, type gcc -v and look at the path displayed in the first line of output.

Compilation

At this point, you are almost ready to compile something. First, you need to tell distcc the names of the compile servers you'd like it to use. To do this, create a file called /etc/distcc/hosts that stores the information. In it, list all the hostnames or IP addresses of the compile servers. Each hostname should be separated by whitespaces. You can use the name "localhost" to refer to the client machine. No distccd daemon needs to be running on the client to refer to "localhost" in /etc /distcc/hosts. To set up the /etc/distcc/hosts variable, first create the /etc/distcc directory:

install -d /etc/distcc

Then create the /etc/distcc/hosts file using your text editor and add something like this to it:

localhost
eagle
falcon
emu

which tells distcc to use the local machine first, then distribute any additional jobs to the machines named eagle, falcon, and emu in the listed order. You may want to remove localhost from /etc/distcc/hosts, and set something like this instead:

eagle
falcon
emu

which causes all compilation to happen remotely, thus freeing your client's CPU for preprocessing and linking. Depending on your hardware and network configuration — as well as the number of compile servers you have set up — you may find that this approach works better.

Next, you need to tweak the local PATH setting so that make finds your masqueraded symbolic links that point to distcc. To do this under bash, type:

export PATH="/usr/lib/distcc/bin:${PATH}"

Now you're ready to compile. Just enter your favorite source tree and type:

make -j5

You'll want to tweak the number after -j to suit the number of machines participating in your compile farm. It's usually optimal to use a -j number that's slightly higher than the number of compile servers you are using.

While your sources are being compiled, log in to the compile servers and monitor their system load. You should notice an increased load on these boxes as they assist your client box.

Distcc Extras

If GNOME is installed on your client machine, then it's likely that a GNOME distcc monitor was compiled and installed along with distcc and distccd. To run it, type:

distccmon-gnome

You should see a GNOME-based distcc monitor that looks something like Figure 1. By using distccmon-gnome, you can see how much time is spent for each step of the build process on all the machines that are being used for compilation. The information from distccmon-gnome is useful for configuring distcc to perform optimally. For example, if you notice that a disproportionate amount of time is being spent on preprocessing, then you may want to remove "localhost" from DISTCC_ HOSTS. This way, the client can be devoted to preprocessing and linking and compilation can be left for the compile servers.

If you don't have GNOME available, you can start the text-based version of distccmon by typing:

distccmon-text 

followed by the refresh interval in seconds:

distccmon-text 1

Other Distcc Use Strategies

Besides using the masquerading method, there are also a couple of other methods that can be employed to get a source tree to use distcc. They're generally not as effective as masquerading, but may be appropriate for some situations.

The first alternate method is to prefix the name of the compiler that is being used with "distcc". This can typically be done as follows:

make CC= "distcc gcc" -j5

The second alternate method is to call distcc as the compiler itself. This can be done as follows:

make CC="distcc" -j5

When called this way, distcc looks for cc in the binary search path and uses it for compilation.

For more information on the various options available for distcc and distcc, go to http://distcc.samba.org/ and read the distcc and distccd man pages. In the distcc man page, you can learn how to further refine your DISTCC_HOSTS environment variable for enhanced performance. The distccd man page has a number of security and connection options (such as ssh-based connections).

Distcc In the Real World

It's encouraging to see the positive response that distcc has received. For one, distcc has been integrated into Apple's Xcode developer tools. This lets multiple Apple machines with Xcode use distcc.

In addition, Gentoo Linux (the free software project I lead) has extensive support for distcc. For information on how to use distcc under Gentoo, go to http://www.gentoo.org/doc/en/distcc.xml/. Thanks to the efforts of Lisa Seelye (our resident distcc guru) as well as others, you can expect Gentoo's support for distcc to continue to expand. For example, the current Gentoo Linux installation CDs for the PowerPC can also be used to set up boot-from-CD compile servers.

Conclusion

If you're interested in accelerating compilation even further, take a look at Andrew Tridgell's ccache program (http://ccache.samba.org/). This compiler tool keeps a local cache of all recently compiled sources, which lets you to do things like perform a "make clean" in a source tree and still be able to recompile it very quickly. Distcc and ccache also happen to be quite a dynamic duo when used together.

Daniel Robbins is the Chief Architect of Gentoo Linux and leader of the Gentoo free software project (http://www.gentoo.org/). He can be contacted at [email protected].


February 04:

Figure 1: Monitoring a kernel compile process that has been distributed to a very fast AMD64/NForce3 workstation.

Gentoo Linux

By Trevor Marshall

It has been 10 years since Patrick Volkerding announced the Slackware 1.00 Linux distribution (http://www.slackware.org/announce/1.0.php). The succeeding Slackware releases served me well, but for a few years Slackware fell a little behind in the GUI race and the lure of a well-integrated KDE forced me to take a look at Red Hat and SuSE.

When I first booted the SuSE installation CD I became aware of a major problem—new distributions were "optimized" for the most recent CPUs. My aging Toshiba Libretto mini-laptop has a 133-MHz Pentium MMX, but even when I cross-compiled the SuSE kernel specifically for PENTIUM-MMX, utilities and compilers would not run unless they, too, were recompiled. (I ran into the same problem when upgrading my DSL firewall/server, which even today runs a 450-MHz Celeron Mendocino.)

After weeks of chasing my tail, I decided to take the plunge into a source-code-based distribution—one where I had source code for everything, and where I could select the level of optimization needed for each of my machines. I found this in Gentoo Linux (http://www.gentoo.org/), which in some ways is a distribution created primarily for programmers who optimize and reoptimize their code.

In addition to being source based, Gentoo has Portage, a package-handling system inspired by the package tools available within BSD. Whenever an update is required for a component (for example, if a security glitch is found in Apache), you don't need to manually integrate the patches. Two Gentoo commands—emerge sync and emerge -u apache—download a new package source tree, compile it, and bring your system fully up to date.

The first command (emerge sync) merges in a new sync tree, effectively downloading a list of the latest versions of each package in the Gentoo distribution. The second command (emerge -u apache) then downloads and installs the latest version of Apache, together with all the libraries and applications it needs to function.

Portage automatically handles package interdependencies. For instance, it used to take hours to compile a new version of AirSnort or Kismet into SuSE. Interdependence upon each other, Ethereal, and PCAP usually created a version nightmare. But installation of the component from the latest Gentoo net-wireless package (http://www.gentoo-portage.com/browseportage.php?category=66/) ensures all dependencies are satisfied, and makes installation and maintenance a breeze.

Portage has a -p switch (the "pretend" function) that lets you test the functions performed by an emerge command. For example, emerge -pu apache lists all the tasks needed to upgrade apache, so that you can manually assure yourself that Portage is going to do all the right things.

Installing Gentoo

When installing Gentoo, you first must keep the target processor in mind—x86, PPC, Sparc, Alpha, AMD64, MIPS, ARM, or IA64. Different targets have different levels of support. I have only used the x86 and PPC versions. Although both were complete and reliable, in this article I'll focus on the x86 installation process.

Gentoo has a database dedicated to cataloging the most recent versions of each available module for each available architecture (http://packages.gentoo.org/). The database is updated daily with a list of all upgraded packages and is worth watching—not so much to keep track of updates, but to look at new applications and features as they are integrated into Gentoo.

Gentoo initially boots from a 96-MB CD (called the "basic LiveCD"), then brings down source for each application package from Internet mirrors. LiveCD is an excellent mini-distribution on its own, and can be used for system maintenance and debug, performing many of the system disk recovery tasks for which the Knoppix distribution (http://www.knopper.net/knoppix/index-old-en.html) has proven useful. Both LiveCD and Knoppix seem to be able to boot on just about any x86 system, and "magically" recognize all the installed hardware and peripherals. The ISO image for the LiveCD can be downloaded from any of the mirrors using the path gentoo/releases/x86/1.4/livecd/basic/.

Step-by-step instructions (http://www.gentoo.org/doc/en/gentoo-x86-install.xml) make the installation straightforward, although you do have to use your brain cells a little. Gentoo installation puts you in control. There is no master installation program. The Portage packages you select are downloaded from the Internet, compiled, and installed onto your target hard disk. You get to do all the configuration, and you get the gratification when your kernel first boots.

One of the first tasks is to select a kernel version. There's an array of choices (http://packages.gentoo.org/packages/?category=sys-kernel), but "gentoo-sources" is a good place for starting your first system. Then, you need to select how many of the really fundamental utilities you want in place at that first boot. There are three tarballs to choose from. The smallest contains only the kernel and the most basic tools, while tarball level III contains a good assortment of system utilities.

It can take 24 hours to compile the larger applications (such as X, KDE, or Gnome), so the most recent distribution (1.4) has been streamlined with the addition of a number of precompiled executables. basic LiveCD has therefore swollen to fill a complete disk. A second CD has been added, with precompiled versions of KDE, GNOME, OpenOffice, and Mozilla. Downloading or purchasing the two-CD set saves time.

In addition to Gnome and KDE, most of the compact X11 window managers (http:// packages.gentoo.org/packages/?category=x11-wm/) are available for installation. My Libretto laptops only have 32 MB of RAM, which makes trying to run Gnome or KDE frustrating. Consequently, I installed IceWM (http://www.icewm.org/) and ended up with enough remaining physical memory to comfortably run either Mozilla-Firebird (http://packages.gentoo.org/packages/?category=net-www;name=mozilla-firebird/) or Kismet (http://www.kismetwireless.net/).

Once you have created your installation, you can easily maintain it with the Portage utilities. Still, the initial Gentoo setup does exercise your programming abilities since just about every system parameter is configurable. Not only does the normal Linux require CPU variables to be manually configured, but because this is a kernel you compile locally, you also have to remember to change the default CPU definitions in /etc/make.conf. The CFLAGS variable passes the default host CPU type to the C/C++ compilers during Linux system and application builds. A list of switches for all the supported processor types—from i386 to SPARC—is available from freehackers.org (http://www.freehackers.org/gentoo/gccflags/flag_gcc3.html).

Gentoo.org has made available a Linux Installation LiveCD (http://store.gentoo.org/index.php?cat=21&action=browse/) optimized for processors such as:

In addition, a PowerPC G3/G4 LiveCD that boots directly in Linux, and which includes KDE, GNOME, and the OpenOffice and Koffice applications, is available. Consequently, there is no longer any need to struggle with CFLAGS during the initial boot, or to cross-compile your Linux executable on another system. Just select a CD with the level of processor optimization you want for any particular target machine, and then install Gentoo with confidence.

Gentoo's BSD influence also shows in the choice of Grub as the boot loader. If you are more familiar with Lilo, you should execute emerge -u lilo, then edit /etc/lilo.conf in the usual way. Being a source-code-based distribution, Gentoo has solid native C and C++ support. Since you rely on these native compilers to make your Gentoo kernel work properly, they need to be complete, with up-to-date revisions of the native libraries (libc6/glibc2).

Gentoo also has excellent support for cross-platform software development. It provides a cc-config wrapper that is able to call any of the available cc compilers, including the cross-compilers for x86, PPC, Sparc, Alpha, MIPS, ARM, and AMD64 targets. Similarly, the gcc-config wrapper changes the active GCC compiler. GCC additionally can produce code for HPPA and IA64 target processors. There are a number of special GCC builds for the Sparc and MIPS kernels, as well as a hardened GCC with transparent and semitransparent -pie -fPIC -fstack-protector support.

Among Gentoo's specialty C/C++ compilers are: tcc (http://fabrice.bellard.free.fr/tcc/), a small (100K) x86 C compiler; ccc, Compaq's enhanced C compiler for the Alpha platform; and icc, Intel's Pentium-optimized C++ compiler for Linux. uclibc is a C library for developing embedded Linux systems. Gentoo also has standard packages for the C++ implementation of the Atlas protocol (used in role playing games at Worldforge) and SIP (http//www.riverbankcomputing.co.uk/sip/), a tool for generating bindings for C++ classes so that they can be used by Python.

Conclusion

For the most part, programmers don't need a Linux distribution with the ability to simultaneously deploy 200 copies across an enterprise. What we need is a distribution where bugs are fixed quickly, and where new tools can be smoothly integrated. Gentoo has proven itself just such a programmer's distribution.


Trevor Marshall is an engineering consultant, specializing in RF and hardware design and Linux internals. He can be contacted at http//www.trevormarshall.com/.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.