Channels ▼
RSS

Distcc & Distributed Computing


February 04:

Gentoo Linux

Distcc is one of the more useful C/C++ tools to come around in a long time. With distcc, you can use a cluster of machines to compile a single GCC/g++ source-code tree, thereby dramatically reducing compilation times. The speed improvement you realize depends on the number of machines you have on your LAN that are available to donate their resources. Two identical machines, for instance, can typically compile about 1.8 times as fast as one machine alone, and four machines will typically be able to compile about 3.5 times faster than a single machine.

Originally developed by the Martin Pool for the Samba project on Linux (http://distcc.samba.org/), distcc is now supported by FreeBSD, NetBSD, Darwin, Solaris, HP-UX, IRX, Cygwin, and BSD/OS. Because most people compile Gentoo from source code, there are large numbers of Gentoo users who use distcc to speed up compilation; hence, Gentoo's strong support for distcc. (For more on Gentoo, see the accompanying sidebar entitled "Gentoo Linux.")

To use distcc, you must have:

  • Two or more machines with identical operating systems (Linux, FreeBSD, or the like) and architecture (x86, PowerPC, and so on) with GCC installed.
  • These machines must have the same minor version of GCC installed; for example, all participating machines could have GCC 3.2.x. That said, I still recommend that each machine have installed exactly the same build of the GNU compiler collection. That way, there is no possibility for things to get compiled differently between machines.

  • The machines must be connected via LAN and should be behind firewalls to prevent intruders from tampering with the distccd (the distcc compile daemon) port.

That's about it. Distcc does not require identical hardware, synchronized system clocks, identical header files on every machine, identical libraries on every machine, kernel patches, or modifications to GCC or make binaries.

Here's how distcc works. First, you need to install distcc on each machine on your LAN that will participate in the distributed compilation. On the machines offering CPU resources to others, you need to run distccd. These machines are the "compile servers."

To use distcc, you need to choose one machine to compile on — that is, the "client." On this machine, you use one of several methods to get your makefiles to call distcc instead of gcc or g++. A machine can be configured to be a client, a compile server, or both.

Once setup is completed, you can compile sources on the client, and distcc intercepts the compiler calls and distributes the work across all the compile servers. The result? Your program compiles much faster, you save a lot of time, and you're happier at the end of the day.

Inside Distcc

On the surface, the theory behind its operation sounds simple. But if you're familiar with the internal workings of C and C++ compilers, distcc raises some interesting questions:

  • How exactly does distcc work when different machines on the LAN have different sets of header files?
  • How does distcc manage to link object code when not all libraries may be available on all compile servers?

  • How do you get make to execute several things simultaneously?

  • And how does distcc work properly when the various compile servers and the client may have different sets of C/C++ header files?

Distcc is able to do all this by doing source-code preprocessing on the client machine. It then sends the preprocessed source — along with all the gcc/g++ command-line options — to the remote machines. On the remote machines, the preprocessed source is compiled into object code, which is then sent back to the client.

By doing all linking locally on the client, distcc is able to link object code. Distcc recognizes calls to gcc/g++ that are intended to link object code, and performs these linking steps on the client machine. In theory, this would seem to make distcc less efficient, but in practicality it does not make much difference. Linking can't really benefit from being distributed across the network, and preprocessing is generally rather fast. Most of gcc/g++'s CPU time is spent converting preprocessed source code to object code — the very work that distcc is able to distribute across the compile servers.

You can execute multiple jobs simultaneously by calling make with the jobserver (-j) command-line option. With -j, most makefiles can be told to execute multiple jobs simultaneously. For example, -j4 tells make to keep four jobs running at all times. When four compilations are running at the same time, there are several jobs available to distribute to the compile servers.

Installation

Installation is fairly straightforward. Once you've download the distcc sources (http://distcc.samba.org/), extract, configure, compile, and install them by performing the following steps:

<b>cat /path/to/distcc-x.y.tar.bz2 | bzip2 -dc | tar xvf -./configure --prefix=/usr
make
make install</b>

Distcc and distccd are then installed. If a machine is going to be run a compile server, start distccd (it detaches from your terminal and runs in the background) by typing:

</b><b>distccd</b>

If your machine is a client, there are three ways to configure the system so that the /usr/bin/distcc executable intercepts compiler calls. Here, I perform the initial setup for the gcc/g++ masquerading option so that it's available later. You only need to set up masquerading on the client machine(s), not the compile servers.

Masquerading

To use masquerading, you first create a directory that contains symbolic links that have the names of the compilers on your system and the distcc program as the link target. Later, you can use this masquerading technique to intercept gcc/g++ calls by inserting your new /usr/lib/distcc/bin directory at the beginning of the shell's executable search path. This stealthily redirects all calls to distcc instead.

Masquerading is set up by performing these configuration steps:

</b><b>install -d /usr/lib/distcc/bin
cd /usr/lib/distcc/bin
ln -s /usr/bin/distcc gcc
ln -s /usr/bin/distcc cc
ln -s /usr/bin/distcc g++ 
ln -s /usr/bin/distcc c++
ln -s /usr/bin/distcc i486-pc-linux-gnu-gcc
ln -s /usr/bin/distcc i486-pc-linux-gnu-c++
ln -s /usr/bin/distcc i486-pc-linux-gnu-g++</b>

You'll want to replace the i486-pc-linux-gnu with the appropriate host string that matches your installed version of GCC. To see which you should use, type gcc -v and look at the path displayed in the first line of output.

Compilation

At this point, you are almost ready to compile something. First, you need to tell distcc the names of the compile servers you'd like it to use. To do this, create a file called /etc/distcc/hosts that stores the information. In it, list all the hostnames or IP addresses of the compile servers. Each hostname should be separated by whitespaces. You can use the name "localhost" to refer to the client machine. No distccd daemon needs to be running on the client to refer to "localhost" in /etc /distcc/hosts. To set up the /etc/distcc/hosts variable, first create the /etc/distcc directory:

<b>install -d /etc/distcc</b>

Then create the /etc/distcc/hosts file using your text editor and add something like this to it:

</b><b>localhost
eagle
falcon
emu</b>

which tells distcc to use the local machine first, then distribute any additional jobs to the machines named eagle, falcon, and emu in the listed order. You may want to remove localhost from /etc/distcc/hosts, and set something like this instead:

<b>eagle
falcon
emu</b>

which causes all compilation to happen remotely, thus freeing your client's CPU for preprocessing and linking. Depending on your hardware and network configuration — as well as the number of compile servers you have set up — you may find that this approach works better.

Next, you need to tweak the local PATH setting so that make finds your masqueraded symbolic links that point to distcc. To do this under bash, type:

<b>export PATH="/usr/lib/distcc/bin:${PATH}"</b>

Now you're ready to compile. Just enter your favorite source tree and type:

</b><b>make -j5</b>

You'll want to tweak the number after -j to suit the number of machines participating in your compile farm. It's usually optimal to use a -j number that's slightly higher than the number of compile servers you are using.

While your sources are being compiled, log in to the compile servers and monitor their system load. You should notice an increased load on these boxes as they assist your client box.

Distcc Extras

If GNOME is installed on your client machine, then it's likely that a GNOME distcc monitor was compiled and installed along with distcc and distccd. To run it, type:

<b>distccmon-gnome</b>

You should see a GNOME-based distcc monitor that looks something like Figure 1. By using distccmon-gnome, you can see how much time is spent for each step of the build process on all the machines that are being used for compilation. The information from distccmon-gnome is useful for configuring distcc to perform optimally. For example, if you notice that a disproportionate amount of time is being spent on preprocessing, then you may want to remove "localhost" from DISTCC_ HOSTS. This way, the client can be devoted to preprocessing and linking and compilation can be left for the compile servers.

If you don't have GNOME available, you can start the text-based version of distccmon by typing:

</b><b>distccmon-text </b>

followed by the refresh interval in seconds:

</b><b>distccmon-text 1</b>

Other Distcc Use Strategies

Besides using the masquerading method, there are also a couple of other methods that can be employed to get a source tree to use distcc. They're generally not as effective as masquerading, but may be appropriate for some situations.

The first alternate method is to prefix the name of the compiler that is being used with "distcc". This can typically be done as follows:

<b>make CC= "distcc gcc" -j5</b>

The second alternate method is to call distcc as the compiler itself. This can be done as follows:

</b><b>make CC="distcc" -j5</b>

When called this way, distcc looks for cc in the binary search path and uses it for compilation.

For more information on the various options available for distcc and distcc, go to http://distcc.samba.org/ and read the distcc and distccd man pages. In the distcc man page, you can learn how to further refine your DISTCC_HOSTS environment variable for enhanced performance. The distccd man page has a number of security and connection options (such as ssh-based connections).

Distcc In the Real World

It's encouraging to see the positive response that distcc has received. For one, distcc has been integrated into Apple's Xcode developer tools. This lets multiple Apple machines with Xcode use distcc.

In addition, Gentoo Linux (the free software project I lead) has extensive support for distcc. For information on how to use distcc under Gentoo, go to http://www.gentoo.org/doc/en/distcc.xml/. Thanks to the efforts of Lisa Seelye (our resident distcc guru) as well as others, you can expect Gentoo's support for distcc to continue to expand. For example, the current Gentoo Linux installation CDs for the PowerPC can also be used to set up boot-from-CD compile servers.

Conclusion

If you're interested in accelerating compilation even further, take a look at Andrew Tridgell's ccache program (http://ccache.samba.org/). This compiler tool keeps a local cache of all recently compiled sources, which lets you to do things like perform a "make clean" in a source tree and still be able to recompile it very quickly. Distcc and ccache also happen to be quite a dynamic duo when used together.

Daniel Robbins is the Chief Architect of Gentoo Linux and leader of the Gentoo free software project (http://www.gentoo.org/). He can be contacted at drobbins@gentoo.org.



Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video