Parallel

Managing Cluster Computers

By Carlo Kopp, July 01, 2000

A side effect of the commodification of computer hardware has been the emergence of supercomputing clusters. Carlo describes how TurboLinux's enFuzion is used to manage the Monash Parallel Parametric Modeling Engine, a cluster of Pentium/Linux-based computers.

Jul00: Managing Cluster Computers

Carlo lectures in computer architecture at Monash University (Australia). He can be reached at [email protected] or http:// www.csse.monash.edu.au/~carlo/.

One of the most interesting side effects of the commodification of computer hardware has been the emergence of supercomputing clusters. Until recently, commodity processors have lacked the performance to be useful for large-scale computing applications. The pressures of software bloat in consumer operating systems and tools have had the beneficial side effect of changing this.

To put this into perspective, the best starting point is to look at the traditional approach to supercomputing -- the venerable Crays, the lesser Convexes, and the now obsoleted i860 Cray-on-a-chip. All of these machines relied heavily on vector processing hardware to achieve high performance. Vector processing hardware is designed to operate on arrays of data (or vectors), rather than single operands. Highly pipelined hardware is employed to make this happen quickly, in concert with large amounts of fast memory and I/O hardware.

If the problem you are trying to solve involves a hefty amount of vector processing (or matrix bashing), then vector processing hardware fits the bill. If you have the budget to afford a Cray (or a slice of one), then your problem can be solved in a reasonable amount of machine time.

Alas, not every problem can be manipulated into a form where vector processing can be exploited, and if your number crunching problem falls into this category, buying a million-dollar supercomputer will not help at all.

If your problem is one that cannot be easily vectorized, are there alternatives that allow you to compute it quickly? Clustering of commodity hardware may be the solution. In the simplest of terms, a cluster is a group of cheap commodity computers -- tied together with a high-speed network or switch -- used to crunch large problems in parallel. Instead of solving a problem on a single, extremely fast processor, the problem is broken into chunks, each of which is solved on an ordinary processor -- Pentium, SPARC, Alpha, PowerPC, or Precision Architecture chip, for instance. By gathering together a sufficiently large number of such processors, the total number of computing operations and the total amount of RAM in the cluster can be competitive against a traditional supercomputer (at a very small fraction of the cost of a supercomputer).

Clusters come under a variety of labels: Cluster of Workstations (COW); Network of Workstations (NOW); and Pile of PCs (PoPC). Their common feature is that they use large numbers of commodity machines, each of which is quite cheap. For tens to hundreds of thousands of dollars, it is possible to amass dozens if not hundreds of processors in a cluster.

Perhaps the most famous cluster architecture is Beowulf (http://www.beowulf .org/), designed by Thomas Sterling and Don Becker in 1994. The first Beowulf was a Linux-based cluster consisting of 16 DX4 processors connected by channel-bonded 10 Mbits/sec. Ethernet. This was followed by Beowulfs built around 16 200-MHz P6 processors connected by Fast Ethernet and Fast Ethernet switches. Over the years, dozens of university and research groups have built their own Beowulf clusters, ranging from the original cluster to Avalon, a 19 Gflop cluster of 140 Alpha processors built by the Los Alamos National Laboratory and costing only $150,000. Another system is the Monash Parallel Parametric Modeling Engine (PPME), a cluster of computers designed specifically for large parametric modeling computations (see Figure 1). Because it supports the concurrent execution of one program on multiple machines, it can achieve performance 62 times faster than would be possible on a single computer. The current system specifications are:

20 330-MHz Pentium II processors.
30 350-MHz Pentium II processors.
8 500-MHz Pentium III processors.
5.8 GB RAM.
180 GB disk space.
Two root nodes, each with a pair of 330/500-MHz Pentium II processors.

The cluster comprises machines at both Clayton and Caulfield campuses. Clayton hosts a server (hathor) and 14 client machines; Caulfield hosts another server (mahar) and 16 client machines. All 32 machines are dual-processor Pentiums. Twenty-six client machines are dual-boot Linux or Windows NT, while the remaining four are running Linux. The client machines are located on a private fast network that is accessible only through the servers. Experiments with the cluster are controlled using the TurboLinux enFuzion program, software that lets users specify an experiment using a GUI-based tool, then perform the work on the nodes of the cluster.

Parallelizing Is the Problem

The big issue with clustering is how to parallelize the problem you are trying to solve. Several approaches exist. One is the parallelizing compiler, which takes your application and (during the compile) determines what parts of the program can be executed in parallel. These are then split off to run in parallel. The difficulty with such compilers is that their performance depends on the nature of the problem, and how well the compiler is designed.

The second approach to parallelizing is to write the application from the outset to run on a cluster, and use message passing libraries to move chunks of data (as required) between processors on the cluster. It is cumbersome and time consuming, but necessary for some types of problems.

The third approach is parametric computing, which is suitable for problems where a single application must be executed a very large number of times, each time being run with a different set of starting conditions or parameters. Each time the program is run, it computes results for a different scenario, specified by a unique set of starting conditions.

Parametric computing basically replaces the sequential execution of a large number of jobs on a single, very fast CPU, with the parallel execution of the same jobs on a very large number of not so fast CPUs. Since each CPU can have modest performance, cheap commodity hardware can be used.

The critical issue with parametric computing is having the proper parametric processing tools to organize, run, and manage the jobs in an orderly manner.

TurboLinux's enFuzion

Clearly, it takes more than just CPUs and cables to build cluster computers. Consequently, software development kits have come along to facilitate the implementation and management of Beowulf-like cluster computers. enFuzion from TurboLinux, the toolset I'll examine in this article, is one such approach. Originally developed by David Abramson and others as a research project called "Nimrod" at Griffith University (Australia), enFuzion is designed to transparently operate on a Beowulf cluster in parallel with other clustering toolsets. Therefore, any Beowulf cluster can concurrently be an enFuzion cluster. In short, a Beowulf computer running on Linux provides an infrastructure for enFuzion to run. enFuzion has been ported to NT and Linux on the Intel architecture, Solaris on SPARC, Irix on Intel, AIX on PowerPC, and HP/UX on HP-PA.

More specifically, what enFuzion does is emulate a gaggle of robot users on a single root node machine, each of which will log into one of the many client node machines that form a cluster. Each job is set up to run with a unique, programmed scenario, with an appropriate set of starting conditions.

Each robot user logs into its assigned client node in the cluster, using either Telnet or rsh, and then executes the simulation program from the command line, substituting either command-line arguments or arguments in a setup file with a programmed set of values.

In concept this is a simple idea, but the machinery required to do this in software isn't entirely trivial.

When enFuzion is started, it first logs into the chosen client node, then proceeds to create a directory for the job, into which it copies specified user files, data files, and any other stuff required for the program to be run, including the actual executable binary. If the login and setup of the run-time environment is successful, enFuzion then looks into its run file to determine which command-line arguments and setup file entries it should change, and does so. Then the program is started and monitored.

If the program runs successfully to completion, enFuzion collects the files containing the results, and copies them back to the user's directory in the root node, after which it cleans up the client node by deleting the directory and its contents. If any jobs remain to be computed, it repeats this process with another set of starting conditions. If the job fails during execution, enFuzion creates an error directory on the root node, then proceeds to gather all of the user's files on the client node and save them in this directory for later debugging.

How enFuzion goes about setting up the environment on the client node, and which command-line arguments or simulation program setup file variables it manipulates, is under user control. A scripting language is used for this purpose. Users can either write scripts directly in the simple enFuzion scripting syntax, or use a button-driven windowing tool called the "Preparator" to generate the script without having to touch a text editor.

The starting point for configuring enFuzion to run a simulation is the creation of a plan file, with a .pln suffix, which can be done with a text editor or the Preparator tool. The plan file first specifies each command-line parameter (or setup file variables) that are to be manipulated during execution. Each of these can be specified to be an integer, floating-point, or string variable. Each is given a range of values through which the simulation needs to step through. enFuzion interprets these values at run time and creates a table of all the specified combinations of the command-line parameters or setup file variables required by the user.

The next portion of the plan file contains the scripting commands to setup the environment on the client machine. This typically involves creating directories and copying specified files across, and most closely resembles a shell script. This is followed by scripting commands that invoke the simulation program with the command-line arguments or setup files to be manipulated, and includes a Substitute command, which tells enFuzion to change these to the chosen values for that specific instance of the job. Finally, the script tells enFuzion how to tidy up when the job is completed and what files to save and where to save them on the root node.

Listing One is a typical plan file. The parameter lines each define a parameter that is to be varied across jobs during the simulation. Fields define the type of parameter, and the values to be stepped through during the simulation runs.

The nodestart task is executed on every client node at job start time, and essentially involves setting up the run-time environment. The syntax is simple, with either "root:" or "node:" specifying which of the hosts either executes a command, or is the source or destination of a Copy command. The Execute directive is always followed by a standard UNIX command line.

In this instance, the nodestart task first has the node set up symlinks to several binary datafiles that are NFS mounted from a common server volume, /u/work_ vol/, in directory carlo/SIMULATION/. This was done to remove the time overheads of having to copy 30-MB data files out to the client nodes, the only network load is then produced by the incremental paging of data over the NFS mount.

Next, a number of work and environment files are copied across from the root to the node machines, including the executable binary called "mysim."

The main task is the script for running the simulation proper, then performing housekeeping chores. Several Execute commands are run to gather logging information, after which the Substitute command is performed on the startup file rcfile, whereby parameters are configured for the specific compute run. These are manipulated in accordance with the top of the plan file, and every job ends up with a unique instantiation of the rcfile file.

More debugging information is saved, after which the user simulation is started with "execute time mysim $dataset rcfile." The $dataset is a command-line argument for the mysim simulation program, which also accepts other parameters from the file rcfile.

Once the mysim simulation has completed, a directory is created on the root machine for saving results, with a "root:execute mkdir PLOTS.$jobname" line. Results and various simulation logfiles are then copied across from the client to the root node, and the task is completed.

Using enFuzion

Using enFuzion is straightforward. The plan file must first be crunched into a run-time command file format, with a .run suffix, using a tool called the "Generator" that interprets the plan file and generates a lower level syntax for enFuzion to interpret once it is running. The plan file contains the table of specific values for substitutions. The windowed user interface for this is easy to use. The run file is a human-readable text file, albeit tedious.

Finally, we fire off the simulation using the "Dispatcher" tool, which can be executed from a command line or another windowed user interface. It is the Dispatcher that actually kicks off the running of the simulations on the client machines. During execution, enFuzion runs a job manager process for each client on the root node, and a node manager process on each client machine.

Porting to enFuzion

Porting most applications to enFuzion is also straightforward. My experience in moving a hefty 34,000-line networking simulation across to the enFuzion environment, after running the code only on a single machine, is a good example. This simulation was originally written in C, and ported to Irix 6.2 and FreeBSD 2.2.5. It was a major part of a research project at Monash University.

The client nodes on the 20-processor cluster we employed were headless, mounted in a 19-inch rack and without monitors or keyboards. The first step in this exercise was to modify the code to turn off the X11 screen activity monitor. For an NT environment, the same would apply. The next step was to get the code to cleanly compile and run on Red Hat Linux, which was the host operating system on all of our cluster machines. This was followed by several days of intensive regression testing to validate that the Linux port had no bugs.

At that point, we proceeded to write the enFuzion plan scripts, which took several hours to produce and a little longer to debug. It was necessary to set up some test simulations so that the script could be fully exercised. To speed this up, we configured the cluster computes to use only a few of the 20 CPUs and run scenarios that we knew would complete in several hours. The full-scale simulation would take up to several days on each 333-MHz Pentium II -- rather a long time to wait if you are trying to get something running.

Once the scripts were debugged, the next big effort was planning out the large-scale simulation runs. The project I was working on involved wide area networks (WAN) of mobile nodes and required about 1000 simulations. Each of these simulations took at best 18 hours to run, and at worst, many days. Therefore, as we found, it pays to put the effort into planning.

Finally, the cluster got put to use. Three months were required to complete the total package of simulations, during which time several operating-system upgrades and hardware upgrades were performed, slowing down progress. By the time the final simulations were completed, our modest 20-CPU cluster had grown to no less than 60 CPUs, more than tripling the original performance of the system. Half of the client machines in the cluster were installed on a separate university campus, and connected via a high-speed ATM link. No loss of performance was observed over the link.

The biggest cause of downtime during the project was instability in the hardware and operating system, especially on the root node. Since the root node system is the nucleus of the cluster, it pays to invest the time to get the operating system tuned up and properly configured. This is true regardless of what operating system is employed in the cluster, be it a UNIX variant or NT. If you are planning to set up a cluster, this is one area in which careful planning pays off.

There is no reason why the root node system and client node systems must use the same operating system or even the same machine architecture. If executable binaries of the simulation program are installed on the client nodes, the system can run quite transparently.

The cluster at Monash University uses Linux on the root node, and all client nodes can be booted up into either Linux or NT. In fact, at this time, the Linux installations across the cluster are a mix of Red Hat 5, Red Hat 6, and TurboLinux, which is being progressively installed as the standard.

Therefore, with a minimum of additional effort to install application binaries on the client nodes, something for which an enFuzion script can also be crafted, it is possible to run a cluster with a heterogeneous population of machines. An example would be an instance where a high-performance central server running a commercial UNIX such as Solaris, Irix, or Tru64 (OSF/1) is used as a root node, and low-cost commodity operating systems (such as Linux) are used on the client nodes (or NT if the site is Win32-centric).

Flexibility Is the Key

The flexibility of enFuzion extends further. If the files to be copied between the root node and the client node (that is, data to operate upon and results to copy back) are modest in size, it becomes entirely feasible to dispense with a high-performance switch. In fact, the consequence of this is that lightly loaded machines across the site can be coopted into the cluster, simply by setting up an account for the users in question and installing the enFuzion package onto the machine. In this manner, clusters can be expanded beyond the central computer room. enFuzion contains some facilities to support such an operation, such as configurable parameters to define times during which jobs can be run. Therefore, a machine on a user desktop can participate in the cluster as a client node overnight, when any performance drain will not be seen by the user.

In terms of managing the load on machines, enFuzion provides some useful facilities. One of these is an arrangement whereby system administrators can limit the concurrent number of jobs on the client machines via a configuration variable called the "joblimit." A good rule of thumb is to limit them to one per CPU, so as to avoid users contending for machine time. Another useful facility is a pair of configuration files in which users can limit the number of jobs to be run concurrently on any client node and also select which client nodes to run jobs on (see Listing Two).

An interesting problem encountered at Monash during the early operations of the cluster was cluster hogging. With a large number of users trying to run jobs, we quickly learned that once long-running jobs became established and filled up the specified joblimit for that client node, users with short-running jobs had to wait until the long ones completed.

We devised and implemented a fix, involving a scheduler, which was invoked by the UNIX cron daemon on the root node. It employs the UNIX renice facility to manipulate the scheduling priority of user jobs on client nodes by reducing the scheduling priority of a job as the job runs longer. This allowed short jobs to complete quickly, while previously running long jobs simply slowed down, until the short jobs completed.

Optimizing Cluster Performance

In terms of performance-tuning systems for use in a cluster, the standard caveats for performance tuning any machines apply. The swap space must be adequate, the amount of memory must be large enough to support the number of jobs to be run, and care should be taken in the choice of client machine hardware if the cluster is being built from scratch. In practice, we found that most trouble had arisen with a particular user's simulation that required almost every megabyte of memory on the client machine, at the expense of other jobs.

Conclusion

As an enFuzion user, I was pleased with the final results, since the task of trying to manually run and manage roughly 1000 individual simulations was simply a nightmare waiting to happen. While I did have to weather some of the ups and downs of commissioning, tuning, and tweaking the cluster into a robust configuration, the time saved was clearly justified: time that would otherwise have been spent fretting over dozens of simulations running on disparate machines, or even worse, waiting for multiple jobs to complete consecutively on some hapless, overworked Pentium, SPARC, or MIPS workstation. The alternative of simply checking the enFuzion files with my web browser, while 50 or 60 CPUs happily grind away each on a different scenario of my simulation, was a much nicer way of getting work done. The truth is that without the cluster and enFuzion, I would have had to settle for a much less-ambitious simulation strategy -- and done it the hard way.

DDJ

Listing One

parameter bandwidthlimit label "Capacity Lower Bound" float select
     anyof 0.3 1 3 10 30 100 300 1000 3000 default 100 300 1000 3000;
parameter frequency label "Carrier" float select
     anyof 1.0 1.4 1.8 2.6 3.15 3.8 4.5 default 1.0 1.4 1.8 2.6 3.15 3.8 4.5;
parameter weathermodel label "Wx Model" integer select
     anyof 0 1 2 3 default 0;
parameter span label "Span" text select
     anyof "SPAN2" "SPAN1" default "SPAN2";
parameter dataset label "Dataset" text select
    anyof "dataset97-20" "dataset97-21" default "dataset97-20" "dataset97-21";
task nodestart
    node:execute ln -s /u/work_vol/carlo/SIMULATION/freq-table freq-table
    node:execute ln -s /u/work_vol/carlo/SIMULATION/isa-table isa-table
    node:execute ln -s /u/work_vol/carlo/SIMULATION/data-dataset97-20-0 
                                                         data-dataset97-20-0
    node:execute ln -s /u/work_vol/carlo/SIMULATION/data-dataset97-21-0
                                                         data-dataset97-21-0
    copy root:PLOTS node:.
    copy root:BLN node:.
    copy root:aid-dataset* node:.
    copy root:limit-dataset* node:.
    copy root:plotlist-dataset* node:.
    copy root:times-dataset* node:.
    copy root:title-dataset* node:.
    copy root:weather-model* node:.
    copy root:*table node:.
    copy root:mysim node:.
    copy root:rcfile.sub node:.
endtask
task main
    node:execute date
    node:execute echo Executing on os:$OS release:$OSRELEASE machine:$MACHINE
    root:execute df -k .
    node:substitute rcfile.sub rcfile
    node:execute ls -l
    node:execute ls -l BLN
    node:execute ls -l PLOTS
    node:execute cat rcfile
    node:execute time mysim $dataset rcfile
    root:execute mkdir PLOTS.$jobname
    copy node:PLOTS/* root:PLOTS.$jobname
    copy node:usage.log root:TMP/usage.log.$jobname
    copy node:links root:TMP/links.$jobname
    copy node:limits.log root:TMP/limits.log.$jobname
    copy node:journal* root:TMP/
    root:execute df -k .
    copy node:stdout root:TMP/stdout.$jobname
    node:execute date
endtask

Back to Article

Listing Two

#      do not execute Clustor jobs 7:30-17:30 Mon-Fri
#off day Mon-Fri time 7:30-17:30
#      do not execute Clustor jobs on June 30, 1998
#off date 1998/Jun/30
#      allow Clustor job for 30 minutes at lunch time
#on day Mon-Fri time 12:15-12:45
#      allow Clustor jobs on Jan 1, 1998
#on date 1998/1/1
#      the computer must be free from interactive use for 30 minutes
#idle 00:30:00
#       turn on the screen saver option
#screensaver on
#       turn off the screen saver option
#screensaver off
#      to start new jobs, the load must be below 2.00
#busyload 2.00
#      to start new jobs, the CPU usage must be below 10%
#busycpu 10
#       to start new jobs, the Processor Queue Length must be 0
#busyqueue 0
#      to prevent new jobs, the program should return 1
#external busy program "/home/myuser/myprogram" interval 00:01:00
#      7.5Mb of temporary space must be available
#tmpspace 7.5
#      50Mb of working disk space must be available
#diskspace 50
#       8000Kb of main memory must be available to start a job
#memory 8000
#      on this node, run concurrently at most 3 jobs from a single
application
#joblimit 3
#      be maximally nice
#priorityoffset 20
#      to stop existing jobs, the load must be above 10.00
#stopload 10.00
# to stop existing jobs, the CPU usage must be above 90%
#stopcpu 90
# to stop existing jobs, the Processor Queue Length must be above 3
#stopqueue 3
#      to stop existing jobs, the program should return 1
#external stop program "/home/myuser/myprogram" interval 00:01:00
#      wait 50 minutes before terminating jobs
#stopaction suspend 00:50:00
#      terminate jobs immediately
#stopaction terminate
#      use signal 3 (SIGQUIT) instead of SIGKILL
#killsignal 3
#      on this computer, run at most 5 concurrent Clustor jobs
#totaljoblimit 5
#      change default node directory
#directory "/tmp/mynodes"
#      monitor mouse usage on /dev/mouse
#mouse "/dev/mouse"
#      check console on /dev/console
#console "/dev/console"

Back to Article

1 2 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel