Dr. Dobb's | Bargain-Basement Parallelism

Bargain-Basement Parallelism

The UNIX multiprocessing fork command lets you take advantage of underutilized processors in dual-processor computers.

February 01, 2003
URL:http://www.drdobbs.com/bargain-basement-parallelism/184405273

Feb03: Bargain-Basement Parallelism

Tim is an associate professor in the Computer Science Department of Eastern Washington University. He can be contacted at [email protected].

Message Passing Systems

Assume you are developing C or C++ programs to run under Linux or other UNIX variant, and you know that they may run on a machine with more than one processor. Sequential programs, of course, only use one of those processors, which probably is sufficient if the machine is otherwise able to keep all processors busy. But if the machine ends up with an idle processor, you may want to capture those processing cycles.

Of course, there are freely available systems such as Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) under which you can rewrite your programs to take advantage of message passing; see the accompanying text box entitled "Message Passing Systems." However, it may happen that your program deals with a number of smaller subproblems that are effectively independent from each other—what the parallel programming literature calls an "embarrassingly parallel" problem. In this case, you may be able to take advantage of the UNIX multiprocessing command fork to get multiple copies of your program executing. With only small changes in the original sequential program, you can keep both processors on dual-processor computers busy and complete your calculations in half the time.

The statement that opens up the possibility of such speed-up is the fork statement. If fork succeeds, there are two copies—the parent process and the child process—of the same program executing at the same time. The child process executes in a copy of the memory used by the parent process. This means that it receives all of the data of the parent, but copied into its own area. However, the processes share any files opened for output. The operating system lets them write to the same file, and handles the interleaving—though you should be careful to write only complete newline terminated records as single writes.

To illustrate, I use a program designed to measure the characteristics of Binary Search Trees by generating large numbers of them from random data (the complete source code for this and other examples is provided electronically; see "Resource Center," page 5). Listing One is C preprocessor code for enabling/disabling the fork processing, currently set to enable the dual-processor mode. A process determines whether it is the child or parent process by examining the variable receiving back the function value of the fork. The parent receives a strictly positive number, the process ID for the child; the child receives zero. (Failure to fork causes a negative return value.)

Listing Two is the preprocessor code surrounding the set-up for either sequential- or dual-processing. In this case, I am only setting up for two processes, so that the alternating sequence can easily be generated—the child process does the even offset values for N (after Nstep has been doubled). The parent process does the odd offset values (by stepping ahead by 1 before Nstep is doubled). Prior to pid_t Child = fork(), the program opened an output stream, using ofstream Fout("BSTstats.csv",ios::app);—both parent and child processes write to this output stream.

The working heart of the simulation (Listing Three) does not even know whether it is running as the only process or as a parent or child process. Even on an otherwise unloaded dual-processor computer, though, the operating system does require the use of some processor cycles, slightly slowing down one side of the calculation. If you just want the statistical data, this isn't a problem. However, if you are interested in the time required, you may see that the parent and child run at slightly different speeds.

You can also get data back from the child process or processes via a shared output file. In this case, the parent needs to wait for the child process to terminate, something easily done with the UNIX C Library function wait(). The child process writes its information to a shared file, which the parent process then rewinds and reads.

Listing Four is from a program implementing the "world's worst" method to calculate a logarithm as ln(x): the Monte Carlo integration of dt/t from 1 to x. The parent opens a file on a local disk, then the child process writes its information to that file. When the child is finished, the parent reads through the file to process the information. Although in this case only a single long int is being returned, you can see that the object written to the binary file and read back from it could be arbitrarily large.

Java Note

Java threads, of course, make multiprocessing easily available, but require a bit more programming effort to split the sequential code into its parallel incarnation. Rather than simply executing from the point of a fork statement, the Java thread begins processing within its run() method. Hence, the segment of code to run in parallel needs to be moved to that method. Similarly, data transfer needs to be handled explicitly in constructing the thread rather than through a copied data area. Listing Five comes from the Java analogue to the C++ code examining Binary Search Tree characteristics (Listing One). The first code segment is taken from the main() method. The Thread constructor captures the subproblem parameters into instance fields to drive the simulation in the run() method; see Listing Six. The run() method (Listing Seven) doing all of the work is simply a copy of the appropriate section of the sequential program.

Since the System.currentTimeMillis() method explicitly returns wall-clock time rather than the processor time returned by the C Library function clock(), this Java implementation may see even more of a speed difference between the threads, depending on which processor is doing the necessary operating-system operations.

In Java environments, information can flow from the threads back to the originating process because they are sharing the same memory, based on references passed as parameters. For the "world's worst" logarithm calculation, I needed to provide a reference to a mutable integer—even though the Java API class Integer is immutable. Listing Eight does this. As in the C example, the size of the object returned could be arbitrarily large in the Java implementation. Indeed, because of the shared memory, it is easier for the threads to return information to the originating process through such mutable objects.

DDJ

Listing One

/* If a dual-processor UNIX machine, you can use a fork and run two 
experiments in parallel. Comment out the following #define to disable */
#define DUAL_PROCESSOR  /* UNIX system, dual-processor computer */
#ifdef DUAL_PROCESSOR
#include <unistd.h>     /* fork (at least for Linux)  */
#endif

Back to Article

Listing Two

#ifndef DUAL_PROCESSOR
   srand(time(NULL));   // Get a chance start to rand() sequence
#else
// If it IS dual-processor, fork the second process, then start two different
// random sequences and arrange to alternate tree sizes during the simulation.
   pid_t Child = fork();
   if ( Child != 0 )    // I.e., if this is the parent process
      N += Nstep;       // Do the odd-offset ones
   Nstep += Nstep;      // Each process steps by twice as much
// Chance start to rand() sequence, different for parent and child
   srand(time(NULL) + Child);
#endif

Back to Article

Listing Three

for ( ; N <= Nlim ; N += Nstep )  // Use value of N from above
{  clock_t Start = clock();       // Capture the starting tick
   double  CPUsecs,               // Processor time required
           SigmaHt = 0,           // Start summations at zero
           SigmaDp = 0; 
   for (Pass = 0; Pass < Npasses; Pass++)
   {  for ( k = 0; k < N; k++ )   // Fill the tree
         Tree.Insert ( rand() );
      Height   = Tree.Height();   // Capture this tree's data
      AvgDepth = Tree.AvgDepth();
      SigmaHt += Height;          // Collect the statistics
      SigmaDp += AvgDepth;
      Tree.Empty();               // Empty out the tree
   } // end for ( Pass...
   CPUsecs = double(clock()-Start) / CLOCKS_PER_SEC;
// Capture statistics in "comma-separated value" mode for later use.
   Fout << N << ',' << SigmaHt / Npasses << ','
        << SigmaDp / Npasses << ',' << CPUsecs << endl;
} // end for ( N...

Back to Article

Listing Four

   int fd = open("/tmp/Joint.data", O_RDWR|O_TRUNC|O_CREAT, 0666);
   .  .  .    < code segment omitted > 
// Spawn all the child processes as a chain
   for ( Proc = 1; Proc < Nproc; Proc ++ )
   {  Child = fork();
      if ( Child != 0 )      // Child spawns the NEXT child.
         break;
   }
   .  .  .    < code segment omitted > 
   if ( Proc == 1 )   // That is, the first parent in the chain
      wait(NULL);     // wait till everything is finished.
   else
   {  if ( Child != 0 )
         wait(NULL);  // Let the immediate descendent finish
      N_out = write (fd, &Nhits, sizeof(Nhits)); // write out Nhits
      exit(1);        // Bow out so that the next can process
   }
   N_out = lseek(fd, 0, SEEK_SET);  // Rewind the file
   for ( Proc = 1; Proc < Nproc; Proc++ )
   {  N_out = read (fd, &Part, sizeof(Part));    // read into Part
      Nhits += Part;
   }

Back to Article

Listing Five

   BSTrun[] worker;     // Compute-engine threads
     .  .  .    < code segment omitted >
   worker = new BSTrun[nThreads];
   for ( k = 0; k < nThreads; k++ )
   {//Note the threads have differing starting values for n,
    //while they all use a correspondingly coarser nStep.
      worker[k] = new BSTrun(n+k*nStep, nLim, nThreads*nStep,nPasses, fOut);
      worker[k].start();     // Start processing
   }
// Now wait for the workers to finish
   for ( k = 0; k < nThreads; k++ )
      try
      {  worker[k].join(); } // Returns when worker[k] terminates
      catch (InterruptedException e)   // We should never see this
      {  System.err.println (e);  System.exit(-1);  }

Back to Article

Listing Six

class BSTrun extends Thread
{ 
   // Seeded in the constructor, used in run()
   java.util.Random generator;
   // Subproblem description parameters / instance fields
   int n,          // Starting tree size
       nLim,       // Upper limit on tree size
       nStep,      // Increment for tree size
       nPasses;    // Number of trees to build for averaging
   // Output stream for the results
   PrintWriter fOut;
   // Copy all the construction parameters into their fields.
   BSTrun(int n, int nLim, int nStep, int nPasses, PrintWriter fOut)
   {  this.n = n; this.nLim = nLim; this.nStep = nStep;
      this.nPasses = nPasses; this.fOut = fOut;
      // Since n values differ, we have somewhat different seeds.
      generator = new java.util.Random(System.currentTimeMillis()
                  + n * nLim);
   } // end BSTrun constructor

Back to Article

Listing Seven

// Compute engine portion --- copied from sequential code
public void run()
{  BST tree = new BST();
   for ( ; n <= nLim; n += nStep )
   {  long   start = System.currentTimeMillis();
      double elapsed;
      double sigmaHt = 0,  // Total of tree heights
             sigmaDp = 0;  // Total of tree average depths
      int    height;       // Individual tree height
      double avgDepth;     // Individual tree average depth
      String stats;        // Build the formatted output here
      DecimalFormat fmt = new DecimalFormat ("0.000");
      for ( int pass = 0; pass < nPasses; pass++ )
      {  for ( int k = 0; k < n; k++ ) // Fill the tree
            tree.insert ( generator.nextInt() );
         height   = tree.height();     // Capture this tree's data
         avgDepth = tree.avgDepth();
         sigmaHt += height;            // Collect the statistics
         sigmaDp += avgDepth;
         tree.empty();                 // Empty out the tree
      }// end for (pass...
      elapsed = (System.currentTimeMillis()-start) * 1.0e-03;
      stats = "," + fmt.format(sigmaHt / nPasses) + ","
                  + fmt.format(sigmaDp / nPasses) + "," 
                  + fmt.format(elapsed);
      fOut.println (n + stats);  fOut.flush();
   } // end for (n...
} // end run()

Back to Article

Listing Eight

class IntWrapper
{  public int value;      // All values allowed, so direct access
   IntWrapper (int value) // Constructor:  copy over value
   {  this.value = value;  }
} // end class IntWrapper
. . . <code segment omitted>
// The following code segment is taken from the main() method--showing the 
// creation of an array of threads th[], and an array of IntWrapper objects 
// p[] to handle parameter passing. The p[] objects also pass information 
// into the thread--the instance number of the thread.
   LnCalc[]     th = null;
   IntWrapper[] p  = null;
   .  .  .    < code segment omitted >
   th = new LnCalc[nThreads];
   p  = new IntWrapper[nThreads];
   for ( k = 0; k < nThreads; k++ ) // Thread initiation
   {  p[k]  = new IntWrapper(k+1);
      th[k] = new LnCalc(x, nShots, p[k]);
      th[k].start();
   } // end thread initiation and start
   for ( k = 0; k < nThreads; k++ ) // Thread completion
   {  try
      {  th[k].join();  } // Wait for completion of this thread.
      catch (InterruptedException e)
      {  System.out.println (e);  System.exit(-1);  }
      nHits += p[k].value;
   } // end wait for thread termination

Back to Article

Feb03: Message Passing Systems

Message Passing Systems

If you have access to networked computers that you want to use for parallel processing, you can examine the freely available message passing systems. Parallel Virtual Machine (PVM) is one such system, available at http://www.epm.ornl.gov/pvm/ and http://www.netlib.org/pvm3/index.html. On the other hand, Message Passing Interface (MPI) is technically not a system, but a standardized interface. For information on MPI, see http://www-unix.mcs.anl.gov/mpi/. There are two freely available implementations, MPICH (ftp://ftp.mcs.anl.gov/pub/mpi) and LAM/MPI (http://www.mpi.nd.edu/lam/).

—T.R.