Dr. Dobb's | Appliances: Adaptable Parallel Computing for Mass Consumption

Appliances: Adaptable Parallel Computing for Mass Consumption

Coping with demanding requirements for adaptability and performance

June 03, 2008
URL:http://www.drdobbs.com/embedded-systems/appliances-adaptable-parallel-computing/208401799

Steve Reinhardt is Vice President of Joint Research at Interactive Supercomputing.

The appliance metaphor evokes an instrument that does a single function extremely well, often a function that is well understood, is used similarly by many people, and takes considerable expertise to get right. Computing appliances have established several important niches recently. For instance, the NetApp storage servers hide tremendous complexity beneath simple interfaces and make it easy for customers to expand storage to immense sizes without becoming storage experts.

How might the appliance metaphor be useful for high-performance computing (HPC)? For many scientists and engineers, their computing is the essence of their science, and hence they change their models constantly for their custom work. This doesn't match the appliance approach very well. But there are some computing tasks that naturally lend themselves to this metaphor, and whose computing needs are growing rapidly. Two common examples are the initial processing of data coming out of DNA microarrays and the processing of images in biomedical research. Since this work is done nearly identically by many scientists or engineers, groups often step forward to implement common functionality for a community. Some examples include BLAST (the "Basic Linear Alignment Search Tool" used for aligning nucleotide and amino acid sequences) and HMMER (profile "hidden Markov models" used for protein sequence analysis) in the genomics world and SPM ("statistical parametric mapping" for testing hypotheses) for functional imaging. Commercial companies also make appliances, such as MRI machines, that often include a hardware component as well as significant software expertise.

Appliance developers want to provide excellent results to their users (who often include themselves!), which usually requires adapting quickly to the latest scientific advances, and running very fast on modern hardware. These needs are often in direct conflict, as the software tools that support fastest development, such as the high-productivity desktop languages like MATLAB, Python, and R, are often not viewed as high performance. Further, modern hardware is multi-core -- and soon to be many-core -- so being able to decompose the work to exploit the multiple cores is essential for top performance. Further, some algorithms, notably imaging, map well to GPUs (graphics processing units), often with speed-ups on the order of 100X, which researchers who depend on imaging need to remain competitive in their own science. The desktop languages do not have strong support for parallelism or the use of hardware accelerators.

So how can software appliance developers practically respond to these conflicting demands, for both faster adaptability and much higher performance? A new generation of tools is emerging that combine the high-productivity of desktop languages such as MATLAB, Python, and R with access to parallelism and accelerators. These include the Parallel Computing Toolbox from The MathWorks, Dynamic Application Virtualization from IBM, and Star-P from Interactive Supercomputing (the company I work for). For example, Star-P bridges the productivity languages to the power of parallel clusters. Because the language differences between Star-P and the desktop language (using the M language of MATLAB for the examples below) are slight, algorithm developers will continue to develop their codes in a familiar environment, yet have access to massive acceleration where their algorithms demand it.

Parallel Execution

Let's look at an image-processing example that does 2D FFTs on each plane of a 3D matrix (images). The original M code looks like this:

% images is an N x N x #images array
for i=1:size(images,3)
   fft_images(:,:,i) = fft2(images(:,:,i));
end

To run this in parallel, the code might look like:

fft_images = ppeval('fft2',images);

where ppeval is essentially a parallel loop over each plane of the input array(s). The Star-P infrastructure takes care of the details of allocating the data to the cores that are active for this session; the user continues to think in the high-productivity M language, but gets the performance advantages of many cores and more memory. In addition to the task parallelism in the example, data parallelism performs a single operation, such as a matrix multiplication or a sort, across an entire array that is distributed across the memory of the parallel nodes. Adapting M or Python codes to run in parallel with Star-P typically takes a day or so, depending on the size of the code, and often yields speed-ups of more than 10X, and even 100X on large systems.

Readers familiar with C and Fortran may interject that those languages often deliver much higher performance than the high-productivity languages for a given algorithm. While that has historically been true, new compilation techniques for the high-productivity languages are closing that gap rapidly while preserving their productivity benefits. While developers wanting every last clock period of performance will usually be able to make C or Fortran run faster on a single core, that advantage must be weighed against the higher-level benefits of being able to go parallel simply with the productivity languages and thereby gain better multi-core performance.

Software Acceleration

Developers of existing appliances will often have created, with considerable effort, optimized versions of the key kernels of their algorithms, usually in C or Fortran. While they may value the greater adaptability of the productivity languages, they must sustain the use of those optimized kernels. Star-P supports this need by its SDK interface. In keeping with the FFT example above, assume that a developer has a single-core C-language FFT routine appl_fft that is customized to the appliance's special circumstances. Using this routine from Star-P involves three steps. The first step involves plugging the C routine into the Star-P infrastructure on the HPC server system. The wrapper code to do this might look like the following (error-checking details omitted for brevity).

static void appl_fft_wrapper(ppevalc_module_t& module,
  pearg_vector_t const& inargs, pearg_vector_t& outargs)
{
// create an output argument of the same type and
// shape as the input argument 
pearg_t outarg(inargs[0].element_type(), inargs[0].size_vector());
// call appl_fft, telling to read its input args 
// directly from inargs, and telling it to write its 
// result directly into the outarg
starp_double_t const* indata = inargs[0].data<starp_double_t>();
starp_double_t *outdata = outarg.data<starp_double_t>();
appl_fft(inargs[0].number_of_elements(), indata, outdata);
outargs[0] = outarg;
}

The second step calls the wrapper from the M language, via:

function out = fft(in)
out = appl_fft_wrapper(in);

The third step, executed in a set-up part of the appliance, then links this wrapper to the productivity language and then places the application-specific FFT routine at the head of the MATLAB search path, might look like:

pploadpackage('C',/path/to/package.so','fft');
setpath('path/to/appliance/')

To preserve compatibility with the unaccelerated source, the core algorithm (the ppeval code above) does not change, but merely uses the new fft function just defined here.

Hardware Acceleration

Many performance fanatics have been excited about the potential of the various hardware accelerators now on the market, from GPUs such as the NVIDIA GeForce series and those from AMD/ATI, field programmable gate arrays (FPGAs) such as the Xilinx Virtex series, and application-specific integrated circuits such as those from ClearSpeed. While the performance potential is clear for well-suited algorithms, the logistics of using the accelerators in real programs can be daunting for most algorithm developers. Here again, this new generation of productivity languages offers dramatic steps forward. For example, the NVIDIA GeForce has FFTs implemented in its scientific library. Thus, for the example above, users don't need to change their core algorithms, but rather just request the use of the GeForce FFT routine, typically running in multiple GeForce chips, instead of the standard one running in the general-purpose cores. This might be requested, at the beginning of a Star-P session, by the following:

ppsetoption('use_accelerator','NVIDIA');

Again, compatibility with the desktop language is crucial, so this change can be made conditionally in a set-up portion of the code, leaving the rest of the program oblivious to whether it's running on a single core on the desktop, on multiple general-purpose cores in Star-P, or on multiple accelerators via Star-P.

Software developers targeting appliances are coping with demanding requirements for adaptability and performance. A new generation of productivity tools, extending existing desktop languages for parallelism, is delivering much higher performance while preserving the productivity of the desktop.