Dr. Dobb's | RapidMind: C++ Meets Multicore

RapidMind is a framework for expressing data-parallel computations from within C++ and executing them on multicore processors.

You can't go far in the software industry these days without hearing about the inevitability of multicore processors. Herb Sutter may have put it best in his Dr. Dobb's Journal article when he said, "The free lunch is over" (www.ddj.com/dept/architect/184405990). It is telling that this warning was quoted by Intel's Tim Mattson and James Reinders in a web-based presentation that explained why multicore is a reality, and is here to stay. Attempting to scale processor speed using traditional means such as increased clock frequency and deeper pipelines is running into physical limitations, putting an end to the "free lunch" in performance that developers—and users—have become accustomed to. Luckily, the answer is straightforward: Instead of attempting to make more complex single cores, processor makers are putting multiple simpler cores on the same die.

Multicore designs have been used in other processors for some time. Less restricted by traditional architectural designs, graphics processing units (GPUs), and the Cell Broadband Engine (Cell BE) processor (www.ddj.com/dept/64bit/197801624) by Sony, Toshiba, and IBM, have demonstrated tremendous performance improvements employing massively parallel approaches to processor architecture. These processors provide opportunities for high-performance applications. But the multicore revolution isn't limited to these processors. With multicore designs being adopted by CPU vendors such as AMD and Intel, parallel programming is a necessity for all developers.

The Problem with Multicore

Although multicore clearly provides potential for high performance, software needs to explicitly take advantage of the multiple cores to fulfill this potential. Sadly, no compiler turns your serial algorithms expressed using C++ into perfectly parallelized programs that scale to an arbitrary number of cores. Traditional approaches such as multithreading force you to spend time worrying about thread management, instead of designing scalable parallel algorithms. With the number of threads growing as the number of cores does, dealing with bugs caused by deadlocks and race conditions hampers developer productivity significantly. At four or more cores, the complexity of threading becomes a serious problem.

So are there high-level approaches to parallel computing that map well to multicore architectures? After all, parallel computing is nothing new. Architectures such as the Thinking Machines Connection Machine CM-2 sported as many as 65,000 processors in the late '80s. However, past high-level approaches to parallel computing have often been focused on new languages or nonstandard extensions to existing languages. This approach allows a great deal of flexibility in terms of syntax, but means that developers are forced down a heavy migration path by dropping established tools and languages. The lack of good approaches to parallel programming that embrace the existing toolchain was one of the factors that led to the sluggish adoption of parallel computing in the past.

Heterogeneous Cores

Parallelization is only half the battle. The move to multicore is an opportunity for chip vendors to embrace other interesting architectural changes in an attempt to reach teraflop performance within this decade. We illustrate this by describing the architecture of the Cell BE processor. Each Cell BE contains a total of nine cores. The Power Processing Unit (PPU) is a fairly traditional CPU core, with a PowerPC instruction set architecture and a reasonably sized cache. The other eight cores are referred to as Synergistic Processing Units (SPUs), and use a nontraditional architecture to achieve high performance in a small chip area. Instead of caches, each SPU sports 256 KB of on-chip, locally addressable memory (the local store), in addition to 128 128-bit registers. The SPUs use a vector instruction set that allows, for example, operating on groups of four floating-point numbers at once. Each SPU also features its own Memory Flow Controller (MFC), which can independently issue DMA transfers between main memory and the SPU local stores. The SPU's predictable, dual-issue pipeline allows complete certainty about the optimality of a particular sequence of assembly instructions. This lets a single Cell BE processor using all eight SPUs perform large matrix multiplications at over 200 Gflops, compared to about 12 Gflops for a single traditional CPU core.

The same features that make the Cell such a high-performance chip also make it hard for compilers for languages such as C/C++ to take advantage of. The assumptions these languages make about memory organization and instruction sets simply do not map well to heterogeneous processors such as the Cell. And if you think this kind of architecture is restricted to the domain of game consoles and special-purpose equipment, consider what vendors such as AMD and Intel are saying about their upcoming architecture. AMD's Fusion project aims to merge GPU-like cores with traditional x86-style cores, which, given the similarities between SPUs and GPU processors, will result in an architecture much like that of the Cell. At its 2006 developer forum, Intel showed off an 80-core processor prototype capable of more than a teraflop of compute power. Those 80 cores are not traditional x86 cores—this is an entirely different processor architecture. Intel also recently announced the Larrabee project, an attempt at converging GPUs and multicore CPU architectures.

So wouldn't it be nice to be able to program with the familiar tools and languages without compromising on performance?

The RapidMind Platform

The RapidMind Development Platform, developed by RapidMind (which Stefanus cofounded), is just such a framework for expressing data-parallel computations from within C++ and executing them efficiently on multicore processors. It lets you specify any computation you want to leverage multiple cores within existing C++ applications, without changing your compiler or workflow. Figure 1 illustrates the RapidMind architecture.

The Developer Edition of the RapidMind platform (download for free at www.rapidmind.net) is currently available for a number of Linux and Windows versions, and is supported on 32- and 64-bit systems.

There are three basic types you need to be aware of when programming with RapidMind:

Values

In terms of the Value type, you can declare two values, initialize them, then add them up and place the result in a third value:

The 1 and f portions of the value types are used here. The 1 means that this value contains a single scalar. Values can contain any fixed number of scalars, but they usually contain between one and four elements. The f stands for float. Values can contain any standard C++ type, as well as some nonstandard ones (most notably, half-precision floating-point types). Values are templated; Value1f is actually a typedef for Value<1, float>. Typedefs for up to four elements of all basic types are provided by the platform.

Programs

The previous example of using values might not make them seem particularly interesting. The computation, if inserted into a C++ function, would simply happen immediately in the same thread the rest of the function is executing in. Values become more interesting when they are combined with the Program type. This code illustrates a program definition using RapidMind:

This trivial program captures the same computation we performed directly on values above. However, the computation does not execute immediately. Instead, it is stored in the program object add_two_numbers, and can later be used to compute the sum of two numbers. When a program object is defined, every computation on RapidMind types between the RM_BEGIN and RM_END statements is collected and stored within the object. This process happens at runtime and does not require any special preprocessing or compiler modifications. Programs can be defined in any function, but typically a program is defined in a constructor of a class encapsulating some computation.

Arrays

Adding two numbers is not a very parallel operation. Adding two arrays of numbers, however, can be parallelized assuming the arrays are large enough. RapidMind lets program objects be called on entire arrays at once:

The first line declares three arrays. The arrays a and b are initialized to make space for 10,000 elements each, whereas c is initially empty. Just like Value, Array is a simple class template. The first template parameter specifies the dimensionality of the array (one, two, or three), and the second parameter specifies the element type of which the array holds a collection.

In the next four lines we simply initialize our input arrays with some data. The write_data() function obtains a plain C++ pointer to the data—in this case, a float*—that can be used to modify the array. A similar read_data() function can be used for read-only accesses. Distinguishing between the two helps the platform understand when it needs to move data around; for example, to do a copy-on-write operation.

The last line of the snippet calls upon our program object, add_two_numbers, to perform the computation. Note how the program object is used just as though it were a C++ function. Now, even though this program takes two values as its input, and produces another value, you can call it on the entire array. This is effectively the same as calling the program once for each entry in the array; for example, by looping through the array. However, by applying the program to entire arrays at once, the parallelism is explicit. The platform knows that this computation can be executed independently for each element it computes, and therefore knows that the computation can be split across an arbitrary number of cores.

Backends and Runtime Compilation

When this last line runs, the platform actually does a little more than just execute the program object. First, it decides which "backend" should be responsible for the program executions. Backends form the connection between the RapidMind platform and a particular piece of target hardware. For example, RapidMind currently ships with backends for the Cell BE, OpenGL-based GPUs, and a fallback backend that generates and compiles C code on the fly. A multicore x86 backend will be released later this year, and other backends are in the works.

Once a suitable backend has been chosen (a process that is generally instantaneous), it is asked to execute the program under the given conditions. The first time this is done generally causes the program object to be compiled for that particular backend, similar to the way a JIT environment behaves. Once a program has been compiled, it is not recompiled. Compiles can also be forced to happen using the compile() function.

This runtime compilation mechanism is powerful, as the generated code is optimized for the exact conditions it is being run under. This is one reason why RapidMind-generated code outperforms plain C or even hand-tuned assembly code in many cases.

A More Interesting Example

The previous program was a trivial example. Program objects typically contain more statements, and include function calls, random-access array reads, control flow, and the like. The following program, which computes a Julia set fractal, is more complicated, but still fairly basic:

This program (available at www.ddj.com/code/) takes as its input the location of a pixel in an image, then computes a shade for that particular pixel by iterating a complex arithmetic expression and checking whether it diverges; see Figure 2. To perform the complex arithmetic, we use the standard C++ class std::complex with a RapidMind value type. Because RapidMind value types act like numeric types in C++, this works out of the box. It is often possible to convert a piece of existing C++ code to RapidMind by simply replacing basic C++ types with RapidMind types.

Control Flow

The Julia set program also illustrates control flow. The main computation encapsulated in the program object, a complex multiplication and addition, is repeated until a maximum number of iterations is reached or the norm of the complex number grows greater than four. Because the latter condition depends on a computation being performed in the program object, we cannot simply include a C++ for loop here. The for loop would attempt to determine the value of std::norm(z) < 4.0f as a bool, but the value of z is not yet determined because the program is merely being collected, not executed. You can use C++ for loops in program definitions, but their conditions need to be based on nonRapidMind types. For example, it's possible to run a loop that iterates a constant number of times and that unrolls the RapidMind computations it contains.

RapidMind does allow dynamic control flow in programs using constructs such as RM_FOR, RM_IF, and RM_WHILE. These constructs generally behave like their C++ counterparts, but are collected into the program. By letting you use control constructs, RapidMind exposes a Single Program Multiple Data (SPMD) programming model, as opposed to a Single Instruction Multiple Data (SIMD) model. This is one of the reasons why RapidMind has an explicit Program object. In systems where basic operations would be executed on arrays and those operations collected into program objects implicitly, control flow is difficult or impossible to express, reducing the generality of the programming model as well as posing performance problems.

The platform provides a number of other operations that let you express almost any kind of computation. In addition to functions and operators that can be used on values, the platform provides operations that affect entire arrays. For example, array contents can be rearranged using "access patterns" and "collective computations" over an entire array, so that finding the sum of all entries can be accomplished. Table 1 lists some of the types and functions.

Feature	Description
Types	Value, Array, Program, Bundle, Matrix, Exception, ...
Arithmetic	+, -, *, /, %, ...
Trigonometric & Exponential	sin, atan2, exp10, pow, ...
Geometry	dot, cross, distance, ...
Logical	<, >, &&, \|\|, cond, all, any, ...
Miscellaneous	lerp, join, cast, ...
Control Flow	RM_FOR, RM_IF, RM_WHILE, RM_DO, ...
Array Access	[], (), read_data, boundary, adopt_array, ...
Access Patterns	take, offset, shift, slice, ...
Virtual Arrays	grid, dice, ...
Collective	reduce, sum, min, max, count, exists, ...

Table 1: Some types and functions.

Real Applications

A number of applications written using the RapidMind platform illustrate the performance gains. One such example is RTT AG (www.rtt.ag), a provider of automotive visualization software. RTT's RealTrace software was built using RapidMind, and is the world's first workstation real-time raytracer. It lets accurate reflections and refractions be displayed on interactive models; see Figure 3. By leveraging GPUs to perform the complicated computations involved in raytracing, RTT was able to put a product on the market that surpasses the state-of-the-art in raytracing performance. The same code was ported to the Cell BE processor within three weeks, and shown as a demonstration in IBM's SIGGRAPH 2006 booth.

Figure 3: RTT's RealTrace application produces real-time, ray-traced images.

With quad-core CPUs now available (and larger numbers of cores on the horizon), we are preparing for the pervasiveness of multicore systems. To this end, we recently demonstrated a prototype of a RapidMind x86 multicore backend on a financial modeling application; see Figure 4. Compared to hand-tuned C code using the Intel C++ Compiler, the RapidMind version of the algorithm was able to achieve twice the performance on a single core, scaling to eight times the performance on four cores, with no additional application effort. The increased performance on a single core may come as a surprise. It stems from the fact that the semantics of the platform let our code generators perform much better analysis and optimization of the application code. We have been careful to design the system so that we are not plagued by issues (such as pointer aliasing) that inhibit effective optimization on C or C++. At the same time, our system integrates so cleanly with C++ that all the modularity constructs of the language are available to the developer, generally at no performance cost.

[Click image to view at full size]

Figure 4: Performance comparison between a tuned C++ implementation and a RapidMind-enabled version running on an x86 multicore backend prototype.

Likewise, Hewlett-Packard and RapidMind performed comparisons between CPUs and GPUs for a scientific algorithms, such as the Fast Fourier Transform (FFT) and the Basic Linear Algebra Subroutines (BLAS) single-precision matrix-multiply (SGEMM). The results showed that the RapidMind implementations on a GPU were between 2.4 and 32.2 times as fast as the same algorithm running on a CPU core. For the CPU comparisons, extremely tuned numerical libraries were used.