Dr. Dobb's | Inside the VSIPL++ API

Inside the VSIPL++ API

VSIPL++ is a C++ API for high-performance computing. One unique feature of it is that it includes direct support for parallel applications.

September 05, 2006
URL:http://www.drdobbs.com/cpp/inside-the-vsipl-api/192501827

Mark is the founder and Chief Sourcerer of CodeSourcery. He is also the Release Manager for the GNU Compiler Collection.

VSIPL++ is a C++ API specification for high-performance signal- and image-processing applications. You can use VSIPL++ for radar applications, software-defined radios, or similar systems. VSIPL++ programs perform operations (such as signal-processing or linear algebra operations) on vectors, matrices, and tensors.

A unique feature of the VSIPL++ API is that it includes direct support for parallel applications. In this article, I show how you build a simple parallel signal-processing application with just a few lines of code. Even if you are not interested in this specific application area, you will find VSIPL++ interesting as an example of a high-performance API for parallel numerical computing.

The VSIPL++ API was defined by the VSIPL Forum (http://www.vsipl.org), a group of industrial and governmental partners, with sponsorship from the U.S. Department of Defense. The API design goals were "the three Ps"—productivity, performance, and portability. In other words, it should be easy to write programs, the programs should run fast, and it should be possible to move programs from one system to another.

Sourcery VSIPL++ (http://www.codesourcery.com/vsiplplusplus/) is CodeSourcery's optimized implementation of the VSIPL++ API. Sourcery VSIPL++ is available at no charge under the GNU Public License (GPL) and under a traditional software license for a fee. Because VSIPL++ is an open API, other people are free to develop competing implementations of VSIPL++. The examples I present here do not use any special features of Sourcery VSIPL++, and should work with any implementation of the API.

Computation

The example computation I present here (complete source code available in the source code area) is based on the High-Performance Embedded Computing (HPEC) FIRBANK benchmark (http://www.ll.mit.edu/HPECChallenge/). The various HPEC benchmarks are representative computations performed in signal-processing applications.

The computation performed by the benchmark is to filter signals recorded by each of M sensors. Each sensor samples a digital data stream every T seconds. After N samples have been collected, a filtering operation is performed to remove unwanted frequencies from the input. For example, if the sensor in question is part of a sonar system, then perhaps only the frequencies generated by submarine propellers might be of interest. Other frequencies, like those generated by marine life, would be filtered out of the sensors. The filters to apply to each sensor may, in general, be different.

Typically, the M sensors are arranged in a straight line. Therefore, signals arriving from a distant point reach one would arrive at one of the sensor arrays first, then gradually show up at other points in the array. By comparing the observations of each sensor, you can determine the bearing of the signal. This subsequent processing would be performed after the filter operation I describe here.

Computationally, then, the input is an M×N matrix. The entry at position (i,t) in the matrix is the sample value recorded by sensor i at time t. The output is also an M×N matrix containing data in the same form as the original input—but with the uninteresting frequencies removed.

The filter is also an M×N matrix. The entry at position (i,j) in the matrix is the amount by which to amplify (or deamplify) the frequency j/NT hertz in the i^{th sensor's input data.}

(The computation described is also used in situations other than the filtering scenario just described. The same computation is called "pulse compression," if each of the rows is a separate set of samples arriving at a single sensor. It's also called "cross-range processing" when used as part of a synthetic-aperture radar computation.)

Implementing in VSIPL++

There are multiple ways to perform this computation in VSIPL++. Here I use fast Fourier transforms (FFTs) to perform the computation. You could use a finite impulse response (FIR) filter instead of FFTs. VSIPL++ provides direct support for that technique, too.

A forward FFT converts a "time domain" representation of a signal (such as the data received by one of the sensors just described) into a "frequency domain" representation. The frequency representation is like the representation for the filter; it is an N-element vector, where the j^{th element indicates the phase and amplitude of the component of the signal with frequency j}/NT hertz. So, after performing the FFT on row i of the input matrix, you have a vector that gives a frequency decomposition of sensor i's input data.

You then multiply the j^{th element in that vector by the element (i,j}) in the filter matrix. For example, if the filter matrix contains the value zero for a given element, then that frequency component is completely eliminated. Finally, you use an inverse FFT to convert back from the frequency domain to the time domain, giving you the desired output.

You begin by declaring three matrices, corresponding to the original input gathered from the sensors, the filters to be applied to each row, and the output:

typedef complex<float> T;
Matrix<T> inputs(M, N);
Matrix<T> filters(M, N);
Matrix<T> outputs(M, N);

Of course, you have to initialize the inputs and filters matrices. Normally, the inputs data would be provided directly by sensors, or by other processing. Here, you initialize the input data with random noise. You initialize the first half of each filter with 1s and the second half with 0s to indicate that you should keep low frequencies, but discard high frequencies:

inputs = Rand<T>::randu(M, N);
Domain<2> low_freq(M, Domain<1>(0, 1, N/2));
Domain<2> high_freq(M, Domain<1>(N/2, 1, N/2));
filters(low_freq) = 1;
filters(high_freq) = 0;

These assignments demonstrate the "data parallel" syntax used throughout VSIPL++. For example, the last assignment assigns the value zero to every element in the portion of the filters matrix given by the high_freq domain.

You next create "FFT objects." In VSIPL++, FFTs are objects, rather than operators. Typically (as in the example here), an application performs multiple FFTs on inputs with the same size. In that case, it is more computationally efficient to set up the FFT only once, since the set-up involves memory allocation and, potentially, other configuration. In VSIPL++, constructing the FFT object performs this set-up operation. VSIPL++ supports a variety of different kinds of FFT, so FFTs are themselves template classes. The template parameters tell VSIPL++ what kind of FFT is desired:

typedef Fft<const_Vector, T, T, fft_fwd, by_reference> fwd_fft_type;
typedef Fft<const_Vector, T, T, fft_inv, by_reference> inv_fft_type;
fwd_fft_type fwd_fft(N, 1.0f);
inv_fft_type inv_fft(N, 1.0f / N);

The first two lines declare the types of the FFT objects; the last two lines perform the actual creation of the objects. The first parameter to the FFT constructor gives the number of elements over which the FFT will be performed; the second gives a "scaling factor" to apply to each element in the resulting FFT. For pragmatic computational reasons, both the forward and inverse FFTs scale by the product of the sqrt(N) and the explicit scaling factor. So if you apply a forward and inverse FFT in sequence, using a scaling factor of 1 both times, the output is N times larger than the input. Since you want the input and output to have the same scale, you divide the inverse FFT by N.

Then, to actually perform the computation:

   for (length_type i = 0; i < M; ++i) {
     fwd_fft(inputs.row(i), outputs.row(i));
     outputs.row(i) *= filters.row(i);
     inv_fft(outputs.row(i));
   }

Here, you apply the forward FFT to the data gathered by a sensor, moving from the time domain to the frequency domain. (FFT objects can be called like functions, so even though they are objects, they look like operators after they have been created.) Next, use the *= operator on vectors to perform element-wise multiplication. (There is a separate function for computing the dot product of two vectors.) Finally, use the inverse-FFT to return to the time domain.

This example demonstrates how easy it is to write code with VSIPL++. It takes just a few lines to perform a useful computation, and the code corresponds closely to the mathematical or algorithmic description of the problem. The VSIPL++ code here almost looks like pseudocode, but it is actually a high-performance implementation of the algorithm.

Performance

Of course, there are other high-productivity environments (such as Matlab) for prototyping numerical applications. However, VSIPL++ is designed to provide high performance in addition to a convenient programming style. Because VSIPL++ can be used on workstations, supercomputers, or embedded devices, it's easy to prototype an algorithm on a workstation, then move it to the target device.

The VSIPL++ API lets you manually specify the layout of data. For example, it is usually more efficient to arrange a matrix in row-major order if performing computations along the rows, but more efficient to arrange the matrix in column-major order if performing computations along the columns. VSIPL++ programmers can pick whichever arrangement is most convenient.

Sourcery VSIPL++ uses several additional techniques to obtain good performance. Sourcery VSIPL++ can dispatch operations (like FFTs) to math libraries that have been carefully tuned for the target system. For example, on Intel processors, the Intel Performance Primitives (IPP; http://www.intel.com/cd/software/products/asmo-na/eng/perflib/ipp/index.htm) provide handwritten code for FFTs. If no library is available for a particular operation, Sourcery VSIPL++ falls back to generic routines.

The generic routines for some computations (like the *= operator used to perform element-wise multiplication in the example) use expression templates to perform loop fusion and eliminate temporaries. In the *= case, this line of code:

tmp *= filters.row(i);

is transformed into code like this:

for (length_type j=0; j<N; ++j)
  tmp[j] *= filters[j];

This technique is even more effective on code such as:

Vector<T> A, B, C;
A = B + cos(C);

which is transformed into code like:

for (length_type i=0;       i<A.size(0); ++i)
  A[i][j]=B[i][j]+cos(C[i][j]);

which can be compiled to very efficient code.

Some compilers (the GNU C++ compiler, for instance) have extensions that can be used to explicitly request that Single-Instruction Multiple-Data (SIMD) units on the processor be used to perform several computations at once. Sourcery VSIPL++ uses these extensions when they are available.

The net result of these techniques is that the Sourcery VSIPL++ performance for an application is usually within a hair's breadth of the performance attained by directly using the underlying low-level math libraries. So, VSIPL++ users get productivity and portability, without sacrificing performance.

On occasion, Sourcery VSIPL++ is able to achieve better performance than the underlying math libraries, due to the use of loop fusion. For example, in the cosine example, using a library like IPP, you would have to perform the addition and cosine operations separately. As a result, you would get code that looks more like this:

for (length_type i = 0; i <       A.size(0); ++i)
  A[i][j] = B[i][j];
for (length_type i = 0; i <       A.size(0); ++i)
  A[i][j] += cos(C[i][j]);

Because there are two separate loops, there is more overhead and an inferior cache access pattern. The loop fusion used by Sourcery VSIPL++ collapses these two loops into a single loop.

Parallel Computation

The computation in this example is "embarrassingly parallel." Each row of the output matrix depends only on the corresponding rows of the input matrix and filter. So, it is possible to compute each row of the output matrix in parallel, if you have enough processors. VSIPL++ makes this easy, using the concepts of "maps" and "blocks."

A map indicates how a particular view (that is, a matrix or vector) should be divided across processors. The map specifies the data layout for each dimension. In the example, you want to keep all the data in each row together on a single processor, but you want to divide the rows among the processors that you have:

typedef Map<Block_dist, Whole_dist> map_type;
map_type map = map_type(Block_dist(num_processors()), Whole_dist());

The first line declares that the map you are creating will divide the first (row) dimension into blocks (rows zero through k-1 will be in one block, k through 2k-1 will be in the next block, and so forth), while the second (column) dimension will not be distributed at all. The second line creates an actual map, with the rows divided across all processors; k will be the number of processors divided by the number of rows.

You also have to tell VSIPL++ that your matrices will be distributed. You do this by providing a distributed block type and the previously created map:

typedef Dense<2, T, row2_type, map_type> block_type;
Matrix<T, block_type> inputs(M, N, map);
Matrix<T, block_type> filters(M, N, map);
Matrix<T, block_type> outputs(M, N, map);

This code tells VSIPL++ that the data should be divided across all of the processors, by rows. With these changes—but no changes to the actual algorithm—the program now performs efficiently in parallel!

VSIPL++'s parallelism model is Single-Program Multiple-Data (SPMD) with the owner-computes rules. Under the SPMD model, the same program is running on every processor—but the data stored on the processors is different. The owner-computes rule means that if element (i,j) of a matrix is the target of an assignment, then the processor where that element is located is responsible for computing the value to assign. If that value requires data from other processors, those processors send the data required to the target processor. Of course, sending data is expensive, so good programmers try to arrange the data so that relatively few messages are required. In this example, no messages are required because each matrix is distributed by rows, and the value of the output data in a single row is only dependent on the input and filter data for that row.

Each processor executes the aforementioned loop. That sounds bad, because that means that each will loop over all of the rows in the array, even though it has only some of those rows. However, because of the owner-computes rule, VSIPL++ recognizes that no work needs to be done for the rows that are located on another processor, and those rows are therefore processed almost instantaneously. Unless there are a huge number of processors, the cost of the FFTs will dominate the cost of skipping the unnecessary computation.

However, VSIPL++ does provide a mechanism for avoiding this relatively small inefficiency by providing a way to ask for the portion of a distributed matrix located on the current processor. You just need to change the main loop to look like:

const length_type local_M =  	inputs.local().size(0);
for (length_type i = 0; i <  	local_M; ++i)
  {
fwd_fft(inputs.local().row(i),  	outputs.local().row(i));
outputs.local().row(i) *=  	filters.local().row(i);
inv_fft(outputs.local().row(i));
  }

The local method on a matrix provides the local portion of the matrix. The size method provides the length in the indicated dimension. Now the loop iterates over just the rows assigned to the processor executing the code.

Since parallel VSIPL++ code still works on a uniprocessor system, the modified loop runs even if the matrices are not distributed matrices. In fact, on uniprocessor systems, this parallel code runs just as fast as the original serial code.

Sourcery VSIPL++ uses the Message-Passing Interface (MPI) API to implement parallelism. There are implementations of MPI for all standard cluster operating systems, including GNU/Linux. There are also implementations of MPI for embedded systems, such as those produced by Mercury Computer Systems and CSP Inc. However, as with the math libraries discussed earlier, Sourcery VSIPL++ has been designed so that it does not intrinsically require MPI. In the future, we expect Sourcery VSIPL++ to use other parallel-computing frameworks, such as Mercury's Parallel Applications System (PAS) library.

Figure 1 shows the performance of the parallel implementation of the benchmark running on a number of processors. (The processors are part of a GNU/Linux cluster.) As you can see, the application performance scales almost linearly with the number of processors.

Figure 1: Performance of parallel implementation.

Conclusion

The functionality I discuss in this article is present in the VSIPL++ 1.0 API. The VSIPL Forum is already discussing functionality to include in future versions of VSIPL++. Users can expect that future versions of the API will contain substantially more image-processing functionality.

VSIPL++ is also emerging as a research vehicle for high-performance computing. A group at Northeastern University (http://www.ece.neu.edu/groups/rcl/projects/vsipl/vsipl.html) has been working on integrating Field-Programmable Gate Arrays (FPGAs) with VSIPL++. (This work leverages VSIPL++'s user-defined storage mechanism, which can also be used to implement sparse arrays, or other custom storage formats.)

Because VSIPL++ was originally designed for use on embedded systems, Sourcery VSIPL++ does not currently run on Windows. However, we plan on adding Windows support soon. We also plan to extend the profiling and debugging capabilities of our optimized implementation, add more optimizations, and provide higher level computational algorithms not required by the API specification.

DDJ