Parallel Computation
The computation in this example is "embarrassingly parallel." Each row of the output matrix depends only on the corresponding rows of the input matrix and filter. So, it is possible to compute each row of the output matrix in parallel, if you have enough processors. VSIPL++ makes this easy, using the concepts of "maps" and "blocks."
A map indicates how a particular view (that is, a matrix or vector) should be divided across processors. The map specifies the data layout for each dimension. In the example, you want to keep all the data in each row together on a single processor, but you want to divide the rows among the processors that you have:
typedef Map<Block_dist, Whole_dist> map_type; map_type map = map_type(Block_dist(num_processors()), Whole_dist());
The first line declares that the map you are creating will divide the first (row) dimension into blocks (rows zero through k-1 will be in one block, k through 2k-1 will be in the next block, and so forth), while the second (column) dimension will not be distributed at all. The second line creates an actual map, with the rows divided across all processors; k will be the number of processors divided by the number of rows.
You also have to tell VSIPL++ that your matrices will be distributed. You do this by providing a distributed block type and the previously created map:
typedef Dense<2, T, row2_type, map_type> block_type; Matrix<T, block_type> inputs(M, N, map); Matrix<T, block_type> filters(M, N, map); Matrix<T, block_type> outputs(M, N, map);
This code tells VSIPL++ that the data should be divided across all of the processors, by rows. With these changesbut no changes to the actual algorithmthe program now performs efficiently in parallel!
VSIPL++'s parallelism model is Single-Program Multiple-Data (SPMD) with the owner-computes rules. Under the SPMD model, the same program is running on every processorbut the data stored on the processors is different. The owner-computes rule means that if element (i,j) of a matrix is the target of an assignment, then the processor where that element is located is responsible for computing the value to assign. If that value requires data from other processors, those processors send the data required to the target processor. Of course, sending data is expensive, so good programmers try to arrange the data so that relatively few messages are required. In this example, no messages are required because each matrix is distributed by rows, and the value of the output data in a single row is only dependent on the input and filter data for that row.
Each processor executes the aforementioned loop. That sounds bad, because that means that each will loop over all of the rows in the array, even though it has only some of those rows. However, because of the owner-computes rule, VSIPL++ recognizes that no work needs to be done for the rows that are located on another processor, and those rows are therefore processed almost instantaneously. Unless there are a huge number of processors, the cost of the FFTs will dominate the cost of skipping the unnecessary computation.
However, VSIPL++ does provide a mechanism for avoiding this relatively small inefficiency by providing a way to ask for the portion of a distributed matrix located on the current processor. You just need to change the main loop to look like:
const length_type local_M = inputs.local().size(0); for (length_type i = 0; i < local_M; ++i) { fwd_fft(inputs.local().row(i), outputs.local().row(i)); outputs.local().row(i) *= filters.local().row(i); inv_fft(outputs.local().row(i)); }
The local method on a matrix provides the local portion of the matrix. The size method provides the length in the indicated dimension. Now the loop iterates over just the rows assigned to the processor executing the code.
Since parallel VSIPL++ code still works on a uniprocessor system, the modified loop runs even if the matrices are not distributed matrices. In fact, on uniprocessor systems, this parallel code runs just as fast as the original serial code.
Sourcery VSIPL++ uses the Message-Passing Interface (MPI) API to implement parallelism. There are implementations of MPI for all standard cluster operating systems, including GNU/Linux. There are also implementations of MPI for embedded systems, such as those produced by Mercury Computer Systems and CSP Inc. However, as with the math libraries discussed earlier, Sourcery VSIPL++ has been designed so that it does not intrinsically require MPI. In the future, we expect Sourcery VSIPL++ to use other parallel-computing frameworks, such as Mercury's Parallel Applications System (PAS) library.
Figure 1 shows the performance of the parallel implementation of the benchmark running on a number of processors. (The processors are part of a GNU/Linux cluster.) As you can see, the application performance scales almost linearly with the number of processors.