Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

C/C++

Inside the VSIPL++ API


Parallel Computation

The computation in this example is "embarrassingly parallel." Each row of the output matrix depends only on the corresponding rows of the input matrix and filter. So, it is possible to compute each row of the output matrix in parallel, if you have enough processors. VSIPL++ makes this easy, using the concepts of "maps" and "blocks."

A map indicates how a particular view (that is, a matrix or vector) should be divided across processors. The map specifies the data layout for each dimension. In the example, you want to keep all the data in each row together on a single processor, but you want to divide the rows among the processors that you have:

typedef Map<Block_dist, Whole_dist> map_type;
map_type map = map_type(Block_dist(num_processors()), Whole_dist());


The first line declares that the map you are creating will divide the first (row) dimension into blocks (rows zero through k-1 will be in one block, k through 2k-1 will be in the next block, and so forth), while the second (column) dimension will not be distributed at all. The second line creates an actual map, with the rows divided across all processors; k will be the number of processors divided by the number of rows.

You also have to tell VSIPL++ that your matrices will be distributed. You do this by providing a distributed block type and the previously created map:

typedef Dense<2, T, row2_type, map_type> block_type;
Matrix<T, block_type> inputs(M, N, map);
Matrix<T, block_type> filters(M, N, map);
Matrix<T, block_type> outputs(M, N, map);

This code tells VSIPL++ that the data should be divided across all of the processors, by rows. With these changes—but no changes to the actual algorithm—the program now performs efficiently in parallel!

VSIPL++'s parallelism model is Single-Program Multiple-Data (SPMD) with the owner-computes rules. Under the SPMD model, the same program is running on every processor—but the data stored on the processors is different. The owner-computes rule means that if element (i,j) of a matrix is the target of an assignment, then the processor where that element is located is responsible for computing the value to assign. If that value requires data from other processors, those processors send the data required to the target processor. Of course, sending data is expensive, so good programmers try to arrange the data so that relatively few messages are required. In this example, no messages are required because each matrix is distributed by rows, and the value of the output data in a single row is only dependent on the input and filter data for that row.

Each processor executes the aforementioned loop. That sounds bad, because that means that each will loop over all of the rows in the array, even though it has only some of those rows. However, because of the owner-computes rule, VSIPL++ recognizes that no work needs to be done for the rows that are located on another processor, and those rows are therefore processed almost instantaneously. Unless there are a huge number of processors, the cost of the FFTs will dominate the cost of skipping the unnecessary computation.

However, VSIPL++ does provide a mechanism for avoiding this relatively small inefficiency by providing a way to ask for the portion of a distributed matrix located on the current processor. You just need to change the main loop to look like:

const length_type local_M =  	inputs.local().size(0);
for (length_type i = 0; i <  	local_M; ++i)
  {
fwd_fft(inputs.local().row(i),  	outputs.local().row(i));
outputs.local().row(i) *=  	filters.local().row(i);
inv_fft(outputs.local().row(i));
  }

The local method on a matrix provides the local portion of the matrix. The size method provides the length in the indicated dimension. Now the loop iterates over just the rows assigned to the processor executing the code.

Since parallel VSIPL++ code still works on a uniprocessor system, the modified loop runs even if the matrices are not distributed matrices. In fact, on uniprocessor systems, this parallel code runs just as fast as the original serial code.

Sourcery VSIPL++ uses the Message-Passing Interface (MPI) API to implement parallelism. There are implementations of MPI for all standard cluster operating systems, including GNU/Linux. There are also implementations of MPI for embedded systems, such as those produced by Mercury Computer Systems and CSP Inc. However, as with the math libraries discussed earlier, Sourcery VSIPL++ has been designed so that it does not intrinsically require MPI. In the future, we expect Sourcery VSIPL++ to use other parallel-computing frameworks, such as Mercury's Parallel Applications System (PAS) library.

Figure 1 shows the performance of the parallel implementation of the benchmark running on a number of processors. (The processors are part of a GNU/Linux cluster.) As you can see, the application performance scales almost linearly with the number of processors.

Figure 1: Performance of parallel implementation.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.