Channels ▼
RSS

C/C++

OpenMP: A Portable Solution for Threading


Effective Use of Reductions

In large applications, you can often see the reduction operation inside hot loops. Loops that reduce a collection of values to a single value are fairly common. Consider the following simple loop that calculates the sum of the return value of the integer-type function call func(k) with the loop index value as input data.


sum = 0;
for ( k = 0; k < 100; k++ ){
    sum = sum + func(k); // "func" has no side-effects
}

It looks as though the loop-carried dependence on sum would prevent threading. However, if you have a dual-core processor system, you can perform the privatization -- that is, create a stack variable "temp" from which memory is allocated from automatic storage for each thread -- and perform loop partitioning to sum up the value of two sets of calls in parallel, as shown in the following example.

Thread 0:



temp = 0;
for (k=0; k<50; k++) {
    temp = temp + func(k);
}

lock (&sum)
sum = sum + temp
unlock (&sum)

Thread 1:


temp = 0;
for (k=50; k<100; k++) {
    temp = temp + func(k)
}

lock(&sum)
sum = sum + temp
unlock(&sum)

At the synchronization point, you can combine the partial sum results from each thread to generate the final sum result. In order to perform this form of recurrence calculation in parallel, the operation must be mathematically associative and commutative. You may notice that the variable sum in the original sequential loop must be shared to guarantee the correctness of the multi-threaded execution of the loop, but it also must be private to permit access by multiple threads using a lock or a critical section for the atomic update on the variable sum to avoid data race condition. To solve the problem of both sharing and protecting sum without using a lock inside the threaded loop, OpenMP provides the reduction clause that is used to efficiently combine certain associative arithmetical reductions of one or more variables in a loop. The following loop uses the reduction clause to generate the correct results.


sum = 0;
#pragma omp parallel for reduction(+:sum)
    for (k = 0; k < 100; k++) {
        sum = sum + func(k);
    }

Given the reduction clause, the compiler creates private copies of the variable sum for each thread, and when the loop completes, it adds the values together and places the result in the original variable sum.

Other reduction operators besides "+" exist. Table 3 lists those C++ reduction operators specified in the OpenMP standard, along with the initial values — which are also the mathematical identity value — for the temporary private variables. You can also find a list of Fortran reduction operators along with their initial values in OpenMP specification Version 2.5.

Table 3: Reduction Operators and Reduction Variable's Initial Value in OpenMP.

For each variable specified in a reduction clause, a private copy is created, one for each thread, as if the private clause is used. The private copy is then initialized to the initialization value for the operator, as specified in Table 3. At the end of the region or the loop for which the reduction clause was specified, the original reduction variable is updated by combining its original value with the final value of each of the private copies, using the operator specified. While identifying the opportunities to explore the use of the reduction clause for threading, you should keep the following three points in mind.

  • The value of the original reduction variable becomes undefined when the first thread reaches the region or loop that specifies the reduction clause and remains so until the reduction computation is completed.
  • If the reduction clause is used on a loop to which the nowait is also applied, the value of original reduction variable remains undefined until a barrier synchronization is performed to ensure that all threads have completed the reduction.
  • The order in which the values are combined is unspecified. Therefore, comparing sequential and parallel runs, even between two parallel runs, does not guarantee that bit-identical results will be obtained or that side effects, such as floating-point exceptions, will be identical.

Minimizing Threading Overhead

Using OpenMP, you can parallelize loops, regions, and sections or straight-line code blocks, whenever dependencies do not forbid them being executed in parallel. In addition, because OpenMP employs the simple fork-join execution model, it allows the compiler and run-time library to compile and run OpenMP programs efficiently with lower threading overhead. However, you can improve your application performance by further reducing threading overhead.

Table 4 provides measured costs of a set of OpenMP constructs and clauses on a 4-way Intel Xeon processor-based system running at 3.0 gigahertz with the Intel compiler and runtime library. You can see that the cost for each construct or clause is small. Most of them are less than 7 microseconds except the schedule(dynamic) clause. The schedule(dynamic) clause takes 50 microseconds, because its default chunk size is 1, which is too small. If you use schedule(dynamic,16), its cost is reduced to 5.0 microseconds. Note that all measured costs are subject to change if you measure these costs on a different processor or under a different system configuration. The key point is that no matter how well the compiler and runtime are developed and tuned to minimize the overhead of OpenMP constructs and clauses, you can always find ways to reduce the overhead by exploring the use of OpenMP in a more effective way.

Table 4: Measured Cost of OpenMP Constructs and Clauses.

Earlier, you saw how the parallel for pragma could be used to split the iterations of a loop across multiple threads. When the compiler generated thread is executed, the iterations of the loop are distributed among threads. At the end of the parallel region, the threads are suspended and they wait for the next parallel region, loop, or sections. A suspend or resume operation, while significantly lighter weight than create or terminate operations, still creates overhead and may be unnecessary when two parallel regions, loops, or sections are adjacent as shown in the following example.


#pragma omp parallel for for
( k = 0; k < m; k++ ) {
    fn1(k); fn2(k);
}

#pragma omp parallel for // adds unnecessary overhead
for ( k = 0; k < m; k++ ) {
    fn3(k); fn4(k);
}

The overhead can be removed by entering a parallel region once, then dividing the work within the parallel region. The following code is functionally identical to the preceding code but runs faster, because the overhead of entering a parallel region is performed only once.


#pragma omp parallel
{
    #pragma omp for
    for ( k = 0; k < m; k++ ) {
        fn1(k); fn2(k);
    }

    #pragma omp for
    for ( k = 0; k < m; k++ ) {
        fn3(k); fn4(k);
    }
}

Ideally, all performance-critical portions of the application would be executed within a parallel region. Since very few programs are comprised only of loops, additional constructs are used to handle non-loop code. A work-sharing section is one such construct.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video