Channels ▼
RSS

Design

OpenMP: A Portable Solution for Threading


Interleaving Single-Thread and Multi-Thread Execution

In large real-world applications, a program may consist of both serial and parallel code segments due to various reasons such as data dependence constraints and I/O operations. A need to execute something only once by only one thread will certainly be required within a parallel region, especially because you are making parallel regions as large as possible to reduce overhead. To handle the need for single-thread execution, OpenMP provides a way to specify that a sequence of code contained within a parallel section should only be executed one time by only one thread. The OpenMP runtime library decides which single thread will do the execution. If need be, you can specify that you want only the master thread, the thread that actually started the program execution, to execute the code, as in the following example.


#pragma omp parallel
{ // every thread calls this function
  int tid = omp_get_thread_num();

  // this loop is divided among the threads
  #pragma omp for nowait
  for ( k = 0; k < 100; k++ ) x[k] = fn1(tid);
  // no implicit barrier at the end of the above loop causes
  // all threads to synchronize here

  #pragma omp master
  y = fn_input_only();// only the master thread calls this

  // adding an explicit barrier to synchronize all threads
  // to make sure x[0-99] and y is ready for use
  #pragma omp barrier

  // again, this loop is divided among the threads
  #pragma omp for nowait
  for ( k = 0; k < 100; k++ ) x[k] = y + fn2(x[k]);
  // The above loop does not have an implicit barrier, so
  // threads will not wait for each other.

  // One thread – presumbly the first one done with above --
  // will continue and execute the following code.
  #pragma omp single
  fn_single_print(y); // only one of threads calls this

  // The above single construct has an implicit barrier,
  // so all threads synchronize here before printing x[].

  #pragma omp master
  fn_print_array(x); // only one of threads prints x[]
}

As can be seen from the comments in this code, a remarkable amount of synchronization and management of thread execution is available in a comparatively compact lexicon of pragmas. Note that all low-level details are taken care of by the OpenMP compiler and runtime. What you need to focus on is to specify parallel computation and synchronization behaviors you expected for correctness and performance. In other words, using single and master pragmas along with the barrier pragma and nowait clause in a clever way, you should be able to maximize the scope of a parallel region and the overlap of computations to reduce threading overhead effectively, while obeying all data dependencies and I/O constraints in your programs.

Data Copy-in and Copy-out

When you parallelize a program, you would normally have to deal with how to copy in the initial value of a private variable to initialize its private copy for each thread in the team. You would also copy out the value of the private variable computed in the last iteration/section to its original variable for the master thread at the end of parallel region. OpenMP standard provides four clauses — firstprivate, lastprivate, copyin, and copyprivate — for you to accomplish the data copy-in and copy-out operations whenever necessary based on your program and parallelization scheme. The following descriptions summarize the semantics of these four clauses:

  • firstprivate provides a way to initialize the value of a private variable for each thread with the value of variable from the master thread. Normally, temporary private variables have an undefined initial value saving the performance overhead of the copy.
  • lastprivate provides a way to copy out the value of the private variable computed in the last iteration/section to the copy of the variable in the master thread. Variables can be declared both firstprivate and lastprivate at the same time.
  • copyin provides a way to copy the master thread's threadprivate variable to the threadprivate variable of each other member of the team executing the parallel region.
  • copyprivate provides a way to use a private variable to broadcast a value from one member of threads to other members of the team executing the parallel region. The copyprivate clause is allowed to associate with the single construct; the broadcast action is completed before any of threads in the team left the barrier at the end of construct.

Considering the code example, let's see how it works. The following code converts a color image to black and white.


for ( row = 0; row < height; row++ ) {
    for ( col = 0; col < width; col++ ) {
        pGray[col] = (BYTE)
            ( pRGB[row].red * 0.299 +
              pRGB[row].green * 0.587 +
              pRGB[row].blue * 0.114 );
    }
    pGray += GrayStride;
    pRGB += RGBStride;
}

The issue is how to move the pointers pGray and pRGB to the correct place within the bitmap while threading the outer "row" loop. The address computation for each pixel can be done with the following code:


pDestLoc = pGray + col + row * GrayStride;
pSrcLoc = pRGB + col + row * RGBStride;

The above code, however, executes extra math on each pixel for the address computation. Instead, the firstprivate clause can be used to perform necessary initialization to get the initial address of pointer pGray and pRGB for each thread. You may notice that the initial addresses of the pointer pGray and pRGB have to be computed only once based on the "row" number and their initial addresses in the master thread for each thread; the pointer pGray and pRGB are induction pointers and updated in the outer loop for each "row" iteration. This is the reason the booltype variable doInit is introduced with an initial value TRUE to make sure the initialization is done only once for each to compute the initial address of pointer pGray and pRGB. The parallelized code follows:


#pragma omp parallel for private (row, col) firstprivate(doInit, pGray, pRGB)

for ( row = 0; row < height; row++ ) {
    // Need this init test to be able to start at an
    // arbitrary point within the image after threading.
    if (doInit == TRUE) {
        doInit = FALSE;
        pRGB += ( row * RGBStride );
        pGray += ( row * GrayStride );
    }
    for ( col = 0; col < width; col++ ) {
        pGray[col] = (BYTE) ( pRGB[row].red * 0.299 +
                              pRGB[row].green * 0.587 +
                              pRGB[row].blue * 0.114 );
    }
    pGray += GrayStride;
    pRGB += RGBStride;
}

If you take a close look at this code, you may find that the four variables GrayStride, RGBStride, height, and width are read-only variables. In other words, no write operation is performed to these variables in the parallel loop. Thus, you can also specify them on the parallel for loop by adding the code below:


firstprivate (GrayStride, RGBStride, height, width)

You may get better performance in some cases, as the privatization helps the compiler to perform more aggressive registerization and code motion as their loop invariants reduce memory traffic.

Protecting Updates of Shared Variables

The critical and atomic pragmas are supported by the OpenMP standard for you to protect the updating of shared variables for avoiding data-race conditions. The code block enclosed by a critical section and an atomic pragma are areas of code that may be entered only when no other thread is executing in them. The following example uses an unnamed critical section.


#pragma omp critical
{
    if ( max < new_value ) max = new_value
}

Global, or unnamed, critical sections will likely and unnecessarily affect performance because every thread is competing to enter the same global critical section, as the execution of every thread is serialized. This is rarely what you want. For this reason, OpenMP offers named critical sections. Named critical sections enable fine-grained synchronization, so only the threads that need to block on a particular section will do so. The following example shows the code that improves the previous example. In practice, named critical sections are used when more than one thread is competing for more than one critical resource.


#pragma omp critical(maxvalue)
{
    if ( max < new_value ) max = new_value
}

With named critical sections, applications can have multiple critical sections, and threads can be in more than one critical section at a time. It is important to remember that entering nested critical sections runs the risk of deadlock. The following code example code shows a deadlock situation:


void dequeue(NODE *node)
{
    #pragma omp critical (x)
    {
        node = node->next;
    }
}

void do_work(NODE *node)
{
    #pragma omp critical (x)
    {
        node->next->data = fn1(node->data);
        node = dequeue(node)
    }
}

In the previous code, the dynamically nested critical sections are used. When the function do_work is called inside a parallel loop, multiple threads compete to enter the outer critical section. The thread that succeeds in entering the outer critical section will call the dequeue function; however, the dequeue function cannot make any further progress, as the inner critical section attempts to enter the same critical section in the do_work function. Thus, the do_work function could never complete. This is a deadlock situation. The simple way to fix the problem in the previous code is to do the inlining of the dequeue function in the do_work function as follows:


void do_work(NODE *node)
{
    #pragma omp critical (x)
    {
        node->next->data = fn1(node->data);
        node = node->next;
    }
}

When using multiple critical sections, be very careful to examine critical sections that might be lurking in subroutines. In addition to using critical sections, you can also use the atomic pragma for updating shared variables. When executing code in parallel, it is impossible to know when an operation will be interrupted by the thread scheduler. However, sometimes you may require that a statement in a high-level language complete in its entirety before a thread is suspended. For example, a statement x++ is translated into a sequence of machine instructions such as:


load	reg, [x];
add	reg 1;
store	[x], reg;

It is possible that the thread is swapped out between two of these machine instructions. The atomic pragma directs the compiler to generate code to ensure that the specific memory storage is updated atomically. The following code example shows a usage of the atomic pragma.


int main()
{ float y[1000];
  int k, idx[1000];

  #pragma omp parallel for shared(y, idx)
  for ( k = 0; k < 8000; k++) {
    idx[k] = k % 1000;
    y[idx[k]] = 8.0;
  }

  #pragma omp parallel for shared(y, idx)
  for ( k = 0; k < 8000; k++) {
    #pragma omp atomic
    y[idx[k]] += 8.0 * (k % 1000);
  }
  return 0;
}

An expression statement that is allowed to use the atomic pragma must be with one of the following forms:

  • binop = expr
  • x++
  • ++x
  • x --
  • -- x

In the preceding expressions, x is an lvalue expression with scalar type; expr is an expression with scalar type and does not reference the object designed by x; binop is not an overloaded operator and is one of +, *, -, /, &, ^, |, <<, or >> for the C/C++ language.

It is worthwhile to point out that in the preceding code example, the advantage of using the atomic pragma is that it allows update of two different elements of array y to occur in parallel. If a critical section were used instead, then all updates to elements of array y would be executed serially, but not in a guaranteed order. Furthermore, in general, the OpenMP compiler and runtime library select the most efficient method to implement the atomic pragma given operating system features and hardware capabilities. Thus, whenever it is possible you should use the atomic pragma before using the critical section in order to avoid data-race conditions on statements that update a shared memory location.

Intel Taskqueuing Extension to OpenMP

The Intel Taskqueuing extension to OpenMP allows a programmer to parallelize control structures such as recursive function, dynamic-tree search, and pointer-chasing while loops that are beyond the scope of those supported by the current OpenMP model, while still fitting into the framework defined by the OpenMP specification. In particular, the taskqueuing model is a flexible programming model for specifying units of work that are not pre-computed at the start of the work-sharing construct. Take a look the following example.


void tq_func(LIST *p)
{
  #pragma intel omp parallel taskq shared(p)
  { while (p!= NULL) {
      #pragma intel omp task captureprivate(p)
       { tq_work1(p, 70); }
      #pragma intel omp task captureprivate(p)
       { tq_work2(p, 80); }
      p= p->next;
    }
  }
}

The parallel taskq pragma directs the compiler to generate code to create a team of threads and an environment for the while loop to enqueue the units of work specified by the enclosed task pragma. The loop's control structure and the enqueuing are executed by one thread, while the other threads in the team participate in dequeuing the work from the taskq queue and executing it. The captureprivate clause ensures that a private copy of the link pointer p is captured at the time each task is being enqueued, hence preserving the sequential semantics. The taskqueuing execution model is shown in Figure 1.

Figure 1: Taskqueuing Execution Model.

Essentially, for any given program with parallel taskq constructs, a team of threads is created by the runtime library when the main thread encounters a parallel region. The runtime thread scheduler chooses one thread TK to execute initially from all the threads that encounter a taskq pragma. All the other threads wait for work to be put on the task queue. Conceptually, the taskq pragma triggers this sequence of actions:

  1. Causes an empty queue to be created by the chosen thread TK
  2. Enqueues each task that it encounters
  3. Executes the code inside the taskq block as a single thread

The task pragma specifies a unit of work, potentially to be executed by a different thread. When a task pragma is encountered lexically within a taskq block, the code inside the task block is placed on the queue associated with the taskq pragma. The conceptual queue is disbanded when all work enqueued on it finishes and the end of the taskq block is reached. The Intel C++ compiler has been extended throughout its various components to support the taskqueuing model for generating multi-threaded codes corresponding to taskqueuing constructs.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video