Work-Sharing Sections
The work-sharing sections construct directs the OpenMP compiler and runtime to distribute the identified sections of your application among threads in the team created for the parallel region. The following example uses work-sharing for loops and work-sharing sections together within a single parallel region. In this case, the overhead of forking or resuming threads for parallel sections is eliminated.
#pragma omp parallel
{
#pragma omp for
for ( k = 0; k < m; k++ ) {
x = fn1(k) + fn2(k);
}
#pragma omp sections private(y, z)
{
#pragma omp section
{ y = sectionA(x); fn7(y); }
#pragma omp section
{ z = sectionB(x); fn8(z); }
}
}
Here, OpenMP first creates several threads. Then, the iterations of the loop are divided among the threads. Once the loop is finished, the sections are divided among the threads so that each section is executed exactly once, but in parallel with the other sections. If the program contains more sections than threads, the remaining sections get scheduled as threads finish their previous sections. Unlike loop scheduling, the schedule clause is not defined for sections. Therefore, OpenMP is in complete control of how, when, and in what order threads are scheduled to execute the sections. You can still control which variables are shared or private, using the private and reduction clauses in the same fashion as the loop construct.
Performance-oriented Programming
OpenMP provides a set of important pragmas and runtime functions that enable thread synchronization and related actions to facilitate correct parallel programming. Using these pragmas and runtime functions effectively with minimum overhead and thread waiting time is extremely important for achieving optimal performance from your applications.
Barriers are a form of synchronization method that OpenMP employs to synchronize threads. Threads will wait at a barrier until all the threads in the parallel region have reached the same point. You have been using implied barriers without realizing it in the work-sharing for and work sharing sections constructs. At the end of the parallel, for, sections, and single constructs, an implicit barrier is generated by the compiler or invoked in the runtime library. The barrier causes execution to wait for all threads to finish the work of the loop, sections, or region before any go on to execute additional work. This barrier can be removed with the nowait clause, as shown in the following code sample.
#pragma omp parallel
{
#pragma omp for nowait
for ( k = 0; k < m; k++ ) {
fn10(k); fn20(k);
}
#pragma omp sections private(y, z)
{
#pragma omp section
{ y = sectionD(); fn70(y); }
#pragma omp section
{ z = sectionC(); fn80(z); }
}
}
In this example, since data is not dependent between the first work sharing for loop and the second work-sharing sections code block, the threads that process the first work-sharing for loop continue immediately to the second work-sharing sections without waiting for all threads to finish the first loop. Depending upon your situation, this behavior may be beneficial, because it can make full use of available resources and reduce the amount of time that threads are idle. The nowait clause can also be used with the work-sharing sections construct and single construct to remove its implicit barrier at the end of the code block.
Adding an explicit barrier is also supported by OpenMP as shown in the following example through the barrier pragma.
#pragma omp parallel shared(x, y, z) num_threads(2)
{
int tid = omp_get_thread_num();
if (tid == 0) {
y = fn70(tid);
}
else {
z = fn80(tid);
}
#pragma omp barrier
#pragma omp for
for ( k = 0; k < 100; k++ ) {
x[k] = y + z + fn10(k) + fn20(k);
}
}
In this example, the OpenMP code is to be executed by two threads; one thread writes the result to the variable y, and another thread writes the result to the variable z. Both y and z are read in the work-sharing for loop, hence, two flow dependencies exist. In order to obey the data dependence constraints in the code for correct threading, you need to add an explicit barrier pragma right before the work-sharing for loop to guarantee that the value of both y and z are ready for read. In real applications, the barrier pragma is especially useful when all threads need to finish a task before any more work can be completed, as would be the case, for example, when updating a graphics frame buffer before displaying its contents.


