The OpenACC Execution Model

This is the second article in a series explaining how to understand and program with OpenACC. This tutorial will introduce parallel regions and teach how the gang, worker, and vector clauses affect the OpenACC execution model. Example code will implement and compare OpenACC and OpenMP implementations of the Classic Gram-Schmidt (CGS) method, a computational linear algebra technique commonly taught in classrooms and used in scientific computing. The previous article in this series, "Easy GPU Parallelism with OpenACC," provided a quick overview of OpenACC and enough information so the reader could start using OpenACC as quickly as possible.

In a nutshell, OpenACC parallel regions are useful because they let programmers annotate code in a style that is conceptually very similar to OpenMP. Kernel regions allow the compiler to automatically generate CUDA-style kernels, which gives advanced programmers the ability to express any CUDA kernel launch configuration using portable directive-based OpenACC syntax. Engineers at PGI wrote an excellent description of the difference between the OpenACC parallel and kernels constructs and how they map to NVIDIA devices.

Gram-Schmidt orthogonalization is a method of factorization that can be termed triangular orthogonalization, meaning a set of triangular matrices are applied on a matrix to produce an orthogonal matrix. Operations on triangular matrices are typically a challenge to express for massively parallel devices like GPUs because the workload can vary on a per loop basis. This tutorial will demonstrate how OpenACC can be used to succinctly annotate a sequential CGS code to express the variable parallelism of the algorithm. A comparable OpenMP implementation shows the similarity between OpenACC and OpenMP pragma usage. Optimizations will be discussed that increase performance by making better use of the memory subsystems of both GPU and CPU devices. Final benchmarks show that the optimized CGS OpenACC implementation provides up to a 7.9x speedup for a 5000x5000 matrix on an NVIDIA C2070 compared with an optimized OpenMP implementation running on a quad-core 2.53 GHz Xeon e5630 CPU.

The OpenACC Execution Model

From the first tutorial in this series, we know that OpenACC targets a host-directed execution model where the sequential code runs on a conventional processor and computationally intensive parallel pieces of code (kernels) run on an accelerator such as a GPU. This mapping between the host and parallel device(s) reflects a hardware expression of Amdahl’s Law.

Amdahl’s Law is named after the computer architect Gene Amdahl. It is not really a law, but rather an approximation that models the ideal speedup that can happen when sequential (single-threaded) programs are modified to run on parallel hardware. In the ideal case, those sections of code that can be parallelized can have their runtime reduced by a factor of N, where N is the number of parallel processing elements. Large application speedups can occur when the parallel sections of code dominate the runtime of the sequential code. Theoretically, the time taken to complete the serial sections of code (those sections that cannot be parallelized) will eventually dominate the runtime when the number of parallel processing elements, N, is large.

The benefit of the OpenACC host-directed sequential model is that it can leverage the massive parallelism of one or more accelerator devices where the number of processing elements (PEs) is large, while preserving the ability of the latest generation of sequential processors to accelerate the serial sections of code. In this way, the performance characteristics of both state-of-the-art processors and accelerators can be exploited.

As noted in Part 1, OpenACC programmers must pay close attention to data transfers and usage of OpenACC device(s). High-performance applications generally conform to the following three rules of coprocesser programming:

• Transfer the data across the PCIe bus onto the device and keep it there.
• Give the device enough work to do.
• Focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks.

As will be shown in this tutorial, a corollary to these three rules is to avoid dynamic memory allocation and free operations. OpenACC targets two types of compute-intensive parallel sections of code so they can potentially be offloaded to an accelerator device:

• Kernel regions bundle one or several nested loops into a kernel. The OpenACC compiler essentially translates the loops into a kernel that can run in parallel on the accelerator. Current compilers are able to offload loops that are nested three levels deep. When the nesting level exceeds the compiler capability, the outer loop(s) will run sequentially on the host, while the inner loops will run on the accelerator. When possible, ensure that the inner loops perform the largest amounts of work to make best use of the accelerator. Following is an example from Part 1 of a nested set of matrix multiplication loops that was annotated as a kernel region for the OpenACC compiler:
```  // Compute matrix multiplication.
#pragma acc kernels copyin(a,b) copy(c)
for (i = 0; i < SIZE; ++i) {
for (j = 0; j < SIZE; ++j) {
for (k = 0; k < SIZE; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
```

Example 1: A kernel region.

• Parallel regions annotate a set of work-sharing loops for acceleration on the target device. It is up to the compiler to decide how to best express the computation for correct and efficient execution on the target architecture. In a very real sense, the OpenACC parallel construct highlights the value added by the compiler authors and can differentiate OpenACC compilers. Following is a set of work-sharing loops from the CGS example discussed later in this article. Note one loop in the parallel region performs a reduction operation, while the other loop utilizes the reduction result:
```#pragma acc parallel loop
for(int j=k+1; j < cols; j++) {
double tmp=0.;
for(int i=0; i < rows; i++) tmp += Q[i][k] * Q[i][j];
for(int i=0; i < rows; i++) Q[i][j] -= tmp * Q[i][k];
}
```

Example 2: A parallel region.

The OpenACC execution model lets the host processor orchestrate the execution of all parallel and kernel accelerated regions of code. This includes allocating memory on the device, initiating data transfer(s), loading executable code on the accelerator, passing arguments used by the parallel region, queuing ready-to-run device code, waiting for completion (unless directed otherwise by the programmer), transferring results back to the host, and deallocating device memory. Most accelerators let the host pipeline a sequence of operations so they can be executed as quickly as possible on the device, one after the other.

From the perspective of the programmer working with a parallel or kernels construct, the OpenACC execution model has three levels: gang, worker and vector. How these constructs map to the underlying hardware depends on the device capabilities and what the compiler thinks is the best mapping for the problem.

OpenACC targets a general device architecture that assumes a device will contain multiple processing elements (PE) that run in parallel. Each PE also has the ability to efficiently perform vector-like operations. For NVIDIA GPUs, it is reasonable to think of a PE as a streaming multiprocessor, that an OpenACC gang is a threadblock, a worker is effectively a warp, and that an OpenACC vector is a form of CUDA thread.

OpenACC does not provide support for synchronization between gangs or directly sharing data between gangs. (That said, limited synchronization through the host or data via device memory is possible.) Both the CUDA and OpenCL programming languages make these same assumptions to preserve high performance and the ability to scale to arbitrarily large numbers of PEs. I refer to this as a "strong scaling execution model" in my talks and articles about NVIDIA GPUs, AMD GPUs, and Intel Xeon Phi. These limitations mean that OpenACC programmers must map the parallelism in their code so that data is only shared with workers within the same gang.

For tuning OpenACC code to a specific device, OpenACC lets the programmer specify that a given loop should map to gangs, workers, and/or vectors as well as the number of gangs, workers, and/or vectors that should be used in the mapping. Note that such tuning operations will be specific to a particular device and might actually reduce performance on another device. In general, it is best to let the compiler decide the best mapping for device portability and to benefit from future compiler improvements. For more information, Mark Harris has a nice tutorial discussing OpenACC tuning using the gang, worker, and vector clauses.

Following are some simple SAXPY examples from Cliff Woolley’s NVIDIA GTC talk, "Profiling and Tuning OpenACC Code," demonstrating how particular OpenACC clauses can be used to map to CUDA threads:

• Use whatever mapping to threads and blocks the compiler chooses.
```#pragma acc kernels loop
for( int i = 0; i < n; ++i ) y[i] += a*x[i];
```

Example 3: Let the compiler choose the mapping.

• Specify that the kernel will use 100 thread blocks, each with 128 threads, where each thread executes one iteration of the loop.
```#pragma acc kernels loop gang(100), vector(128)
for( int i = 0; i < n; ++i ) y[i] += a*x[i];
```

Example 4: Specify gang and vector values for a kernels loop.

• Specify that the following parallel region will utilize 100 thread blocks, each with 128 threads, where each thread executes one iteration of the loop.
```#pragma acc parallel num_gangs(100), vector_length(128)
{
#pragma acc loop gang, vector
for( int i = 0; i < n; ++i ) y[i] += a*x[i];
}
```

Example 5: Specify num_gangs and vector length of a parallel region.

The Classic Gram-Schmidt Method

Classic Gram-Schmidt is commonly taught and utilized in scientific computing even though it is known to be numerically unstable due to floating-point rounding errors. To decrease rounding errors, the examples in this article will show how easy it is to mix double- and single-precision computations (hybrid precision) in OpenACC.

Even with hybrid precision, much of the existing literature encourages researchers to avoid CGS and utilize the more numerically stable Modified Gram-Schmidt algorithm (MGS). An advantage of the MGS algorithm is that it requires fewer floating-point operations. Per-Olof Persson’s 2007 "Introduction to Numerical Methods" lecture at MIT, "Gram-Schmidt Orthogonalization" provides a nice synopsis of the conventional reasoning. Conversely, the general availability and high flop capability of massively parallel hardware has sparked new thinking that indicates efficient parallel variants of CGS can be a more attractive, and more numerically stable computational technique than MGS. Following are two references that compare parallel CGS against other Gram-Schmidt variants:

More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.