Channels ▼
RSS

C/C++

Creating and Using Libraries with OpenACC


Table 1 makes it easy to view the runtimes of simpleMult.c and fastSimpleMult.c on both the quad-core Xeon and the NVIDIA C2050 GPU. Note that the C2050 maintained high performance regardless of how the nested loops of the matrix multiplication are structured. These results strongly indicate that the GPU hardware is successfully using Thread-Level Parallelism (TLP) and hardware scheduling to hide latency and make the most efficient use of its computational hardware. In contrast, the rearranged loop structure strongly benefits the OpenMP code (providing an average 20x speed-up), which means the C2050 OpenACC version only runs 3x faster than the quad-core Xeon when multiplying 1000x1000 matrices.


Run

OpenACC Conventional Loops

OpenMP Conventional Loops

OpenACC Speed-up

OpenACC Rearranged Loops

OpenMP Rearranged Loops

OpenACC Speed-up

0.045108

2.9749

65.95

0.04298

0.12139

2.82

0.043823

2.6862

61.30

0.041681

0.13461

3.23

0.043793

2.6802

61.20

0.041697

0.13055

3.13

Average

62.82

3.06

Table 1: Performance of two OpenMP and OpenACC square matrix multiplication loop structures.

It is appropriate to hypothesize that the performance robustness of the GPU against variations in nested loop structure is a generic feature of a TLP-based hardware execution model. TLP was designed to reorder computations to hide latency and increase computational efficiency. Succinctly, how to structure nested loops to most efficiently exploit hardware parallelism is an important question in computer science, and one that lies at the heart of high-performance parallel application design.

The process of pragma-based programming has been described as a "negotiation" with the compiler. Finding the right loop structure can be both difficult and rewarding as seen in even the simple square matrix multiplication example in this tutorial. The frustrating part is that it is never clear if the best performing solution has been found. For example, the OpenMP runtimes reported by simpleMult.c are reasonable and there is no obvious performance concern as both codes show 100% utilization on all the processor cores. Only through comparison against my timings reported earlier (for example, regression testing) was a performance issue even identified. (Regression testing should be a core part of any software development project — no matter how simple the project!) Finally, a non-intuitive solution was only found after a human took sufficient time to find out why one set of nested loops ran much faster than another.

OpenACC programmers have a rather unique opportunity to investigate the performance of nested loops on both GPU and conventional SMP/vector hardware. The comparison can benefit application performance on both architectures. Bottom line: If the GPU results look too good, then re-examine the nested loops in the OpenMP code because this might be a sign that the GPU was able to reorder the computation to be more efficient. Alternatively, just run on the GPU.

Time will tell if OpenACC, which can utilize hardware execution models like GPU thread-level parallelism, will make this negotiation process with the compiler easier and more effective. Those interested in this discussion should also investigate applications that exhibit dynamic parallelism (where the parallel work varies by time, grid location, or some other factor), as these problems further complicate the challenges of finding the best performing loop structures.

Conditional Runtime Execution

Conditional runtime execution is possible in OpenACC by calling the acc_get_device_type() or acc_on_device() runtime library methods. The following source code for initSgemm() demonstrates how to initialize the NVIDIA CUBLAS library, but only when the source code is compiled with OpenACC and when the application runs on an NVIDIA device.

// Initialization required when CUBLAS is going to be used on an Nvidia GPU
void initSgemm()
{
#ifdef _OPENACC
  if( acc_get_device_type() == acc_device_nvidia) {
    cublas_Init();
  }
#endif
}

Conditional runtime behavior based on the device type is necessary because OpenACC can run on a variety of device types. Currently, only the host processor and NVIDIA CUDA-enabled GPUs are supported. Section 4 of the OpenACC specification describes how the environment variable ACC_DEVICE_TYPE is used to specify the default device type at runtime. The type of the device can also be specified at compile time. The PGI compiler, for example, creates a host-only binary when "-ta=host" is specified on the command line. Similarly, specifying "-ta=host,nvida" will create a fat binary (referred to as a "Unified Binary" by PGI) that can run on either the host processor or an NVIDIA GPU. Finally, the device type can be specified inside the application with acc_init() or acc_set_device_type(). It is expected that additional OpenACC device types will be added in the future.

Conditional runtime execution means developers can write reusable OpenACC code that will transparently call optimized libraries such as CUBLAS and CUFFT, or external CUDA and OpenCL methods. To support this capability, OpenACC provides two ways to determine the address of an OpenACC memory region that is present on the device so it can be passed to an external method or library:

  1. The acc_malloc() function returns a pointer to memory allocated on the device. The pointer can be used in OpenACC code or passed to a method that expects a device pointer.
  2. The OpenACC use_device clause. As of the OpenACC specification v1.0, this is the only valid clause that can be used with the host_data pragma construct. The use_device clause is not implemented by the PGI compilers as of the 12.6 release.

To complete the ability to interact with external code, the deviceptr(list) clause is used to declare that the pointers in list are device pointers. This clause lets OpenACC developers utilize memory created on a device with CUDA, OpenCL, or an optimized library.

The following code snippet from the doSgemm() method (which will be shown shortly) calls acc_malloc() to allocate device memory at a known location for the transposed At, Bt, and Ct matrices. These pointers contain the address of the matrix on the device, so they can be passed to the CUBLAS sgemm() method while still being accessible to OpenACC. Conditional runtime execution based on device type ensures that the CUBLAS sgemm() method is only called when the binary is utilizing an NVIDIA GPU.

…

#ifdef _OPENACC
  if( acc_get_device_type() == acc_device_nvidia) {

    // workaround while waiting for the host_data use_device pragma and clause to work
    // allocate space at a known device pointer location
    float (*restrict At)[size] = acc_malloc(sizeof(float)*size*size);
    float (*restrict Bt)[size] = acc_malloc(sizeof(float)*size*size);
    float (*restrict Ct)[size] = acc_malloc(sizeof(float)*size*size);

#pragma acc data pcopyin(A[0:size][0:size],B[0:size][0:size]) pcopyout(C[0:size][0:size]) deviceptr(At,Bt,Ct)
    {
      // copy data on the device to the known pointer locations 
#pragma acc kernels loop 
      for(int i=0; i< size; i++)
      for(int j=0; j < size; j++) {
	At[i][j] = A[j][i];
	Bt[i][j] = B[j][i];
      }

      cublas_sgemm('N', 'N', size, size, size, 1.0, (float*)At, size, (float*)Bt, size, 0.0, (float*)Ct, size);
      cuda_WaitForCompletion(); // A production code would probably not wait for completion here

#pragma acc kernels loop 
      for(int i=0; i< size; i++)
	for(int j=0; j < size; j++)
	  C[i][j] = Ct[j][i];
    }
    acc_free(At); acc_free(Bt); acc_free(Ct);
    return;
  }
#endif

… Call a host based sgemm here …

As will be seen in the complete source listing at the end of this article (matrix-lib.c), the doSgemm() method is structured so the ACML sgemm() method is called by default when the OpenACC binary is running on the host processor or the code has not compiled with OpenACC. It is likely that many production library codes will be structured in a similar manner.

Aside from the doSgemm() method, the remainder of the matrix-lib.c test code is straight-forward and utilizes concepts already discussed. For timing purposes, separate doMult_acc() and doMult_omp() methods have been created. A checkMult() method has been provided to verify that each of the matrix multiplication methods produces a result that agrees with an optimized sgemm() call.

It is expected that some of the example methods will calculate slightly different results due to round-off errors. Floating-point arithmetic is only approximate, which means that even slight variations in the order in which floating-point operations are performed (say, when compiling with different optimization flags or running on different devices) can cause the same code to produce slightly different numerical results. For validation purposes, the checkMult() method calculates the Root–Mean-Squared Deviation (RMSD) as a measure of similarity. The RMSD value will be small when the matrices are similar.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video