Channels ▼
RSS

Design

Easy GPU Parallelism with OpenACC


The following commands compile the source code for each application with the PGI C and Fortran compilers. These commands assume the source code has been saved to the file name provided in the comment at the beginning of each example.

pgcc -fast -mp -Minfo -Mconcur=allcores matrix-omp.c -o matrix-omp
pgcc -fast -acc -Minfo matrix-acc.c -o matrix-acc-gpu
pgfortran -fast -acc -Minfo matrix-acc.f -o matrix-acc-gpuf

The command line arguments to the PGI C compiler (pgcc) and Fortran compiler (pgfortran) are:

  • -fast: Chooses generally optimal flags for the target platform.
  • -mp: Interpret OpenMP pragmas to explicitly parallelize regions of code for execution by multiple threads on a multi-processor system.
  • -acc: Interpret OpenACC pragmas.
  • -Minfo: Emit useful information to stderr.
  • -Mconcur: Instructs the compiler to enable auto-concurrentization of loops.

The Portland Group also provides a profiling capability that can be enabled via the PGI_ACC_TIME environment variable. By default, profiling is not enabled. Setting PGI_ACC_TIME to a positive integer value enables profiling while a negative value will disable it. The profiling overhead is minimal because the runtime only reports information collected by the GPU hardware performance counters. The wealth of information gathered by the runtime profiler can be seen in the output generated by matrix-acc-gpu after setting PGI_ACC_TIME=1:

rmfarber@bd:~/PGI/example1$ ./matrix-acc-gpu

Accelerator Kernel Timing data
/home/rmfarber/PGI/example1/matrix-acc.c
  main
    21: region entered 1 time
        time(us): total=139658 init=88171 region=51487
                  kernels=43848 data=7049
        w/o init: total=51487 max=51487 min=51487 avg=51487
        25: kernel launched 1 times
            grid: [63x63]  block: [16x16]
            time(us): total=43848 max=43848 min=43848 avg=43848

Example 5: Runtime profile output when PGI_ACC_TIME=1 for matrix-acc-gpu.

This output from the PGI runtime profiling tells us that the application spent 7 milliseconds transferring data and 43 milliseconds computing the matrix multiply kernel.

It is possible to create a timeline plot using the NVIDIA Visual Profiler (nvvp), which runs on Windows, Linux and Mac computers. (The nvvp application was previously known as computeprof.) The timeline is a new feature in the CUDA 4.2 release and is extremely useful!

[Click image to view at full size]
Figure 1: nvvp timeline for matrix-acc-gpu.

Notice that there are:

  • Three host to device data transfers at the start of the computation. These transfers correspond to the copyin() clauses for matrices a and b plus the copy() clause for matrix c.
  • A GPU computation that requires 39.1% of the time for kernel main_24_gpu. A helpful feature of the PGI OpenACC compiler is that it intelligently labels the kernel with the routine name and line number to make these timelines intelligible.
  • A single data transfer back from the device to the host, which was required by the copy clause for matrix c at the end of the kernel.

The visual profiler provides an interactive display of the timeline. A larger screenshot would show the calls to the driver API for the CUDA context setup and the data transfers along with a host of other information. In addition, the nvvp profiler will analyze the application and provide automated suggestions. This requires running the application many times. It is recommended to look at the timeline first as this only requires running the application once.

For example, the following screenshot shows the initial analysis of the timeline shown in Figure 1:

[Click image to view at full size]
Figure 2: Automated analysis performed by the NVIDIA Visual Profiler.

Matrix Multiply Is an Ideal Case

Most computationally oriented scientists and programmers are familiar with BLAS (the Basic Linear Algebra Subprograms) library. BLAS is the de facto programming interface for basic linear algebra.

BLAS is structured according to three different levels with increasing data and runtime requirements.

  1. Level-1: Vector-vector operations that require O(N) data and O(N) work. Examples include taking the inner product of two vectors, or scaling a vector by a constant multiplier.
  2. Level-2: Matrix-vector operations that require O(N2) data and O(N2) work. Examples include matrix-vector multiplication or a single right-hand-side triangular solve.
  3. Level-3 Matrix-vector operations that require O(N2) data and O(N3) work. Examples include dense matrix-matrix multiplication.

The following table describes the amount of work that is performed by each BLAS level assuming that N floating-point values are transferred from the host to the device. This table does not take into account the time required to transfer the data back to the host.

BLAS level Data Work Work per Datum
1 O(N) O(N) O(1)
2 O(N2) O(N2) O(1)
3 O(N2) O(N2) O(N)

Table 2: Work per datum for the three BLAS levels.

Matrix multiply is an ideal example for OpenACC acceleration because the data transfers become less important as the size of the matrices increase. Matrix multiply is a level-3 BLAS operation that performs O(N) work for every floating-point value transferred to the device. The effect of this high computational density can be seen in the following plot of wall clock time on a dedicated system as the problem size increases. Multiplying 1k by 1k square matrices results in a 1.7 speedup of matrix-acc.c over matrix-omp.c when running on an NVIDIA C2050 GPU compared with a 2.65 GHz quad-core Intel Xeon E5630 processor. Increasing the matrix sizes to 11k by 11k shows a 6.4x speedup over OpenMP. This empirically demonstrates the high work per datum runtime behavior of matrix multiply. Similar speedups will occur for other high work-per-datum computations as well.

Figure 3: Runtime behavior by matrix size of OpenACC and OpenMP implementations (lower is better).

The Three Rules of Coprocessor Programming

Matrix multiply is an excellent teaching tool but most real-world calculations do not exhibit such ideal behavior. Instead, the programmer must be creative and pay close attention to data transfers and computational density on the OpenACC device(s). High performance can be achieved when the compute intensive portions of the application conform to the following three rules of high-performance coprocesser programming. If not, expect application performance to be either PCIe or device memory bandwidth limited.

  1. Transfer the data across the PCIe bus onto the device and keep it there.
  2. Give the device enough work to do.
  3. Focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks.

Using the Create Clause to Allocate on the OpenACC Device

It is easy to create the matrices on the OpenACC device and initialize them. Creating and initializing data on the OpenACC device conforms to the first rule and avoids data transfers. The following example, matrix-acc-create.c demonstrates the use of the create() clause in a kernels region.

/* matrix-acc-create.c */
#define SIZE 1000
float a[SIZE][SIZE];
float b[SIZE][SIZE];
float c[SIZE][SIZE];

int main()
{
  int i,j,k;
  
#pragma acc kernels create(a,b) copyout(c)
  { // start of kernels
    // Initialize matrices.
    for (i = 0; i < SIZE; ++i) {
      for (j = 0; j < SIZE; ++j) {
	a[i][j] = (float)i + j;
	b[i][j] = (float)i - j;
	c[i][j] = 0.0f;
      }
    }
    
    // Compute matrix multiplication.
    for (i = 0; i < SIZE; ++i) {
      for (j = 0; j < SIZE; ++j) {
	for (k = 0; k < SIZE; ++k) {
	  c[i][j] += a[i][k] * b[k][j];
	}
      }
    }
  } // end of kernels

  return 0;
}

Example 6: matrix-acc-create.c

Following is the nvvp timeline showing that two kernels are now running on the GPU.

[Click image to view at full size]
Figure 4: High resolution nvvp timeline showing two kernels.

The Visual Profiler shows that only one data transfer occurs at the end of the data region as required by the copyout() clause.

[Click image to view at full size]
Figure 5: lower resolution nvvp timeline showing two kernels and copy at the end of the data region.

Removal of the data transfers speeds the OpenACC performance over the OpenMP version:

  • Speedup with copyin() and copy() clauses over OpenMP: 6.4x
  • Speedup with create() and copyout() clauses over OpenMP: 6.9x


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video