The following commands compile the source code for each application with the PGI C and Fortran compilers. These commands assume the source code has been saved to the file name provided in the comment at the beginning of each example.

`pgcc -fast -mp -Minfo -Mconcur=allcores matrix-omp.c -o matrix-omp`

pgcc -fast -acc -Minfo matrix-acc.c -o matrix-acc-gpu

pgfortran -fast -acc -Minfo matrix-acc.f -o matrix-acc-gpuf

The command line arguments to the PGI C compiler (`pgcc`

) and Fortran compiler (`pgfortran`

) are:

`-fast`

: Chooses generally optimal flags for the target platform.

`-mp`

: Interpret OpenMP pragmas to explicitly parallelize regions of code for execution by multiple threads on a multi-processor system.

`-acc`

: Interpret OpenACC pragmas.

`-Minfo`

: Emit useful information to`stderr`

.

`-Mconcur`

: Instructs the compiler to enable auto-concurrentization of loops.

The Portland Group also provides a profiling capability that can be enabled via the `PGI_ACC_TIME`

environment variable. By default, profiling is not enabled. Setting `PGI_ACC_TIME`

to a positive integer value enables profiling while a negative value will disable it. The profiling overhead is minimal because the runtime only reports information collected by the GPU hardware performance counters. The wealth of information gathered by the runtime profiler can be seen in the output generated by `matrix-acc-gpu`

after setting `PGI_ACC_TIME=1`

:

rmfarber@bd:~/PGI/example1$ ./matrix-acc-gpu Accelerator Kernel Timing data /home/rmfarber/PGI/example1/matrix-acc.c main 21: region entered 1 time time(us): total=139658 init=88171 region=51487 kernels=43848 data=7049 w/o init: total=51487 max=51487 min=51487 avg=51487 25: kernel launched 1 times grid: [63x63] block: [16x16] time(us): total=43848 max=43848 min=43848 avg=43848

**Example 5: Runtime profile output when PGI_ACC_TIME=1 for matrix-acc-gpu.**

This output from the PGI runtime profiling tells us that the application spent 7 milliseconds transferring data and 43 milliseconds computing the matrix multiply kernel.

It is possible to create a timeline plot using the NVIDIA Visual Profiler (nvvp), which runs on Windows, Linux and Mac computers. (The nvvp application was previously known as computeprof.) The timeline is a new feature in the CUDA 4.2 release and is extremely useful!

Notice that there are:

- Three host to device data transfers at the start of the computation. These transfers correspond to the
`copyin()`

clauses for matrices`a`

and`b`

plus the`copy()`

clause for matrix`c`

.

- A GPU computation that requires 39.1% of the time for kernel
`main_24_gpu`

. A helpful feature of the PGI OpenACC compiler is that it intelligently labels the kernel with the routine name and line number to make these timelines intelligible.

- A single data transfer back from the device to the host, which was required by the copy clause for matrix
`c`

at the end of the kernel.

The visual profiler provides an interactive display of the timeline. A larger screenshot would show the calls to the driver API for the CUDA context setup and the data transfers along with a host of other information. In addition, the nvvp profiler will analyze the application and provide automated suggestions. This requires running the application many times. It is recommended to look at the timeline first as this only requires running the application once.

For example, the following screenshot shows the initial analysis of the timeline shown in Figure 1:

### Matrix Multiply Is an Ideal Case

Most computationally oriented scientists and programmers are familiar with BLAS (the Basic Linear Algebra Subprograms) library. BLAS is the de facto programming interface for basic linear algebra.

BLAS is structured according to three different levels with increasing data and runtime requirements.

- Level-1: Vector-vector operations that require
`O(N)`

data and`O(N)`

work. Examples include taking the inner product of two vectors, or scaling a vector by a constant multiplier.

- Level-2: Matrix-vector operations that require
`O(N`

data and^{2})`O(N`

work. Examples include matrix-vector multiplication or a single right-hand-side triangular solve.^{2})

- Level-3 Matrix-vector operations that require
`O(N`

data and^{2})`O(N`

work. Examples include dense matrix-matrix multiplication.^{3})

The following table describes the amount of work that is performed by each BLAS level assuming that `N`

floating-point values are transferred from the host to the device. This table does not take into account the time required to transfer the data back to the host.

BLAS level | Data | Work | Work per Datum |
---|---|---|---|

1 | O(N) | O(N) | O(1) |

2 | O(N^{2}) |
O(N^{2}) |
O(1) |

3 | O(N^{2}) |
O(N^{2}) |
O(N) |

**Table 2: Work per datum for the three BLAS levels**.

Matrix multiply is an ideal example for OpenACC acceleration because the data transfers become less important as the size of the matrices increase. Matrix multiply is a level-3 BLAS operation that performs `O(N)`

work for every floating-point value transferred to the device. The effect of this high computational density can be seen in the following plot of wall clock time on a dedicated system as the problem size increases. Multiplying 1k by 1k square matrices results in a 1.7 speedup of matrix-acc.c over matrix-omp.c when running on an NVIDIA C2050 GPU compared with a 2.65 GHz quad-core Intel Xeon E5630 processor. Increasing the matrix sizes to 11k by 11k shows a 6.4x speedup over OpenMP. This empirically demonstrates the high work per datum runtime behavior of matrix multiply. Similar speedups will occur for other high work-per-datum computations as well.

**Figure 3: Runtime behavior by matrix size of OpenACC and OpenMP implementations (lower is better)**.

### The Three Rules of Coprocessor Programming

Matrix multiply is an excellent teaching tool but most real-world calculations do not exhibit such ideal behavior. Instead, the programmer must be creative and pay close attention to data transfers and computational density on the OpenACC device(s). High performance can be achieved when the compute intensive portions of the application conform to the following three rules of high-performance coprocesser programming. If not, expect application performance to be either PCIe or device memory bandwidth limited.

- Transfer the data across the PCIe bus onto the device and keep it there.

- Give the device enough work to do.

- Focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks.

### Using the Create Clause to Allocate on the OpenACC Device

It is easy to create the matrices on the OpenACC device and initialize them. Creating and initializing data on the OpenACC device conforms to the first rule and avoids data transfers. The following example, matrix-acc-create.c demonstrates the use of the create() clause in a kernels region.

/* matrix-acc-create.c */ #define SIZE 1000 float a[SIZE][SIZE]; float b[SIZE][SIZE]; float c[SIZE][SIZE]; int main() { int i,j,k; #pragma acc kernels create(a,b) copyout(c) { // start of kernels // Initialize matrices. for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { a[i][j] = (float)i + j; b[i][j] = (float)i - j; c[i][j] = 0.0f; } } // Compute matrix multiplication. for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { for (k = 0; k < SIZE; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } } // end of kernels return 0; }

**Example 6: matrix-acc-create.c**

Following is the nvvp timeline showing that two kernels are now running on the GPU.

The Visual Profiler shows that only one data transfer occurs at the end of the data region as required by the` copyout()`

clause.

Removal of the data transfers speeds the OpenACC performance over the OpenMP version:

- Speedup with
`copyin()`

and`copy()`

clauses over OpenMP: 6.4x

- Speedup with
`create()`

and`copyout()`

clauses over OpenMP: 6.9x