For teaching purposes, separate versions of the OpenMP and OpenACC gramSchmidt() method will be used in this tutorial. The use of OpenACC in production code will almost certainly mix OpenACC and OpenMP pragmas without the use of a preprocessor conditional.
The call to gramSchmidt() is timed, which works because the host processor must block at the end of the gramSchmidt() method and wait for the Q matrix to be copied off the accelerator.
double startTime = omp_get_wtime();
gramSchmidt(A,rows, cols);
double endTime = omp_get_wtime();
printf("runtime %d %d matrix %g\n",rows, cols, (endTime-startTime));
Example 9: Timing the gramSchmidt runtime.
The remainder of the methods in the test code should be self-explanatory.
The sequential, OpenMP, and OpenACC executables can be built under Linux with the following commands. In this exampile, I'm using the PGI C compiler:
DEF="-D ROWS=1000 -D COLS=1000" pgcc $DEF -O4 -Minfo -mp -Mconcur=allcores cgs-01.c -o cgs_01_omp pgcc $DEF -O4 cgs-01.c -o cgs_01_serial; pgcc $DEF -O4 -Minfo -acc -fast cgs-01.c -o cgs_01_acc
Example 10: Building the Gram-Schmidt test code.
Performance on a 1000x1000 matrix demonstrates a modest performance increase when using an NVIDIA C2070 GPU compared with a quad-core 2.53 GHz Xeon e5630 CPU.
| 1000x1000 | Time (sec) |
| OpenACC | 1.66707 |
| OpenMP (4-cores) | 2.92685 |
| Serial | 9.22192 |
Table 1: 1000x1000 performance number for cgs-01.c.
Examining the timeline output using the NVIDIA Visual Profiler shows that most of the time is spent in the parallel region containing the j loop (gramSchmidt_31_gpu):
Figure 1: Unexpected HtoD and DtoH transfers.
The timeline also shows that there are transfers between the host and device both before and after each call to gramSchmidt_31_gpu.
Closer examination of the timeline shows that the data transfers bracket the loop at line 24.
Figure 2: Timeline showing the HtoD or DtoH transfers bracket gramSchmidt_24_gpu.
In other words, the nvvp profiler is showing that the result of the reduction at line 24 is being brought back to the host so it can calculate sqrt().
#pragma acc parallel reduction(+:tmp)
for(int i=0; i < rows; i++) tmp += (Q[i][k] * Q[i][k]);
tmp = sqrt(tmp);
Example 11: Loop causing PCIe transfers.
The previous timeline (Figure 1) shows that the compiler was smart enough to recognize that the reduction to tmp at line 34 could be performed entirely on the GPU. Hence, the gramSchmidt_31_gpu parallel region does not have any PCIe transfers.
Also note the large amount of time required for cuMemAlloc() and cuMemFree() relative to the calculation:
Figure 3: The cost of dynamic memory operations on the device.
As noted in Chapter 3 of CUDA Application Design and Development, the amount of time consumed by dynamic device memory operations does not vary with the size of the allocated memory region. In other words, the frequent allocation and freeing of scratch memory (demonstrated in this example) can become a performance bottleneck.
The fix is quite easy in OpenACC because a parallel region can be defined around the reduction at line 24. The lesson learned is that mirroring the parallelism of OpenMP code can provide a good starting point, but some modifications might be necessary to make best use of the OpenACC capabilities.
The following is the modified source for the OpenACC version of gramSchmidt(). For demonstration purposes, the loops were annotated with OpenACC pragmas to inform the compiler that these should be vector loops.
// Use the OpenACC implementation if _OPENACC is defined
void gramSchmidt(restrict float Q[][COLS], const int rows, const int cols)
{
#pragma acc data copy(Q[0:rows][0:cols])
for(int k=0; k < cols; k++) {
#pragma acc parallel
{
double tmp = 0.;
#pragma acc loop vector reduction(+:tmp)
for(int i=0; i < rows; i++) tmp += (Q[i][k] * Q[i][k]);
tmp = sqrt(tmp);
#pragma acc loop vector
for(int i=0; i < rows; i++) Q[i][k] /= tmp;
}
#pragma acc parallel loop
for(int j=k+1; j < cols; j++) {
double tmp=0.;
for(int i=0; i < rows; i++) tmp += Q[i][k] * Q[i][j];
for(int i=0; i < rows; i++) Q[i][j] -= tmp * Q[i][k];
}
}
}
Example 12: Code with parallel region that eliminates PCIe transfers.
Rebuilding the executable and examining the Visual Profiler timeline confirms that there are no longer any host/device transfers or memory operations inside the k loop.
Figure 4: Timeline demonstrating no PCIe transfers or dynamic memory operations.
The following table shows that the performance improvement is noticeable, but it does not drastically change the speedup relative to the OpenMP version of the code.
| 1000x1000 | Time (sec) |
| OpenACC (v2) | 1.44388 |
| OpenACC | 1.66707 |
| OpenMP (4-cores) | 2.92685 |
| Serial | 9.22192 |
Table 2: Improved results after eliminating PCIe transfers.
Transpose the Matrix for Better Memory Utilization
The NVIDIA Visual Profiler includes an automated analysis capability. The following results of the kernel memory analysis show that there is poor memory utilization throughout the calculation.
Figure 5: Low memory efficiency reported by nvvp.
Examining the source code with this in mind shows that the Q array is stored in column-major order. A better way to arrange the 2D array is in row-major order so that the device can step through adjacent memory addresses.


