Channels ▼


The OpenACC Execution Model

For teaching purposes, separate versions of the OpenMP and OpenACC gramSchmidt() method will be used in this tutorial. The use of OpenACC in production code will almost certainly mix OpenACC and OpenMP pragmas without the use of a preprocessor conditional.

The call to gramSchmidt() is timed, which works because the host processor must block at the end of the gramSchmidt() method and wait for the Q matrix to be copied off the accelerator.

  double startTime = omp_get_wtime();
  gramSchmidt(A,rows, cols);
  double endTime = omp_get_wtime();
  printf("runtime %d %d matrix %g\n",rows, cols, (endTime-startTime));

Example 9: Timing the gramSchmidt runtime.

The remainder of the methods in the test code should be self-explanatory.

The sequential, OpenMP, and OpenACC executables can be built under Linux with the following commands. In this exampile, I'm using the PGI C compiler:

DEF="-D ROWS=1000 -D COLS=1000"

pgcc $DEF -O4 -Minfo -mp -Mconcur=allcores cgs-01.c -o cgs_01_omp
pgcc $DEF -O4 cgs-01.c -o cgs_01_serial;
pgcc $DEF -O4 -Minfo -acc -fast cgs-01.c -o cgs_01_acc

Example 10: Building the Gram-Schmidt test code.

Performance on a 1000x1000 matrix demonstrates a modest performance increase when using an NVIDIA C2070 GPU compared with a quad-core 2.53 GHz Xeon e5630 CPU.

1000x1000 Time (sec)
OpenACC 1.66707
OpenMP (4-cores) 2.92685
Serial 9.22192

Table 1: 1000x1000 performance number for cgs-01.c.

Examining the timeline output using the NVIDIA Visual Profiler shows that most of the time is spent in the parallel region containing the j loop (gramSchmidt_31_gpu):

Figure 1: Unexpected HtoD and DtoH transfers.

The timeline also shows that there are transfers between the host and device both before and after each call to gramSchmidt_31_gpu.

Closer examination of the timeline shows that the data transfers bracket the loop at line 24.

Figure 2: Timeline showing the HtoD or DtoH transfers bracket gramSchmidt_24_gpu.

In other words, the nvvp profiler is showing that the result of the reduction at line 24 is being brought back to the host so it can calculate sqrt().

#pragma acc parallel reduction(+:tmp)
    for(int i=0; i < rows; i++) tmp +=  (Q[i][k] * Q[i][k]);
    tmp = sqrt(tmp);

Example 11: Loop causing PCIe transfers.

The previous timeline (Figure 1) shows that the compiler was smart enough to recognize that the reduction to tmp at line 34 could be performed entirely on the GPU. Hence, the gramSchmidt_31_gpu parallel region does not have any PCIe transfers.

Also note the large amount of time required for cuMemAlloc() and cuMemFree() relative to the calculation:

Figure 3: The cost of dynamic memory operations on the device.

As noted in Chapter 3 of CUDA Application Design and Development, the amount of time consumed by dynamic device memory operations does not vary with the size of the allocated memory region. In other words, the frequent allocation and freeing of scratch memory (demonstrated in this example) can become a performance bottleneck.

The fix is quite easy in OpenACC because a parallel region can be defined around the reduction at line 24. The lesson learned is that mirroring the parallelism of OpenMP code can provide a good starting point, but some modifications might be necessary to make best use of the OpenACC capabilities.

The following is the modified source for the OpenACC version of gramSchmidt(). For demonstration purposes, the loops were annotated with OpenACC pragmas to inform the compiler that these should be vector loops.

// Use the OpenACC implementation if _OPENACC is defined
void gramSchmidt(restrict float Q[][COLS], const int rows, const int cols) 
#pragma acc data copy(Q[0:rows][0:cols])
  for(int k=0; k < cols; k++) {
#pragma acc parallel
	double tmp = 0.;
#pragma acc loop vector reduction(+:tmp)  
	for(int i=0; i < rows; i++) tmp +=  (Q[i][k] * Q[i][k]);
	tmp = sqrt(tmp);
#pragma acc loop vector 
	for(int i=0; i < rows; i++) Q[i][k] /= tmp;
#pragma acc parallel loop
    for(int j=k+1; j < cols; j++) {
      double tmp=0.;
      for(int i=0; i < rows; i++) tmp += Q[i][k] * Q[i][j];
      for(int i=0; i < rows; i++) Q[i][j] -= tmp * Q[i][k];

Example 12: Code with parallel region that eliminates PCIe transfers.

Rebuilding the executable and examining the Visual Profiler timeline confirms that there are no longer any host/device transfers or memory operations inside the k loop.

Figure 4: Timeline demonstrating no PCIe transfers or dynamic memory operations.

The following table shows that the performance improvement is noticeable, but it does not drastically change the speedup relative to the OpenMP version of the code.

1000x1000 Time (sec)
OpenACC (v2) 1.44388
OpenACC 1.66707
OpenMP (4-cores) 2.92685
Serial 9.22192

Table 2: Improved results after eliminating PCIe transfers.

Transpose the Matrix for Better Memory Utilization

The NVIDIA Visual Profiler includes an automated analysis capability. The following results of the kernel memory analysis show that there is poor memory utilization throughout the calculation.

Figure 5: Low memory efficiency reported by nvvp.

Examining the source code with this in mind shows that the Q array is stored in column-major order. A better way to arrange the 2D array is in row-major order so that the device can step through adjacent memory addresses.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.