Channels ▼


Creating and Using Libraries with OpenACC

Good programming practice suggests that the size variable be specified as a const to prevent errors from mistakenly changing the number of columns in the matrix. Some compilers (specifically version 12.6 and earlier versions of the PGI compiler) do not correctly parse a const in the argument list for variables used in multidimensional array declarations. For compatibility reasons, const has been omitted from the example code presented here. It is also important to list the variables used in the multidimensional array declarations first in the calling sequence as some compilers (like the Intel icc compiler) do not handle forwarding of variables within an argument list.

The OpenACC (simpleMult.acc) and OpenMP (simpleMult.omp) binaries of the simpleMult.c example code can be built with the PGI compiler with the following bash script:


echo "------ build for OpenACC ----"
pgcc -acc -O3 -Minfo -fast -I $INC simpleMult.c -o simpleMult.acc

echo "------ build for OpenMP ----"
pgcc -mp=all -O3 -Minfo -fast -I $INC simpleMult.c -o simpleMult.omp

The informational compiler statements in lines 2-4 below tell us that the doMult() method is using the conditional data clauses:

     12, Generating present_or_copyout(C[0:size][0:size])
         Generating present_or_copyin(B[0:size][0:size])
         Generating present_or_copyin(A[0:size][0:size])
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     15, Loop is parallelizable
     16, Loop is parallelizable
         Accelerator kernel generated
         15, #pragma acc loop gang /* blockIdx.y */
         16, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             CC 1.3 : 22 registers; 72 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 22 registers; 0 shared, 88 constant, 0 local memory bytes
     18, Loop is parallelizable
     29, Loop not vectorized/parallelized: contains call
     59, Generating present_or_copyin(B[0:size][0:size])
         Generating present_or_copyin(A[0:size][0:size])
     60, Loop not vectorized/parallelized: contains call

The runtime reported by simpleMult.acc in Example 1 for an NVIDIA C2050 and Example 2 by simpleMult.omp when running on a quad-core 2.53 GHz Xeon E5630 provide a measure of relative performance. The Ubuntu system monitor showed that all four Xeon cores were fully utilized during the OpenMP run. The PGI supplied pgcudainit utility was used to ensure that the timings on the NVIDIA C2050 were not affected by the GPU entering a low-power state. (pgcudainit maintains an active CUDA context on the GPU, so it won't power down.)The nvvp timeline was visually checked to ensure that the application performed the minimum number of data transfers (see Figure 1).

  $ ./simpleMult.acc 1000 3
  ./simpleMult.acc runtime 0.045108
  ./simpleMult.acc runtime 0.043823
  ./simpleMult.acc runtime 0.043793

Example 1 : OpenACC performance using a C2050 on a 1000x1000 matrix.

  $ ./simpleMult.omp 1000 3
  ./simpleMult.omp runtime   2.9749
  ./simpleMult.omp runtime   2.6862
  ./simpleMult.omp runtime   2.6802

Example 2 : OpenMP performance on a 2.53 GHz Xeon E5630 on a 1000x1000 matrix.

Surprisingly, Examples 1 and Example 2 indicate that the OpenACC GPU code is roughly 59x faster than the OpenMP code running on a quad-core Xeon. In comparison, the GPU version of the OpenACC matrix multiplication discussed in my first article was, at best, 6.8x faster than the OpenMP version.

The loop interchange message in the following code (line 6) generated by the PGI compiler when compiling the source code for matrix-omp.c from Easy GPU Parallelism with OpenACC provides an important clue:

     12, Parallel code generated with block distribution
     13, Generated vector sse code for the loop
     22, Parallel region activated
         Parallel loop activated with static block schedule
         Loop interchange produces reordered loop nest: 22,24,23
     23, Generated an alternate version of the loop
         Generated vector sse code for the loop
         Generated 2 prefetch instructions for the loop
     29, Barrier
         Parallel region terminated

This message indicates that the PGI compiler decided that the loop ordering provided in the matrix-omp.c source code was sub-optimal for the Xeon processor. Instead, it reordered the loops for better performance. In contrast, the simpleMult.c example accumulates a sum in the variable tmp, which prevents loop reordering.

The following implementation, fastSimpleMult.c, reorders the loops so they match the preferred ordering reported by the PGI compiler for matrix-omp.c, even though the code looks like it will run more slowly due to the two step process of zeroing out the C matrix and then performing the matrix multiplication.

/* fastSimpleMult.c for OpenACC and OpenMP */
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

// A simple square matrix multiply using conditional data clauses
void doMult(int size, float (* restrict A)[size], float (* restrict B)[size],
	    float (* restrict C)[size]) 
  // Compute matrix multiplication.
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) pcopyout(C[0:size][0:size])
#pragma omp parallel for default(none) shared(C,size)
    for (int i = 0; i < size; ++i)
      for (int j = 0; j < size; ++j)
	C[i][j] =0.f;

#pragma omp parallel for default(none) shared(A,B,C,size)
    for (int i = 0; i < size; ++i) {
      for (int k = 0; k < size; ++k) {
	for (int j = 0; j < size; ++j) {
	  C[i][j] += A[i][k] * B[k][j];

void fill(int size, float (* restrict A)[size], float (* restrict B)[size]) 
  for (int i = 0; i < size; ++i) {
    for (int j = 0; j < size; ++j) {
      A[i][j] = random()/(double)RAND_MAX; B[i][j] = random()/(double)RAND_MAX;

int main(int argc, char *argv[])

  if(argc != 3) {
    fprintf(stderr,"Use: %s size nIter\n",argv[0]);
    return -1;

  int size=atoi(argv[1]);
  int nIter=atoi(argv[2]);
  if(nIter <= 0) {
    fprintf(stderr,"%s: Invalid nIter (%d)\n",argv[0],nIter);
    return -1;

  // allocate the square matrices
  float (*restrict A)[size] = malloc(sizeof(float)*size*size);
  float (*restrict B)[size] = malloc(sizeof(float)*size*size);
  float (*restrict C)[size] = malloc(sizeof(float)*size*size);


  // Ensure the A and B matrices are present on the device   
#pragma acc data pcopyin(A[0:size][0:size],B[0:size][0:size]) 
  for(int i=0; i < nIter; i++) {
      double startTime = omp_get_wtime();
      double endTime = omp_get_wtime();
      printf("%s runtime %8.5g\n",argv[0], (endTime-startTime));

  free(A); free(B); free(C);

  return 0;

However, the following timings show that fastSimpleMult.c is clearly much faster on the Xeon processor.

$ ./fastSimpleMult.acc 1000 3
./fastSimpleMult.acc runtime  0.04298
./fastSimpleMult.acc runtime 0.041681
./fastSimpleMult.acc runtime 0.041697
$ ./fastSimpleMult.omp 1000 3
./fastSimpleMult.omp runtime  0.12139
./fastSimpleMult.omp runtime  0.13461
./fastSimpleMult.omp runtime  0.13055

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.