Channels ▼
RSS

Parallel

Easy GPU Parallelism with OpenACC


Multidimensional Dynamic Arrays

The previous examples utilized static 2D globally accessible C arrays. Most applications utilize dynamic allocation of all data structures including multidimensional arrays that are frequently passed to functions and subroutines. A particular challenge for C/C++ programmers is that OpenACC transfers occur between contiguous regions of host and device memory. The use of non-contiguous multidimensional arrays (such as float ** arrays) is not recommended because they require individual transfers of each contiguous memory region.

The following example, matrix-acc-func.c, dynamically allocates the 2D matrices for the test and passes them to doTest(), which performs the matrix initializations and multiplication. This test utilizes the C-language (as of C99) restrict keyword that indicates the matrices do not overlap. For convenience, the 2D nature of the arrays was defined in the function to make array accesses straightforward:

int doTest(restrict float a[][SIZE], restrict float b[][SIZE],
	   restrict float c[][SIZE], int size)
{
  …
	c[i][j] = 0.0f;
}   

Example 7: Code snippet for straightforward 2D array indexing.

Of course, the programmer can pass the pointer to the contiguous region of memory and manually calculate the offsets into the multidimensional array as will be demonstrated in the next example.

int doTest(restrict float *a, restrict float *b, restrict float *c, int size)
{
  …
	c[i*size+j] = 0.0f;
}

Example 8: Code snippet demonstrate manual calculation of the 2D array offset.

The matrix-acc-func.c example also demonstrates the use of the OpenACC pragma "#pragma acc loop independent". The independent clause tells the compiler to ignore its own dependency analysis and trust that the programmer knows the loops have no dependencies. Incorrect and non-deterministic program behavior can happen if the programmer is mistaken. Conversely, the OpenACC pragma "#pragma acc loop seq" tells the compiler to generate code that will execute sequentially on the device.

/* matrix-acc-func.c */
#include <stdio.h>
#include <stdlib.h>
#define SIZE 1000

int doTest(restrict float a[][SIZE], 
	   restrict float b[][SIZE], 
	   restrict float c[][SIZE], int size)
{
  int i,j,k;
  
#pragma acc kernels create(a[0:size][0:size], b[0:size][0:size]) \
  copyout(c[0:size][0:size]) 
  {
    // Initialize matrices.
 #pragma acc loop independent
    for (i = 0; i < size; ++i) {
 #pragma acc loop independent
      for (j = 0; j < size; ++j) {
	a[i][j] = (float)i + j;
	b[i][j] = (float)i - j;
	c[i][j] = 0.0f;
      }
    }
    
    // Compute matrix multiplication.
 #pragma acc loop independent
    for (i = 0; i < size; ++i) {
 #pragma acc loop independent
      for (j = 0; j < size; ++j) {
 #pragma acc loop seq
	for (k = 0; k < size; ++k) {
	  c[i][j] += a[i][k] * b[k][j];
	}
      }
    }
  }
}

int main()
{
  int i,j,k;
  int size=SIZE;
  float *a= (float*)malloc(sizeof(float)*size*size);
  float *b= (float*)malloc(sizeof(float)*size*size);
  float *c= (float*)malloc(sizeof(float)*size*size);


  doTest(a,b,c, size);

  free(a);
  free(b);

  // ****************
  // double-check the OpenACC result sequentially on the host
  // ****************
  float *seq= (float*)malloc(sizeof(float)*size*size);
  // Initialize the seq matrix
  for(i = 0; i < size; ++i) 
    for(j = 0; j < size; ++j) 
      seq[i*SIZE+j] = 0.f;
  
  // Perform the multiplication
  for (i = 0; i < size; ++i) 
    for (j = 0; j < size; ++j) 
      for (k = 0; k < size; ++k) 
	seq[i*size+j] += (i+k) * (k-j);
  
  // check all the OpenACC matrices
  for (i = 0; i < size; ++i)
    for (j = 0; j < size; ++j)
      if(c[i*size+j] != seq[i*size+j]) {
	printf("Error (%d %d) (%g, %g)\n", i,j, c[i*size+j], seq[i*size+j]);
	exit(1);
      }
  free(c);
  free(seq);

  printf("OpenACC matrix multiplication test was successful!\n");
  
  return 0;
}

Example 9: matrix-acc-func.c source code.

Using Data Allocated on the Device

OpenACC also provides the ability to use previously allocated device memory with the deviceptr() clause. The following example matrix-acc-alloc.c demonstrates how to allocate memory in main() with the OpenACC runtime method acc_malloc(). The pointer is then passed to doTest() where it is accessed via deviceptr(). The copyout() clause also includes the size of the contiguous region of memory. For timing purposes, this code utilizes a size specified by the user on the command-line.

/* matrix-acc-alloc.c */
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>

int doTest(restrict float *a, restrict float *b,
	   restrict float *c, int size)
{
  int i,j,k;
  
#pragma acc kernels deviceptr(a, b) copyout(c[0:size*size-1]) 
  {
    // Initialize matrices.
 #pragma acc loop independent
    for (i = 0; i < size; ++i) {
 #pragma acc loop independent
      for (j = 0; j < size; ++j) {
	a[i*size+j] = (float)i + j;
	b[i*size+j] = (float)i - j;
	c[i*size+j] = 0.0f;
      }
    }
    
    // Compute matrix multiplication.
 #pragma acc loop independent
    for (i = 0; i < size; ++i) {
 #pragma acc loop independent
      for (j = 0; j < size; ++j) {
 #pragma acc loop seq
	for (k = 0; k < size; ++k) {
	  c[i*size+j] += a[i*size+k] * b[k*size+j];
	}
      }
    }
  }
}

int main(int argc, char *argv[])
{
  int i,j,k;
 
  if(argc < 2) {
    fprintf(stderr,"Use: size (for size x size) matrices\n");
    return -1;
  }
  int size=atoi(argv[1]);
  float *a = (float *)acc_malloc(sizeof(float)*size*size);
  float *b = (float *)acc_malloc(sizeof(float)*size*size);
  float *c= (float*)malloc(sizeof(float)*size*size);

  printf("size = %d\n",size);

  doTest(a,b,c, size);

  acc_free(a); 
  acc_free(b);
  free(c);

  printf("OpenACC matrix multiplication test was successful!\n");
  
  return 0;
}

Example 10: Source code for matrix-acc-alloc.c.

Conclusion

OpenACC has been designed to provide OpenMP-style programmers with an easy transition to GPU programming. Following the common sense adage, "Make your life easy and use the highest level API first," OpenACC provides a natural starting point to transition any C or Fortran code to massive parallelism. For legacy code, OpenACC can be the only viable route to massively parallel coprocessors because it eliminates the need for a total rewrite of the software and Fortran is supported. As a result, OpenACC opens the door to scalable, massively parallel GPU (or, more generically, coprocessor) acceleration of millions of lines of legacy application code. Currently, OpenACC is supported by compilers that must be purchased from either PGI or CAPS Enterprise. The PGI compiler used in this article is free for evaluation but continued use after the trial period expires requires a license that must be purchased. As with OpenMP, it is assumed that open-source compilers will eventually provide free OpenACC support.

Profiling and informational compiler messages play a key role in achieving high performance in pragma-based programming. Instead of having to blindly add pragmas and then guess at the impact of each might have on an application, free tools like the NVIDIA Visual Profiler let the developer actually see what is happening during runtime on Windows, Linux, and Mac computers. Being able to see what effect OpenACC pragmas have on runtime behavior greatly speeds the OpenACC learning process as well as application acceleration.

My next article in this series will discuss the OpenACC memory and execution model including the gang and worker clauses plus more sophisticated ways to handle data.


Rob Farber is an analyst who writes frequently on High-Performance Computing hardware topics.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video