### Multidimensional Dynamic Arrays

The previous examples utilized static 2D globally accessible C arrays. Most applications utilize dynamic allocation of all data structures including multidimensional arrays that are frequently passed to functions and subroutines. A particular challenge for C/C++ programmers is that OpenACC transfers occur between contiguous regions of host and device memory. The use of non-contiguous multidimensional arrays (such as `float **`

arrays) is not recommended because they require individual transfers of each contiguous memory region.

The following example, matrix-acc-func.c, dynamically allocates the 2D matrices for the test and passes them to `doTest()`

, which performs the matrix initializations and multiplication. This test utilizes the C-language (as of C99) `restrict`

keyword that indicates the matrices do not overlap. For convenience, the 2D nature of the arrays was defined in the function to make array accesses straightforward:

int doTest(restrict float a[][SIZE], restrict float b[][SIZE], restrict float c[][SIZE], int size) { … c[i][j] = 0.0f; }

**Example 7: Code snippet for straightforward 2D array indexing**.

Of course, the programmer can pass the pointer to the contiguous region of memory and manually calculate the offsets into the multidimensional array as will be demonstrated in the next example.

int doTest(restrict float *a, restrict float *b, restrict float *c, int size) { … c[i*size+j] = 0.0f; }

**Example 8: Code snippet demonstrate manual calculation of the 2D array offset**.

The matrix-acc-func.c example also demonstrates the use of the OpenACC pragma `"#pragma acc loop independent"`

. The independent clause tells the compiler to ignore its own dependency analysis and trust that the programmer knows the loops have no dependencies. Incorrect and non-deterministic program behavior can happen if the programmer is mistaken. Conversely, the OpenACC pragma `"#pragma acc loop seq" `

tells the compiler to generate code that will execute sequentially on the device.

/* matrix-acc-func.c */ #include <stdio.h> #include <stdlib.h> #define SIZE 1000 int doTest(restrict float a[][SIZE], restrict float b[][SIZE], restrict float c[][SIZE], int size) { int i,j,k; #pragma acc kernels create(a[0:size][0:size], b[0:size][0:size]) \ copyout(c[0:size][0:size]) { // Initialize matrices. #pragma acc loop independent for (i = 0; i < size; ++i) { #pragma acc loop independent for (j = 0; j < size; ++j) { a[i][j] = (float)i + j; b[i][j] = (float)i - j; c[i][j] = 0.0f; } } // Compute matrix multiplication. #pragma acc loop independent for (i = 0; i < size; ++i) { #pragma acc loop independent for (j = 0; j < size; ++j) { #pragma acc loop seq for (k = 0; k < size; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } } } int main() { int i,j,k; int size=SIZE; float *a= (float*)malloc(sizeof(float)*size*size); float *b= (float*)malloc(sizeof(float)*size*size); float *c= (float*)malloc(sizeof(float)*size*size); doTest(a,b,c, size); free(a); free(b); // **************** // double-check the OpenACC result sequentially on the host // **************** float *seq= (float*)malloc(sizeof(float)*size*size); // Initialize the seq matrix for(i = 0; i < size; ++i) for(j = 0; j < size; ++j) seq[i*SIZE+j] = 0.f; // Perform the multiplication for (i = 0; i < size; ++i) for (j = 0; j < size; ++j) for (k = 0; k < size; ++k) seq[i*size+j] += (i+k) * (k-j); // check all the OpenACC matrices for (i = 0; i < size; ++i) for (j = 0; j < size; ++j) if(c[i*size+j] != seq[i*size+j]) { printf("Error (%d %d) (%g, %g)\n", i,j, c[i*size+j], seq[i*size+j]); exit(1); } free(c); free(seq); printf("OpenACC matrix multiplication test was successful!\n"); return 0; }

**Example 9: matrix-acc-func.c source code.**

### Using Data Allocated on the Device

OpenACC also provides the ability to use previously allocated device memory with the `deviceptr()`

clause. The following example matrix-acc-alloc.c demonstrates how to allocate memory in `main() `

with the OpenACC runtime method `acc_malloc()`

. The pointer is then passed to `doTest()`

where it is accessed via `deviceptr()`

. The` copyout()`

clause also includes the size of the contiguous region of memory. For timing purposes, this code utilizes a size specified by the user on the command-line.

/* matrix-acc-alloc.c */ #include <stdio.h> #include <stdlib.h> #include <openacc.h> int doTest(restrict float *a, restrict float *b, restrict float *c, int size) { int i,j,k; #pragma acc kernels deviceptr(a, b) copyout(c[0:size*size-1]) { // Initialize matrices. #pragma acc loop independent for (i = 0; i < size; ++i) { #pragma acc loop independent for (j = 0; j < size; ++j) { a[i*size+j] = (float)i + j; b[i*size+j] = (float)i - j; c[i*size+j] = 0.0f; } } // Compute matrix multiplication. #pragma acc loop independent for (i = 0; i < size; ++i) { #pragma acc loop independent for (j = 0; j < size; ++j) { #pragma acc loop seq for (k = 0; k < size; ++k) { c[i*size+j] += a[i*size+k] * b[k*size+j]; } } } } } int main(int argc, char *argv[]) { int i,j,k; if(argc < 2) { fprintf(stderr,"Use: size (for size x size) matrices\n"); return -1; } int size=atoi(argv[1]); float *a = (float *)acc_malloc(sizeof(float)*size*size); float *b = (float *)acc_malloc(sizeof(float)*size*size); float *c= (float*)malloc(sizeof(float)*size*size); printf("size = %d\n",size); doTest(a,b,c, size); acc_free(a); acc_free(b); free(c); printf("OpenACC matrix multiplication test was successful!\n"); return 0; }

**Example 10: Source code for matrix-acc-alloc.c.**

### Conclusion

OpenACC has been designed to provide OpenMP-style programmers with an easy transition to GPU programming. Following the common sense adage, "Make your life easy and use the highest level API first," OpenACC provides a natural starting point to transition any C or Fortran code to massive parallelism. For legacy code, OpenACC can be the only viable route to massively parallel coprocessors because it eliminates the need for a total rewrite of the software and Fortran is supported. As a result, OpenACC opens the door to scalable, massively parallel GPU (or, more generically, coprocessor) acceleration of millions of lines of legacy application code. Currently, OpenACC is supported by compilers that must be purchased from either PGI or CAPS Enterprise. The PGI compiler used in this article is free for evaluation but continued use after the trial period expires requires a license that must be purchased. As with OpenMP, it is assumed that open-source compilers will eventually provide free OpenACC support.

Profiling and informational compiler messages play a key role in achieving high performance in pragma-based programming. Instead of having to blindly add pragmas and then guess at the impact of each might have on an application, free tools like the NVIDIA Visual Profiler let the developer actually see what is happening during runtime on Windows, Linux, and Mac computers. Being able to see what effect OpenACC pragmas have on runtime behavior greatly speeds the OpenACC learning process as well as application acceleration.

My next article in this series will discuss the OpenACC memory and execution model including the gang and worker clauses plus more sophisticated ways to handle data.

*Rob Farber is an analyst who writes frequently on High-Performance Computing hardware topics. *