### How To Try Out OpenACC

An extended 30-day trial license for the PGI software can be obtained by registering with NVIDIA. The Portland Group also provides a free 15 day OpenACC trial license, which can be obtained by following the following three steps:

1. Download any of the available software packages for your operating system.

2. Review the PGI Installation Guide [PDF] or the PGI Visual Fortran Installation Guide [PDF] and configure your environment.

3. Generate the trial license keys. Note the trial keys and all executable files compiled using them will cease operating at the end of the trial period.

The following set of examples multiply two matrices `a`

and `b`

and store the result in matrix `c`

. They utilize a useful set of basic OpenACC data transfer, parallelization, and memory creation/access clauses. A C-language OpenMP matrix multiply is also provided to show the similarity between OpenACC and OpenMP and provide CPU and GPU performance comparisons. While the PGI matrix multiplication performance is good, please look to the highly optimized BLAS (Basic Linear Algebra Subroutines) packages such as CUBLAS and phiGEMM for production GPU and hybrid CPU + GPU implementations.

Following is our first OpenACC program, matix-acc-check.c. This simple code creates a static set of square matrices (`a,b,c,seq`

), initializes them, and then performs a matrix multiplication on the OpenACC device. The test code then performs the matrix multiplication sequentially on the host processor and double-checks the OpenACC result.

/* matrix-acc-check.c */ #define SIZE 1000 float a[SIZE][SIZE]; float b[SIZE][SIZE]; float c[SIZE][SIZE]; float seq[SIZE][SIZE]; int main() { int i,j,k; // Initialize matrices. for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { a[i][j] = (float)i + j; b[i][j] = (float)i - j; c[i][j] = 0.0f; } } // Compute matrix multiplication. #pragma acc kernels copyin(a,b) copy(c) for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { for (k = 0; k < SIZE; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } // **************** // double-check the OpenACC result sequentially on the host // **************** // Initialize the seq matrix for(i = 0; i < SIZE; ++i) for(j = 0; j < SIZE; ++j) seq[i][j] = 0.f; // Perform the multiplication for (i = 0; i < SIZE; ++i) for (j = 0; j < SIZE; ++j) for (k = 0; k < SIZE; ++k) seq[i][j] += a[i][k] * b[k][j]; // check all the OpenACC matrices for (i = 0; i < SIZE; ++i) for (j = 0; j < SIZE; ++j) if(c[i][j] != seq[i][j]) { printf("Error %d %d\n", i,j); exit(1); } printf("OpenACC matrix multiplication test was successful!\n"); return 0; }

**Example 1: matrix-acc-check.c source code.**

The OpenACC pragma tells the compiler the following:

`#pragma acc`

: This is an OpenACC pragma.

- kernels: A kernels region.

No jumps are allowed into/out of the kernels region.

Loops will be sent to the OpenACC device.

The scope of the kernels region code block is denoted by the curly brackets in a C program. `copyin()`

: copy the contiguous region of memory from the host to the device.

The variables, arrays or subarrays in the list have values in the host memory that need to be copied to the device memory.

If a subarray is specified, then only that subarray of the array needs to be copied.

`copy()`

: copy the contiguous memory region from the host to the device and back again.

The variables, arrays or subarrays in the list have values in the host memory that need to be copied to the device memory.

If a subarray is specified, then only that subarray of the array needs to be copied.

The data is copied to the device memory before entry to the kernles region, and data copied back to the host memory when the code block is complete.

The source code is compiled with the pgcc compiler and a successful test is indicated after the application runs as shown below:

pgcc -acc -fast -Minfo matrix-acc-check.c -o matrix-acc-check ./matrix-acc-check OpenACC matrix multiplication test was successful!

The source code for matrix-acc.c was created by removing the italicized code from matric-acc-check.c to simplify the following discussion.

/* matrix-acc.c */ #define SIZE 1000 float a[SIZE][SIZE]; float b[SIZE][SIZE]; float c[SIZE][SIZE]; int main() { int i,j,k; // Initialize matrices. for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { a[i][j] = (float)i + j; b[i][j] = (float)i - j; c[i][j] = 0.0f; } } // Compute matrix multiplication. #pragma acc kernels copyin(a,b) copy(c) for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { for (k = 0; k < SIZE; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } return 0; }

**Example 2: matrix-acc.c source code.**

Note the similarity between matrix-acc.c and the following OpenMP implementation, matrix-omp.c. Only the pragmas are different as the OpenACC pragma includes copy operations that are not required in the OpenMP implementation.

/* matrix-omp.c */ #define SIZE 1000 float a[SIZE][SIZE]; float b[SIZE][SIZE]; float c[SIZE][SIZE]; int main() { int i,j,k; // Initialize matrices. for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { a[i][j] = (float)i + j; b[i][j] = (float)i - j; c[i][j] = 0.0f; } } // Compute matrix multiplication. #pragma omp parallel for default(none) shared(a,b,c) private(i,j,k) for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { for (k = 0; k < SIZE; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } return 0; }

**Example 3: matrix-omp.c source code.**

Fortran programmers will find the corresponding source code in Example 4. Again, the OpenACC pragmas annotate data movement with the `copy()`

and `copyin() `

clauses. Note that the C-based pragmas know the extent of the code block due to the use of curly brackets while the Fortran version must explicitly specify the end of the scope of the pragma with `"!$acc end …"`

.

! matrix-acc.f program example1 parameter ( n_size=1000 ) real*4, dimension(:,:) :: a(n_size,n_size) real*4, dimension(:,:) :: b(n_size,n_size) real*4, dimension(:,:) :: c(n_size,n_size) ! Initialize matrices (values differ from C version) do i=1, n_size do j=1, n_size a(i,j) = i + j; b(i,j) = i - j; c(i,j) = 0.; enddo enddo !$acc data copyin(a,b) copy(c) !$acc kernels loop ! Compute matrix multiplication. do i=1, n_size do j=1, n_size do k = 1, n_size c(i,j) = c(i,j) + a(i,k) * b(k,j) enddo enddo enddo !$acc end data end program example1

**Example 4: matrix-acc.f source code.**