# The OpenACC Execution Model

The following code copies the `Q` matrix onto the GPU where it is transposed in parallel into the `Qt` array. The cost of the transpose should be minimal compared with the 2MN2 operations (where M and N specify the number of columns and rows of the matrix) of the classic Gram-Schmidt algorithm. Experimentation also determined that changing the `vector_length` to 128 provided a slight performance increase on an NVIDIA C2070.

```// Classical Gram-Schmidt Rob Farber

#include <stdio.h>
#include <stdlib.h>
#include <accelmath.h>
#include <omp.h>

#ifndef ROWS
#define ROWS 640
#endif
#ifndef COLS
#define COLS 640
#endif

#ifdef _OPENACC

void gramSchmidt(restrict float Q[][COLS], const int rows, const int cols)
{
float Qt[cols][rows];

#pragma acc data create(Qt[cols][rows]) copy(Q[0:rows][0:cols])
{
//transpose Q in Qt
#pragma acc parallel loop
for(int i=0; i < rows; i++)
for(int j=0; j < cols; j++)
Qt[j][i] = Q[i][j];

for(int k=0; k < cols; k++) {

#pragma acc parallel
{
double tmp = 0.;

#pragma acc loop vector reduction(+:tmp)
for(int i=0; i < rows; i++) tmp +=  (Qt[k][i] * Qt[k][i]);
tmp = sqrt(tmp);

#pragma acc loop vector
for(int i=0; i < rows; i++) Qt[k][i] /= tmp;
}

#pragma acc parallel loop vector_length(128)
for(int j=k+1; j < cols; j++) {
double tmp=0.;
for(int i=0; i < rows; i++) tmp += Qt[k][i] * Qt[j][i];
for(int i=0; i < rows; i++) Qt[j][i] -= tmp * Qt[k][i];
}
}

#pragma acc parallel loop
for(int i=0; i < rows; i++)
for(int j=0; j < cols; j++)
Q[i][j] = Qt[j][i];
}
}

#else

// OMP/serial code for performance testing
void gramSchmidt(restrict float Q[][COLS], const int rows, const int cols)
{
float Qt[cols][rows];

#pragma omp parallel for
for(int i=0; i < rows; i++)
for(int j=0; j < cols; j++)
Qt[j][i] = Q[i][j];

for(int k=0; k < cols; k++) {
double tmp = 0.;
#pragma omp parallel for reduction(+:tmp)
for(int i=0; i < rows; i++) tmp +=  (Qt[k][i] * Qt[k][i]);
tmp = sqrt(tmp);

#pragma omp parallel for
for(int i=0; i < rows; i++) Qt[k][i] /= tmp;

#pragma omp parallel for reduction(+:tmp)
for(int j=k+1; j < cols; j++) {
double tmp=0.;
for(int i=0; i < rows; i++) tmp += Qt[k][i] * Qt[j][i];
for(int i=0; i < rows; i++) Qt[j][i] -= tmp * Qt[k][i];
}
}

#pragma omp parallel for
for(int i=0; i < rows; i++)
for(int j=0; j < cols; j++)
Q[i][j] = Qt[j][i];
}
#endif

void printOctave(char* var, float A[][COLS], int rows, int cols)
{
//  if((rows*cols) > 1000) return;
printf("%s=[\n",var);
for(int i=0; i < rows; i++) {
for(int j=0; j < cols; j++) printf("%c%e", (j==0)?' ':',', A[i][j]);
printf(";");
}
printf("];\n");
}
void checkOctave(restrict float A[][COLS], int rows, int cols)
{
int found_error=0;
for(int c1=0; c1 < cols; c1++)
for(int c2=c1; c2 < cols; c2++) {
double sum=0.;
for(int i=0; i < rows; i++) sum += A[i][c1] * A[i][c2];
if(c1 == c2) { // should be near 1 (unit length)
if(sum < 0.9) {
printf("Failed unit length: %d %d %g\n", c1,c2,sum);
found_error = 1;
exit(1);
}
} else { // should be very small (orthogonal)
if(sum > 0.1) {
printf("Failed orthonogonal  %d %d %g\n", c1,c2,sum);
found_error = 1;
exit(1);
}
}
}
if(!found_error) printf("Check OK!\n");
}

int main(int argc, char *argv[])
{
int rows=ROWS;
int cols=COLS;
float (*A)[COLS] = (float(*)[COLS])malloc(sizeof(float)*rows*cols);

// fill matrix A with random numbers
for(int i=0; i < rows; i++)
for(int j=0; j < cols; j++)
A[i][j] = (float)rand()/ (float)RAND_MAX;
printf("Done with init!\n");

//printOctave("A", A, rows, cols);
double startTime = omp_get_wtime();
gramSchmidt(A,rows, cols);
double endTime = omp_get_wtime();
printf("runtime %d %d matrix %g\n",rows, cols, (endTime-startTime));
#ifndef NO_CHECK
checkOctave(A, rows, cols);
//printOctave("Q", A, rows, cols);
#endif
free(A);
}
```

Example 14: Source for cgs-02.c.

The speedup of the transposed version is noticeable for all versions (serial, OpenMP, and OpenACC), which highlights how optimizations for OpenACC code can potentially benefit other implementations as well.

 1000x1000 Time (sec) Time with Transpose (sec) OpenACC (v2) 1.44388 0.227207 OpenACC 1.66707 OpenMP (4-cores) 2.92685 0.75971 Serial 9.22192 2.39342

Table 3: Improved results from better memory system utilization

The NVIDIA Visual Profiler analysis shows that the memory utilization is more efficient as it increased from 3.1% load/12.5% store to 40.5% load/12.5% store.

Figure 6: Visual Profiler analysis confirming better memory utilization.

Per the second rule of high performance accelerator programming, the kernel startup latency appears to be the gating performance limitation on a 1000x1000 matrix as can be seen in the `nvvp `timeline below. Note the gaps and that the total utilization of all the kernels is 38.5%. The two transpositions only take up 0.4% of the time.

Figure 7: Timeline showing inefficiency due to insufficient work.

The following timeline for a 5000x5000 matrix shows that the additional work results in much higher accelerator efficiency. In this case, the total kernel utilization is 92.3%. Even though the matrix is much larger, the two transpose operations still take a minimal amount of time.

Figure 8: Timeline showing better utilization with larger matrices.

The runtime and speedup compared with the OpenMP version for a 5000x5000 matrix is as follows:

 5000x5000 Time (sec) OpenACC 11.0075 OpenMP (4-cores) 87.3252 OpenACC speedup 7.9x

Table 4: Improved results even with the overhead of two matrix transposes.

Eliminating the use of double-precision does reduce the runtime to 10.4265 seconds on a C2070. It will be interesting to see how the upcoming NVIDIA Kepler K10 and K20 chips accelerate these runtimes both with single-precision and hybrid double-precision.

### Conclusion

The OpenACC execution model lets the programmer exploit both the massive parallelism of the accelerator device(s) and the capabilities of the latest generation of sequential processors. In this way, the full potential of the host and accelerator hardware can be exploited. When required, the OpenACC gang, worker, and vector clauses can be utilized to tune an application to a particular hardware configuration.

OpenACC and OpenMP implementations of the Classic Gram-Schmidt (CGS) method provided in this tutorial demonstrated how easy it is to work with OpenACC parallel regions, even when the amount of work per loop can vary. Parallel regions are useful because they let programmers annotate code in a style that is conceptually very similar to OpenMP. Kernel regions allow the compiler to automatically generate CUDA-style kernels, which gives advanced programmers the ability to express legal CUDA kernel launch configurations using portable directive-based OpenACC syntax. Conversations with PGI indicate that the ability to extract maximum performance from GPUs using parallel region annotation is still to be demonstrated, which bodes well for the future.

Rob Farber is an analyst who writes frequently on High-Performance Computing hardware topics.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

# First C Compiler Now on Github

The earliest known C compiler by the legendary Dennis Ritchie has been published on the repository.

# HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

# Building Bare Metal ARM Systems with GNU

All you need to know to get up and running... and programming on ARM

# Amazon's Vogels Challenges IT: Rethink App Dev

Amazon Web Services CTO says promised land of cloud computing requires a new generation of applications that follow different principles.

# How to Select a PaaS Partner

Eventually, the vast majority of Web applications will run on a platform-as-a-service, or PaaS, vendor's infrastructure. To help sort out the options, we sent out a matrix with more than 70 decision points to a variety of PaaS providers.

More "Best of the Web" >>