The following code copies the Q matrix onto the GPU where it is transposed in parallel into the Qt array. The cost of the transpose should be minimal compared with the 2MN2 operations (where M and N specify the number of columns and rows of the matrix) of the classic Gram-Schmidt algorithm. Experimentation also determined that changing the vector_length to 128 provided a slight performance increase on an NVIDIA C2070.
// Classical Gram-Schmidt Rob Farber
#include <stdio.h>
#include <stdlib.h>
#include <accelmath.h>
#include <omp.h>
#ifndef ROWS
#define ROWS 640
#endif
#ifndef COLS
#define COLS 640
#endif
#ifdef _OPENACC
void gramSchmidt(restrict float Q[][COLS], const int rows, const int cols)
{
float Qt[cols][rows];
#pragma acc data create(Qt[cols][rows]) copy(Q[0:rows][0:cols])
{
//transpose Q in Qt
#pragma acc parallel loop
for(int i=0; i < rows; i++)
for(int j=0; j < cols; j++)
Qt[j][i] = Q[i][j];
for(int k=0; k < cols; k++) {
#pragma acc parallel
{
double tmp = 0.;
#pragma acc loop vector reduction(+:tmp)
for(int i=0; i < rows; i++) tmp += (Qt[k][i] * Qt[k][i]);
tmp = sqrt(tmp);
#pragma acc loop vector
for(int i=0; i < rows; i++) Qt[k][i] /= tmp;
}
#pragma acc parallel loop vector_length(128)
for(int j=k+1; j < cols; j++) {
double tmp=0.;
for(int i=0; i < rows; i++) tmp += Qt[k][i] * Qt[j][i];
for(int i=0; i < rows; i++) Qt[j][i] -= tmp * Qt[k][i];
}
}
#pragma acc parallel loop
for(int i=0; i < rows; i++)
for(int j=0; j < cols; j++)
Q[i][j] = Qt[j][i];
}
}
#else
// OMP/serial code for performance testing
void gramSchmidt(restrict float Q[][COLS], const int rows, const int cols)
{
float Qt[cols][rows];
#pragma omp parallel for
for(int i=0; i < rows; i++)
for(int j=0; j < cols; j++)
Qt[j][i] = Q[i][j];
for(int k=0; k < cols; k++) {
double tmp = 0.;
#pragma omp parallel for reduction(+:tmp)
for(int i=0; i < rows; i++) tmp += (Qt[k][i] * Qt[k][i]);
tmp = sqrt(tmp);
#pragma omp parallel for
for(int i=0; i < rows; i++) Qt[k][i] /= tmp;
#pragma omp parallel for reduction(+:tmp)
for(int j=k+1; j < cols; j++) {
double tmp=0.;
for(int i=0; i < rows; i++) tmp += Qt[k][i] * Qt[j][i];
for(int i=0; i < rows; i++) Qt[j][i] -= tmp * Qt[k][i];
}
}
#pragma omp parallel for
for(int i=0; i < rows; i++)
for(int j=0; j < cols; j++)
Q[i][j] = Qt[j][i];
}
#endif
void printOctave(char* var, float A[][COLS], int rows, int cols)
{
// if((rows*cols) > 1000) return;
printf("%s=[\n",var);
for(int i=0; i < rows; i++) {
for(int j=0; j < cols; j++) printf("%c%e", (j==0)?' ':',', A[i][j]);
printf(";");
}
printf("];\n");
}
void checkOctave(restrict float A[][COLS], int rows, int cols)
{
int found_error=0;
for(int c1=0; c1 < cols; c1++)
for(int c2=c1; c2 < cols; c2++) {
double sum=0.;
for(int i=0; i < rows; i++) sum += A[i][c1] * A[i][c2];
if(c1 == c2) { // should be near 1 (unit length)
if(sum < 0.9) {
printf("Failed unit length: %d %d %g\n", c1,c2,sum);
found_error = 1;
exit(1);
}
} else { // should be very small (orthogonal)
if(sum > 0.1) {
printf("Failed orthonogonal %d %d %g\n", c1,c2,sum);
found_error = 1;
exit(1);
}
}
}
if(!found_error) printf("Check OK!\n");
}
int main(int argc, char *argv[])
{
int rows=ROWS;
int cols=COLS;
float (*A)[COLS] = (float(*)[COLS])malloc(sizeof(float)*rows*cols);
// fill matrix A with random numbers
for(int i=0; i < rows; i++)
for(int j=0; j < cols; j++)
A[i][j] = (float)rand()/ (float)RAND_MAX;
printf("Done with init!\n");
//printOctave("A", A, rows, cols);
double startTime = omp_get_wtime();
gramSchmidt(A,rows, cols);
double endTime = omp_get_wtime();
printf("runtime %d %d matrix %g\n",rows, cols, (endTime-startTime));
#ifndef NO_CHECK
checkOctave(A, rows, cols);
//printOctave("Q", A, rows, cols);
#endif
free(A);
}
Example 14: Source for cgs-02.c.
The speedup of the transposed version is noticeable for all versions (serial, OpenMP, and OpenACC), which highlights how optimizations for OpenACC code can potentially benefit other implementations as well.
| 1000x1000 | Time (sec) | Time with Transpose (sec) |
| OpenACC (v2) | 1.44388 | 0.227207 |
| OpenACC | 1.66707 | |
| OpenMP (4-cores) | 2.92685 | 0.75971 |
| Serial | 9.22192 | 2.39342 |
Table 3: Improved results from better memory system utilization
The NVIDIA Visual Profiler analysis shows that the memory utilization is more efficient as it increased from 3.1% load/12.5% store to 40.5% load/12.5% store.
Figure 6: Visual Profiler analysis confirming better memory utilization.
Per the second rule of high performance accelerator programming, the kernel startup latency appears to be the gating performance limitation on a 1000x1000 matrix as can be seen in the nvvp timeline below. Note the gaps and that the total utilization of all the kernels is 38.5%. The two transpositions only take up 0.4% of the time.
Figure 7: Timeline showing inefficiency due to insufficient work.
The following timeline for a 5000x5000 matrix shows that the additional work results in much higher accelerator efficiency. In this case, the total kernel utilization is 92.3%. Even though the matrix is much larger, the two transpose operations still take a minimal amount of time.
Figure 8: Timeline showing better utilization with larger matrices.
The runtime and speedup compared with the OpenMP version for a 5000x5000 matrix is as follows:
| 5000x5000 | Time (sec) |
| OpenACC | 11.0075 |
| OpenMP (4-cores) | 87.3252 |
| OpenACC speedup | 7.9x |
Table 4: Improved results even with the overhead of two matrix transposes.
Eliminating the use of double-precision does reduce the runtime to 10.4265 seconds on a C2070. It will be interesting to see how the upcoming NVIDIA Kepler K10 and K20 chips accelerate these runtimes both with single-precision and hybrid double-precision.
Conclusion
The OpenACC execution model lets the programmer exploit both the massive parallelism of the accelerator device(s) and the capabilities of the latest generation of sequential processors. In this way, the full potential of the host and accelerator hardware can be exploited. When required, the OpenACC gang, worker, and vector clauses can be utilized to tune an application to a particular hardware configuration.
OpenACC and OpenMP implementations of the Classic Gram-Schmidt (CGS) method provided in this tutorial demonstrated how easy it is to work with OpenACC parallel regions, even when the amount of work per loop can vary. Parallel regions are useful because they let programmers annotate code in a style that is conceptually very similar to OpenMP. Kernel regions allow the compiler to automatically generate CUDA-style kernels, which gives advanced programmers the ability to express legal CUDA kernel launch configurations using portable directive-based OpenACC syntax. Conversations with PGI indicate that the ability to extract maximum performance from GPUs using parallel region annotation is still to be demonstrated, which bodes well for the future.
Rob Farber is an analyst who writes frequently on High-Performance Computing hardware topics.


