Channels ▼
RSS

Parallel

The OpenACC Execution Model


The following code copies the Q matrix onto the GPU where it is transposed in parallel into the Qt array. The cost of the transpose should be minimal compared with the 2MN2 operations (where M and N specify the number of columns and rows of the matrix) of the classic Gram-Schmidt algorithm. Experimentation also determined that changing the vector_length to 128 provided a slight performance increase on an NVIDIA C2070.

// Classical Gram-Schmidt Rob Farber

#include <stdio.h>
#include <stdlib.h>
#include <accelmath.h>
#include <omp.h>

#ifndef ROWS
#define ROWS 640
#endif
#ifndef COLS
#define COLS 640
#endif

#ifdef _OPENACC

void gramSchmidt(restrict float Q[][COLS], const int rows, const int cols) 
{
  float Qt[cols][rows];
  
#pragma acc data create(Qt[cols][rows]) copy(Q[0:rows][0:cols])
  {
    //transpose Q in Qt
#pragma acc parallel loop
    for(int i=0; i < rows; i++) 
      for(int j=0; j < cols; j++) 
	Qt[j][i] = Q[i][j];
    
    for(int k=0; k < cols; k++) {
      
#pragma acc parallel
      {
	double tmp = 0.;
	
#pragma acc loop vector reduction(+:tmp)  
	for(int i=0; i < rows; i++) tmp +=  (Qt[k][i] * Qt[k][i]);
	tmp = sqrt(tmp);
	
#pragma acc loop vector 
	for(int i=0; i < rows; i++) Qt[k][i] /= tmp;
      }
      
#pragma acc parallel loop vector_length(128)
      for(int j=k+1; j < cols; j++) {
	double tmp=0.;
	for(int i=0; i < rows; i++) tmp += Qt[k][i] * Qt[j][i];
	for(int i=0; i < rows; i++) Qt[j][i] -= tmp * Qt[k][i];
      }
    }

#pragma acc parallel loop
    for(int i=0; i < rows; i++) 
      for(int j=0; j < cols; j++) 
	Q[i][j] = Qt[j][i];
  }
}

#else

// OMP/serial code for performance testing
void gramSchmidt(restrict float Q[][COLS], const int rows, const int cols) 
{
    float Qt[cols][rows];

#pragma omp parallel for
    for(int i=0; i < rows; i++) 
      for(int j=0; j < cols; j++) 
	Qt[j][i] = Q[i][j];
    
    for(int k=0; k < cols; k++) {
      double tmp = 0.;
#pragma omp parallel for reduction(+:tmp)
      for(int i=0; i < rows; i++) tmp +=  (Qt[k][i] * Qt[k][i]);
      tmp = sqrt(tmp);
      
#pragma omp parallel for
      for(int i=0; i < rows; i++) Qt[k][i] /= tmp;
      
#pragma omp parallel for reduction(+:tmp)
      for(int j=k+1; j < cols; j++) {
	double tmp=0.;
	for(int i=0; i < rows; i++) tmp += Qt[k][i] * Qt[j][i];
	for(int i=0; i < rows; i++) Qt[j][i] -= tmp * Qt[k][i];
      }
    }
    
#pragma omp parallel for
    for(int i=0; i < rows; i++) 
      for(int j=0; j < cols; j++) 
	Q[i][j] = Qt[j][i];
}
#endif

void printOctave(char* var, float A[][COLS], int rows, int cols)
{
  //  if((rows*cols) > 1000) return;
  printf("%s=[\n",var);
  for(int i=0; i < rows; i++) { 
    for(int j=0; j < cols; j++) printf("%c%e", (j==0)?' ':',', A[i][j]);
    printf(";"); 
  }
  printf("];\n");
}
void checkOctave(restrict float A[][COLS], int rows, int cols)
{
  int found_error=0;
  for(int c1=0; c1 < cols; c1++)
    for(int c2=c1; c2 < cols; c2++) {
      double sum=0.;
      for(int i=0; i < rows; i++) sum += A[i][c1] * A[i][c2];
      if(c1 == c2) { // should be near 1 (unit length)
	if(sum < 0.9) {
	  printf("Failed unit length: %d %d %g\n", c1,c2,sum);
	  found_error = 1;
	  exit(1);
	}
      } else { // should be very small (orthogonal)
	if(sum > 0.1) {
	  printf("Failed orthonogonal  %d %d %g\n", c1,c2,sum);
	  found_error = 1;
	  exit(1);
	}
      }
  }
  if(!found_error) printf("Check OK!\n");
}

int main(int argc, char *argv[])
{
  int rows=ROWS;
  int cols=COLS;
  float (*A)[COLS] = (float(*)[COLS])malloc(sizeof(float)*rows*cols);

  // fill matrix A with random numbers
  for(int i=0; i < rows; i++)
    for(int j=0; j < cols; j++)
      A[i][j] = (float)rand()/ (float)RAND_MAX;
  printf("Done with init!\n");

  //printOctave("A", A, rows, cols);
  double startTime = omp_get_wtime();
  gramSchmidt(A,rows, cols);
  double endTime = omp_get_wtime();
  printf("runtime %d %d matrix %g\n",rows, cols, (endTime-startTime));
#ifndef NO_CHECK
  checkOctave(A, rows, cols);
  //printOctave("Q", A, rows, cols);
#endif
  free(A);
}

Example 14: Source for cgs-02.c.

The speedup of the transposed version is noticeable for all versions (serial, OpenMP, and OpenACC), which highlights how optimizations for OpenACC code can potentially benefit other implementations as well.

1000x1000 Time (sec) Time with Transpose (sec)
OpenACC (v2) 1.44388 0.227207
OpenACC 1.66707
OpenMP (4-cores) 2.92685 0.75971
Serial 9.22192 2.39342

Table 3: Improved results from better memory system utilization

The NVIDIA Visual Profiler analysis shows that the memory utilization is more efficient as it increased from 3.1% load/12.5% store to 40.5% load/12.5% store.

Figure 6: Visual Profiler analysis confirming better memory utilization.

Per the second rule of high performance accelerator programming, the kernel startup latency appears to be the gating performance limitation on a 1000x1000 matrix as can be seen in the nvvp timeline below. Note the gaps and that the total utilization of all the kernels is 38.5%. The two transpositions only take up 0.4% of the time.

Figure 7: Timeline showing inefficiency due to insufficient work.

The following timeline for a 5000x5000 matrix shows that the additional work results in much higher accelerator efficiency. In this case, the total kernel utilization is 92.3%. Even though the matrix is much larger, the two transpose operations still take a minimal amount of time.

Figure 8: Timeline showing better utilization with larger matrices.

The runtime and speedup compared with the OpenMP version for a 5000x5000 matrix is as follows:

5000x5000 Time (sec)
OpenACC 11.0075
OpenMP (4-cores) 87.3252
OpenACC speedup 7.9x

Table 4: Improved results even with the overhead of two matrix transposes.

Eliminating the use of double-precision does reduce the runtime to 10.4265 seconds on a C2070. It will be interesting to see how the upcoming NVIDIA Kepler K10 and K20 chips accelerate these runtimes both with single-precision and hybrid double-precision.

Conclusion

The OpenACC execution model lets the programmer exploit both the massive parallelism of the accelerator device(s) and the capabilities of the latest generation of sequential processors. In this way, the full potential of the host and accelerator hardware can be exploited. When required, the OpenACC gang, worker, and vector clauses can be utilized to tune an application to a particular hardware configuration.

OpenACC and OpenMP implementations of the Classic Gram-Schmidt (CGS) method provided in this tutorial demonstrated how easy it is to work with OpenACC parallel regions, even when the amount of work per loop can vary. Parallel regions are useful because they let programmers annotate code in a style that is conceptually very similar to OpenMP. Kernel regions allow the compiler to automatically generate CUDA-style kernels, which gives advanced programmers the ability to express legal CUDA kernel launch configurations using portable directive-based OpenACC syntax. Conversations with PGI indicate that the ability to extract maximum performance from GPUs using parallel region annotation is still to be demonstrated, which bodes well for the future.


Rob Farber is an analyst who writes frequently on High-Performance Computing hardware topics.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video