Channels ▼
RSS

Tools

CUDA, Supercomputing for the Masses: Part 17


In CUDA, Supercomputing for the Masses: Part 16 of this article series, I discussed the CUDA 3.0 release. CUDA 3.0 is a major revision number increment release that adds enhancements valuable to all CUDA developers to make day-to-day development tasks easier, less error prone, and more consistent.

As mentioned in the previous article, expanded consistent coverage appears to have been the thrust behind this major revision number release as it fills in several previous gaps and must-have capabilities. In a nutshell, in this article I discuss runtime and driver API compatibility, the new graphics interoperability API, C++ inheritance plus expanded functionality in CUBLAS and CUFFT. Examples are provided that demonstrate:

  • Consistency and interoperability between the CUDA runtime and driver codes through two simple examples that call a runtime kernel and a CUBLAS routine.
  • A C++ example that:
    • Uses both versions of the OpenGL interoperability APIs in drawing a Mandelbrot set. The C++ source code concisely recreates the simplePBO example from Part 15 of this article series using C++ classes.
    • Demonstrates C++ inheritance by deriving a new class from our C++ Mandelbrot example that uses programmable shaders created with Cg. (Please look to the extensive NVIDIA Cg homepage for more information). While Cg compatibility is not new with the CUDA 3.0 release, mixing Cg shaders with CUDA can open up a vast collection of Cg libraries and existing software!

Be aware that the latest NVIDIA driver must be installed to use the CUDA 3.0 toolkit. As always, the latest released driver can be downloaded from CUDA ZONE and installed for a number of systems. Beta drivers and software can be downloaded from nvdevelopers but registration is required. Ubuntu users might wish to follow one of the many available guides, such as the one at Web Upd8, to see how to install the latest released or beta drivers via the Ubuntu tools.

Mixing CUDA Runtime and Driver API Codes

Previous articles in this series have focused on teaching the CUDA with the runtime API (e.g. those methods that start with "cuda" as opposed to "cu") because it is fairly intuitive and not too verbose. Many developers prefer to utilize the driver API because they have more control and can make better use of existing code bases. Now programmers can utilized the best characteristics of both APIs.

The following is the source code for a driver mode CUDA program that calls a kernel via the runtime API. Please put this into a file called vectorAddDrv.cu:


/*
 * Driver APIC code that calls a runtime kernel
 * Vector addition: C = A + B.
 */

// Includes
#include <stdio.h>
#include <cuda.h>
#include <cutil_inline.h>

// Variables
CUdevice cuDevice;
CUcontext cuContext;
CUmodule cuModule;
CUfunction vecAdd;
float* h_A;
float* h_B;
float* h_C;
CUdeviceptr d_A;
CUdeviceptr d_B;
CUdeviceptr d_C;

// Functions

__global__ void kernel(float* d_a, float* d_b, float* d_c, int n)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  
  if(idx < n)
    d_c[idx] = d_a[idx] + d_b[idx];
}

// Allocates an array with random float entries.
void RandomInit(float* data, int n)
{
  for (int i = 0; i < n; ++i)
    data[i] = rand() / (float)RAND_MAX;
}

void errorExit()
{
  printf("Error exit!\n");
  exit(1);
}

// Host code
int main(int argc, char** argv)
{
  int N = 50000;
  unsigned int size = N * sizeof(float);
  CUresult error;
  
  printf("Vector Addition (Driver API)\n");
  // Initialize
  error = cuInit(0);
  if (error != CUDA_SUCCESS) errorExit();
  
  // Get number of devices supporting CUDA
  int deviceCount = 0;
  error = cuDeviceGetCount(&deviceCount);
  if (error != CUDA_SUCCESS) errorExit();
  if (deviceCount == 0) {
    printf("There is no device supporting CUDA.\n");
    exit(1);
  }
  
  // Get handle for device 0
  error = cuDeviceGet(&cuDevice, 0);
  if (error != CUDA_SUCCESS) errorExit();
  
  // Create context
  error = cuCtxCreate(&cuContext, 0, cuDevice);
  if (error != CUDA_SUCCESS) errorExit();
  
  // Allocate input vectors h_A and h_B in host memory
  h_A = (float*)malloc(size);
  if (h_A == 0) errorExit();
  h_B = (float*)malloc(size);
  if (h_B == 0) errorExit();
  h_C = (float*)malloc(size);
  if (h_C == 0) errorExit();
  
  // Initialize input vectors
  RandomInit(h_A, N);
  RandomInit(h_B, N);
  
  // Allocate vectors in device memory
  error = cuMemAlloc(&d_A, size);
  if (error != CUDA_SUCCESS) errorExit();
  error = cuMemAlloc(&d_B, size);
  if (error != CUDA_SUCCESS) errorExit();
  error = cuMemAlloc(&d_C, size);
  if (error != CUDA_SUCCESS) errorExit();
  
  // Copy vectors from host memory to device memory
  error = cuMemcpyHtoD(d_A, h_A, size);
  if (error != CUDA_SUCCESS) errorExit();
  error = cuMemcpyHtoD(d_B, h_B, size);
  if (error != CUDA_SUCCESS) errorExit();
  
  // Invoke kernel (Runtime API)
  int nThreadsPerBlk=128;
  int nBlks = (N/nThreadsPerBlk) + (((N%nThreadsPerBlk)>0)?1:0);
  kernel<<<nBlks,nThreadsPerBlk>>>((float*)d_A,(float*) d_B,(float*) d_C, N);
  
  // Copy result from device memory to host memory
  // h_C contains the result in host memory
  error = cuMemcpyDtoH(h_C, d_C, size);
  if (error != CUDA_SUCCESS) errorExit();
  
  // Verify result
  int i;
  for (i = 0; i < N; ++i) {
    float sum = h_A[i] + h_B[i];
    if (fabs(h_C[i] - sum) > 1e-7f) {
      printf("Mistake index %d %g %g\n",i,h_C[i],sum);
      break;
    }
  }
  printf("Test %s \n", (i == N) ? "PASSED" : "FAILED");
  return(0);
}

A detailed discussion of the driver API is beyond the scope of this article, as I'm focusing on interoperability and the 3.0 release. Even so, much of this code should look familiar to runtime API developers as only GPU setup, memory allocation and data movement is utilized. Many of these calls are similar to the runtime API calls so it should be easy to follow the source code. The specific point made with this example is that the following runtime CUDA call to kernel() works in the 3.0 release:


  kernel<<<nBlks,nThreadsPerBlk>>>((float*)d_A,(float*) d_B,(float*) d_C, N);

Use nvcc to build the executable with debugging enabled as shown below for Linux:


SDK_PATH=$HOME/NVIDIA_GPU_Computing_SDK/C
INC=$SDK_PATH/common/inc
LIB=$SDK_PATH/lib
nvcc -g -G -I$INC -L$LIB vectorAddDrv.cu -lcutil_x86_64 -lcuda -o vectorAddDrv

Now start the program with the command cuda-gdb:


$ cuda-gdb vectorAddDrv
NVIDIA (R) CUDA Debugger
BETA release
Portions Copyright (C) 2008,2009 NVIDIA Corporation
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
Using host libthread_db library "/lib/libthread_db.so.1".	

Enable memory checking and run the program with the following two commands:


(cuda-gdb) set cuda memcheck on
(cuda-gdb) r

The following illustrative output shows that the code works and apparently used memory correctly. Further the results passed a validation test on the host as indicated by the "Test PASSED" message.


Starting program: PATH_TO_EXECUTABLE/vectorAddDrv 
[Thread debugging using libthread_db enabled]
[New process 31619]
Vector Addition (Driver API)
[New Thread 139794699740944 (LWP 31619)]
Test PASSED 

Program exited normally.

Now set a breakpoint at line 109, which is just after the result vector is moved to the host computer and run the vectorAddDrv again:

.

(cuda-gdb) b 109
Breakpoint 1 at 0x40183e: file vectorAdd.cu, line 109.

(cuda-gdb) r
Starting program: /home/rmfarber/DDJ/Part17/vecAdd/vectorAdd 
[Thread debugging using libthread_db enabled]
[New process 31623]
Vector Addition (Driver API)
[New Thread 140441416316688 (LWP 31623)]
[Switching to Thread 140441416316688 (LWP 31623)]

Breakpoint 1, main (argc=1, argv=0x7fff40b55248) at vectorAdd.cu:109
109	  if (error != CUDA_SUCCESS) errorExit();

Use the CUDA-GDB set command to change index three of the h_C vector on the host to be 1000-times larger:


(cuda-gdb) p h_C[3]
$1 = 1.30907393
(cuda-gdb) set h_C[3] = 1000. * h_C[3]

The program is then allowed to continue. The resulting output shows that the comparison code does find the error caused by our manually modifying the h_C vector:


(cuda-gdb) c
Continuing.
Mistake index 3 1309.07 1.30907
Test FAILED 

Program exited normally.
(cuda-gdb) quit

This example demonstrates that CUDA-GDB in the 3.0 release works with driver API programs. It also shows how straightforward it now is to mix driver and runtime API codes.

Similarly, the following example, blasAddDrv.cu, demonstrates that it is now possible with CUDA 3.0 to call the CUBLAS library routines. In this case, the previous example code for vectorAddDrv.cu was adapted to call the saxpy() routine:


  cublasSaxpy(N, 1.0f,(float*) d_A, 1,(float*) d_B, 1);

The following is the complete source code for vectorAddDrv.cu. Again, I won't discuss the details of the driver API. See the NVIDIA documentation for more information.


/*
 * Vector addition using SAXPY from a driver API example
 */

// Includes
#include <stdio.h>
#include <cuda.h>
#include <cutil_inline.h>
#include <cublas.h>


// Variables
CUdevice cuDevice;
CUcontext cuContext;
CUmodule cuModule;
CUfunction vecAdd;
float* h_A;
float* h_B;
float* h_C;
CUdeviceptr d_A;
CUdeviceptr d_B;

// Functions

// Allocates an array with random float entries.
void RandomInit(float* data, int n)
{
  for (int i = 0; i < n; ++i)
    data[i] = rand() / (float)RAND_MAX;
}

void errorExit()
{
  printf("Error exit!\n");
}

// Host code
int main(int argc, char** argv)
{
  int N = 50000;
  unsigned int size = N * sizeof(float);
  CUresult error;
  int status;
  
  printf("Vector Addition (BLAS and Driver APIs)\n");
  // Initialize
  error = cuInit(0);
  if (error != CUDA_SUCCESS) errorExit();
  
  // Get number of devices supporting CUDA
  int deviceCount = 0;
  error = cuDeviceGetCount(&deviceCount);
  if (error != CUDA_SUCCESS) errorExit();
  if (deviceCount == 0) {
    printf("There is no device supporting CUDA.\n");
    exit(1);
  }
  
  // Get handle for device 0
  error = cuDeviceGet(&cuDevice, 0);
  if (error != CUDA_SUCCESS) errorExit();
  
  // Create context
  error = cuCtxCreate(&cuContext, 0, cuDevice);
  if (error != CUDA_SUCCESS) errorExit();
  
  // Allocate input vectors h_A and h_B in host memory
  h_A = (float*)malloc(size);
  if (h_A == 0) errorExit();
  h_B = (float*)malloc(size);
  if (h_B == 0) errorExit();
  h_C = (float*)malloc(size);
  if (h_C == 0) errorExit();
  
  // Initialize input vectors
  RandomInit(h_A, N);
  RandomInit(h_B, N);
  
  // Allocate vectors in device memory
  error = cuMemAlloc(&d_A, size);
  if (error != CUDA_SUCCESS) errorExit();
  error = cuMemAlloc(&d_B, size);
  if (error != CUDA_SUCCESS) errorExit();
  
  // Copy vectors from host memory to device memory
  error = cuMemcpyHtoD(d_A, h_A, size);
  if (error != CUDA_SUCCESS) errorExit();
  error = cuMemcpyHtoD(d_B, h_B, size);
  if (error != CUDA_SUCCESS) errorExit();
  
  // Invoke kernel
  status = cublasInit();
  if (status != CUBLAS_STATUS_SUCCESS) {
    fprintf (stderr, "!!!! CUBLAS initialization error\n");
    return EXIT_FAILURE;
  }
  
  // Call CUBLAS from a driver API code
  cublasSaxpy(N, 1.0f,(float*) d_A, 1,(float*) d_B, 1);
  
  // Copy result from device memory to host memory
  // h_C contains the result in host memory
  error = cuMemcpyDtoH(h_C, d_B, size);
  if (error != CUDA_SUCCESS) errorExit();
  
  // Verify result
  int i;
  for (i = 0; i < N; ++i) {
    float sum = h_A[i] + h_B[i];
    if (fabs(h_C[i] - sum) > 1e-7f)
      break;
  }
  printf("Test %s \n", (i == N) ? "PASSED" : "FAILED");
  return(0);
}

Use nvcc to build the executable. The following simple script builds the executable on Linux. Please note that this program requires the CUBLAS library. Linking is specified with the -lcublas command-line option.


SDK_PATH=$HOME/NVIDIA_GPU_Computing_SDK/C
INC=$SDK_PATH/common/inc
LIB=$SDK_PATH/lib
nvcc -I$INC -L$LIB blasAddDrv.cu -lcutil_x86_64 -lcuda -lcublas -o blasAddDrv

Running the program demonstrates that it does indeed work correctly as indicated by the "Test PASSED" message:


$ ./blasAddDrv
Vector Addition (BLAS and Driver APIs)
Test PASSED

Unquestionably, the ability to mix and debug driver and runtime API codes and libraries is valuable. For many, this expanded capability alone will make the CUDA 3.0 release an obvious download choice. As the CUDA library and code base expands, this transparent interoperability will continue to pay dividends in ease of use -- although it is likely that in the future most developers will utilize this capability without thought or awareness of the ease in which it occurs.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video