In CUDA, Supercomputing for the Masses: Part 16 of this article series, I discussed the CUDA 3.0 release. CUDA 3.0 is a major revision number increment release that adds enhancements valuable to all CUDA developers to make day-to-day development tasks easier, less error prone, and more consistent.
As mentioned in the previous article, expanded consistent coverage appears to have been the thrust behind this major revision number release as it fills in several previous gaps and must-have capabilities. In a nutshell, in this article I discuss runtime and driver API compatibility, the new graphics interoperability API, C++ inheritance plus expanded functionality in CUBLAS and CUFFT. Examples are provided that demonstrate:
- Consistency and interoperability between the CUDA runtime and driver codes through two simple examples that call a runtime kernel and a CUBLAS routine.
- A C++ example that:
- Uses both versions of the OpenGL interoperability APIs in drawing a Mandelbrot set. The C++ source code concisely recreates the simplePBO example from Part 15 of this article series using C++ classes.
- Demonstrates C++ inheritance by deriving a new class from our C++ Mandelbrot example that uses programmable shaders created with Cg. (Please look to the extensive NVIDIA Cg homepage for more information). While Cg compatibility is not new with the CUDA 3.0 release, mixing Cg shaders with CUDA can open up a vast collection of Cg libraries and existing software!
Be aware that the latest NVIDIA driver must be installed to use the CUDA 3.0 toolkit. As always, the latest released driver can be downloaded from CUDA ZONE and installed for a number of systems. Beta drivers and software can be downloaded from nvdevelopers but registration is required. Ubuntu users might wish to follow one of the many available guides, such as the one at Web Upd8, to see how to install the latest released or beta drivers via the Ubuntu tools.
Mixing CUDA Runtime and Driver API Codes
Previous articles in this series have focused on teaching the CUDA with the runtime API (e.g. those methods that start with "cuda" as opposed to "cu") because it is fairly intuitive and not too verbose. Many developers prefer to utilize the driver API because they have more control and can make better use of existing code bases. Now programmers can utilized the best characteristics of both APIs.
The following is the source code for a driver mode CUDA program that calls a kernel via the runtime API. Please put this into a file called vectorAddDrv.cu:
/* * Driver APIC code that calls a runtime kernel * Vector addition: C = A + B. */ // Includes #include <stdio.h> #include <cuda.h> #include <cutil_inline.h> // Variables CUdevice cuDevice; CUcontext cuContext; CUmodule cuModule; CUfunction vecAdd; float* h_A; float* h_B; float* h_C; CUdeviceptr d_A; CUdeviceptr d_B; CUdeviceptr d_C; // Functions __global__ void kernel(float* d_a, float* d_b, float* d_c, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if(idx < n) d_c[idx] = d_a[idx] + d_b[idx]; } // Allocates an array with random float entries. void RandomInit(float* data, int n) { for (int i = 0; i < n; ++i) data[i] = rand() / (float)RAND_MAX; } void errorExit() { printf("Error exit!\n"); exit(1); } // Host code int main(int argc, char** argv) { int N = 50000; unsigned int size = N * sizeof(float); CUresult error; printf("Vector Addition (Driver API)\n"); // Initialize error = cuInit(0); if (error != CUDA_SUCCESS) errorExit(); // Get number of devices supporting CUDA int deviceCount = 0; error = cuDeviceGetCount(&deviceCount); if (error != CUDA_SUCCESS) errorExit(); if (deviceCount == 0) { printf("There is no device supporting CUDA.\n"); exit(1); } // Get handle for device 0 error = cuDeviceGet(&cuDevice, 0); if (error != CUDA_SUCCESS) errorExit(); // Create context error = cuCtxCreate(&cuContext, 0, cuDevice); if (error != CUDA_SUCCESS) errorExit(); // Allocate input vectors h_A and h_B in host memory h_A = (float*)malloc(size); if (h_A == 0) errorExit(); h_B = (float*)malloc(size); if (h_B == 0) errorExit(); h_C = (float*)malloc(size); if (h_C == 0) errorExit(); // Initialize input vectors RandomInit(h_A, N); RandomInit(h_B, N); // Allocate vectors in device memory error = cuMemAlloc(&d_A, size); if (error != CUDA_SUCCESS) errorExit(); error = cuMemAlloc(&d_B, size); if (error != CUDA_SUCCESS) errorExit(); error = cuMemAlloc(&d_C, size); if (error != CUDA_SUCCESS) errorExit(); // Copy vectors from host memory to device memory error = cuMemcpyHtoD(d_A, h_A, size); if (error != CUDA_SUCCESS) errorExit(); error = cuMemcpyHtoD(d_B, h_B, size); if (error != CUDA_SUCCESS) errorExit(); // Invoke kernel (Runtime API) int nThreadsPerBlk=128; int nBlks = (N/nThreadsPerBlk) + (((N%nThreadsPerBlk)>0)?1:0); kernel<<<nBlks,nThreadsPerBlk>>>((float*)d_A,(float*) d_B,(float*) d_C, N); // Copy result from device memory to host memory // h_C contains the result in host memory error = cuMemcpyDtoH(h_C, d_C, size); if (error != CUDA_SUCCESS) errorExit(); // Verify result int i; for (i = 0; i < N; ++i) { float sum = h_A[i] + h_B[i]; if (fabs(h_C[i] - sum) > 1e-7f) { printf("Mistake index %d %g %g\n",i,h_C[i],sum); break; } } printf("Test %s \n", (i == N) ? "PASSED" : "FAILED"); return(0); }
A detailed discussion of the driver API is beyond the scope of this article, as I'm focusing on interoperability and the 3.0 release. Even so, much of this code should look familiar to runtime API developers as only GPU setup, memory allocation and data movement is utilized. Many of these calls are similar to the runtime API calls so it should be easy to follow the source code. The specific point made with this example is that the following runtime CUDA call to kernel()
works in the 3.0 release:
kernel<<<nBlks,nThreadsPerBlk>>>((float*)d_A,(float*) d_B,(float*) d_C, N);
Use nvcc to build the executable with debugging enabled as shown below for Linux:
SDK_PATH=$HOME/NVIDIA_GPU_Computing_SDK/C INC=$SDK_PATH/common/inc LIB=$SDK_PATH/lib nvcc -g -G -I$INC -L$LIB vectorAddDrv.cu -lcutil_x86_64 -lcuda -o vectorAddDrv
Now start the program with the command cuda-gdb
:
$ cuda-gdb vectorAddDrv NVIDIA (R) CUDA Debugger BETA release Portions Copyright (C) 2008,2009 NVIDIA Corporation GNU gdb 6.6 Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... Using host libthread_db library "/lib/libthread_db.so.1".
Enable memory checking and run the program with the following two commands:
(cuda-gdb) set cuda memcheck on (cuda-gdb) r
The following illustrative output shows that the code works and apparently used memory correctly. Further the results passed a validation test on the host as indicated by the "Test PASSED" message.
Starting program: PATH_TO_EXECUTABLE/vectorAddDrv [Thread debugging using libthread_db enabled] [New process 31619] Vector Addition (Driver API) [New Thread 139794699740944 (LWP 31619)] Test PASSED Program exited normally.
Now set a breakpoint at line 109, which is just after the result vector is moved to the host computer and run the vectorAddDrv
again:
. (cuda-gdb) b 109 Breakpoint 1 at 0x40183e: file vectorAdd.cu, line 109. (cuda-gdb) r Starting program: /home/rmfarber/DDJ/Part17/vecAdd/vectorAdd [Thread debugging using libthread_db enabled] [New process 31623] Vector Addition (Driver API) [New Thread 140441416316688 (LWP 31623)] [Switching to Thread 140441416316688 (LWP 31623)] Breakpoint 1, main (argc=1, argv=0x7fff40b55248) at vectorAdd.cu:109 109 if (error != CUDA_SUCCESS) errorExit();
Use the CUDA-GDB set command to change index three of the h_C
vector on the host to be 1000-times larger:
(cuda-gdb) p h_C[3] $1 = 1.30907393 (cuda-gdb) set h_C[3] = 1000. * h_C[3]
The program is then allowed to continue. The resulting output shows that the comparison code does find the error caused by our manually modifying the h_C
vector:
(cuda-gdb) c Continuing. Mistake index 3 1309.07 1.30907 Test FAILED Program exited normally. (cuda-gdb) quit
This example demonstrates that CUDA-GDB in the 3.0 release works with driver API programs. It also shows how straightforward it now is to mix driver and runtime API codes.
Similarly, the following example, blasAddDrv.cu, demonstrates that it is now possible with CUDA 3.0 to call the CUBLAS library routines. In this case, the previous example code for vectorAddDrv.cu was adapted to call the saxpy()
routine:
cublasSaxpy(N, 1.0f,(float*) d_A, 1,(float*) d_B, 1);
The following is the complete source code for vectorAddDrv.cu. Again, I won't discuss the details of the driver API. See the NVIDIA documentation for more information.
/* * Vector addition using SAXPY from a driver API example */ // Includes #include <stdio.h> #include <cuda.h> #include <cutil_inline.h> #include <cublas.h> // Variables CUdevice cuDevice; CUcontext cuContext; CUmodule cuModule; CUfunction vecAdd; float* h_A; float* h_B; float* h_C; CUdeviceptr d_A; CUdeviceptr d_B; // Functions // Allocates an array with random float entries. void RandomInit(float* data, int n) { for (int i = 0; i < n; ++i) data[i] = rand() / (float)RAND_MAX; } void errorExit() { printf("Error exit!\n"); } // Host code int main(int argc, char** argv) { int N = 50000; unsigned int size = N * sizeof(float); CUresult error; int status; printf("Vector Addition (BLAS and Driver APIs)\n"); // Initialize error = cuInit(0); if (error != CUDA_SUCCESS) errorExit(); // Get number of devices supporting CUDA int deviceCount = 0; error = cuDeviceGetCount(&deviceCount); if (error != CUDA_SUCCESS) errorExit(); if (deviceCount == 0) { printf("There is no device supporting CUDA.\n"); exit(1); } // Get handle for device 0 error = cuDeviceGet(&cuDevice, 0); if (error != CUDA_SUCCESS) errorExit(); // Create context error = cuCtxCreate(&cuContext, 0, cuDevice); if (error != CUDA_SUCCESS) errorExit(); // Allocate input vectors h_A and h_B in host memory h_A = (float*)malloc(size); if (h_A == 0) errorExit(); h_B = (float*)malloc(size); if (h_B == 0) errorExit(); h_C = (float*)malloc(size); if (h_C == 0) errorExit(); // Initialize input vectors RandomInit(h_A, N); RandomInit(h_B, N); // Allocate vectors in device memory error = cuMemAlloc(&d_A, size); if (error != CUDA_SUCCESS) errorExit(); error = cuMemAlloc(&d_B, size); if (error != CUDA_SUCCESS) errorExit(); // Copy vectors from host memory to device memory error = cuMemcpyHtoD(d_A, h_A, size); if (error != CUDA_SUCCESS) errorExit(); error = cuMemcpyHtoD(d_B, h_B, size); if (error != CUDA_SUCCESS) errorExit(); // Invoke kernel status = cublasInit(); if (status != CUBLAS_STATUS_SUCCESS) { fprintf (stderr, "!!!! CUBLAS initialization error\n"); return EXIT_FAILURE; } // Call CUBLAS from a driver API code cublasSaxpy(N, 1.0f,(float*) d_A, 1,(float*) d_B, 1); // Copy result from device memory to host memory // h_C contains the result in host memory error = cuMemcpyDtoH(h_C, d_B, size); if (error != CUDA_SUCCESS) errorExit(); // Verify result int i; for (i = 0; i < N; ++i) { float sum = h_A[i] + h_B[i]; if (fabs(h_C[i] - sum) > 1e-7f) break; } printf("Test %s \n", (i == N) ? "PASSED" : "FAILED"); return(0); }
Use nvcc to build the executable. The following simple script builds the executable on Linux. Please note that this program requires the CUBLAS library. Linking is specified with the -lcublas
command-line option.
SDK_PATH=$HOME/NVIDIA_GPU_Computing_SDK/C INC=$SDK_PATH/common/inc LIB=$SDK_PATH/lib nvcc -I$INC -L$LIB blasAddDrv.cu -lcutil_x86_64 -lcuda -lcublas -o blasAddDrv
Running the program demonstrates that it does indeed work correctly as indicated by the "Test PASSED" message:
$ ./blasAddDrv Vector Addition (BLAS and Driver APIs) Test PASSED
Unquestionably, the ability to mix and debug driver and runtime API codes and libraries is valuable. For many, this expanded capability alone will make the CUDA 3.0 release an obvious download choice. As the CUDA library and code base expands, this transparent interoperability will continue to pay dividends in ease of use -- although it is likely that in the future most developers will utilize this capability without thought or awareness of the ease in which it occurs.