Tips for Using CUDA-GDB with Large Data
Graphics processors are excellent platforms for working with large amounts of data due to their many-core architecture, massive threading model, and high performance. Conversely, finding errors in these large volumes of data by manual methods can be time consuming and painful.
To make debugging easier, I like to include some simple helper debugging routines in my code that can be called from GDB, CUDA-GDB, or used as part of a test harness for sanity checking through the use of assertions. When running in production, these debugging routines are never called (if necessary they can be eliminated with #ifdef
statements) so they do not incur any memory or processor overhead. Having the ability to interactively call any of a number of debugging routines when running GDB (or CUDA-GDB) provides a convenient and easy way to look for errors while the program is executing -- and without having to modify the source code or recompile!
The function sumOnHost()
in the AssignScaleVectorWithError.cu example program below provides one illustration of a helpful debugging routine. In this case, sumOnHost()
calculates the floating-point sum of a large vector after moving the vector from device to host. It is easy to imagine how the same idea can be extended to provide useful information about any large data structure. Calculating a sum is useful because it produces a single number that can be used to get a sense of the data, identify NaN (Not a Number) problems and perform other sanity checks. Many network and disk subsystems use a similar technique by calculating a checksum (or other aggregate measure) to look for data errors.
Using a sum as a comparative measure to spot data differences can be especially useful when known-working host-based software exists that can be used to compare intermediate results against a CUDA-based kernel. Instead of blindly looking through huge tables of numbers to find a difference, the programmer can leverage the power of CUDA-GDB to quickly isolate the first occurrence of any differences between the legacy host and GPU kernel results and/or intermediate values.
From experience, this is a wonderful time-saving capability. Be aware that some variability will inevitably occur when comparing floating-point results because floating-point is only an approximate representation. Minor differences in how the host or GPU performs arithmetic and even legitimate variations in the ordering of arithmetic operations (which can caused by simple changes like using different compiler switches or changing optimization levels) can cause slight variations in the results of even correctly working code.
Most Unix-based operating system releases of the CUDA Toolkit now include CUDA-GDB, so you should be able to just type:
cuda-gdb
to start the debugger. (If not, look to the CUDA-GDB release notes and the NVIDIA CUDA forums to see how others have gotten the debugger to work on your specific OS.)
CUDA-GDB accepts the same variety of arguments and options as GDB. Usually CUDA-GDB is started with one argument that specifies the program executable to debug (e.g., cuda-gdb a.out). CUDA-GDB can also be used to debug programs that are already running by adding the process ID of the program (PID) to the command-line (e.g., cuda-gdb a.out pid).
To debug programs in a human-friendly fashion, the compiler needs to generate additional debugging information for CUDA-GDB that describes the data type of each variable or function and the correspondence between source line numbers and addresses in the executable code. To make the compiler generate this information, both the -g
and -G
options must be specified when you run the nvcc compiler.
The following is the command line to compile the program AssignScaleVectorWithError.cu for debugging:
nvcc -G -g AssignScaleVectorWithError.cu -o AssignScaleVectorWithError
So, what do these command-line options do?
- The
-G
options specifies generate debugging information for the CUDA kernels and- Forces
-O0
(mostly unoptimized) compilation - Spills all variables to local memory (and will probably slow program execution)
- Forces
- The
-g
option tells nvcc to generate debugging information for the host code and include symbolic debugging information in the executable. - Finally, the
-o
option tells the compiler to write the executable to AssignScaleVectorWithError.
NOTE: It is currently not possible to generate debugging information when compiling with the -cubin
option.
The following is the source code for AssignScaleVectorWithError.cu:
#include <stdio.h> #include <assert.h> // A simple example program to illustrate // debugging with cuda-gdb // Vector on the device float *a_d; // Print a message if a CUDA error occurred void checkCUDAError(const char *msg) { cudaError_t err = cudaGetLastError(); if( cudaSuccess != err) { fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); exit(EXIT_FAILURE); } } // Zero the vector __global__ void zero(float *v_d, int n) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if(tid < n) v_d[tid] = 0.f; } // Assign the vector with consecutive values starting with zero __global__ void assign(float *v_d, int n) { int tid = threadIdx.x; if(tid < n) v_d[tid] = ((float) tid); } // Scale the vector __global__ void scale(float *v_d, int n, float scaleFactor) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if(tid < n) v_d[tid] *= scaleFactor; } // Move the vector to the host and sum float sumOnHost(const float *v_d, int n) { float sum=0.f; int i; // create space on the host for the device data float *v_h = (float*)malloc(n*sizeof(float)); // check if the malloc succeeded assert(v_h != NULL); // copy the vector from the device to the host cudaMemcpy(v_h,v_d, n*sizeof(float), cudaMemcpyDeviceToHost); for(i=0; i<n; i++) sum += v_h[i]; // free the vector on host free(v_h); return(sum); } int main() { int nBlocks = 32; int blockSize = 256; int n = nBlocks*blockSize; float scaleFactor = 10.f; // create the vector a_d on the device and zero it cudaMalloc((void**)&a_d, n*sizeof(float)); checkCUDAError("Create and zero vector"); // fill the vector with zeros zero<<<nBlocks, blockSize>>>(a_d, n); // assign the vector assign<<<nBlocks, blockSize>>>(a_d, n); // scale the vector by scaleFactor scale<<<nBlocks, blockSize>>>(a_d, n, scaleFactor); // calculate the sum of the vector on the host float dSum = sumOnHost(a_d, n); checkCUDAError("calculating dSum"); // Check if both host and GPU agree on the result float hSum=0.f; for(int i=0; i < n; i++) hSum += ((float)i)*scaleFactor; if(hSum != dSum) { printf("TEST FAILED!\n"); } else { printf("test succeeded!\n"); } // free the vector on the device cudaFree(a_d); }
In a nutshell, this program creates a vector on the device, a_d
, which is filled with zeros by the kernel, zero()
. The vector a_d
is then assigned consecutively increasing values, starting at zero by the kernel, assign()
. Finally, the vector a_d
is multiplied by a scale factor with kernel, scale()
. The host routine sumOnHost()
is called to calculate the sum of the values in a_d
, which is placed in dSum
and compared against the host generated sum contained in the variable hSum
. If the values are the same, we get a message that states the test succeeded. Otherwise, the program indicates the test failed.
As we see below, running the unmodified program generates a failure message, which indicates there is a bug in the code:
$ ./AssignScaleVectorWithError TEST FAILED!
The following command starts CUDA-GDB so it can be used to debug the program:
$cuda-gdb AssignScaleVectorWithError
You should see output similar to the following:
NVIDIA (R) CUDA Debugger BETA release Portions Copyright (C) 2008,2009 NVIDIA Corporation GNU gdb 6.6 Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... Using host libthread_db library "/lib/libthread_db.so.1".
We use the abbreviated command l
(for list) to look at the lines around line 81 in the source code:
(cuda-gdb) l 81 76 checkCUDAError("Create and zero vector"); 77 78 // fill the vector with zeros 79 zero<<<nBlocks, blockSize>>>(a_d, n); 80 // assign the vector 81 assign<<<nBlocks, blockSize>>>(a_d, n); 82 // scale the vector by scaleFactor 83 scale<<<nBlocks, blockSize>>>(a_d, n, scaleFactor); 84 85 // calculate the sum of the vector on the host
Now we set a breakpoint prior to starting execution of the assign()
kernel at line 81 with the command (again using the one letter abbreviation b
for "breakpoint"):
(cuda-gdb) b 81 Breakpoint 1 at 0x40f216: file AssignScaleVectorWithError.cu, line 81.
Breakpoints can also be set symbolically as shown with the following command that sets a breakpoint whenever the kernel scale()
is called:
(cuda-gdb) b scale Breakpoint 2 at 0x40f4e3: file AssignScaleVectorWithError.cu, line 38.
We now run the program in the debugger using the letter r
instead of typing out the full command run
. (Note: some of the output may appear different such as the process ID):
(cuda-gdb) r Starting program: /home/XXX/DDJ/Part14/AssignScaleVectorWithError [Thread debugging using libthread_db enabled] [New process 16805] [New Thread 140439601190656 (LWP 16805)] [Switching to Thread 140439601190656 (LWP 16805)] Breakpoint 1, main () at AssignScaleVectorWithError.cu:81 81 assign<<<nBlocks, blockSize>>>(a_d, n); Current language: auto; currently c++
Using the p
command (for "print") we call the host function, sumOnHost()
, with arguments appropriate to move all the data in the GPU array a_d
to the host and calculate a sum of the values. As can be seen, the call to the kernel zero()
appears to have worked correctly as the vector seems to be filled with only zero floating-point values:
(cuda-gdb) p sumOnHost(a_d, n) $1 = 0
We use the next
command (abbreviated n
) to run the next line of the program. In this case, the program runs the assign()
kernel on the GPU.
Please note that unlike normal execution, calls to a kernel in CUDA-GDB happen synchronously. (Normally kernels are launched asynchronously).
Thus after typing the next
command, control returns only after the assign()
kernel runs to completion on the GPU.
As pointed out in Section 4.4 of the CUDA-GDB manual, the debugger support stepping GPU code at the granularity of a warp. This means that individual threads are not advanced but rather that all the threads within the warp advance. The exception is stepping over a thread barrier call, __syncThreads()
, which causes all the threads to advance past the barrier. Additionally, it is not possible to step over a subroutine because the compiler currently inlines this code. Thus, it is only possible to step into a subroutine.
Again we look at the sum that is returned from sumOnHost()
with the print command:
(cuda-gdb) n 83 scale<<<nBlocks, blockSize>>>(a_d, n, scaleFactor); (cuda-gdb) p sumOnHost(a_d, n) $2 = 32640
In this case the returned value of 32640
looks suspiciously small to be the sum of all integers ranging from [0 to nBlocks*BlockSize
), so we elect to "continue" (abbreviated with c
) until we hit the next breakpoint that happens to be the breakpoint set in the CUDA kernel scale()
. (Note: For the moment, we ignore the meaning of the line that describes the "Current CUDA Thread".)
(cuda-gdb) c Continuing. [Current CUDA Thread <<<(0,0),(0,0,0)>>>] Breakpoint 2, scale () at AssignScaleVectorWithError.cu:38 38 int tid = blockIdx.x * blockDim.x + threadIdx.x;
The debugger allows us to examine values on the GPU itself when in a kernel running on the GPU. This is the case as the breakpoint was set to stop program when inside the scale()
kernel.
The address of a_d
was passed into this kernel via the v_d
function argument. Using the print command (abbreviated p
), we can examine successive values of the vector values residing in the GPU memory by using the GNU concept of an artificial array. As can be seen in the output of the following command, the first 10 values (indicated by the syntax @10 in the command) of the vector were set correctly by the assign()
kernel:
(cuda-gdb) p *[email protected] $3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
However, we see that vector elements greater than 255 are still set to zero, which indicates there is a problem in the assign()
kernel. This is accomplished by telling the debugger to offset 250 elements from the start of the v_d
pointer with the command syntax (v_d+250)
.
(cuda-gdb) p *(v_d+250)@10 $4 = {250, 251, 252, 253, 254, 255, 0, 0, 0, 0}
We type quit
to exit CUDA-GDB and go to look at our code:
(cuda-gdb) quit The program is running. Exit anyway? (y or n) y [email protected]:~/DDJ/Part14$
It turns out that the problem with assign()
is that the variable tid is only set to threadIdx.x
. This is incorrect when multiple blocks are used.
Let's verify this by using the CUDA-GDB extensions that allow us to look at individual threads within a warp.
Start CUDA-GDB again:
$cuda-gdb AssignScaleVectorWithError
Now, let's set a breakpoint at line 31, which is after tid
is initialized in the assign()
kernel and run the debugger until the breakpoint is hit:
(cuda-gdb) b 31 Breakpoint 1 at 0x40f4cf: file AssignScaleVectorWithError.cu, line 31. (cuda-gdb) r Starting program: /home/XXXX/DDJ/Part14/AssignScaleVectorWithError [Thread debugging using libthread_db enabled] [New process 22405] [New Thread 139839913080576 (LWP 22405)] [Switching to Thread 139839913080576 (LWP 22405)] [Current CUDA Thread <<<(0,0),(0,0,0)>>>] Breakpoint 1, assign () at AssignScaleVectorWithError.cu:31 31 if(tid < n) Current language: auto; currently c++
Using the "info cuda threads" command, we see the following output:
(cuda-gdb) info cuda threads <<<(0,0),(0,0,0)>>> ... <<<(31,0),(255,0,0)>>> assign () at AssignScaleVectorWithError.cu:31
CUDA thread information is represented in the following form:
<<<(BX,BY),(TX,TY,TZ)>>>
Where BX
and BY
are the X
and Y
block indexes and TX
, TY
, and TZ
are the corresponding thread X
, Y
, and Z
indexes. Thus we can see that the assign()
kernel has blocks with indexes ranging from (0,0) to (31,0) and threads within each block ranging from (0,0,0) to (255,0,0). This correctly represents a kernel configured to run on the GPU with 32 blocks where each block contains 256 threads per block.
The following line indicates the debugger is currently set to examine the first thread of the first block:
[Current CUDA Thread <<<(0,0),(0,0,0)>>>]
Printing tid
shows that it is correctly set to zero for this thread"
(cuda-gdb) p tid $1 = 0
Using the CUDA thread syntax, we switch to block 31 and thread 255 using an abbreviated syntax to save typing:
(cuda-gdb) thread <<<(31),(255)>>> Switching to <<<(31,0),(255,0,0)>>> assign () at AssignScaleVectorWithError.cu:31 31 if(tid < n)
Printing the value of the tid
variable shows that it is incorrectly set to 255
.
(cuda-gdb) p tid $2 = 255
We now know the assign()
kernel incorrectly assigns threadIdx.x
to tid
with the following statement:
int tid = threadIdx.x;
Using an editor, change the assignment of the tid
index to the following:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
After saving, recompiling and running the modified program, we see that the program now reports success.
test succeeded!
Starting CUDA-GDB and repeating the previous debugging steps, we now see that tid
in thread <<<(31,0),(255,0,0)>>> correctly contains the value of 8191
:
(cuda-gdb) thread <<<(31),(255)>>> Switching to <<<(31,0),(255,0,0)>>> assign () at AssignVector.cu:31 31 if(tid < n) (cuda-gdb) p tid $1 = 8191
Additional CUDA-GDB Debugging Extensions and Semantics
CUDA-GDB provides a number of CUDA-specific commands:
thread
- Display the current host and CUDA thread of focusthread <<<(TX,TY,TZ)>>>
- Switch to the CUDA thread at the specified coordinatesthread <<<(BX,BY),(TX,TY,TZ)>>>
- Switch to the CUDA block and thread at the specified coordinatesinfo cuda threads
- Display an overall summary of all CUDA threads that are currently resident on the GPUinfo cuda threads all
- Display a list of each CUDA thread that is currently resident on the GPU. This can be quite largeinfo cuda state
- Display information about the current CUDA state
Special semantics of the next and step commands:
- Execution is advanced at the warp-level; all threads in the same warp as the current CUDA thread will proceed
- A special case is stepping the thread barrier call,
__syncthreads()
, which causes an implicit breakpoint to set immediately after the barrier. All threads are continued to this breakpoint after the__syncthreads()