In CUDA, Supercomputing for the Masses: Part 13 of this article series, I resumed the discussion of "texture memory" began in Part 11 of this series and included information such as the recently introduced CUDA Toolkit 2.2 that added ability to write to global memory on the GPU that has a 2D texture bound to it. This installment focuses on debugging techniques and how CUDA-GDB can be used to effectively diagnose and debug CUDA code -- with an emphasis on how to speed the process when looking through large amounts of data as well as the thread syntax and semantic differences needed to debug kernels while they are running on the GPU.
Upfront, youo need to be aware of two important points about the current state of CUDA-GDB:
- It only runs on UNIX-based systems. Microsoft users in particular should look to the recently announced Visual Studio debugger, which I'll cover in a future column.
- X11 cannot be running on the GPU that is used for debugging. Instead use either a multi-GPU system or kill X11 and remotely access the single-GPU system via ssh, VNC, or some other method.
This article also provides an admittedly contrived debugging example. I recommend everyone look though it -- regardless of their previous experience with GDB -- for the following reasons:
- Everyone new to CUDA-GDB: Example commands demonstrate how to use the new CUDA thread syntax and highlight semantic changes that allow debugging and stepping through kernels running on CUDA-enabled graphics processors.
- Beginners: This example shows how to set breakpoints (both symbolically and by line number) as well as performing other basic debugging operations including: running the program, stepping execution through the source code line-by-line, continuing program execution, and exiting the debugger.
- Experts: GPUs are fast, which generally means that debugging a CUDA application requires digging through large amounts of data to find a problem. Putting a little thought into debugging strategies including the application of debugger helper functions and GDB artificial arrays can really save debugging time (and personal sanity).
Essentially CUDA-GDB is a port of the GNU GDB debugger version 6.6. Programmers who have used GDB in the past will find they are already familiar with CUDA-GDB and should look to this article for CUDA specific tips. Newcomers to GDB will be able to use this article to begin debugging their software right away, but should look to one of the many extensive tutorials and references on the Internet to find out more about the comprehensive set of features available within this powerful debugging tool. One obvious starting place is the GNU documentation for GDB. Regardless of skill level, all CUDA developers should at least glance over the latest version of CUDA-GDB: The NVIDIA CUDA Debugger.
Debugging Methods Prior to CUDA-GDB
Prior to the creation of CUDA-GDB, the easiest and least elegant method of debugging a CUDA program was to add print statements to the source code and compile to run in the emulator (initiated by passing the
-device-emulation flag to the nvcc compiler). Since the emulator runs on the host processor (and not on the GPU), the
An alternative method that actually utilizes the GPU and permits examination of GPU calculated results utilizes
cudaMemcpy() to transfer any variables of interest from the GPU to a scratch location on the host. Host-based methods (including GDB and/or print statements) can then be used to examine the information in the scratch location to hopefully diagnose the problem. Later in this article, I will use this technique to demonstrate a simple helper debugging function for CUDA-GDB to identify an error when writing to a large CUDA vector.
Mapped Memory and the Importance of Regression Testing
It is worthwhile mentioning at this point that the new mapped memory capability, introduced in the CUDA 2.2 release, provides an important (and convenient) new capability to facilitate regression testing when porting legacy application code to CUDA. Without question, regression testing is an essential software practice. It cannot be emphasized too strongly how important this technique is in creating and verifying correctly working software!
In the case of legacy software, the developer already has a working code base that can be used for comparison with GPU generated results to help identify errors. Mapped memory (discussed in Part 12, "CUDA 2.2 Changes the Data Movement Paradigm") greatly facilitates this process by transparently maintaining a synchronized version of data between both the host and device memory spaces. With care, the programmer can exploit this transparent synchronization to keep the original software functional throughout the entire porting project. As a result, there will be a known working version that can be used to compare all GPU results and intermediate results.
Essentially new CUDA kernels can be incorporated into the legacy code without having to think about explicitly moving data off the host and onto the GPU, which allows easy switching between the new CUDA kernel(s) and corresponding original host code. The new GPU version can then be evaluated on one or more test cases to see if it produces correct results. If an error is identified, then the original host code for that phase of the calculation -- plus intermediate results -- can be used to quickly identify the first appearance of the error in the GPU code. Eventually, enough of the calculation will reside on the GPU so it no longer becomes necessary to maintain synchronization with the host and mapping can be disabled or removed -- thus creating a GPU only version of the legacy code that can run a full-speed without any PCI bottlenecks.