Jason Sanders is coauthor, with Edward Kandrot, of CUDA By Example: An Introduction to General-Purpose GPU Programming. When he's not putting pen to paper, Jason is a senior software engineer in NVIDIA's CUDA Platform Group, where he helped develop early releases of CUDA system software and contribute to the OpenCL 1.0 specification. On the eve of his book's release, Jason took time to talk with Dr. Dobb's.
Dr. Dobb's: What's more difficult -- parallel programming or writing a book?
Sanders: For me, writing a book was more difficult, mostly because teaching someone else to do something has always been harder for me than actually doing it myself. They say that those who can't do, teach, but in my experience I've never really felt comfortable knowing something until I've had to teach it to someone else.
Dr. Dobb's: Let's start with book. What's it about?
Sanders: The book, CUDA by Example, is meant to be an introduction to general-purpose programming on the GPU. We wanted the book to be accessible to as wide an audience as possible, so we assume no special knowledge of GPUs, graphics APIs, parallel processing, or CUDA. The book is oriented toward people who have experience writing C or C++, but don't necessarily have any experience with a GPU, with CUDA, or with computer graphics.
We try to use interesting, but relatively simple examples to explore the CUDA C language and many of the CUDA APIs. By the end of the book, you'll know the basics of nearly every feature in CUDA, including shared memory, textures, graphics interoperability, streams, atomics, and using multiple GPUs simultaneously.
For those hesitant to invest the $35, a sample chapter is available for free online.
Dr. Dobb's: CUDA 3.1 was recently announced. How does this version differ from the previous version?
Sanders: Like most of our releases, CUDA 3.1 has a lot of new features. Some of these features are pretty amazing, especially if you've been following GPU computing for the past 8 or 9 years.
There are two features that I'm personally really excited about. As of CUDA 3.1, users can now launch up to 16 concurrent kernels at once on Fermi architecture GPUs. This development is particularly exciting because it adds a whole new dimension of parallel processing to the GPU. Each of these concurrent kernels can itself be a massively data-parallel function, so as of CUDA 3.1, you can run multiple data-parallel kernels in parallel with each other.
CUDA 3.1 also marks the debut of a printf() that executes directly on the GPU. For those of us who really appreciate quick and dirty printf() debugging, we can finally make use of this "classic" method as of CUDA 3.1. Of course, with CUDA-GDB and Parallel Nsight, CUDA debugging tools have gotten so good that many people no longer yearn for printf(). But honestly, sometimes I'm just too lazy to run a full-featured debugger.
Aside from these two major features, CUDA 3.1 brings a large number of performance enhancements and bug fixes from CUDA 3.0, notably in the CUBLAS and CUFFT libraries. The other changes from 3.0 are described here.
Dr. Dobb's: What's hard about parallel programming?
Sanders: Well, everything that's hard about serial programming is still hard when you move to parallel programming, but there are loads of complexities specific to parallel programming. Parallel programming brings with it the obvious "bookkeeping" problems like mutual exclusion, interthread communication, and the avoidance of deadlocks and race conditions to name a few.
But often the challenge lies at a higher, more abstract level, namely that of extracting the fundamental parallelism in your problem. This can be especially difficult when you're working with hardware as massively parallel as the CUDA architecture. This is a machine that's designed to run at full efficiency when using thousands of threads to solve a problem. Parallelism at this scale is just not a paradigm in which most of us were trained to think. It can certainly be daunting at first, but fortunately we've found that it doesn't take long to grow accustomed to programming in parallel.
Dr. Dobb's: What's the most unique non-graphics application of a GPU that you've run across?
Sanders: Our users regularly discover applications for CUDA that amaze me! We currently have over 1,100 project submissions on the CUDA Zone website, and frankly, most of the applications are in domains that I have no experience with. In fact, just this week I learned that CUDA is being used to help process the huge amount of imagery that has been collected of the oil spill in the Gulf of Mexico.
As for the most unique application, perhaps this is just my unruly curly hair talking but I was (pleasantly) surprised to hear that CUDA is being used to accelerate research into better detergents for shampoo. I can only hope that one day CUDA will deliver me a world where my hair is much more silky and manageable!
Dr. Dobb's: What's the difference between a GPU and general-purpose (multicore) CPU?
Sanders: I think the differences are most readily understandable when you look at the problems each processor was originally designed to solve.
The processing pipeline in a CPU was designed to handle one complex task as quickly as possible. To this end, CPU's employ a variety of schemes to prevent this pipeline from stalling. Elaborate cache hierarchies, out-of-order execution units, and sophisticated branch prediction hardware all help CPUs offer very low-latency response to single, complex computational tasks.
On the other hand, GPUs were designed to solve thousands (or millions) of very similar, but relatively simple computational tasks (determining the color of each pixel on a computer monitor). To achieve this, GPUs invest a great deal of their silicon real estate in arithmetic units. While the comparatively simple execution units often hit high-latency stalls, there are a sufficient number of threads in flight so that the arithmetic units never actually need to go idle. The disproportionate number of transistors dedicated to ALUs ensures that the entire problem will enjoy a huge computational throughput despite the fact that individual computations may suffer large latencies.
Of course, these lines are being blurred more and more as time passes. Modern CPUs have multiple cores, each of which is capable of handling a couple of threads of execution. Likewise, the current generation of GPUs sport more sophisticated, superscalar execution units, cache hierarchies, and concurrent execution of multiple programs.
Dr. Dobb's: For programmers who avoid C but like the CUDA architecture -- is there hope?
Sanders: There is most definitely hope, even for programmers who wish to maintain their C-free lifestyle!
First, NVIDIA has cooperated with The Portland Group to offer CUDA Fortran. CUDA Fortran includes a Fortran 2003 compiler and tool chain that allows developers to write GPU kernels in Fortran and to invoke them from Fortran applications, exactly the way their CUDA C brethren do with C applications.
Aside from CUDA C, OpenCL, DirectX Compute, and CUDA Fortran, there are still other ways to harness the power of the CUDA architecture. A quick Google search will turn up bindings to the CUDA API available for many development environments. Although none are officially supported by NVIDIA, some of the languages I have seen make use of the CUDA architecture include Python, Java, C#, Perl, Ruby, MATLAB, .NET, and Lua.
Dr. Dobb's: What's 'texture memory'?
Sanders: Texture memory was designed for computer graphics originally, but we've found it to be quite useful for some general-purpose computing applications. Essentially, a texture is a region of read-only memory that has a particular caching behavior that differs from non-texture memory. Specifically, we see friendly cache behavior when we perform several reads that are "near" each other spatially. Unlike traditional CPU caches that cache sequential addresses, the GPU texture memory is optimized for 2D spatial locality. In the texture chapter of CUDA by Example, we use texture memory to speed up a simple heat transfer simulation. Applications involving discrete grids (such as our 2D simulation application) are obvious candidates for texture memory.
There are some other interesting features of texture memory, too. First, packed data formats can be unpacked in a single instruction. Also, 8- and 16-bit integer types can be converted to normalized 32-bit floating-point values (in the range [0.0, 1.0] or [-1.0, 1.0]). And finally, linear interpolation of neighboring values can be performed directly in the GPU's texture hardware, saving precious instructions.
In fact, Dr. Dobb's published a great article on using CUDA's texture memory
Dr. Dobb's: What's your next book about and when we will see it?
Sanders: Would you believe that I don't currently have plans to write another book? If NVIDIA ever started producing fictional novels or screenplays, I might try to bully my way onto one of those projects. But until then, I think I'll probably spend some time writing code again.