In CUDA, Supercomputing for the Masses Part 21, I focused on Fermi and the architectural changes that significantly broadened the types of applications that map well to GPGPU computing yet preserve the application performance of software written for previous generations of CUDA-enabled GPUs. This article addresses the mindset that CUDA is a language for only GPU-based applications.
Recent developments allow CUDA programs to transparently compile and run at full speed on x86 architectures. This advance makes CUDA a viable programming model for all application development, just like OpenMP. The PGI CUDA C/C++ compiler for x86 (from the Portland Group Inc.) is the reason for this recent change in mindset. It is the first native CUDA compiler that can transparently create a binary that will run on an x86 processor. No GPU is required. As a result, programmers now have the ability to use a single source tree of CUDA code to reach those customers who own CUDA-enabled GPUs as or who use x86-based systems.
Figure 1 illustrates the options and target platforms that are currently available to build and run CUDA applications. The various products are discussed next.
Aside from the new CUDA-x86 compiler, the other products require developer or customer intervention to run CUDA on multiple backends. For example:
- nvcc: The freely downloadable nvcc compiler from NVIDIA creates both host and device code. With the use of the
__host__specifiers, a developer can use C++ Thrust functions to run on both host and CUDA-enabled devices. This x86 pathway is represented by the dotted line in Figure 1, as the programmer must explicitly specify use of the host processor. In addition, developers must explicitly check whether a GPU is present and use this information to select the memory space in which the data will reside (that is, GPU or host). The Thrust API also allows CUDA codes to be transparently compiled to run on different backends. The Thrust documentation shows how to use OpenMP to run a Monte Carlo simulation on x86. Note that Thrust is not optimized to create efficient OpenMP code.
- gpuocelot provides a dynamic compilation framework to run CUDA binaries on various backends such as x86, AMD GPUs, and an x86-based PTX emulator. The emulator alone is a valuable tool for finding hot spots and bottlenecks in CUDA codes. The gpuocelot website claims that it "allows CUDA programs to be executed on NVIDIA GPUs, AMD GPUs, and x86-CPUs at full speed without recompilation." I recommend this project even though it is challenging to use. As it matures, Ocelot will provide a pathway for customers to run CUDA binaries on various backends.
- MCUDA is an academic project that translates CUDA to C. It is not currently maintained, but the papers are interesting reading. A follow-up project (FCUDA) provides a CUDA to FPGA translation capability.
- SWAN provides a CUDA-to-OpenCL translation capability. The authors note that Swan is "not a drop in replacement for nvcc. Host code needs to have all kernel invocations and CUDA API calls rewritten." Still, it is an interesting project to bridge the gap between CUDA and OpenCL.
The CUDA-x86 compiler is the first to provide a seamless pathway to create a multi-platform application.
Why It Matters
Using CUDA for all application development may seem like a radical concept to many readers, but in fact, it is the natural extension of the emerging CPU/GPU paradigm of high-speed computing. One of the key benefits of CUDA is that it uses C/C++ and can be adopted easily and it runs on 300+ million GPUs and now all x86 chips. If this still feels like an edgy practice, this video presentation might be helpful.
CUDA works well now at its principal task massively parallel computation as demonstrated by the variety and number of projects that achieve 100x or greater performance in the NVIDIA showcase. See Figure 2.