In CUDA, Supercomputing for the Masses Part 20, I focused on the analysis capability of Parallel Nsight v1.0 coupled with the NVIDIA Tools extension (NVTX) library to illustrate asynchronous I/O, hybrid CPU/GPU computing, and the performance of primitive restart to dramatically accelerate OpenGL rendering in CUDA applications. (Note that Parallel Nsight 1.5 has been released, which is now compatible with Visual Studio 2010 and further refines the Parallel Nsight experience.)
This article will focus on Fermi and the architectural changes that significantly broaden the types of applications that map well to GPGPU computing while maintaining the performance benefits provided by previous generations of CUDA-enabled GPUs. Particular attention will be paid to how the Fermi architecture affects CUDA memory spaces. Also discussed will be how the Fermi architecture moves GPU computing into mainstream 24/7 production computing with error correction and other robustness features.
Fermi is the internal name that NVIDIA uses for the GF100 architecture that has many expanded capabilities to overcome computational limitations in the previous G80 and follow-on GT200 series of architectures. Variants of the Fermi architecture are used in the GeForce 400 and Tesla 20-series of products.
GPGPU computing has now permeated all aspect of global computing technology. From ultra-low-power CUDA-enabled GPUs to the largest supercomputers in the world such as China's Tianhe-1A (meaning Milky Way), which can perform 1 quadrillion peak floating-point operations, GPGPU computing is redefining what is possible on a computer. Review my article for the GPU Source issue of Scientific Computing, "Redefining what is possible" and my GTC presentation "Supercomputing for the Masses: Killer-Apps, Parallel Mappings, Scalability and Application Lifespan" for a more in-depth analysis. Tianhe-1A was recently named as the world's fastest supercomputer in the Top500 list.
Aside from creating opportunities throughout science and industry for the developers of GPU software, this ubiquity (as illustrated by NVIDIA's claim of 250+ million CUDA-enabled GPUs sold to date) has driven the evolution of CUDA-enabled GPU architectures so they can efficiently run applications in more problem domains. Further, it has forced GPU hardware designers to harden GPGPU technology against common errors so that many GPGPUs can simultaneously be used in 24/7 production environments to reliably run applications for extended periods measured in days, weeks and months. Examples include rendering farms that create animated movies and supercomputers that run some of the largest physics simulations in the world.
Both as a result of Fermi and also due to the maturation of CUDA and GPU programming in general, the thinking behind how to program GPU technology is changing. Just as the Bebop (Berkeley Benchmarking and OPtimization) group led the way with publications like the Volkov and Demmel paper "Benchmarking GPUs to Tune Dense Linear Algebra" for high performance on earlier GPU architectures, so are they are changing the thinking about occupancy as is discussed in the hyperlink and will be discussed in this article.
An overview of Fermi changes
A brief overview of changes made in the GF100 architecture includes:
- A unified 64-bit memory space with:
- Pointers can now refer to local, shared, and global memory locations, and are portable among threads.
- Support for a per-thread stack and recursion.
- Full 32-bit ALU (Arithmetic Logic Unit) integer operations.
- Improved 64-bit data paths in shared memory.
- ECC capability on all global and internal memory and other robustness improvements.
- Fermi's upgraded configurable L1 cache and unified coherent L2 cache across the GPU provide:
- The ability to broadcast read-only cached data from global memory just like constant memory.
- Registers now spill to fast cache rather than global memory, which might speed application performance.
- Accelerated irregular memory accesses within the cache.
- An order of magnitude (10x) faster atomic operations.
- An improved GigaThread engine that:
- Supports concurrent kernel execution (and increased efficiency for unbalanced loads).
- Provides 10x faster context switching (that demonstrates a nearly 5x faster frames per second, FPS, rate on the virtual terrain demo from Part 18).
- Delivers concurrent bi-directional data transfers to/from the GPU across the PCIe bus.
- Numerous streaming-multiprocessor improvements including:
- Dual-dispatch scheduling that allows better utilization of the SFU (Special Function Units), integer and other pipelines.
- Hardware that accelerates small conditional branching and predication.
- Improved speed and accuracy of various math operations.
All these hardware capabilities translate to a higher-performance more generalized CPU-like GPGPU programming experience that can efficiently support a broader range of applications. Kudos to the CUDA software development teams that have leveraged these capabilities to further increase performance and support Fermi architecture GPGPUs including:
- Recursive functions.
- Function pointers.
- (Note: in CUDA 3.2 use
__forceinline__on functions to force inlining again.)
- C++ features such as:
- Virtual functions.
- On GPU "new" and "delete" operators for dynamic objects on the GPU.
- Try/catch/throw exception handling.
Please see the Fermi Compatibility guide to understand how Fermi related changes have affected the
nvcc compiler command-line arguments for building Fermi CUDA applications.
Fermi architecture products include the GF100 and variants classified as the GF104/106/108 and just released GF110 series. Differences include:
- Designated as compute capability 2.0 devices.
- Each Streaming Multiprocessor (SM) contains:
- 32 Shader Processors (SP).
- 4 SFU (Special Function Units).
- 4 texture filtering units for every texture address unit or Render Output Unit (ROP).
- Designated as compute capability 2.1 devices.
- Each Streaming Multiprocessor (SM) contains:
- 48 Shader Processors (SP).
- 8 SFU (Special Function Units).
- 8 texture filtering units for every texture address unit or Render Output Unit (ROP).
Each complete die contains varying amounts of texture capabilities shown in Table 1:
Various features of the GF100 architecture are available only on the more expensive Tesla series of cards. For consumer products, double precision performance has been limited to a quarter of that of the "full" Fermi architecture. Error checking and correcting memory (ECC) is also disabled on consumer cards.
Vasily Volkov has an excellent set of slides discussing how Fermi follows the trend towards an inverse memory hierarchy, "Programming inverse memory hierarchy: case of stencils on GPUs." The basic idea is that registers and on-chip local memory is fast and can scale with local processors and massive numbers of threads. This suggests that tiled/stencil algorithms on such systems should be stored in the upper, larger levels of the memory hierarchy such as registers instead of the traditionally used lower, smaller levels such as caches or local stores. (His slides, "Better Performance at Lower Occupancy" from the 2010 GTC conference are also an excellent source of information.)
Fermi, along with other processors, seem to be following this trend towards an inverse memory hierarchy as shown in this table created with data from, "Programming inverse memory hierarchy: case of stencils on GPUs." Note how the gap in memory hierarchy ratios is decreasing as parallelism increases. Succinctly, a single-thread won't see the inverse hierarchy. The inversion is caused by massive-parallelism, which motivates the use of large amounts of thread local data that can scale along with the number of processing elements. This also ties into the ILP (Instruction Level Parallelism) discussion in "Registers and warp scheduling" section of this article, see Table 2.