A unified 64-bit address space
The Fermi architecture now supports a 64-bit address space. Providing 40-bits of physical addressing capability, Fermi combines local, shared and global memory in the same address space, which means pointers work as expected! New GPUs such as the C2050 and C2070 currently support 3 and 6 GB of global memory respectively. It is likely that future GPUs will contain even more onboard memory.
Fermi also significantly improves support for working with 64-bit quantities over previous generations of GPGPUs both in moving data to/from shared memory and when operating on 64-bit quantities in terms of the speed and accuracy.
32-bit integer ALU operations
To support calculations in the larger address space, the integer ALU now supports full 32-bit precision for all instructions, and has been optimized to efficiently support 64-bit and extended precision operations. (Various other instructions are also supported including Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population count.)
In contrast, the GT200 integer ALU was limited to 24 bits of precision. This caused multiple instructions to be utilized to perform a 32-bit integer multiply unless either
__umult24() was specified by the programmer to perform the multiplication using 24-bits. On Fermi, basic indexing operations such as the calculation of the thread ID (illustrated below) now happen faster.
int tid = blockIdx.x * blockDim.x + threadIdx.x;
Developers with legacy applications should note that the situation has now reversed and that
__umult24() now require multiple instructions on Fermi GPUs! While the performance impact will likely be negligible, legacy codes should use something like the following preprocessor define to get rid of 24-bit multiplications.
#define __mul24(a,b) ((a)*(b))
Pointers work as expected
Aside from the larger memory capacity, the big news about the unified address space is that pointers now work the way one expects. Basically pointers can now be passed around between threads and they will correctly point to the same physical memory location regardless of the thread block that reads or writes the pointer. Earlier architectures utilized offsets relative to a particular memory space, which meant that a pointer was meaningful only in the memory space to which it belonged. Incorporating a unified address space into the Fermi architecture greatly enhances the ability to share objects, data, and calculation results between threads, both within a single application's threads and between application kernels as well.
Per the NVIDIA Programming manual section 3.1.6, pointers are now 64-bit. Use
-m64 with the 32-bit
nvcc compiler to emulate 64-bit pointers and
-m32 with the 64-bit nvcc compiler to generate 32-bit pointers.
Support for a per thread stack and recursion
Fermi now provides a stack, which is a fundamental data structure in computer science that is now usable on the GPU on a per thread basis. Basically, each thread can push data onto the stack (up to 1KB per thread per the 3.1 release notes) and use pop operations to pull data off the stack in a LIFO (Last-In First-Out) fashion.
Incorporating a stack per thread allows functions to recursively call themselves among other useful capabilities. Recursion is a technique used to break a large problem into increasingly smaller pieces. At some point, the problem becomes trivially small and easily solved - after which the individuals pieces are reassembled bit-by-bit to generate the solution to the overall, larger problem. Recursion is an important part of divide-and-conquer algorithms for sorting and other general computer science algorithms.
Certain rendering algorithms like ray tracing also utilize recursion. On previous architectures, developers had to implement a limited depth state machine or use in-line multiple nested function calls to emulate recursion in a limited sense. That is why ray-tracing and other recursive algorithms were able to run on GT200 and earlier GPGPU architectures. Now that GPGPUS have a stack, recursion is easy to implement as functions can simply call themselves, which provides CUDA developers a powerful ability to utilize recursive programming techniques to solve problems.
Fermi's upgraded configurable L1 cache
One of the most anticipated and exciting additions in the Fermi architecture is the addition of a true L1/L2 cache hierarchy.
Each SM has an increased amount (64KB) of local memory that can be partitioned to favor shared memory or dynamic read/write operations. Note that the L1 cache is designed for spatial and not temporal reuse. It is not an LRU cache like most CPU caches. In addition, the Fermi L1 cache can be deactivated (as discussed here in the forums) with the
-Xptxas -dlcm=cg command-line argument to
nvcc. Even when deactivated, both the stack and local memory still reside in the L1 cache memory.
The beauty in this configurability is that applications that reuse data or have misaligned, unpredictable or irregular memory access patterns can configure the L1 cache as a 48KB dynamic cache (leaving 16 KB for shared memory) while applications that need to share more data amongst threads inside a thread block can assign 48KB as shared memory (leaving 16KB for the cache). In this way, the NVIDIA designers empowered the application developer so they can configure the memory within the SM to achieve the best performance.
The L1 caches per-thread local data such as register spills and stack accesses. It caches global memory reads, which can provide a significant performance benefit if the compiler detects that all threads load the same value (see Constant Memory below). Stores to global memory bypass the L1 cache — the latency to shared memory and the L1 cache is 10 - 20 cycles.
Note that the L1 cache is not coherent, use the
volatile keyword if threads in other blocks can modify the contents of a memory location. Otherwise, private data (registers, stack, etcetera) can be used without concern.
The unified coherent L2 cache
Fermi GPUs also have a 768KB unified L2 cache that provides read/write cache memory that is guaranteed to present a coherent view to all SMs in the GPU. In other words, any thread can modify a value held in the L2 cache. At a later time, any other thread on the GPU can read that particular data address and receive the correct, updated value. Of course, atomic updates must be used to guarantee that the store completes before other threads are allowed read access. (The good news is that the L2 cache and Fermi architecture has increased the speed of atomic operations by roughly an order of magnitude.)
Previous GPGPU architectures had challenges with this very common read/update operation because two separate data paths were utilized — specifically the read-only texture load path and the write-only pixel data output path. To ensure data correctness, older GPU architectures required that all participating caches along the read-path be potentially invalidated and flushed after any thread modified the value of an in-cache memory location. The Fermi architecture eliminated this bottleneck with the L2 cache along with the need for the texture and Raster Output (ROP) caches in earlier generation GPUs.
All data loads and stores go through the L2 cache. It is important to note that all global memory transfers always go through the L2 cache including CPU/GPU memory copies. The italics stress the fact that host data transfers might unexpectedly affect cache hits and thus application performance.
Wikipedia also notes that, "The quantity of on-board SRAM per ALU actually decreased proportionally compared to the previous G200 generation, despite the increase of the L2 cache from 256kB per 240 ALUs to 768kB per 512 ALUs."
The Impact of Fermi on CUDA memory spaces
The Fermi unified L2 cache and Fermi architectural changes impact CUDA memory spaces in several different ways as is discussed below. Most applications will benefit from these architectural features. The NVIDIA designers recognized that some of these features will decrease the performance of some applications so they gave the programmer the option to turn some features off. In particular, ECC memory and the L1 cache can be disabled as they can adversely impact the performance of some global memory bandwidth limited applications.
David Kirk presented a nice summary of the performance capabilities of a C2050 GPU, which is a GF100 based product:
- 14 multiprocessors.
- Core:1150 MHz.
- Mem:1500 MHz.
- Instruction: 515 Ginstr/s:
- Fp32: 1030 Gflops/s peak.
- Fp64: 515 Gflops/s peak.
- Shared memory:1030 GB/s aggregate.
- L1:1030 GB/s aggregate.
- L2:230 GB/s (not affected by ECC).
- DRAM:144 GB/s (115 GB/s with ECC).
- Instruction: 515 Ginstr/s: