Channels ▼


CUDA, Supercomputing for the Masses: Part 21

A unified 64-bit address space

The Fermi architecture now supports a 64-bit address space. Providing 40-bits of physical addressing capability, Fermi combines local, shared and global memory in the same address space, which means pointers work as expected! New GPUs such as the C2050 and C2070 currently support 3 and 6 GB of global memory respectively. It is likely that future GPUs will contain even more onboard memory.

Fermi also significantly improves support for working with 64-bit quantities over previous generations of GPGPUs both in moving data to/from shared memory and when operating on 64-bit quantities in terms of the speed and accuracy.

32-bit integer ALU operations

To support calculations in the larger address space, the integer ALU now supports full 32-bit precision for all instructions, and has been optimized to efficiently support 64-bit and extended precision operations. (Various other instructions are also supported including Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population count.)

In contrast, the GT200 integer ALU was limited to 24 bits of precision. This caused multiple instructions to be utilized to perform a 32-bit integer multiply unless either __mult24() or __umult24() was specified by the programmer to perform the multiplication using 24-bits. On Fermi, basic indexing operations such as the calculation of the thread ID (illustrated below) now happen faster.

  int tid = blockIdx.x * blockDim.x + threadIdx.x;

Developers with legacy applications should note that the situation has now reversed and that __mult24() or __umult24() now require multiple instructions on Fermi GPUs! While the performance impact will likely be negligible, legacy codes should use something like the following preprocessor define to get rid of 24-bit multiplications.

#define __mul24(a,b) ((a)*(b))

Pointers work as expected

Aside from the larger memory capacity, the big news about the unified address space is that pointers now work the way one expects. Basically pointers can now be passed around between threads and they will correctly point to the same physical memory location regardless of the thread block that reads or writes the pointer. Earlier architectures utilized offsets relative to a particular memory space, which meant that a pointer was meaningful only in the memory space to which it belonged. Incorporating a unified address space into the Fermi architecture greatly enhances the ability to share objects, data, and calculation results between threads, both within a single application's threads and between application kernels as well.

Per the NVIDIA Programming manual section 3.1.6, pointers are now 64-bit. Use -m64 with the 32-bit nvcc compiler to emulate 64-bit pointers and -m32 with the 64-bit nvcc compiler to generate 32-bit pointers.

Support for a per thread stack and recursion

Fermi now provides a stack, which is a fundamental data structure in computer science that is now usable on the GPU on a per thread basis. Basically, each thread can push data onto the stack (up to 1KB per thread per the 3.1 release notes) and use pop operations to pull data off the stack in a LIFO (Last-In First-Out) fashion.

Incorporating a stack per thread allows functions to recursively call themselves among other useful capabilities. Recursion is a technique used to break a large problem into increasingly smaller pieces. At some point, the problem becomes trivially small and easily solved - after which the individuals pieces are reassembled bit-by-bit to generate the solution to the overall, larger problem. Recursion is an important part of divide-and-conquer algorithms for sorting and other general computer science algorithms.

Certain rendering algorithms like ray tracing also utilize recursion. On previous architectures, developers had to implement a limited depth state machine or use in-line multiple nested function calls to emulate recursion in a limited sense. That is why ray-tracing and other recursive algorithms were able to run on GT200 and earlier GPGPU architectures. Now that GPGPUS have a stack, recursion is easy to implement as functions can simply call themselves, which provides CUDA developers a powerful ability to utilize recursive programming techniques to solve problems.

Fermi's upgraded configurable L1 cache

One of the most anticipated and exciting additions in the Fermi architecture is the addition of a true L1/L2 cache hierarchy.

Each SM has an increased amount (64KB) of local memory that can be partitioned to favor shared memory or dynamic read/write operations. Note that the L1 cache is designed for spatial and not temporal reuse. It is not an LRU cache like most CPU caches. In addition, the Fermi L1 cache can be deactivated (as discussed here in the forums) with the -Xptxas -dlcm=cg command-line argument to nvcc. Even when deactivated, both the stack and local memory still reside in the L1 cache memory.

The beauty in this configurability is that applications that reuse data or have misaligned, unpredictable or irregular memory access patterns can configure the L1 cache as a 48KB dynamic cache (leaving 16 KB for shared memory) while applications that need to share more data amongst threads inside a thread block can assign 48KB as shared memory (leaving 16KB for the cache). In this way, the NVIDIA designers empowered the application developer so they can configure the memory within the SM to achieve the best performance.

The L1 caches per-thread local data such as register spills and stack accesses. It caches global memory reads, which can provide a significant performance benefit if the compiler detects that all threads load the same value (see Constant Memory below). Stores to global memory bypass the L1 cache — the latency to shared memory and the L1 cache is 10 - 20 cycles.

Note that the L1 cache is not coherent, use the volatile keyword if threads in other blocks can modify the contents of a memory location. Otherwise, private data (registers, stack, etcetera) can be used without concern.

The unified coherent L2 cache

Fermi GPUs also have a 768KB unified L2 cache that provides read/write cache memory that is guaranteed to present a coherent view to all SMs in the GPU. In other words, any thread can modify a value held in the L2 cache. At a later time, any other thread on the GPU can read that particular data address and receive the correct, updated value. Of course, atomic updates must be used to guarantee that the store completes before other threads are allowed read access. (The good news is that the L2 cache and Fermi architecture has increased the speed of atomic operations by roughly an order of magnitude.)

Previous GPGPU architectures had challenges with this very common read/update operation because two separate data paths were utilized — specifically the read-only texture load path and the write-only pixel data output path. To ensure data correctness, older GPU architectures required that all participating caches along the read-path be potentially invalidated and flushed after any thread modified the value of an in-cache memory location. The Fermi architecture eliminated this bottleneck with the L2 cache along with the need for the texture and Raster Output (ROP) caches in earlier generation GPUs.

All data loads and stores go through the L2 cache. It is important to note that all global memory transfers always go through the L2 cache including CPU/GPU memory copies. The italics stress the fact that host data transfers might unexpectedly affect cache hits and thus application performance.

Wikipedia also notes that, "The quantity of on-board SRAM per ALU actually decreased proportionally compared to the previous G200 generation, despite the increase of the L2 cache from 256kB per 240 ALUs to 768kB per 512 ALUs."

The Impact of Fermi on CUDA memory spaces

The Fermi unified L2 cache and Fermi architectural changes impact CUDA memory spaces in several different ways as is discussed below. Most applications will benefit from these architectural features. The NVIDIA designers recognized that some of these features will decrease the performance of some applications so they gave the programmer the option to turn some features off. In particular, ECC memory and the L1 cache can be disabled as they can adversely impact the performance of some global memory bandwidth limited applications.

David Kirk presented a nice summary of the performance capabilities of a C2050 GPU, which is a GF100 based product:

  • 14 multiprocessors.
  • Clocks:
    • Core:1150 MHz.
    • Mem:1500 MHz.
  • Throughputs:
    • Instruction: 515 Ginstr/s:
      • Fp32: 1030 Gflops/s peak.
      • Fp64: 515 Gflops/s peak.
    • Memory:
      • Shared memory:1030 GB/s aggregate.
      • L1:1030 GB/s aggregate.
      • L2:230 GB/s (not affected by ECC).
      • DRAM:144 GB/s (115 GB/s with ECC).

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.