Channels ▼
RSS

Design

CUDA, Supercomputing for the Masses: Part 21


Global Memory

The Fermi architecture makes some important changes in how CUDA programmers need to think about and use global memory.

From a hardware perspective memory requests are issued in groups of 32 threads (as opposed to 16 in previous architectures), which matches the instruction issue width. Thus, the 32 addresses of a warp should ideally address a contiguous, aligned region to fully utilize the cache line(s).

What this means for CUDA developers:

  • 2D/3D thread blocks should be a multiple of 32 "wide".
  • Data should be a multiple of 32 in the fastest-varying dimension.

Most applications will benefit from the L1 cache because it performs coalesced global memory loads and stores in terms of a 128 byte cache line size. Once data is inside the cache, applications can reuse data, perform irregular memory accesses, and spill registers without incurring the dramatic slowdown caused by having to rely on roundtrips to the much slower global memory. For this reason, the L1 cache is a very good thing. Note that L1 cache is limited to 16KB/48KB per SM, which is much smaller then the 128 KB register file. This limits the amount of data that can be spilled plus spilled data will evict. For these reasons, spilling data to L1 may help or hinder application performance.

The advantage to disabling the L1 cache is that it avoids use of the 128 byte cache line, permitting 32 byte data accesses. In some situations, an application may stride through, or access memory in such a way that the 128 byte cache line load will waste most of the global memory bandwidth. For these particular applications, disabling the L1 cache will provide a performance benefit.

[Click image to view at full size]
Figure 1: source NVIDIA slide 41 from http://people.maths.ox.ac.uk/gilesm/cuda/cudaconf_oxford.pdf

In addition, the caching of local memory in the L1 cannot be disabled (Section 5.3.2.2), but programmers can control local memory usage by limiting the amount of variables that the compiler is likely to place in local memory and by controlling register spilling via the __launch_bounds()__ attribute (Section B.17) or the -maxrregcount compiler option. In particular, it is important to note that __launch_bounds__ provides the compiler additional information to reduce register usage. Even when the L1 cache is disabled, the stack is always kept in L1 memory.

Since Fermi provides a unified caches, which eliminated the need for separate texture caches, the compute capability 1.X trick of using texture memory to accelerate irregular memory accesses discussed in Part 13 is now discouraged. However, texture units can provide additional computational capability (albeit with 9-bits of accuracy) that can accelerate some applications. One example is Chapter 7, "Leveraging the Untapped Computation Power of GPUs: Fast Spectral Synthesis Using Texture Interpolation" by Richard Townsend, Karthikeyan Sankaralingam, and Matthew D. Sinclair in the recent GPU CUDA Gems book edited by Wen-mei Hwu.

The Fermi memory subsystem provides for 6 partitions of GDDR5 memory with ECC capability on GF100 hardware. There is no longer a linear mapping between addresses and partitions, so typical access patterns are unlikely to all fall into the same partition. This avoids avoid partition camping (bottlenecking on a subset or even a single controller) as discussed in this thread. In addition, ECC can be turned off at the driver level to gain an additional 20% in memory bandwidth and added memory capacity, which can benefit global memory bandwidth limited applications. The Linux nvidia-smi command added a -e option for controlling ECC. There is a control panel option for Windows.

As discussed next, the Fermi cache hierarchy provides the ability to act like constant memory with the ability to broadcast read-only data.

Constant memory

Constant memory is still an excellent way to store and broadcast read-only data to all the threads on the GPU. Compute capability 2.0 and higher devices allow developers to access global memory with the efficiency of constant memory (e.g., not serializing on the load) under certain conditions where the compiler can recognize and use the LDU (LoaD Uniform) instruction. Specifically, the data must:

  • Reside in global memory.
  • Be read-only in the kernel (programmer can enforce this using the const keyword).
  • Must not depend on the thread ID.


__global__ void kernel( float *g_dst, const float *g_src )
{
g_dst = g_src[0] + g_src[blockIdx.x];
}

Essentially a uniform memory access is the same across a thread block. In this case there is no need for __constant__ declaration and there is no fixed limit to the amount of data as is the case with constant memory. As noted in this thread, constant offset reads from constant memory might still be somewhat faster.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video