In CUDA, Supercomputing for the Masses: Part 10 of this article series on CUDA (short for "Compute Unified Device Architecture"), I examined CUDPP, the "CUDA Data Parallel Primitives Library." In this installment, I revisit local and constant memory and introduce the concept of "texture memory."
Optimizing the performance of CUDA applications most often involves optimizing data accesses which includes the appropriate use of the various CUDA memory spaces. Texture memory provides a surprising aggregation of capabilities including the ability to cache global memory (separate from register, global, and shared memory) and dedicated interpolation hardware separate from the thread processors. Texture memory also provides a way to interact with the display capabilities of the GPU. Texture memory is an extensive and evolving topic that will be introduced here and discussed more in a dedicated follow-on article.. Part 4 of this series introduced the CUDA memory model and illustrated the various CUDA memory spaces with the schematic in Figure 1. Each of these memory spaces has certain performance characteristics and restrictions.
Appropriate use of these memory spaces can have significant performance implications for CUDA applications. Table 1 summarizes characteristics of the various CUDA memory spaces. Part 5 of this series discusses CUDA memory spaces (with the exception of texture memory) in greater detail and includes performance cautions.
Local memory is a memory abstraction that implies "local in the scope of each thread". It is not an actual hardware component of the multi-processor. In actuality, local memory resides in global memory allocated by the compiler and delivers the same performance as any other global memory region. Normally, automatic variables declared in a kernel reside in registers, which provide very fast access. Unfortunately, the relationship between automatic variables and local memory continues to be a source of confusion for CUDA programmers. The compiler might choose to place automatic variables in local memory when:
- There are too many register variables.
- A structure would consume too much register space.
- The compiler cannot determine if an array is indexed with constant quantities. Please note that registers are not addressable so an array has to go into local memory -- even if it is a two-element array -- when the addressing of the array is not known at compile time.
For additional discussion, please see "Register/Local Memory Cautions" in Part 5 of this series, "The CUDA Memory Model" in Part 4, and the CUDA programming guide on the NVIDIA CUDA Zone Documentation site.
Constant memory is read only from kernels and is hardware optimized for the case when all threads read the same location. Amazingly, constant memory provides one cycle of latency when there is a cache hit even though constant memory resides in device memory (DRAM). If threads read from multiple locations, the accesses are serialized. The constant cache is written to only by the host (not the device because it is constant!) with
cudaMemcpyToSymbol and is persistent across kernel calls within the same application. Up to 64KB of data can be placed in constant cache and there is 8 KB of cache for each multiprocessor. Access to data in constant memory can range from one cycle for in cache data to hundreds of cycles depending on cache locality. The first access to constant memory often does not generate a cache "miss" due to pre-fetching.