In CUDA, Supercomputing for the Masses: Part 12 of this article series on CUDA, I took a quick detour to discuss some of the paradigm changing features of the latest CUDA Toolkit 2.2 release. This article resumes the discussion of "texture memory" which I began in Part 11 of this series. In addition, this installment includes information on the new CUDA Toolkit 2.2 texture capability that allows some programs to eliminate extra copies by providing the ability to write to global memory on the GPU that has a 2D texture bound to it.
From a C-programmer's perspective, texture memory provides an unusual combination of cache memory (separate from register, global, and shared memory), local processing capability (separate from the scalar processors), and a way to interact with the display capabilities of the GPU. This article focuses on the cache and local processor capabilities of texture memory while the next column will discuss how to perform viewable graphic operations with the GPU.
Don't be put off from using texture memory because it is different and has many options. The use of texture memory can improve performance for both bandwidth and latency limited programs. For example, some programs can exceed the maximum theoretical memory bandwidth of the underlying global memory through judicious use of the texture memory cache. While the latency of texture cache reference is generally the same as DRAM, there are some special cases that can deliver data with slightly less than 100 cycles of latency. As usual in CUDA, the use of many threads can hide memory access latency regardless if texture cache or global memory is being accessed.
For CUDA programmers, the most salient points about using texture memory as a cache are: it is optimized for 2D spatial locality, very small (effectively about 8KB per multiprocessor), and can provide a performance benefit by having all the threads in a warp access nearby locations in the texture (as demonstrated in Cache-Efficient Numerical Algorithms using Graphics Hardware). Another tip from the forums is to pack data up if you can because a single float4
texture read is faster than four separate float
texture reads.
One ingenious mapping of a random-access data structure to texture memory has been implemented by the CUDA-EC software. In the CUDA code, NVIDIA implements a Bloom filter to test for set membership. The CUDA-EC software is available for free download at http://cuda-ec.sourceforge.net/.
The CUDA Toolkit 2.2 introduced the ability to write to 2D textures bound to pitch linear memory on the GPU that has a texture bound to it. In other words, the data within the texture can be updated within a kernel running on the GPU. This is a very nice feature because it allows many codes to better utilize the caching behavior of texture memory while also eliminating copies. One common example that immediately springs to mind are calculations that require two passes through the data: one to calculate a value (such as a mean or maximum) and a second pass to update the data in place. Such calculations are common when changing the data range or calculating probabilities. The use of an updatable texture can potentially speed these types of calculations.
The cuBLAS library uses texture memory for many of the single-pass calculations (sasum
, sdot
, and etc). However, comments in the source code indicate that texture memory should not be used for vectors that are short or those that are aligned and have unit stride and thus have nicely coalesced behavior. (The source for cuBLAS library and cuFFT are available for those who have signed up as NVIDIA developers.)
Texture cache is part of each TPC, here short for "Thread Processing Cluster" since I am discussing operations in compute mode. (TPC stands for "Texture Processing Cluster" in graphics mode, which I don't address in this article.) Each TPC contains multiple streaming multiprocessors and a single texture cache. It is important to note that in the GTX 200 series, the texture cache supports three SM (Streaming Multiprocessors) per TPC while the G80/G92 architecture only supports two.
Figure 1 depicts a high-level view of the GeForce GTX 280 GPU in parallel computing mode: A hardware-based thread scheduler at the top manages scheduling threads across the TPCs, which includes the texture caches and memory interface units. The elements indicated as "atomic" refer to the ability to perform atomic read-modify-write operations to memory. For more information, please see GeForce GTX 200 GPU Technical Brief.

Figure 2 represents a lower-level view of a single TPC. Note that TF stands for "Texture Filtering" and IU is the abbreviation for "Instruction Unit".

Textures are bound to global memory and can provide both cache and some processing capabilities. How the global memory was created dictates some of the capabilities the texture can provide. For this reason, it is important to distinguish between three memory types that can be bound to a texture:
