Channels ▼
RSS

Tools

CUDA, Supercomputing for the Masses: Part 13


In CUDA, Supercomputing for the Masses: Part 12 of this article series on CUDA, I took a quick detour to discuss some of the paradigm changing features of the latest CUDA Toolkit 2.2 release. This article resumes the discussion of "texture memory" which I began in Part 11 of this series. In addition, this installment includes information on the new CUDA Toolkit 2.2 texture capability that allows some programs to eliminate extra copies by providing the ability to write to global memory on the GPU that has a 2D texture bound to it.

From a C-programmer's perspective, texture memory provides an unusual combination of cache memory (separate from register, global, and shared memory), local processing capability (separate from the scalar processors), and a way to interact with the display capabilities of the GPU. This article focuses on the cache and local processor capabilities of texture memory while the next column will discuss how to perform viewable graphic operations with the GPU.

Don't be put off from using texture memory because it is different and has many options. The use of texture memory can improve performance for both bandwidth and latency limited programs. For example, some programs can exceed the maximum theoretical memory bandwidth of the underlying global memory through judicious use of the texture memory cache. While the latency of texture cache reference is generally the same as DRAM, there are some special cases that can deliver data with slightly less than 100 cycles of latency. As usual in CUDA, the use of many threads can hide memory access latency regardless if texture cache or global memory is being accessed.

For CUDA programmers, the most salient points about using texture memory as a cache are: it is optimized for 2D spatial locality, very small (effectively about 8KB per multiprocessor), and can provide a performance benefit by having all the threads in a warp access nearby locations in the texture (as demonstrated in Cache-Efficient Numerical Algorithms using Graphics Hardware). Another tip from the forums is to pack data up if you can because a single float4 texture read is faster than four separate float texture reads.

One ingenious mapping of a random-access data structure to texture memory has been implemented by the CUDA-EC software. In the CUDA code, NVIDIA implements a Bloom filter to test for set membership. The CUDA-EC software is available for free download at http://cuda-ec.sourceforge.net/.

The CUDA Toolkit 2.2 introduced the ability to write to 2D textures bound to pitch linear memory on the GPU that has a texture bound to it. In other words, the data within the texture can be updated within a kernel running on the GPU. This is a very nice feature because it allows many codes to better utilize the caching behavior of texture memory while also eliminating copies. One common example that immediately springs to mind are calculations that require two passes through the data: one to calculate a value (such as a mean or maximum) and a second pass to update the data in place. Such calculations are common when changing the data range or calculating probabilities. The use of an updatable texture can potentially speed these types of calculations.

The cuBLAS library uses texture memory for many of the single-pass calculations (sasum, sdot, and etc). However, comments in the source code indicate that texture memory should not be used for vectors that are short or those that are aligned and have unit stride and thus have nicely coalesced behavior. (The source for cuBLAS library and cuFFT are available for those who have signed up as NVIDIA developers.)

Texture cache is part of each TPC, here short for "Thread Processing Cluster" since I am discussing operations in compute mode. (TPC stands for "Texture Processing Cluster" in graphics mode, which I don't address in this article.) Each TPC contains multiple streaming multiprocessors and a single texture cache. It is important to note that in the GTX 200 series, the texture cache supports three SM (Streaming Multiprocessors) per TPC while the G80/G92 architecture only supports two.

Figure 1 depicts a high-level view of the GeForce GTX 280 GPU in parallel computing mode: A hardware-based thread scheduler at the top manages scheduling threads across the TPCs, which includes the texture caches and memory interface units. The elements indicated as "atomic" refer to the ability to perform atomic read-modify-write operations to memory. For more information, please see GeForce GTX 200 GPU Technical Brief.

Figure 1: High-Level view of GTX 280 Architecture (Courtesy NVIDIA).

Figure 2 represents a lower-level view of a single TPC. Note that TF stands for "Texture Filtering" and IU is the abbreviation for "Instruction Unit".

Figure 2: Lower-level view of a single GTX 280 TPC (Courtesy NVIDIA).

Textures are bound to global memory and can provide both cache and some processing capabilities. How the global memory was created dictates some of the capabilities the texture can provide. For this reason, it is important to distinguish between three memory types that can be bound to a texture:

Table 1: Distinguishing between memory types.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video