Channels ▼


CUDA, Supercomputing for the Masses: Part 13

Linear Memory

It is important to distinguish between "linear memory" created with cudaMalloc() and "pitch linear" memory created with cudaMallocPitch(). In a nutshell, both methods create linear memory but cudaMallocPitch() pads the allocation to get best performance for the memory subsystem of a given piece of hardware. A programmer can create memory with cudaMalloc() and manually set the pitch, but best memory performance may not be achieved. Aside from the ability to update, there is little difference between a texture bound to a 2D CUDA array and pitch linear memory. NVIDIA has indicated there is effectively no difference between textures bound to these two types of memory.

There are two cases to consider when binding the texture to global memory that necessitated distinguishing between pitch linear memory and linear memory:

  • When using the texture only as a cache: In this case programmers might consider binding the texture to linear memory created with cudaMalloc() because the texture unit cache is small and caching the padding added by cudaMallocPitch() would be wasteful.
  • When using the texture to perform some processing: In this case it is important to bind the texture to pitch linear memory created with cudaMallocPitch() so that the texture unit boundary processing works correctly. In other words, don't bind linear memory created with cudaMalloc() (and manually set the pitch) to a texture because unexpected things might happen.

The use of cudaMallocPitch() is generally recommended because it "knows" what pitch is appropriate for a given piece of hardware to get the best performance and is a good way to future-proof your code.

Also note that CUDA arrays are an opaque data storage mechanism, and are composed of elements, each of which has 1, 2 or 4 components that may be signed or unsigned 8-, 16- or 32-bit integers, 16-bit floats (CUDA driver only), or 32-bit floats. You can also use int2 and hiloint2double to use double-precision values. Note that CUDA arrays may get reordered for locality on the GPU.

Binding memory to a texture is quite fast and unlikely to have an appreciable impact on program performance. There are some restrictions and additional caveats:

  • Updates to the memory backing the texture cache are not seen until the next kernel invocation.
    • In other words, a thread can safely read via texture some memory location only if this memory location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread from the same kernel call.
  • Binding a texture to linear memory effectively prevents utilization of the texture for texture processing. An important note from the documentation (section of the Programming Guide) is that the texture reference fields normalized, addressMode, and filterMode may be modified in host code, but only for texture references bound to CUDA arrays and pitch linear memory! This effectively means that textures bound to linear memory cannot be used to perform texture processing with the texture units.
  • Texture memory cannot bind to mapped memory.

As noted, the texture cache is optimized for 2D spatial locality with streaming behavior when bound to pitch linear memory or CUDA arrays. However, this does not tell us how to order the data to get the best performance out of the texture unit when using it as a cache. There is a good thread on the CUDA Zone forums discussing how to order 3D data for best performance. One of the suggestions is to use a Z-order curve to map multidimensional data to 1D while preserving locality. The challenging question of how to best order your data for cache locality is further complicated because the methods used by the graphics hardware may change at some point in the future to better meet customer needs.

Depending on how the global memory bound to the texture was created, there are several possible ways to fetch from the texture that might also invoke some form of texture processing by the texture.

The simplest way to fetch data from a texture is by using tex1Dfetch() because:

  • Only integer addressing is supported.
  • No additional filtering or addressing modes are provided.

Use of the methods tex1D(), tex2D(), and tex3D() are more complicated because the interpretation of the texture coordinates, what processing occurs during the texture fetch, and the return value delivered by the texture fetch are all controlled by setting the texture reference's mutable (runtime) and immutable (compile time) attributes:

  • Immutable parameters (compile-time)

    • Type: type returned when fetching

      • Basic integer and float types
      • CUDA 1-, 2-, 4-element vectors

    • Dimensionality:

      • Currently 1D, 2D, or 3D

    • Read Mode:
      • cudaReadModeElementType
      • cudaReadModeNormalizedFloat (valid for 8- or 16-bit integers)

        • returns [-1,1] for signed, [0,1] for unsigned

  • Mutable parameters (run-time, only for array-textures and pitch linear-memory)

    • Normalized:

      • non-zero = addressing range [0, 1]

    • Filter Mode:

      • cudaFilterModePoint
      • cudaFilterModeLinear

    • Address Mode:

      • cudaAddressModeClamp
      • cudaAddressModeWrap

For more detailed information, please consult the CUDA Programming Guide.

By default, textures are referenced using floating-point coordinates in the range [0, N) where N is the size of the texture in the dimension corresponding to the coordinate. Specifying normalized texture coordinates will be used implies all references will be in the range [0,1).

The wrap mode specifies what happens for out-of-bounds addressing:

  • Wrap: out-of-bounds coordinates are wrapped (via modulo arithmetic)

    Figure 3: Wrap mode (Courtesy NVIDIA)

  • Clamp: out-of-bounds coordinates are replaced with the closest boundary:

    Figure 4: Clamp mode (Courtesy NVIDIA)

    Linear texture filtering may be done only for textures that are configured to return floating-point data. A texel, short for "texture element", is an element of a texture array. Thus, linear texture filtering performs low-precision (9-bit fixed-fixed point with 8-bits of fractional value) interpolation between neighboring texels. When enabled, the texels surrounding a texture fetch location are read and the return value of the texture fetch is interpolated by the texture hardware based on where the texture coordinates fell between the texels. Simple linear interpolation is performed for one-dimensional textures as can be seen in the following equation from Appendix D.2 of the NVIDIA CUDA Programming Guide 2.2:

    tex(x) = (1- α)T[i] + αT[i +1]
    Equation 1: Filtering mode for a one-dimensional texture.

    Similarly, the dedicated texture hardware will perform bilinear and trilinear filtering for higher-dimensional data. (For more information check out the free online GPU Gems books, and the Wikipedia articles on texture filtering.)

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.