It is important to distinguish between "linear memory" created with
cudaMalloc() and "pitch linear" memory created with
cudaMallocPitch(). In a nutshell, both methods create linear memory but
cudaMallocPitch() pads the allocation to get best performance for the memory subsystem of a given piece of hardware. A programmer can create memory with
cudaMalloc() and manually set the pitch, but best memory performance may not be achieved. Aside from the ability to update, there is little difference between a texture bound to a 2D CUDA array and pitch linear memory. NVIDIA has indicated there is effectively no difference between textures bound to these two types of memory.
There are two cases to consider when binding the texture to global memory that necessitated distinguishing between pitch linear memory and linear memory:
- When using the texture only as a cache: In this case programmers might consider binding the texture to linear memory created with
cudaMalloc()because the texture unit cache is small and caching the padding added by
cudaMallocPitch()would be wasteful.
- When using the texture to perform some processing: In this case it is important to bind the texture to pitch linear memory created with
cudaMallocPitch()so that the texture unit boundary processing works correctly. In other words, don't bind linear memory created with
cudaMalloc()(and manually set the pitch) to a texture because unexpected things might happen.
The use of
cudaMallocPitch() is generally recommended because it "knows" what pitch is appropriate for a given piece of hardware to get the best performance and is a good way to future-proof your code.
Also note that CUDA arrays are an opaque data storage mechanism, and are composed of elements, each of which has 1, 2 or 4 components that may be signed or unsigned 8-, 16- or 32-bit integers, 16-bit floats (CUDA driver only), or 32-bit floats. You can also use
hiloint2double to use double-precision values. Note that CUDA arrays may get reordered for locality on the GPU.
Binding memory to a texture is quite fast and unlikely to have an appreciable impact on program performance. There are some restrictions and additional caveats:
- Updates to the memory backing the texture cache are not seen until the next kernel invocation.
- In other words, a thread can safely read via texture some memory location only if this memory location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread from the same kernel call.
- Binding a texture to linear memory effectively prevents utilization of the texture for texture processing. An important note from the documentation (section 220.127.116.11 of the Programming Guide) is that the texture reference fields
filterModemay be modified in host code, but only for texture references bound to CUDA arrays and pitch linear memory! This effectively means that textures bound to linear memory cannot be used to perform texture processing with the texture units.
- Texture memory cannot bind to mapped memory.
As noted, the texture cache is optimized for 2D spatial locality with streaming behavior when bound to pitch linear memory or CUDA arrays. However, this does not tell us how to order the data to get the best performance out of the texture unit when using it as a cache. There is a good thread on the CUDA Zone forums discussing how to order 3D data for best performance. One of the suggestions is to use a Z-order curve to map multidimensional data to 1D while preserving locality. The challenging question of how to best order your data for cache locality is further complicated because the methods used by the graphics hardware may change at some point in the future to better meet customer needs.
Depending on how the global memory bound to the texture was created, there are several possible ways to fetch from the texture that might also invoke some form of texture processing by the texture.
The simplest way to fetch data from a texture is by using
- Only integer addressing is supported.
- No additional filtering or addressing modes are provided.
Use of the methods
tex3D() are more complicated because the interpretation of the texture coordinates, what processing occurs during the texture fetch, and the return value delivered by the texture fetch are all controlled by setting the texture reference's mutable (runtime) and immutable (compile time) attributes:
- Immutable parameters (compile-time)
- Type: type returned when fetching
- Basic integer and float types
- CUDA 1-, 2-, 4-element vectors
- Currently 1D, 2D, or 3D
- Read Mode:
- cudaReadModeNormalizedFloat (valid for 8- or 16-bit integers)
- returns [-1,1] for signed, [0,1] for unsigned
- Type: type returned when fetching
- Mutable parameters (run-time, only for array-textures and pitch linear-memory)
- non-zero = addressing range [0, 1]
- Filter Mode:
- Address Mode:
For more detailed information, please consult the CUDA Programming Guide.
By default, textures are referenced using floating-point coordinates in the range [0, N) where N is the size of the texture in the dimension corresponding to the coordinate. Specifying normalized texture coordinates will be used implies all references will be in the range [0,1).
The wrap mode specifies what happens for out-of-bounds addressing:
- Wrap: out-of-bounds coordinates are wrapped (via modulo arithmetic)
- Clamp: out-of-bounds coordinates are replaced with the closest boundary:
Linear texture filtering may be done only for textures that are configured to return floating-point data. A texel, short for "texture element", is an element of a texture array. Thus, linear texture filtering performs low-precision (9-bit fixed-fixed point with 8-bits of fractional value) interpolation between neighboring texels. When enabled, the texels surrounding a texture fetch location are read and the return value of the texture fetch is interpolated by the texture hardware based on where the texture coordinates fell between the texels. Simple linear interpolation is performed for one-dimensional textures as can be seen in the following equation from Appendix D.2 of the NVIDIA CUDA Programming Guide 2.2:
tex(x) = (1- α)T[i] + αT[i +1]
Equation 1: Filtering mode for a one-dimensional texture.
Similarly, the dedicated texture hardware will perform bilinear and trilinear filtering for higher-dimensional data. (For more information check out the free online GPU Gems books, and the Wikipedia articles on texture filtering.)