In CUDA, Supercomputing for the Masses: Part 11 of this article series on CUDA, I revisited CUDA memory spaces and introduced the concept of "texture memory". In this installment, I discuss some paradigm changing features of the just released CUDA version 2.2 -- namely the introduction of "mapped" pinned system memory that allows compute kernels to share host system memory and provides zero-copy support for direct access to host system memory when running on many newer CUDA-enabled graphics processors. The next article in this series will resume the discussion of texture memory and include information about new CUDA 2.2 features such as the ability to write to global memory on the GPU that has a texture bound to it. (Go here for more on CUDA 2.2.)
- Move data to the GPU.
- Perform calculation on GPU.
- Move result(s) from the GPU to host.
This paradigm has now changed as CUDA 2.2 has introduced new APIs that allow host memory to be mapped into device memory via a new function called
cuMemHostAlloc in the CUDA driver API). This new memory type supports the following features:
- "Portable" pinned buffers that are available to all GPUs.
- The use of multiple GPUs will be discussed in a future article.
- "Mapped" pinned buffers that map host memory into the CUDA address space and provide asynchronous transparent access to the data without requiring an explicit programmer initiated copy.
- Integrated GPUs share physical memory with the host processor (as opposed to the on-board fast global memory of discrete GPUs). Mapped pinned buffers act as "zero-copy" buffers for many newer (especially integrated graphics processors) because they avoid superfluous copies. When developing code for integrated GPUs, using mapped pinned memory really makes sense.
- For discrete GPUs, mapped pinned memory is only a performance win in certain cases. Since the memory is not cached by the GPU:
- It should be read or written exactly once.
- The global loads and stores that read or write the memory must be coalesced to avoid a 2x-7x PCIe performance penalty.
- At best, it will only deliver PCIe bandwidth performance, but this can be 2x faster than
cudaMemcpybecause mapped memory is able exploit the full duplex capability of the PCIe bus by reading and writing at the same time. A call to
cudaMemcpycan only move data in one direction at a time (i.e., half duplex).
Further, a drawback of the current CUDA 2.2 release is that all pinned allocations are mapped into the GPU's 32-bit linear address space, regardless of whether the device pointer is needed or not. (NVIDIA indicates this will be changed to a per-allocation basis in a later release.)
- "WC" (write-combined) memory can provide higher performance:
- Since WC memory is neither cached or cache coherent, greater PCIe performance can be achieved because the memory is not snooped during transfers across the PCI Express bus. NVIDIA notes in their "CUDA 2.2 Pinned Memory APIs" document that WC memory may perform as much as 40% faster on certain PCI Express 2.0 implementations.
- It may increase the host processor(s) write performance to host memory because individual writes are first combined (via an internal processor write-buffer) so that only a single burst write containing many aggregated individual writes need be issued. (Intel claims they have observed actual performance increases of over 10x but this is not typical). For more information, please see the Intel publication Write Combining Memory Implementation Guidelines.
- Host-side calculations and applications may run faster because write-combined memory does not pollute the internal processor caches such as the L1 and L2 caches. This happens because WC does not enforce cache coherency, which can increase host processor efficiency by reducing cache misses as well as avoiding the overhead incurred when enforcing cache coherency. Write-combining also avoids cache pollution by utilizing a separate dedicated internal write-buffer cache, which by-passes and leaves the other internal processor caches untouched.
- WC memory does have drawbacks and CUDA programmers should not consider a WC memory region as general-purpose memory because it is "weakly-ordered". In other words, reading from a WC memory location may return unexpected -- and incorrect -- data because a previous write to that memory location might have been delayed in order to combine it with other writes. Without programmer enforced coherency though a "fence" operation, it is possible that a read of WC memory may actually "read" old or even initialized data.
- Unfortunately, enforcing coherent reads from WC memory may incur a performance penalty on some host processor architectures. Happily, processors with the SSE4 instruction set provide a streaming load instruction (
MOVNTDQA) that can efficiently read from WC memory. (Check if the
CPUIDinstruction is executed with
EAX==1, bit 19 of
ECX, to see if SSE4.1 is available.) Please see the Intel publication, Increasing Memory Throughput With Intel Streaming SIMD Extensions 4 (Intel SSE4) Streaming Load.
- It is unclear if and when a CUDA programmer needs to take any action (such as using a memory fence) to ensure that the WC memory is in-place and ready for use by the host or graphics processor(s). The Intel documentation states that "[a] 'memory fence' instruction should be used to properly ensure consistency between the data producer and data consumer." The CUDA driver does use WC memory internally and must issue a store fence instruction whenever it sends a command to the GPU. For this reason, the NVIDIA documentation notes, "the application may not have to use store fences at all" (emphasis added). A rough rule of thumb that appears to work is to look to the CUDA commands prior to referencing WC memory and assume they issue a fence instruction. Otherwise, utilize your compiler intrinsic operations to issue a store fence instruction and guarantee that every preceding store is globally visible. This is compiler dependent. Linux compilers will probably understand the
_mm_sfenceintrinsic while Windows compilers will probably use
Each of these memory features can be used individually or in any combination -- you can allocate a portable, write-combined buffer, a portable pinned buffer, a write-combined buffer that is neither portable nor pinned, or any other permutation enabled by the flags.
In a nutshell, these new features add convenience and performance while conversely adding complexity and creating version dependencies on the CUDA driver, the CUDA hardware and the host processors. However, many types of applications can benefit from these new features.