This article highlights aspects related to the support and use of unified, or managed, memory in CUDA 6.x. We also revisit two other CUDA memory transaction topics: zero-copy memory and unified virtual addressing. These two can be regarded as intermediate milestones in a process that has led to the release of managed memory in CUDA 6, with three immediate consequences:
- The host/device heterogeneous computing memory model is simplified by eliminating the need for deep copies when accessing structures of arrays in GPU kernels;
- The programmer has the option to defer the host-device data transfers to the runtime;
- CUDA code becomes cleaner and simpler to develop/debug/maintain.
This discussion does not attempt to exhaustively cover the managed memory topic. We have attempted to highlight the points that will address 80% of CUDA developers' needs, while only touching 20% of the technical aspects associated with the use and support of unified memory in CUDA 6. An explanation of unified memory that does not delve into the zero-copy and unified virtual memory topics but instead concentrates exclusively on its key benefits, including elimination of explicit deep copies and implications to C++ code development, is already available online.
This article has a companion PowerPoint file that summarizes the main points and which could be useful for instructional purposes.
GPU Computing: The Physics-Based Simulation Angle
In 2007, soon after the release of CUDA 1.0, it became apparent that many classes of applications in computer-aided engineering (CAE) could benefit from the SIMD computing that GPUs offer. In our case, we were interested in how large collections of components move around and interact with each other in dynamic systems; that is, systems whose configuration changes in time under the influence of external and internal forces. The examples in Figure 1 illustrate two problems that have to do with the motion of granular systems. The goal here is to figure out, for systems with millions of elements, for example, grains in a granular material, how these elements mutually interact and how they move collectively. First of all, this calls for performing collision detection to figure out who is contacting whom. Next, drawing on some rather involved math, we must compute the interaction forces between any two bodies in contact. For the ball floating on the wave of granular material there are about 4 million mutual contacts. This scenario calls for the solution of an optimization problem in which we have to minimize a cost function that depends on four million variables. Approximately one million such optimization problems need to be solved to understand how this wave moves for 10 seconds. GPU computing helps provide a fast solution to each of the 1 million optimization problems. We cannot solve these 1 million problems in parallel since there is a principle of causality at work that makes the second optimization problem depend on the first, the third on the second, etc.
Figure 1: Example of physics-based modeling and simulation enabled by GPU computing. The left image represents a mixing process simulation. The right image captures the dynamics of a light body floating on more than 1 million rigid bodies whose collective motion resembles the propagation of a wave.
Life Before Unified Memory
Three CUDA features shaped the memory transaction landscape prior to the introduction of unified memory support in release 6. They are
cudaMemcpy, zero-copy memory, and support for unified virtual addressing. These CUDA features have been used in implementing a simulation engine called Chrono, which was employed to generate the images in Figure 1.
A staple of CUDA programming since version 1.0,
cudaMemcpy has facilitated the back-and-forth movement of data between host and device. While relatively awkward in syntax, it enabled a simple three-step approach for data processing on the GPU: Data was moved onto the device, it was processed by the device, and results were moved back to the host. This three-step process assumes that memory has been allocated on the device to store the data transferred from the host. The awkwardness stems from the
enum cudaMemcpyKind argument
cudaError_t cudaMemcpy(void * dst, const void * src, size_t count, enum cudaMemcpyKind kind);
The last argument defines the nature of the data transfer: from host to device (
cudaMemcpyHostToDevice), device to host (
cudaMemcpyDeviceToHost), device to device (
cudaMemcpyDeviceToDevice), or host to host (
cudaMemcpyHostToHost). It was the programmer's responsibility to ensure that if, for instance, the destination pointer
dst and the source pointer
src pointed to device and host memory, respectively, then the last argument was
cudaMemcpyHostToDevice. The code would most likely
segfault if, for instance, the
src pair above were used with a
cudaMemcpyDeviceToHost flag. The crash is due to the fact that the host and device each dealt with their own virtual memory space. Using a host address that was valid in the context of the host virtual memory led to an attempted device memory access to a location that might not even exist in the virtual memory space of the device.
Of the four kinds of flags introduced above, the ones that typically saw the highest usage were
cudaMemcpyDeviceToHost. The bandwidths that one would have seen in relation to the associated transfers depended on the PCIe generation. Practically, for PCIe gen 1, 2, and 3, one could count on 3, 6, and 12 GB/s, respectively. The bandwidth of the PCIe transfer could be increased by pinning the host memory allocation used in the transactions to prevent the OS from paging it in and out of memory. Pinned memory on the host side had two additional side benefits:
- It allowed for asynchronous data movement between the host and device,
- It enabled the GPU to use both of its copy engines, one for moving data from the host to the device, the other for moving data from device to the host.
However, relying on a large amount of pinned memory ran the risk of slowing down the entire system as a consequence of curtailing the freedom of the operating system to page out the pinned pages; at the same time, the very pinning of host memory took a substantial amount of time. For instance, pinning 5 GB could require on the order of one second (order of magnitude), thus forcing the programmer to choose between spending the time pinning and hoping to recoup the time by faster PCIe transfers while running the risk of slowing down the entire system; or skipping the pinning and hoping that the PCIe data movement was negligible relative to the amount of time spent on the host processing this data.
The prototype of the runtime call for page-locked memory allocations on the host is:
cudaError_t cudaHostAlloc(void** pHost, size_t size, unsigned int flags);
The parameter flags, with a value that is some combination of
cudaHostAllocWriteCombined, qualify the nature of the pinned memory; that is, what a programmer can subsequently do with it. Of interest here is
cudaHostAllocMapped, which ensures that the memory allocated on the host is mapped into the CUDA address space. Thereafter, a pointer to this memory can be used within a kernel call or other device function to access host memory directly. This is called zero-copy GPU-CPU interaction, hence the name "zero-copy memory." Note that data is still moved through the PCIe bus. However, unlike with
cudaMemcpy, this data movement is not coordinated by the user and it does not happen in one large transaction. The mapped memory is accessed directly by GPU threads and only data that is read/written is moved upon access over the PCIe bus.
The zero-copy avenue is recommended when two conditions are met. First, the memory is accessed in a coalesced fashion, as recommended for any global memory access in CUDA. Second, if multiple memory accesses are made, they should display a high degree of spatial and temporal coherence. If any of these conditions is not met, multiple accesses to zero-copy memory will lead to multiple low bandwidth PCIe excursions. The knee-jerk reaction might be to use a
cudaMemcpy call as soon as the pinned data is used more than once. However, this might not be the winning strategy. Recall that today's cards have a decent amount of cache per thread and one might hit the cache if memory accesses display a certain degree of spatial and/or temporal coherence. A rule of thumb cannot be provided here because the decision depends on the specific code being executed and the underlying hardware. Complicating the decision process even further is the observation that accesses of the host memory, albeit displaying high latency and low bandwidth when not cached, can benefit from the ability of the warp scheduler to hide memory latency with useful execution of warps that are ready to go.
Unified Virtual Addressing (UVA)
Following the introduction of zero-copy memory in CUDA 2.0, CUDA 4.0 brought the host and device memory spaces one step closer by providing, on Fermi and later architectures, full integration of the corresponding virtual spaces. Essentially, no distinction was made between a host pointer and a device pointer. The CUDA runtime could identify where the data was stored based on the value of the pointer. In a unified virtual address (UVA) space setup, the runtime manipulates the pointer and allocation mappings used in device code (through
cudaMalloc), as well as pointers and allocation mappings used in host code (through
cudaHostAlloc) inside a single unified space. An immediate consequence is that the
kind flag in the
cudaMemcpy argument list becomes obsolete and is replaced by a generic
cudaMemcpyDefault. The true benefit of the UVA mechanism becomes apparent when transferring data between two devices. Indeed, if zero-copy provided a degree of convenience in relation to data access operations, UVA took this one step further by improving on data transfer tasks. Owing to the fact that all devices plus the host shared the same virtual address space, a data access or data transfer involving two devices was simplified in two respects:
- Straight use on device A of a pointer to access memory on device B,
- No need for staging the data transfer through host memory.