Channels ▼
RSS

Parallel

Unified Memory in CUDA 6: A Brief Overview


Unified Memory in CUDA 6

The zero-copy and unified virtual addressing features of CUDA go back to the use of a pair of function calls. First, memory is allocated in page-locked fashion on the host by means of cudaHostAlloc. Second, for data transfers, cudaMemcpy moves data without staging the process through the host. These two functions open the door to several data access and data transfer options: accessing data on the host from a device function or kernel, accessing data on one GPU from a different GPU, and data movement between GPUs through peer-to-peer transfers.

CUDA 6 adds one extra layer of convenience to the CPU/GPU memory management task with the introduction of unified, or managed, memory (UM). Data is now stored and migrated in a user-transparent fashion that enables, under circumstances spelled out shortly, data access/transfer at latencies and bandwidths of the host and of the device, for host-side and device-side memory operations, respectively. Moreover, the use of the cudaHostAlloc and cudaMemcpy combination is no longer a requirement, which allows for a cleaner and more natural programming style.

The centerpiece of the UM concept and associated programming style is the CUDA runtime call cudaMallocManaged, which allocates memory on the device. As far as the host is concerned, no distinction is made in terms of accessing memory allocated with cudaMallocManaged or through malloc. However, although there is no difference in semantics, the programmer needs to be aware that different processors might experience different access times owing to different latencies and bandwidths. Incidentally, a processor is regarded as an independent execution unit with a dedicated memory management unit (MMU); that is, any GPU or CPU.

CUDA 6
Figure 2: Prior to CUDA 6, there was a clear physical and logical separation between the host and device memories. UVA blurred the logical separation while UM went one step further. The elimination of the physical separation has already taken place on NVIDIA's Tegra K1 SoC architecture.

The use of unified memory is best seen through a simple example in which a kernel is called to increment the value of each entry in an array of integers. We provide first a pre-CUDA 6 implementation.

#include <iostream>
#include <cmath>

const int ARRAY_SIZE = 1000;

__global__ void increment(double* aArray, double val, unsigned int sz) {
    unsigned int indx = blockIdx.x * blockDim.x + threadIdx.x;
    if (indx < sz)
        aArray[indx] += val;
}

int main() {
    double* hA;
    double* dA;
    hA = (double *)malloc(ARRAY_SIZE * sizeof(double));
    cudaMalloc(&dA, ARRAY_SIZE * sizeof(double));
    for (int i = 0; i < ARRAY_SIZE; i++)
        hA[i] = 1.*i;
    double inc_val = 2.0;
    cudaMemcpy(dA, hA, sizeof(double) * ARRAY_SIZE, cudaMemcpyHostToDevice);
    increment<<<2, 512>>>(dA, inc_val, ARRAY_SIZE);
    cudaMemcpy(hA, dA, sizeof(double) * ARRAY_SIZE, cudaMemcpyDeviceToHost);
    double error = 0.;
    for (int i = 0; i < ARRAY_SIZE; i++)
        error += std::fabs(hA[i] - (i + inc_val));
    
    std::cout << "Test: " << (error < 1.E-9 ? "Passed" : "Failed") << std::endl;
    cudaFree(dA);
    free(hA);
    return 0;
}

The managed memory version, which follows, is shorter because it makes no reference to cudaMemcpy and there is no need for the host malloc/free calls. In other words, the implementation does not require explicit shadowing of any chunk of device memory by a corresponding chunk of host memory. Allocating unified memory suffices as the runtime, upon a cudaDeviceSynchronize call, will make it available in a coherent fashion to the host or device.

#include <iostream>
#include <cmath>

const int ARRAY_SIZE = 1000;

__global__ void increment(double* aArray, double val, unsigned int sz) {
    unsigned int indx = blockIdx.x * blockDim.x + threadIdx.x;
    if (indx < sz)
        aArray[indx] += val;
}

int main() {
    double* mA;
    cudaMallocManaged(&mA, ARRAY_SIZE * sizeof(double));
    for (int i = 0; i < ARRAY_SIZE; i++)
        mA[i] = 1.*i;
    double inc_val = 2.0;
    increment<<<2, 512>>>(mA, inc_val, ARRAY_SIZE);
    cudaDeviceSynchronize();
    double error = 0.;
    for (int i = 0; i < ARRAY_SIZE; i++)
        error += std::fabs(mA[i] - (i + inc_val));
    
    std::cout << "Test: " << (error < 1.E-9 ? "Passed" : "Failed") << std::endl;
    cudaFree(mA);
    return 0;
}

It is instructive to compare zero-copy and unified memory. For the former, the memory is allocated in page-locked fashion on the host. A device thread has to reach out to get the data. No guarantee of coherence is provided as, for instance, the host could change the content of the pinned memory while the device reads its content. For UM, the memory is allocated on the device and transparently made available where needed. Specifically, upon a call to

cudaError_t cudaMallocManaged(void** devPtr, size_t size, unsigned int flag);

the user has, in devPtr, a pointer to an address of a chunk of device memory. This address can be equally well manipulated on the device and the host (although, as illustrated below, not simultaneously). Note that cudaMallocManaged and cudaMalloc are semantically identical; in fact, the former can be used anywhere the latter is used.

UM enables a "single-pointer-to-data" memory model. For instance, the same pointer can be used on the host in a memcpy operation to copy a set of integers to an array mA, and then on the device to alter, just like in the code snippet above, the value of each entry in mA. The data in mA will be coherent as long as the host does not touch entries in mA when the GPU executes a kernel. The host can safely operate with/on mA only after a cudaDeviceSynchronize call. Failure to obey this rule will lead to a segfault, as illustrated in the following example lifted from the CUDA Programming guide.

__device__ __managed__ int x, y = 2;
__global__ void kernel() {
    x = 10;
}
int main() {
    kernel<<<1, 1>>>();
    y = 20; // ERROR: CPU access concurrent with GPU
    cudaDeviceSynchronize();
    return 0;
}

The segfault goes back to the attempt of the CPU to reference a managed-memory variable, y in this case, while the device is executing a kernel. The example below, lifted from the same source, illustrates the typical strategy for handling data stored in UM by highlighting the role played by the cudaDeviceSynchronize call. Note that any function that logically guarantees that the GPU finished execution, such as cudaStreamSynchronize, cudaMemcpy, or cudaMemset, can play the role played by cudaDeviceSynchronize below.

__device__ __managed__ int x, y = 2;
__global__ void kernel() {
    x = 10;
}
int main() {
    kernel<<<1, 1>>>();
    cudaDeviceSynchronize();    
    y = 20; // GPU is idle so access is OK 
    return 0;
}

These code samples illustrate a second CUDA feature added to support UM: the __managed__ type qualifier that allocates, at compile time, a quantity stored in managed memory. Besides the cudaMallocManaged call and __managed__ qualifier, the third and last feature introduced in CUDA 6 to support UM is cudaStreamAttachMemAsync. Its role is to choreograph the interplay between managed memory and concurrency in multithreaded CPU applications. The backdrop for its use is provided by the observation that pages from managed allocations touched by a host thread are migrated back to GPU before any kernel launch. As such, no overlap of kernel execution and data transfer can take place in that CUDA stream. Overlap is still possible, but it calls for the use of multiple streams — while one kernel executes in one stream, a different stream can engage in a data transfer process. This strategy is possible because the managed allocation process is specific to a stream and, as such, it allows concurrency to control which allocations are synchronized on which specific kernel launches.

As mentioned before, a unified memory allocation physically takes place in device memory on the device that happens to be active at the time of the allocation. When this memory is operated upon by the CPU, the migration to host happens at page-level resolution, which is typically 4KB. The runtime tracks dirty pages and detects page faults. It transparently moves (over the PCIe bus) only the dirty pages. Pages touched by the CPU (GPU) are moved back to the device (host) when needed. Coherence points are kernel launches and device/stream synchronizations. For now, there is no oversubscription of the device memory. If several devices are available, the largest amount of managed memory that can be allocated is the smallest of the available device memories.

A Short Example: Unified Memory and Thrust

Thrust is a CUDA C++ template library of parallel algorithms and data structures. With an interface similar to the C++ Standard Template Library (STL), Thrust provides a high-level interface to high-performance GPU-accelerated implementation of common algorithms, such as sorts, scans, transforms, and reductions.

Scientific and engineering CUDA codes, such as the one briefly described at the beginning of this article, often involve a combination of custom kernels and calls to Thrust algorithms. To allow interoperability with the entire CUDA ecosystem of libraries, tools, and user kernels, Thrust provides a simple API for wrapping CUDA device pointers so that they can be passed to a Thrust algorithm (thrust::device_pointer_cast) and extracting the raw device pointer from a Thrust device_ptr or device_vector (thrust::raw_pointer_cast) so that it can be used in custom kernels.

By default, Thrust relies on implicit algorithm dispatch, using tags associated with its vector containers. For example, the system tag for the iterators of thrust::device_vector is thrust::cuda::tag, so algorithms dispatched on such iterators will be parallelized in the CUDA system. This will not work with memory allocated through cudaMallocManaged. To prevent the need to introduce new vectors or to wrap existing managed memory simply to use a parallel algorithm, Thrust algorithms can be invoked with an explicitly specified execution policy. This approach is illustrated in the example below, where the array mA could also be directly passed, as is, to a host function or a CUDA kernel.

#include <iostream>
#include <cmath>
#include <thrust/reduce.h>
#include <thrust/system/cuda/execution_policy.h>
#include <thrust/system/omp/execution_policy.h>

const int ARRAY_SIZE = 1000;

int main(int argc, char **argv) {
    double* mA;
    cudaMallocManaged(&mA, ARRAY_SIZE * sizeof(double));
    thrust::sequence(mA, mA + ARRAY_SIZE, 1);
    double maximumGPU = thrust::reduce(thrust::cuda::par, mA, mA + ARRAY_SIZE, 0.0,      
                                     thrust::maximum<double>());
    cudaDeviceSynchronize();
    double maximumCPU = thrust::reduce(thrust::omp::par, mA, mA + ARRAY_SIZE, 0.0,    
                                       thrust::maximum<double>());
    std::cout << "GPU reduce: “ << (std::fabs(maximumGPU ‐ ARRAY_SIZE) < 1e‐10 ? "Passed" : "Failed");  
    std::cout << "CPU reduce: “ << (std::fabs(maximumCPU ‐ ARRAY_SIZE) < 1e‐10 ? "Passed" : "Failed"); 
    cudaFree(mA);
    return 0;
}

With this model, the programmer specifies only the Thrust backend of interest (how the algorithm should be parallelized), without being concerned about the system being able to dereference the iterators provided to the algorithm (where the data "lives"). This is consistent with the simpler programming and memory management enabled by UM.

Additional examples of using Thrust with UM are available in the GitHub Thrust-test repository. Note that using Thrust with managed memory requires the latest development version Thrust v1.8 (the CUDA Toolkit only provides Thrust v1.7).

Unified Memory at Work in Scientific Computing

Solving sparse systems of linear equations Ax = b, where A is either a symmetric or a nonsymmetric nonsingular matrix, is one of the most common tasks in scientific computing. For instance, numerical solution approaches for the practical problems discussed at the beginning of this article oftentimes require the solution of large sparse linear systems. Here, we briefly introduce SPIKE::GPU, a library for the GPU solution of mid-size sparse linear systems, and use it to assess the performance of managed memory in CUDA 6.

SPIKE::GPU is an open source sparse linear solver for systems of medium size; that is, up to approximately 0.5 million unknowns. The solution strategy has three steps. In step 1, a sparse matrix is reordered; that is, entries in A are moved around by shifting the order of the rows and columns in the matrix to accomplish two things (Figure 3):

  • Reordering 1: Move the large entries (in absolute value) of the A matrix to the diagonal to render the reordered matrix as close as possible to a diagonally dominant matrix,
  • Reordering 2: Reduce the sparse matrix to a dense band matrix (bandwidth reduction), or equivalently, group all the matrix entries close to the matrix diagonal.

In the second step of the solution sequence, the reordered matrix is partitioned into submatrices A1 through Ap that are LU factorized independently. Some coupling exists between these submatrices, accounted for through the parallel computation of some bridging terms (called in the literature "spikes"). In the third and final step of the solution strategy, given that several approximations are made during the factorization of the matrix of A, SPIKE::GPU relies on an iterative solver that is preconditioned by the approximate factorization obtained at the end of step 2. The iterative solver considered in step 3 belongs to the family of so-called Krylov subspace methods.

CUDA 6
Figure 3: Sparsity pattern of the original sparse matrix (left), after applying four-stage algorithm (center), and after applying bandwidth reordering (right). The matrix illustrated is "Garon2" from the University of Florida collection of sparse matrices. The dimension of the matrix is 13535, the number of nonzero entries is 373235, and the half bandwidth is 13531. After the first reordering, the half bandwidth doesn't change, but note that nonzero entries populate the entire diagonal of the matrix due to migration of "heavy" entries to the diagonal. After the band-reduction reordering, the half-bandwidth becomes 585 and from there on the matrix is considered dense within the band.

The performance analysis for UM is done in conjunction with step 1 of the algorithm that seeks to reorder the matrix A such that its "heavy" entries are moved to the diagonal. The benefit of this substep is twofold: It decreases the probability of encountering a zero pivot during the factorization of the diagonal blocks A1 through Ap, and it increases the quality of the approximations to the coupling off-diagonal blocks (the "spikes"). Reordering 1 of step 1 in itself has four stages:

  • Stage1: Form bipartite graph reflecting sparsity pattern of the matrix A — easy to do on the GPU
  • Stage 2: Find a partial reordering — currently relatively hard to do on the GPU
  • Stage 3: Refine partial reordering — currently very hard to do on the GPU
  • Stage 4: Extract permutation from the bipartite graph — easy to do on the GPU

Dan Negrut, an NVIDIA CUDA Fellow, is an ME Associate Professor at the University of Wisconsin-Madison where he leads the Simulation-Based Engineering Lab. Radu Serban is an Associate Scientist at the University of Wisconsin-Madison specializing in computational dynamics and high-performance computing. Ang Li is a Ph.D. student in ECE at the University of Wisconsin-Madison working on high-performance computing and sparse linear algebra. Andrew Seidl is a Ph.D. student in Mechanical Engineering at the University of Wisconsin -Madison working on high-performance and distributed computing.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.