Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


A Massively Parallel Stack for Data Allocation

Dynamic parallelism, or the ability to initiate kernel launches from the CUDA device, is an important evolutionary step in the CUDA software development platform. With it, developers can perform variable amounts of work based on divide-and-conquer algorithms and in-memory data structures such as trees and graphs — entirely on the GPU without host intervention. The variable nature of dynamic parallelism means that GPU applications can easily consume varying amounts of memory and generate inconstant amounts of data for later computational processing.

This article provides working code for a general-purpose massively parallel data structure, MappedTypeArray, that has the ability to accept variable numbers of results from dynamic algorithms and act as a fast malloc() method that can very quickly hand out items from a pre-allocated array of arbitrary type. Such a data structure requires: a fast, massively parallel counter that will not serialize execution when many parallel threads attempt to perform an increment at the same time, and the ability to create and utilize generic C++ objects transparently on both the host and device. The allocated memory and stack can be utilized on the host, which means this data structure runs well on multicore processors as well.

A massively parallel stack requires a robust low-wait counter; specifically, a ParallelCounter class (see Atomic Operations and Low-Wait Algorithms in CUDA) that preserves parallel performance even when every thread attempts to increment a single counter at the same time. It also needs the SHADOW_MACRO() GPU object management macro, which allows generic, standard layout C++ objects to move transparently between the host and device. C++ type trait methods are a means to ensure that C++ objects are usable on both the host and device. In particular, the SHADOW_MACRO() implementation is designed to preserve the lowest common denominator standard layout for compatibility with C structures. C programmers should be able to utilize the capabilities of this C++ code by wrapping a C-callable functional interface around specific objects (example wrappers are not provided).

As I've mentioned in the previous articles in this series, NVIDIA is moving to a unified virtual environment where objects and object pointers are transparently accessible from both host and GPU devices. This article will use mapped memory to simplify the handling of pointers shared between the host and device for object creation and post-processing. Mapped memory does present a performance problem because it is not currently cached (as of CUDA-5). The heavy reuse of data in a mapped object can result in a two orders-of-magnitude loss of CUDA performance. This article gets around such poor performance by using mapped memory only for linear traversals during object creation and results reporting. Using this technique, the MappedTypeArrayclass in the example code exhibits high performance while enabling the creation and use of extremely complex tree and graph objects on both the host and GPU. More adventurous readers may wish modify the example code to convert mapped memory into high-speed device memory, either through their own code or through the use of the SHADOW_MACRO().

A Motivating Example

The general-purpose CUDA memory allocator accessed via cudaMalloc() and cudaFree() is designed to service requests for arbitrarily sized memory regions. For many problems, a simpler memory allocator that manages only constant sized memory regions can be both significantly faster and preserve high degrees of parallelism in a massively parallel environment.

Listing One shows the need for the MappedTypeArray class by illustrating the cost of allocating 1 million individual objects, as opposed to a single block allocation of a million objects. The speed of the bulk allocation can be exploited by the MappedTypeArrayclass.

Listing One: Complete source for testMalloc.cu.

#include <iostream>
using namespace std;
#include <cstdio>

#define MAX_ELEMENTS 2000000

__device__ void *ara1;
__global__ void performBlockAlloc(int nElements, int sizeofElement)
  ara1 = malloc(nElements * sizeofElement);

__device__ void *ara2[MAX_ELEMENTS];
__global__ void performIndividualAlloc(int nElements, int sizeofElement)
  int tid = threadIdx.x + blockIdx.x * blockDim.x;
  while(tid < nElements) {
    ara2[tid] = malloc(sizeofElement);
    tid += blockDim.x * gridDim.x;

int main(int argc, char *argv[])
  try {
    if(argc < 3) {
      cerr << "Use: nElements size (in bytes) of each element" << endl;
      return -1;

    int nElements=atoi(argv[1]);
    int sizeElements=atoi(argv[2]);

    cerr << "Using: nElements " << nElements << " " << sizeElements << endl;

    performBlockAlloc<<<1,1>>>(nElements, sizeElements);
    if(cudaPeekAtLastError() != cudaSuccess) throw "single allocation failed!";

    int nBlocks = (nElements/256 + 1);
    nBlocks = (nBlocks > 65535)?65535:nBlocks;
    performIndividualAlloc<<<nBlocks,256>>>(nElements, sizeElements);
    if(cudaPeekAtLastError() != cudaSuccess) throw "individual allocations failed!";

  } catch (const char * e) {
    cerr << "caught exception: " << e << endl;
    cudaError err;
    if((err=cudaGetLastError()) != cudaSuccess)
      cerr << "CUDA reports: " << cudaGetErrorString(err) << endl;
    return -1;
  return 0;

Compiling this program with nvcc (with –arch=sm_20 or greater) and running it with nvprof demonstrates the tremendous cost of individual malloc() or new operations. In Listing Two (the profiler output), allocating 1 million individual objects ran 445x slower than creating a single memory region that contains a million objects.

Listing Two: Profile information for testMalloc.cu.

$ nvprof ./testMalloc 1000000 128
======== NVPROF is profiling testMalloc...
======== Command: testMalloc 1000000 128
Using: nElements 1000000 128
======== Profiling result:
 Time(%)      Time   Calls       Avg       Min       Max  Name
  100.00     7.86s       1     7.86s     7.86s     7.86s  performIndividualAlloc(int, int)
    0.00   17.66us       1   17.66us   17.66us   17.66us  performBlockAlloc(int, int)

A Massively-Parallel Memory Manager for Both Host and Device Data

To meet the need for a massively parallel stack and memory allocator, the MappedTypeArray class ties the two orders-of-magnitude faster performance of the CUDA memory allocator when creating an array of many objects with the performance robustness of the ParallelCounter class and the transparency of the SHADOW_MACRO().

More specifically, this article provides working source code for a MappedTypeArray class that uses cudaHostMalloc()to allocate an array containing a user-specified number of objects of type T. The MappedTypeArray class utilizes SHADOW_MACRO(), so it can transparently create and move data between the host and device memory. The MappedTypeArray class also utilizes a variant of the fast ParallelCounter class to provide an index into the unallocated portions of the type array. This new class, called BoundedParallelCounter, also provides methods that return the total count of the number of objects that have been allocated, and a means of traversing all allocated items.

The BoundedParallelCounter class has been designed to provide unique integers that cover all possible integers in the range [0 .. nTypes), regardless of how it is incremented and queried in a massively parallel environment.

The challenge with the ParallelCounter class is that it's designed so that all the threads in a warp can atomically modify the count vector in parallel without blocking any threads. Under this common situation, the value of a ParallelCounter object will appear to suddenly increase by warp size. While atomicAdd() operations are guaranteed to correctly increment a single integer in the count vector in the ParallelCounter class, no guarantees can be made that a sum across all elements in count will accurately reflect the contributions of all the threads while the counter is being modified in parallel. Thus, the ParallelCounter class as currently designed cannot support the memory allocation needs of the MappedTypeArray class.

A Bounded Parallel Counter Class for Memory Allocation

The source code is provided in this section for a BoundedParalleCounter class that will generate all unique integers between [0.. nTypes), regardless of how the object is incremented and queried in a massively parallel environment. Note that this does not imply that the integers will be delivered in consecutive order! As with the ParallelCounter class, the BoundedParallelCounter class minimizes blocking to preserve high parallel performance under all usage scenarios.

The main idea behind the BoundedParallelCounter class is that the value contained in each integer in the count array represents a unique index that can be used for memory allocation. The value of this unique index is calculated much like the index in a 2D array stored in contiguous memory (for example, count[i] * maxColCount).  The idea is illustrated in Figure 1.

Figure 1: Unique indexes calculated by the BoundedParallelCounter class.

Just like the ParallelCounter class, each element of the count array is atomically modified. The old value returned by atomicAdd() is used in the calculation of the correct index into the type array, which means no race condition can occur even when another thread incrementing the same element in the count array at the same time. The other thread will simply get the next index into the type array. This technique is not guaranteed to generate sequential indices, which is acceptable so long as duplicate indices are not generated.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.