Dynamic parallelism, or the ability to initiate kernel launches from the CUDA device, is an important evolutionary step in the CUDA software development platform. With it, developers can perform variable amounts of work based on divide-and-conquer algorithms and in-memory data structures such as trees and graphs entirely on the GPU without host intervention. The variable nature of dynamic parallelism means that GPU applications can easily consume varying amounts of memory and generate inconstant amounts of data for later computational processing.
This article provides working code for a general-purpose massively parallel data structure, MappedTypeArray
, that has the ability to accept variable numbers of results from dynamic algorithms and act as a fast malloc()
method that can very quickly hand out items from a pre-allocated array of arbitrary type. Such a data structure requires: a fast, massively parallel counter that will not serialize execution when many parallel threads attempt to perform an increment at the same time, and the ability to create and utilize generic C++ objects transparently on both the host and device. The allocated memory and stack can be utilized on the host, which means this data structure runs well on multicore processors as well.
A massively parallel stack requires a robust low-wait counter; specifically, a ParallelCounter
class (see Atomic Operations and Low-Wait Algorithms in CUDA) that preserves parallel performance even when every thread attempts to increment a single counter at the same time. It also needs the SHADOW_MACRO()
GPU object management macro, which allows generic, standard layout C++ objects to move transparently between the host and device. C++ type trait methods are a means to ensure that C++ objects are usable on both the host and device. In particular, the SHADOW_MACRO()
implementation is designed to preserve the lowest common denominator standard layout for compatibility with C structures. C programmers should be able to utilize the capabilities of this C++ code by wrapping a C-callable functional interface around specific objects (example wrappers are not provided).
As I've mentioned in the previous articles in this series, NVIDIA is moving to a unified virtual environment where objects and object pointers are transparently accessible from both host and GPU devices. This article will use mapped memory to simplify the handling of pointers shared between the host and device for object creation and post-processing. Mapped memory does present a performance problem because it is not currently cached (as of CUDA-5). The heavy reuse of data in a mapped object can result in a two orders-of-magnitude loss of CUDA performance. This article gets around such poor performance by using mapped memory only for linear traversals during object creation and results reporting. Using this technique, the MappedTypeArray
class in the example code exhibits high performance while enabling the creation and use of extremely complex tree and graph objects on both the host and GPU. More adventurous readers may wish modify the example code to convert mapped memory into high-speed device memory, either through their own code or through the use of the SHADOW_MACRO()
.
A Motivating Example
The general-purpose CUDA memory allocator accessed via cudaMalloc()
and cudaFree()
is designed to service requests for arbitrarily sized memory regions. For many problems, a simpler memory allocator that manages only constant sized memory regions can be both significantly faster and preserve high degrees of parallelism in a massively parallel environment.
Listing One shows the need for the MappedTypeArray
class by illustrating the cost of allocating 1 million individual objects, as opposed to a single block allocation of a million objects. The speed of the bulk allocation can be exploited by the MappedTypeArray
class.
Listing One: Complete source for testMalloc.cu.
#include <iostream> using namespace std; #include <cstdio> #define MAX_ELEMENTS 2000000 __device__ void *ara1; __global__ void performBlockAlloc(int nElements, int sizeofElement) { ara1 = malloc(nElements * sizeofElement); } __device__ void *ara2[MAX_ELEMENTS]; __global__ void performIndividualAlloc(int nElements, int sizeofElement) { int tid = threadIdx.x + blockIdx.x * blockDim.x; while(tid < nElements) { ara2[tid] = malloc(sizeofElement); tid += blockDim.x * gridDim.x; } } int main(int argc, char *argv[]) { try { if(argc < 3) { cerr << "Use: nElements size (in bytes) of each element" << endl; return -1; } int nElements=atoi(argv[1]); int sizeElements=atoi(argv[2]); cerr << "Using: nElements " << nElements << " " << sizeElements << endl; performBlockAlloc<<<1,1>>>(nElements, sizeElements); if(cudaPeekAtLastError() != cudaSuccess) throw "single allocation failed!"; int nBlocks = (nElements/256 + 1); nBlocks = (nBlocks > 65535)?65535:nBlocks; performIndividualAlloc<<<nBlocks,256>>>(nElements, sizeElements); if(cudaPeekAtLastError() != cudaSuccess) throw "individual allocations failed!"; } catch (const char * e) { cerr << "caught exception: " << e << endl; cudaError err; if((err=cudaGetLastError()) != cudaSuccess) cerr << "CUDA reports: " << cudaGetErrorString(err) << endl; cudaDeviceReset(); return -1; } return 0; }
Compiling this program with nvcc
(with –arch=sm_20
or greater) and running it with nvprof
demonstrates the tremendous cost of individual malloc()
or new
operations. In Listing Two (the profiler output), allocating 1 million individual objects ran 445x slower than creating a single memory region that contains a million objects.
Listing Two: Profile information for testMalloc.cu.
$ nvprof ./testMalloc 1000000 128 ======== NVPROF is profiling testMalloc... ======== Command: testMalloc 1000000 128 Using: nElements 1000000 128 ======== Profiling result: Time(%) Time Calls Avg Min Max Name 100.00 7.86s 1 7.86s 7.86s 7.86s performIndividualAlloc(int, int) 0.00 17.66us 1 17.66us 17.66us 17.66us performBlockAlloc(int, int)
A Massively-Parallel Memory Manager for Both Host and Device Data
To meet the need for a massively parallel stack and memory allocator, the MappedTypeArray
class ties the two orders-of-magnitude faster performance of the CUDA memory allocator when creating an array of many objects with the performance robustness of the ParallelCounter
class and the transparency of the SHADOW_MACRO()
.
More specifically, this article provides working source code for a MappedTypeArray
class that uses cudaHostMalloc()
to allocate an array containing a user-specified number of objects of type T
. The MappedTypeArray
class utilizes SHADOW_MACRO()
, so it can transparently create and move data between the host and device memory. The MappedTypeArray
class also utilizes a variant of the fast ParallelCounter
class to provide an index into the unallocated portions of the type array. This new class, called BoundedParallelCounter
, also provides methods that return the total count of the number of objects that have been allocated, and a means of traversing all allocated items.
The BoundedParallelCounter
class has been designed to provide unique integers that cover all possible integers in the range [0 .. nTypes
), regardless of how it is incremented and queried in a massively parallel environment.
The challenge with the ParallelCounter
class is that it's designed so that all the threads in a warp can atomically modify the count
vector in parallel without blocking any threads. Under this common situation, the value of a ParallelCounter
object will appear to suddenly increase by warp size. While atomicAdd()
operations are guaranteed to correctly increment a single integer in the count
vector in the ParallelCounter
class, no guarantees can be made that a sum across all elements in count
will accurately reflect the contributions of all the threads while the counter is being modified in parallel. Thus, the
ParallelCounter
class as currently designed cannot support the memory allocation needs of the MappedTypeArray
class.
A Bounded Parallel Counter Class for Memory Allocation
The source code is provided in this section for a BoundedParalleCounter
class that will generate all unique integers between [0.. nTypes
), regardless of how the object is incremented and queried in a massively parallel environment. Note that this does not imply that the integers will be delivered in consecutive order! As with the ParallelCounter
class, the BoundedParallelCounter
class minimizes blocking to preserve high parallel performance under all usage scenarios.
The main idea behind the BoundedParallelCounter
class is that the value contained in each integer in the count
array represents a unique index that can be used for memory allocation. The value of this unique index is calculated much like the index in a 2D array stored in contiguous memory (for example, count[i] * maxColCount
). The idea is illustrated in Figure 1.
Figure 1: Unique indexes calculated by the BoundedParallelCounter class.
Just like the ParallelCounter
class, each element of the count
array is atomically modified. The old value returned by atomicAdd()
is used in the calculation of the correct index into the type array, which means no race condition can occur even when another thread incrementing the same element in the count
array at the same time. The other thread will simply get the next index into the type array. This technique is not guaranteed to generate sequential indices, which is acceptable so long as duplicate indices are not generated.