Channels ▼


A Massively Parallel Stack for Data Allocation

To make this approach work, it is essential that no element in the count array exceed maxColCount. For this reason, the getIndex() method checks that the atomicAdd() never exceeds the maxColCount limit passed in the constructor. If it does, the count is reduced by atomicSub(), and getIndex() attempts to scavenge an index from a different element in the count array. The host version of getIndex() attempts to uniformly pick from all elements in the count array by choosing an index based on the CPU Time Stamp Counter (TSC) counter. Picking from a uniform distribution keeps a serial CPU code from causing unnecessary scavenge() operations on the GPU. It is assumed that the CUDA threads will arbitrarily request memory, which means that the value of threadIdx.x can be used to uniformly increment the count array. The code should be modified if either of these assumptions is incorrect.

Listing Three: Source for the BoundedParallelCounter class (part 1 of 2 of MappedTypeArray.hpp).

#include <stdint.h>

#include "ParallelCounter.hpp"

// A parallel counter where the user provides the pointer to the
// counter array.  This means mapped memory can be used to simplify
// management.
template <int N_ATOMIC=32>
class BoundedParallelCounter {
  static const uint32_t uint32_max= 0xffffffff;
  uint32_t perCounterMax;
  uint32_t count[N_ATOMIC];

  inline __host__ __device__ uint32_t limit(int idx) {
    return( (idx+1) * perCounterMax); 

  inline __host__ __device__ uint32_t getIndex(int idx) {
    uint32_t val;
#if !defined(__CUDA_ARCH__)
    val = atomicAdd(count+idx, 1);
    if(val >= perCounterMax ) {
#if !defined(__CUDA_ARCH__)
      atomicSub(count+idx, 1); // return count to previous value
      return uint32_max; // signal exceeded the limit
    return val + idx * perCounterMax;

  inline __host__ __device__ uint32_t scavenge(int idx) 
    uint32_t val;
    for(int i=idx+1; i < N_ATOMIC; i++) {
      if(this->count[i] >= perCounterMax) 
	continue; // no chance of finding a free region
      val = getIndex(i);
      if(val != uint32_max) return val; // if there is a race, see if we won
    for(int i=0; i < idx; i++) {
      if(this->count[i] >= perCounterMax) 
	continue; // no chance of finding a free region
      val = getIndex(i);
      if(val != uint32_max) return val; // if there is a race, see if we won
    return(uint32_max); // no index found

  BoundedParallelCounter() : perCounterMax(0) {}
  BoundedParallelCounter(uint32_t perCounterMax) : 
    perCounterMax(perCounterMax) {}

  inline __host__ __device__ uint32_t increment() {
#if !defined(__CUDA_ARCH__)
    // select a random value or potentially use thread id in posix
    // systems. This keeps the host from dominating a single counter. 
    int idx;
      uint32_t lo;
      __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(idx));
      idx=lo % N_ATOMIC;
    // a failsafe for when reading the tdtsc does not work
    //idx = rand() % N_ATOMIC;
    int idx = (threadIdx.x % N_ATOMIC);
    // Use the following in cases where threadIdx.x is less than N_ATOMIC
    //int idx = (threadIdx.x + blockIdx.x * blockDim.x) % N_ATOMIC;
    uint32_t val= getIndex(idx);
    if(val == uint32_max) { // Find a free value (can increase serialization)
      val = scavenge(idx);

    if(val == uint32_max) { 
      // The counter has reached all limits and cannot increment further.
#if !defined(__CUDA_ARCH__)
      throw "No more free items in mapped memory";
      // This is an excellent place for an exception once they are implemented.
      // right now cause the kernel to trap and exit
    return val;

  // Set the individual counters using a single thread on host or device.
  inline __host__ __device__ void set(uint32_t x) {
    for(int i=0; i < N_ATOMIC; i++) count[i] = x;

  // Get the counts using a simple single threaded method
  inline __host__ __device__ uint32_t getCount() {
    // Assume single thread. This is a great place to use
    // __CUDA_ARCH__ to selectively create a parallel version.
    uint32_t total=0;
    for(int i=0; i < N_ATOMIC; i++) total += count[i];
    return total;

  inline __host__ __device__ uint32_t countIndex(int i) {
    return this->count[i]; 
  inline __host__ __device__ uint32_t getColCount() { return perCounterMax; }

#ifndef N_ATOMIC
#define N_ATOMIC 32

The MappedTypeArray Class

The MappedTypeArray class utilizes SHADOW_MACRO(), which means it can transfer itself for use on the device as well as the host. It is important to note that the constructor allocates an array of type T objects in mapped memory. For simplicity, this was required so the data could be accessed on both the host and device from the same pointer. The use of mapped memory is why the cudaMemcpy() operations of the object that contains a pointer will work as a single mapped pointer that can access the same data on both the host and device.

Much like malloc(), programmers call the MappedTypeArray alloc() method to get a new object of type T. For simplicity, no free() method is provided in this example. The programmer can use the allocated object on either the host or device, subject to the performance limitations of mapped memory (meaning the data is not cached, so each access can incur a PCIe data transfer cost). When required, the alloc() method will scavenge to find unallocated objects so that all the mapped memory objects will be utilized. The [] operator is defined along with size() so the allocated memory regions can be conveniently walked using a for loop. As will be demonstrated in the test code, the for loop can be a parallel OpenMP for loop on the host.

Listing Four: MappedTypeArray.hpp (part 2 of 2).

// Use a Bounded Parallel Counter to grab from a pre-allocated array of type.
template <class T>
class MappedTypeArray {

  BoundedParallelCounter<N_ATOMIC> boundedCount;
  T *typePtr; // the array of preallocated types

  MappedTypeArray(int nType) {
    // Initialize the SHADOW_MACRO variables
    usrManaged=false; my_d_ptr = NULL;

    // Determine the per column count maximum
    uint32_t perCounterMax = nType/N_ATOMIC + ((nType % N_ATOMIC)?1:0);

    // allocate all the space for the parallel counters at one time
    boundedCount = BoundedParallelCounter<N_ATOMIC>(perCounterMax);

    // Zero the individual counters in the parallel counter on host and device

    // map the type array (increased to a multiple of N_ATOMIC)
    cudaMallocHost(&typePtr, (perCounterMax*N_ATOMIC) * sizeof(T)); 
  ~MappedTypeArray() {
  inline __host__ __device__ T* base() {return(typePtr);}
  inline __host__ __device__ uint32_t newIndex() { 
  inline __host__ __device__ uint32_t size() { return boundedCount.getCount(); }
  inline __host__ __device__ T* alloc() { return base()+newIndex(); }
  inline __host__ T* operator[](uint32_t index) {
    int i;
    for(i=0; (i < N_ATOMIC) && (index >= boundedCount.countIndex(i)) ; i++) 
      index -= boundedCount.countIndex(i);
    return typePtr + index + boundedCount.getColCount() * i;


The best way to understand the MappedTypeArray class is to use it on both the host and a GPU in an example.

Test Code

Listing Five demonstrates how to use MappedTypeArray on both the host and GPU. Succinctly, this test code creates a histogram containing nBins of class ParallelCounter. These bins are uniformly filled based on integers in the range [0 ..nSamples) modulus nBins (for example, histo[tid % nBins]++). Both nBins and nSamples are set by user-provided command-line arguments.

Adding objects to the MappedTypeArray object on both the host and device demonstrates the usefulness of this class for complex code.

The counts of each bin of this histogram are saved in parallel on the device to output, the device side of a MappedTypeArray object. Note that output was preloaded with a few objects of HistoType on the host. In this example, a kernel-side printf() reports the size of output before and after the update. After the device-side update to output, a few additional objects of HistoType were also added by the host side of the MappedTypeArray object.

The code uses __is_standard_layout() in an assertion to ensure that the programmer has not provided a class to the MappedTypeArray template that breaks C/C++ standard layout conformance.

An OpenMP parallel for loop demonstrates how to use a host-side parallel loop to walk the allocated HistoType objects. This parallel loop fills a host-side STL vector that is used to sort the data on the host. The results are then sanity-checked for consistency.

Listing Five: Complete source code for

//Rob Farber
#include <iostream>
using namespace std;
#include <cstdio>
#include <cassert>
#include <vector>
#include <algorithm>
#include "MappedTypeArray.hpp"

#ifndef SPREAD
#define SPREAD 32

// A trivial Histo Class that will be used to push the histogram results
struct HistoType {
  uint32_t index;
  uint32_t count;

// a comparision function for sorting on the host
bool HistoTypeBinCmp(HistoType *a, HistoType *b) { return a->index < b->index; }

__device__ ParallelCounter<SPREAD> *histo;
__global__ void createHisto(int nBins) {
  int tid = threadIdx.x + blockIdx.x * blockDim.x;
  if(tid == 0) {
    histo = new ParallelCounter<SPREAD>[nBins];

// Kernels to create and calculate the histogram
__global__ void initHisto(int nBins) {
  int tid = threadIdx.x + blockIdx.x * blockDim.x;
  while(tid < nBins) {
    tid += blockDim.x * gridDim.x;

// Calculate the histo based on thread id
__global__ void doHisto(uint32_t nSamples, int nBins) {
  uint32_t tid = threadIdx.x + blockIdx.x * blockDim.x;

  while(tid < nSamples) {
    histo[tid % nBins] += 1;
    tid += blockDim.x * gridDim.x;

// Use the MappedTypeArray structure to push whatever data is required
__global__ void pushResults(uint32_t nSamples, MappedTypeArray<HistoType> *d_output,
			    int nBins) {
  int tid = threadIdx.x + blockIdx.x * blockDim.x;

  if(tid == 0)
    printf("Before push device reports: Size %d\n", d_output->size());

  while(tid < nBins) {
    HistoType *pt=d_output->alloc();
    pt->index = tid;
    pt->count = histo[tid].getCount();
    tid += blockDim.x * gridDim.x;
  tid = threadIdx.x + blockIdx.x * blockDim.x;
  if(tid == 0)
    printf("After push device reports: Size %d\n", d_output->size());

// Demonstrate that the host can allocate data
void testHostAlloc(MappedTypeArray<HistoType> &output, int n, int uid)
  for(int i=0; i < n; i++) {
    HistoType *pt=output.alloc();
    pt->index = uid+i; // a unique identifier
    pt->count = 0;

int main(int argc, char *argv[])
  uint32_t nSamples;
  int device=0, nBins;
  int nHostAlloc=4;

  try {
    if(argc < 4) {
      cerr << "Use: nthreads nBins device" << endl;
      return -1;

    if(cudaPeekAtLastError() != cudaSuccess) throw "failed to set device";

    int nThreadsPerBlock=256;
    int nBlocks=nSamples/nThreadsPerBlock + ((nSamples % nThreadsPerBlock)?1:0);
    if(nBlocks > 65535) nBlocks=65535;
    cerr << "device " << device << " nSamples " << nSamples
	 << " spread " << SPREAD << " nBlocks " << nBlocks
	 << " threadsPerBlock " << nThreadsPerBlock << endl;
    // overallocate memory by a factor of two
    int nPreallocateSize= 2*(nBins + nHostAlloc); 
    // TEST: completely fill the allocated memory

    cout << "MappedTypeArray is_standard_layout: " 
	 << __is_standard_layout(MappedTypeArray<HistoType>) << endl;
    cout << "MappedTypeArray is in standard layout" << endl;
    // demonstrate that we can perform host-side allocations as well.
    MappedTypeArray<HistoType> output(nPreallocateSize);
    if(cudaPeekAtLastError() != cudaSuccess) 
      throw "failed Mapped Array Creation";

    // show that we can allocate data on the host and have it appear
    // on the device
    testHostAlloc(output, nHostAlloc, 100);
    // call the kernels
    doHisto<<<nBlocks,nThreadsPerBlock>>>(nSamples, nBins);
    // Note that d_ptr() calls cudaMemcpy()
    pushResults<<<nBins/256 + 1,256>>>(nSamples, output.d_ptr(), nBins);
    if(cudaPeekAtLastError() != cudaSuccess) throw "kernel failure";
    // demonstrate that we can perform host-side allocations as well.
    // Note: Host and device allocations cannot be used at the same time! 
    testHostAlloc(output, nHostAlloc, 200);
    cout << "The host reports " << output.size() 
	 << " items allocated" << endl;

    // SANITY check that the allocated bins is correct
    assert(output.size() == (nBins + 2 * nHostAlloc));
    // Use OpenMP to demonstrate parallel host-side access
    uint32_t sum=0;
#pragma omp parallel for reduction(+ : sum)
    for(int i=0; i < output.size(); i++)
      sum += output[i]->count;
    // SANITY check: total counts should equal nSamples 
    assert(sum == nSamples);
    // get a pointer array
    vector<HistoType*> v(output.size());
#pragma omp parallel for
    for(int i=0; i < output.size(); i++)
      v[i] = output[i];
    // sort by indices and print
    sort(v.begin(), v.end(), HistoTypeBinCmp);
    for(int i=0; i < v.size(); i++)
      cerr << "bin " << v[i]->index << " count " << v[i]->count << endl;
    // SANITY check: sort by address and check that memory is not duplicated
    sort(v.begin(), v.end());
    for(int i=0; i < v.size(); i++)
      assert(v[i-1] != v[i]);
    cout << "total " << sum << " should have " << nSamples << endl;
  } catch (const char * e) {
    cerr << "caught exception: " << e << endl;
    cudaError err;
    if((err=cudaGetLastError()) != cudaSuccess)
      cerr << "CUDA reports: " << cudaGetErrorString(err) << endl;
    return -1;
  cout << "***** Passed all sanity checks! *****" << endl;
  return 0;

The NVIDIA nvprof text-based profiler is used to provide performance data. This eliminates the need to manually instrument the example code, thus making the code both cleaner and simpler.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.