Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


A Robust Histogram for Massive Parallelism

CUDA application programmers know that they have two metrics for success: The code must be correct, and it must run with high performance by exploiting massive parallelism. Low-wait algorithms are essential to preserving parallelism (and thus performance) even when presented with pathological situations that appear to require sequential processing. The NVIDIA SDK code samples are a useful repository of working code that demonstrates how to implement a variety of non-trivial computations in CUDA. In particular, the NVIDIA histogram SDK example is a good resource to consult to see how to use atomic operations.

Unfortunately, the performance of the NVIDIA example code can degrade severely when many threads need to increment the same histogram bin at the same time. This is unfortunate, as many programmers have adopted the NVIDIA histogram code into their own applications. Alternatively, the low-wait ParallelCounter class from my earlier tutorial can be used to construct a histogram that maintains high performance regardless of how the histogram bins are incremented.

This article provides a complete working histogram example that preserves highly parallel performance when confronted with pathological situations in which every thread is simultaneously trying to increment a single ParallelCounter object. A comparative analysis of performance relative to the more traditional "increment within a vector of integers" used in the NVIDIA SDK sample code demonstrate that equivalent high performance can be achieved when using histograms constructed from ParallelCounter objects.

The histogram needs to be accessible from both the host and device for setup, computation, and reporting of results. Unfortunately, mapped memory is not an option due to the high reuse of the bins, and the expense of unnecessary PCIe data transfers; hence the need to transparently create and move potentially large numbers of objects between the host and device, which is why the SHADOW_MACRO()  method from my last Dr. Dobb's tutorial will be used to provide transparent host and device transfers. In this article, I also demonstrate the simplicity of dynamic parallelism introduced in CUDA 5.0.


To conform to the lowest common denominator of C structure compatibility, the source code for ParallelCounter.hpp (Listing One) implements most of the code to transparently move data between the host and device as a macro, SHADOW_MACRO(). Be aware that using a macro can cause name conflicts, among other issues.

The type of the class is passed via the variable TYPE, which acts much like a C++ template argument and gives SHADOW_MACRO() the ability to be used in many classes, structures, and C++ templates. This macro specifies the variable my_d_ptr, which points to the device-side memory. The Boolean variable usrManaged flags whether this code is performing the memory management or an external, user-allocated memory region is being used for the host and device transfers with cudaMemcpy(). Public methods include:

  • d_ptr(): This method is called to get the device-side pointer. As needed, it performs any data allocation and/or initiates the data transfer between the host and device.  
  • set_d_ptr():  This sets the device pointer to a user-allocated region of memory and sets usrManaged to true (so this code will not call cudaFree()).
  • free_d_ptr(): This frees the device pointer when appropriate.
  • cpyHtoD(): This copies data from the host to the device with cudaMemcpy(). If needed, memory is allocated on the device.
  • cpyDtoH(): This copies data from the device to the host with cudaMemcpy().

The supporting macros SHADOW_MACRO_INIT and SHADOW_MACRO_FINI define code for inclusion in the constructor and destructor. The ParallelCounter class from my previous article is modified to make use of SHADOW_MACRO. Note that the TYPE passed to the macro is ParallelCounter<N_ATOMIC>.

Listing One: Source code for ParallelCounter.hpp.

// Rob Farber

#include <iostream>
#include <vector>
using namespace std;
#include <cstdio>
#include <cassert>

// parallel counter class definitions
#include <stdint.h>

#define SHADOW_MACRO(TYPE)													\
  private:																	\
  bool usrManaged;															\
  TYPE *my_d_ptr;															\
public:																		\
  inline __host__ void set_d_ptr(TYPE* pt) {								\
     if(!usrManaged && my_d_ptr) cudaFree(my_d_ptr);						\
     my_d_ptr = pt; usrManaged=true;										\
  }																			\
  inline __host__ TYPE* d_ptr() {											\
    if(!my_d_ptr) {cudaMalloc(&my_d_ptr,sizeof(TYPE));usrManaged=false;}	\
    cpyHtoD();																\
    return my_d_ptr;														\
    }																		\
  inline __host__ void free_d_ptr() {										\
    if(!usrManaged && my_d_ptr) cudaFree(my_d_ptr);							\
    my_d_ptr = NULL; usrManaged=false;										\
    }																		\
  inline __host__ void cpyHtoD() {											\
    if(!my_d_ptr) my_d_ptr = d_ptr();										\
    cudaMemcpy(my_d_ptr, this, sizeof(TYPE), cudaMemcpyDefault);			\
  }																			\
  inline __host__ void cpyDtoH() {											\
    if(!my_d_ptr) my_d_ptr = d_ptr();										\
    cudaMemcpy(this, my_d_ptr, sizeof(TYPE), cudaMemcpyDefault);			\

#define SHADOW_MACRO_INIT() my_d_ptr=NULL; usrManaged=false;
#define SHADOW_MACRO_FINI() free_d_ptr();

template <uint32_t N_ATOMIC=32>
struct ParallelCounter {
  uint32_t count[N_ATOMIC];

  // Transparent data movement

  __host__ __device__ ParallelCounter<N_ATOMIC>() {
  __host__ __device__ ~ParallelCounter<N_ATOMIC>() {
#if !defined(__CUDA_ARCH__)

  inline __device__ uint32_t operator-=(uint32_t x) {
    return atomicSub(count + (threadIdx.x % N_ATOMIC), x); 
  inline __device__ uint32_t operator+=(uint32_t x) {
    return atomicAdd(count + (threadIdx.x % N_ATOMIC), x); 
  // spread the counts across the counter
  __device__ __host__ void set(uint32_t x) {
    for(int i=0; i < N_ATOMIC; i++) count[i]=x/N_ATOMIC;
    for(int i=0; i < x % N_ATOMIC; i++) count[i] += 1;
  inline __device__ __host__ uint32_t getCount() {
#if !defined(__CUDA_ARCH__)
    // simplest slow method for right now.
    uint32_t sum=0;
    for(int i=0; i < N_ATOMIC; i++) {
      sum += count[i];
    return sum;

// end of class definitions


Note that this code is predicated on the assumption that the object size and layout are identical between the host processor and the GPU. It is important to check that the classes that include SHADOW_MACRO() are at least trivially copyable.

At some point in the future, CUDA will offer a form of high-performance cached mapped memory. Until then, the SHADOW_MACRO() code is useful because it encapsulates device-side data allocation and data movement between the host and device. Regardless of how the data moves between the devices, the programmer must ensure that all C++ classes provide a data layout that is usable on both devices! This implies that C++ type traits and C++ compiler methods to check type traits will be important for the foreseeable future.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.