CUDA application programmers know that they have two metrics for success: The code must be correct, and it must run with high performance by exploiting massive parallelism. Low-wait algorithms are essential to preserving parallelism (and thus performance) even when presented with pathological situations that appear to require sequential processing. The NVIDIA SDK code samples are a useful repository of working code that demonstrates how to implement a variety of non-trivial computations in CUDA. In particular, the NVIDIA histogram SDK example is a good resource to consult to see how to use atomic operations.
Unfortunately, the performance of the NVIDIA example code can degrade severely when many threads need to increment the same histogram bin at the same time. This is unfortunate, as many programmers have adopted the NVIDIA histogram code into their own applications. Alternatively, the low-wait ParallelCounter
class from my earlier tutorial can be used to construct a histogram that maintains high performance regardless of how the histogram bins are incremented.
This article provides a complete working histogram example that preserves highly parallel performance when confronted with pathological situations in which every thread is simultaneously trying to increment a single ParallelCounter
object. A comparative analysis of performance relative to the more traditional "increment within a vector of integers" used in the NVIDIA SDK sample code demonstrate that equivalent high performance can be achieved when using histograms constructed from ParallelCounter
objects.
The histogram needs to be accessible from both the host and device for setup, computation, and reporting of results. Unfortunately, mapped memory is not an option due to the high reuse of the bins, and the expense of unnecessary PCIe data transfers; hence the need to transparently create and move potentially large numbers of objects between the host and device, which is why the SHADOW_MACRO()
method from my last Dr. Dobb's tutorial will be used to provide transparent host and device transfers. In this article, I also demonstrate the simplicity of dynamic parallelism introduced in CUDA 5.0.
ParallelCounter.hpp
To conform to the lowest common denominator of C structure compatibility, the source code for ParallelCounter.hpp
(Listing One) implements most of the code to transparently move data between the host and device as a macro, SHADOW_MACRO()
. Be aware that using a macro can cause name conflicts, among other issues.
The type of the class is passed via the variable TYPE
, which acts much like a C++ template argument and gives SHADOW_MACRO()
the ability to be used in many classes, structures, and C++ templates. This macro specifies the variable my_d_ptr
, which points to the device-side memory. The Boolean variable usrManaged
flags whether this code is performing the memory management or an external, user-allocated memory region is being used for the host and device transfers with cudaMemcpy()
. Public methods include:
d_ptr()
: This method is called to get the device-side pointer. As needed, it performs any data allocation and/or initiates the data transfer between the host and device.set_d_ptr()
: This sets the device pointer to a user-allocated region of memory and setsusrManaged
totrue
(so this code will not callcudaFree()
).free_d_ptr()
: This frees the device pointer when appropriate.cpyHtoD()
:cudaMemcpy()
. If needed, memory is allocated on the device.cpyDtoH()
: This copies data from the device to the host withcudaMemcpy()
.
The supporting macros SHADOW_MACRO_INIT
and SHADOW_MACRO_FINI
define code for inclusion in the constructor and destructor. The ParallelCounter
class from my previous article is modified to make use of SHADOW_MACRO
. Note that the TYPE
passed to the macro is ParallelCounter<N_ATOMIC>.
Listing One: Source code for ParallelCounter.hpp.
// Rob Farber #ifndef PARALLEL_COUNTER_HPP #define PARALLEL_COUNTER_HPP #include <iostream> #include <vector> using namespace std; #include <cstdio> #include <cassert> // parallel counter class definitions #include <stdint.h> #define SHADOW_MACRO(TYPE) \ private: \ bool usrManaged; \ TYPE *my_d_ptr; \ public: \ inline __host__ void set_d_ptr(TYPE* pt) { \ if(!usrManaged && my_d_ptr) cudaFree(my_d_ptr); \ my_d_ptr = pt; usrManaged=true; \ } \ inline __host__ TYPE* d_ptr() { \ if(!my_d_ptr) {cudaMalloc(&my_d_ptr,sizeof(TYPE));usrManaged=false;} \ cpyHtoD(); \ return my_d_ptr; \ } \ inline __host__ void free_d_ptr() { \ if(!usrManaged && my_d_ptr) cudaFree(my_d_ptr); \ my_d_ptr = NULL; usrManaged=false; \ } \ inline __host__ void cpyHtoD() { \ if(!my_d_ptr) my_d_ptr = d_ptr(); \ cudaMemcpy(my_d_ptr, this, sizeof(TYPE), cudaMemcpyDefault); \ } \ inline __host__ void cpyDtoH() { \ if(!my_d_ptr) my_d_ptr = d_ptr(); \ cudaMemcpy(this, my_d_ptr, sizeof(TYPE), cudaMemcpyDefault); \ } #define SHADOW_MACRO_INIT() my_d_ptr=NULL; usrManaged=false; #define SHADOW_MACRO_FINI() free_d_ptr(); template <uint32_t N_ATOMIC=32> struct ParallelCounter { private: uint32_t count[N_ATOMIC]; //////////////////////////// // Transparent data movement SHADOW_MACRO(ParallelCounter<N_ATOMIC>) public: __host__ __device__ ParallelCounter<N_ATOMIC>() { SHADOW_MACRO_INIT() } __host__ __device__ ~ParallelCounter<N_ATOMIC>() { #if !defined(__CUDA_ARCH__) SHADOW_MACRO_FINI() #endif } inline __device__ uint32_t operator-=(uint32_t x) { return atomicSub(count + (threadIdx.x % N_ATOMIC), x); } inline __device__ uint32_t operator+=(uint32_t x) { return atomicAdd(count + (threadIdx.x % N_ATOMIC), x); } // spread the counts across the counter __device__ __host__ void set(uint32_t x) { for(int i=0; i < N_ATOMIC; i++) count[i]=x/N_ATOMIC; for(int i=0; i < x % N_ATOMIC; i++) count[i] += 1; } inline __device__ __host__ uint32_t getCount() { #if !defined(__CUDA_ARCH__) this->cpyDtoH(); #endif // simplest slow method for right now. uint32_t sum=0; for(int i=0; i < N_ATOMIC; i++) { sum += count[i]; } return sum; } }; // end of class definitions #endif
Note that this code is predicated on the assumption that the object size and layout are identical between the host processor and the GPU. It is important to check that the classes that include SHADOW_MACRO()
are at least trivially copyable.
At some point in the future, CUDA will offer a form of high-performance cached mapped memory. Until then, the SHADOW_MACRO()
code is useful because it encapsulates device-side data allocation and data movement between the host and device. Regardless of how the data moves between the devices, the programmer must ensure that all C++ classes provide a data layout that is usable on both devices! This implies that C++ type traits and C++ compiler methods to check type traits will be important for the foreseeable future.