To facilitate the creation and use of complex data structures such as trees and graphs on both the host and device a convenient and arguably essential requirement for dynamic parallelism this article will focus on a set of macros that can be used in a broad spectrum of C++ classes to mimic the convenience of mapped memory. Use of the presented SHADOW_MACRO()
avoids the two order-of-magnitude decrease in CUDA performance when data in mapped memory is heavily reused, as demonstrated in the previous article in this series, Atomic Operations and Low-Wait Algorithms in CUDA.
NVIDIA is clearly moving toward a unified virtual environment, where objects will be transparently accessible from both host and GPU devices. NVIDIA does not currently support the caching of mapped memory on the device (as of CUDA 5), which is why I focus on a general method to move data reasonably transparently between the host and device.
I provide a complete working histogram example that shows how simple it is to use SHADOW_MACRO()
in real applications. The histogram is implemented using the ParallelCounter
class from the previous article, which preserves highly parallel performance even when confronted with pathological situations where every thread is simultaneously trying to increment a single ParallelCounter
object.
Why Use a Macro Instead of C++ Inheritance?
Guaranteeing that all the devices can use the layout of a C++ object is constant challenge for CUDA C++ programmers in a multi-device environment. The use of cudaMemcpy()
to transfer data is analogous to the use of memcpy()
(or read
and write
operations) to transfer data to/from disk or between multiple devices. As a result, CUDA C++ developers can leverage existing work by the C++ standards committee and the C++ compiler development community to define the conditions when sequential byte-oriented copy operations can be used without affecting the usability of a C++ object.
The previous article discussed POD_structs
, which is an early C++98 definition of a stringent set of conditions where layout compatibility is guaranteed after a byte-copy operation. Unfortunately, POD_structs
do not allow user-defined destructors and require that all data be declared public (meaning no protected
or private
data) among the loss of other very useful C++ capabilities.
More recently, the C++ community has clarified and relaxed to a certain degree the conditions in which a C++ object can be safely byte-copied. For more information, see Bjarne Stroustrup's discussion of generalized POD
s in the C++11 FAQ and Beman Dawes' POD's Revisited; Resolving Core Issue 568 (Revision 4).
The end result of this discussion has been a revision to the C++ type_traits
definitions. Key to this article are the is_standard_layout()
and the more general is_trivially_copyable()
methods to check the copyability of a C++ class:
is_standard_layout()
: A standard-layout type is a type with a simple linear data structure and access control that can easily be used to communicate with code written in other programming languages, such as C, eithercv
-qualified or not. This is true for scalar types, standard-layout classes, and arrays of any such types.is_trivially_copyable()
: A trivially copyable type is a type whose storage is contiguous (thus, its copy implies a trivial memory block copy, as if performed withmemcpy
), eithercv
-qualified or not. This is true for scalar types, trivially copyable classes, and arrays of any such types.
A useful discussion regarding the differences between these types is available in Standard Layout Classes and Trivially Copyable Types, Part I and Standard Layout Classes and Trivially Copyable Types, Part II.
While waiting to get a full C++11 type_traits
implementation, CUDA C++ programmers can use the existing GNU C++ front-end methods to check C++ type traits. Calling compiler front-end methods is certainly not the most desirable solution, but is acceptable because these methods are used to check the type of a class rather than perform some functional operation. In particular, the GNU __is_pod()
, __is_standard_layout()
, and __has_trivial_copy()
methods are useful to check that C++ classes used with SHADOW_MACRO()
are byte transferable. Microsoft users can utilize the is_pod()
, is_standard_layout()
, and has_trivial_copy()
methods.
C++ classes that conform to the __has_trivial_copy()
traits are certainly the most general, while the classes that meet the __is_standard_layout()
criteria provide a lowest common denominator that can work with both C and C++ code. Listing One illustrates the differences.
Listing One: Example of a POD, standard layout, and has_trivial_copy class.
#include <iostream> using namespace std; struct A { public: int i; }; struct B : public A { public: }; struct C : public A { public: float f; }; int main() { cout << boolalpha << "struct A " << endl << "\t__is_pod(A): " << __is_pod(A) << endl << "\t__is_standard_layout(A): " << __is_standard_layout(A) << endl << "\t__has_trivial_copy(A): " << __has_trivial_copy(A) << endl; cout << boolalpha << "struct B " << endl << "\t__is_pod(B): " << __is_pod(B) << endl << "\t__is_standard_layout(B): " << __is_standard_layout(B) << endl << "\t__has_trivial_copy(B): " << __has_trivial_copy(B) << endl; cout << boolalpha << "struct C " << endl << "\t__is_pod(C): " << __is_pod(C) << endl << "\t__is_standard_layout(C): " << __is_standard_layout(C) << endl << "\t__has_trivial_copy(C): " << __has_trivial_copy(C) << endl; return 0; }
Compiling and running this example shows that the most common (and general) case of a derived class that contains internal data is still byte-copyable, but is not usable as a C struct
. For the greatest generality, this article avoids the use of inheritance with the use of a macro to preserve standard layout compatibility with C.
Listing Two: Compile command and output to test C++ layout traits.
$ nvcc testLayout.cc -run struct A __is_pod(A): true __is_standard_layout(A): true __has_trivial_copy(A): true struct B __is_pod(B): false __is_standard_layout(B): true __has_trivial_copy(B): true struct C __is_pod(C): false __is_standard_layout(C): false __has_trivial_copy(C): true
ParallelCounter.hpp
To conform to the lowest common denominator of C structure compatibility, the source code for ParallelCounter.hpp
(Listing Three) implements most of the code to transparently move data between the host and device as a macro: SHADOW_MACRO()
. Be aware that using a macro can cause name conflicts, among other issues.
The macro is passed the type of the class via the variable TYPE
, which acts much like a C++ template argument and gives SHADOW_MACRO()
the ability to be used in many classes, structures, and C++ templates. This macro specifies the variable my_d_ptr
that points to the device-side memory. The Boolean variable usrManaged
flags whether this code is performing the memory management or an external, user-allocated memory region is being used for the host and device transfers with cudaMemcpy()
. Public methods include:
d_ptr()
: This method is called to get the device-side pointer. As needed, it performs any data allocation and/or initiates the data transfer between the host and device.set_d_ptr()
: This sets the device pointer to a user-allocated region of memory and setsusrManaged
totrue
so this code will not callcudaFree()
.free_d_ptr()
: This frees the device pointer when appropriate.cpyHtoD()
:cudaMemcpy()
. If needed, memory is allocated on the device.cpyDtoH()
: This copies data from the device to the host withcudaMemcpy()
.
The supporting macros SHADOW_MACRO_INIT
and SHADOW_MACRO_FINI
define code for inclusion in the constructor and destructor.
The ParallelCounter
class from the previous article is modified to make use of SHADOW_MACRO
. Note that the TYPE
passed to the macro is ParallelCounter<N_ATOMIC>
.
Listing Three: Source code for ParallelCounter.hpp.
// Rob Farber #ifndef PARALLEL_COUNTER_HPP #define PARALLEL_COUNTER_HPP #include <iostream> #include <vector> using namespace std; #include <cstdio> #include <cassert> // parallel counter class definitions #include <stdint.h> #define SHADOW_MACRO(TYPE) \ private: \ bool usrManaged; \ TYPE *my_d_ptr; \ public: \ inline __host__ void set_d_ptr(TYPE* pt) { \ if(!usrManaged && my_d_ptr) cudaFree(my_d_ptr); \ my_d_ptr = pt; usrManaged=true; \ } \ inline __host__ TYPE* d_ptr() { \ if(!my_d_ptr) {cudaMalloc(&my_d_ptr,sizeof(TYPE));usrManaged=false;} \ cpyHtoD(); \ return my_d_ptr; \ } \ inline __host__ void free_d_ptr() { \ if(!usrManaged && my_d_ptr) cudaFree(my_d_ptr); \ my_d_ptr = NULL; usrManaged=false; \ } \ inline __host__ void cpyHtoD() { \ if(!my_d_ptr) my_d_ptr = d_ptr(); \ cudaMemcpy(my_d_ptr, this, sizeof(TYPE), cudaMemcpyDefault); \ } \ inline __host__ void cpyDtoH() { \ if(!my_d_ptr) my_d_ptr = d_ptr(); \ cudaMemcpy(this, my_d_ptr, sizeof(TYPE), cudaMemcpyDefault); \ } #define SHADOW_MACRO_INIT() my_d_ptr=NULL; usrManaged=false; #define SHADOW_MACRO_FINI() free_d_ptr(); template <uint32_t N_ATOMIC=32> struct ParallelCounter { private: uint32_t count[N_ATOMIC]; ///////////////////////////////////////////////////////// // Transparent data movement SHADOW_MACRO(ParallelCounter<N_ATOMIC>) public: __host__ __device__ ParallelCounter<N_ATOMIC>() { SHADOW_MACRO_INIT() } __host__ __device__ ~ParallelCounter<N_ATOMIC>() { #if !defined(__CUDA_ARCH__) SHADOW_MACRO_FINI() #endif } inline __device__ uint32_t operator-=(uint32_t x) { return atomicSub(count + (threadIdx.x % N_ATOMIC), x); } inline __device__ uint32_t operator+=(uint32_t x) { return atomicAdd(count + (threadIdx.x % N_ATOMIC), x); } // spread the counts across the counter __device__ __host__ void set(uint32_t x) { for(int i=0; i < N_ATOMIC; i++) count[i]=x/N_ATOMIC; for(int i=0; i < x % N_ATOMIC; i++) count[i] += 1; } inline __device__ __host__ uint32_t getCount() { #if !defined(__CUDA_ARCH__) this->cpyDtoH(); #endif // simplest slow method for right now. uint32_t sum=0; for(int i=0; i < N_ATOMIC; i++) { sum += count[i]; } return sum; } }; // end of class definitions #endif
It is important to note that this code is predicated on the assumption that the object size and layout are identical between the host processor and the GPU. It is crucial to check that the classes that include SHADOW_MACRO()
are at least trivially copyable.