Channels ▼
RSS

Design

CUDA: Unifying Host/Device Interactions with a Single C++ Macro


To facilitate the creation and use of complex data structures such as trees and graphs on both the host and device — a convenient and arguably essential requirement for dynamic parallelism — this article will focus on a set of macros that can be used in a broad spectrum of C++ classes to mimic the convenience of mapped memory. Use of the presented SHADOW_MACRO() avoids the two order-of-magnitude decrease in CUDA performance when data in mapped memory is heavily reused, as demonstrated in the previous article in this series, Atomic Operations and Low-Wait Algorithms in CUDA.

NVIDIA is clearly moving toward a unified virtual environment, where objects will be transparently accessible from both host and GPU devices. NVIDIA does not currently support the caching of mapped memory on the device (as of CUDA 5), which is why I focus on a general method to move data reasonably transparently between the host and device.

I provide a complete working histogram example that shows how simple it is to use SHADOW_MACRO() in real applications. The histogram is implemented using the ParallelCounter class from the previous article, which preserves highly parallel performance even when confronted with pathological situations where every thread is simultaneously trying to increment a single ParallelCounter object.

Why Use a Macro Instead of C++ Inheritance?

Guaranteeing that all the devices can use the layout of a C++ object is constant challenge for CUDA C++ programmers in a multi-device environment. The use of cudaMemcpy() to transfer data is analogous to the use of memcpy() (or read and write operations) to transfer data to/from disk or between multiple devices. As a result, CUDA C++ developers can leverage existing work by the C++ standards committee and the C++ compiler development community to define the conditions when sequential byte-oriented copy operations can be used without affecting the usability of a C++ object.

The previous article discussed POD_structs, which is an early C++98 definition of a stringent set of conditions where layout compatibility is guaranteed after a byte-copy operation. Unfortunately, POD_structs do not allow user-defined destructors and require that all data be declared public (meaning no protected or private data) among the loss of other very useful C++ capabilities.

More recently, the C++ community has clarified and relaxed to a certain degree the conditions in which a C++ object can be safely byte-copied. For more information, see  Bjarne Stroustrup's discussion of generalized PODs in the C++11 FAQ and Beman Dawes' POD's Revisited; Resolving Core Issue 568 (Revision 4).

The end result of this discussion has been a revision to the C++ type_traits definitions. Key to this article are the is_standard_layout() and the more general is_trivially_copyable() methods to check the copyability of a C++ class:

  • is_standard_layout(): A standard-layout type is a type with a simple linear data structure and access control that can easily be used to communicate with code written in other programming languages, such as C, either cv-qualified or not. This is true for scalar types, standard-layout classes, and arrays of any such types.
  • is_trivially_copyable(): A trivially copyable type is a type whose storage is contiguous (thus, its copy implies a trivial memory block copy, as if performed with memcpy), either cv-qualified or not. This is true for scalar types, trivially copyable classes, and arrays of any such types.

A useful discussion regarding the differences between these types is available in Standard Layout Classes and Trivially Copyable Types, Part I and Standard Layout Classes and Trivially Copyable Types, Part II.

While waiting to get a full C++11 type_traits implementation, CUDA C++ programmers can use the existing GNU C++ front-end methods to check C++ type traits. Calling compiler front-end methods is certainly not the most desirable solution, but is acceptable because these methods are used to check the type of a class rather than perform some functional operation. In particular, the GNU __is_pod(), __is_standard_layout(), and __has_trivial_copy() methods are useful to check that C++ classes used with SHADOW_MACRO() are byte transferable. Microsoft users can utilize the is_pod(), is_standard_layout(), and  has_trivial_copy() methods.

C++ classes that conform to the __has_trivial_copy() traits are certainly the most general, while the classes that meet the __is_standard_layout() criteria provide a lowest common denominator that can work with both C and C++ code. Listing One illustrates the differences.

Listing One: Example of a POD, standard layout, and has_trivial_copy class.

#include <iostream>
using namespace std;

struct A
{
public:
    int i;
};

struct B : public A
{
public:
};

struct C : public A
{
public:
  float f;
};

int main()
{
  cout << boolalpha  << "struct A " << endl
       << "\t__is_pod(A): " << __is_pod(A)  << endl
       << "\t__is_standard_layout(A): " << __is_standard_layout(A) << endl 
       << "\t__has_trivial_copy(A): " << __has_trivial_copy(A) << endl;

  cout << boolalpha  << "struct B " << endl
       << "\t__is_pod(B): " << __is_pod(B)  << endl
       << "\t__is_standard_layout(B): " << __is_standard_layout(B) << endl 
       << "\t__has_trivial_copy(B): " << __has_trivial_copy(B) << endl;

  cout << boolalpha  << "struct C " << endl
       << "\t__is_pod(C): " << __is_pod(C)  << endl
       << "\t__is_standard_layout(C): " << __is_standard_layout(C) << endl 
       << "\t__has_trivial_copy(C): " << __has_trivial_copy(C) << endl;
  return 0;
}

Compiling and running this example shows that the most common (and general) case of a derived class that contains internal data is still byte-copyable, but is not usable as a C struct. For the greatest generality, this article avoids the use of inheritance with the use of a macro to preserve standard layout compatibility with C.

Listing Two: Compile command and output to test C++ layout traits.

$ nvcc testLayout.cc -run
struct A 
	__is_pod(A): true
	__is_standard_layout(A): true
	__has_trivial_copy(A): true
struct B 
	__is_pod(B): false
	__is_standard_layout(B): true
	__has_trivial_copy(B): true
struct C 
	__is_pod(C): false
	__is_standard_layout(C): false
	__has_trivial_copy(C): true

ParallelCounter.hpp

To conform to the lowest common denominator of C structure compatibility, the source code for ParallelCounter.hpp (Listing Three) implements most of the code to transparently move data between the host and device as a macro: SHADOW_MACRO(). Be aware that using a macro can cause name conflicts, among other issues.

The macro is passed the type of the class via the variable TYPE, which acts much like a C++ template argument and gives SHADOW_MACRO() the ability to be used in many classes, structures, and C++ templates. This macro specifies the variable my_d_ptr that points to the device-side memory. The Boolean variable usrManaged flags whether this code is performing the memory management or an external, user-allocated memory region is being used for the host and device transfers with cudaMemcpy(). Public methods include:

  • d_ptr(): This method is called to get the device-side pointer. As needed, it performs any data allocation and/or initiates the data transfer between the host and device.  
  • set_d_ptr():  This sets the device pointer to a user-allocated region of memory and sets usrManaged to true —so this code will not call cudaFree().
  • free_d_ptr(): This frees the device pointer when appropriate.
  • cpyHtoD(): This copies data from the host to the device with cudaMemcpy(). If needed, memory is allocated on the device.
  • cpyDtoH(): This copies data from the device to the host with cudaMemcpy().

The supporting macros SHADOW_MACRO_INIT and SHADOW_MACRO_FINI define code for inclusion in the constructor and destructor.

The ParallelCounter class from the previous article is modified to make use of SHADOW_MACRO. Note that the TYPE passed to the macro is ParallelCounter<N_ATOMIC>.

Listing Three: Source code for ParallelCounter.hpp.

// Rob Farber
#ifndef PARALLEL_COUNTER_HPP
#define PARALLEL_COUNTER_HPP

#include <iostream>
#include <vector>
using namespace std;
#include <cstdio>
#include <cassert>

// parallel counter class definitions
#include <stdint.h>

#define SHADOW_MACRO(TYPE)								\
  private:										\
  bool usrManaged;									\
  TYPE *my_d_ptr;									\
public:										\
  inline __host__ void set_d_ptr(TYPE* pt) {					\
     if(!usrManaged && my_d_ptr) cudaFree(my_d_ptr);				\
     my_d_ptr = pt; usrManaged=true;						\
  }											\
  inline __host__ TYPE* d_ptr() {							\
    if(!my_d_ptr) {cudaMalloc(&my_d_ptr,sizeof(TYPE));usrManaged=false;}	\
    cpyHtoD();									\
    return my_d_ptr;								\
    }											\
  inline __host__ void free_d_ptr() {						\
    if(!usrManaged && my_d_ptr) cudaFree(my_d_ptr);				\
    my_d_ptr = NULL; usrManaged=false;						\
    }											\
  inline __host__ void cpyHtoD() {						\
    if(!my_d_ptr) my_d_ptr = d_ptr();						\
    cudaMemcpy(my_d_ptr, this, sizeof(TYPE), cudaMemcpyDefault);		\
  }											\
  inline __host__ void cpyDtoH() {						\
    if(!my_d_ptr) my_d_ptr = d_ptr();						\
    cudaMemcpy(this, my_d_ptr, sizeof(TYPE), cudaMemcpyDefault);		\
  }

#define SHADOW_MACRO_INIT() my_d_ptr=NULL; usrManaged=false;
#define SHADOW_MACRO_FINI() free_d_ptr();

template <uint32_t N_ATOMIC=32>
struct ParallelCounter {
private:
  uint32_t count[N_ATOMIC];

  /////////////////////////////////////////////////////////
  // Transparent data movement
  SHADOW_MACRO(ParallelCounter<N_ATOMIC>)

public:
  __host__ __device__ ParallelCounter<N_ATOMIC>() {
    SHADOW_MACRO_INIT()
  }
  __host__ __device__ ~ParallelCounter<N_ATOMIC>() {
#if !defined(__CUDA_ARCH__)
    SHADOW_MACRO_FINI()
#endif
  }

  inline __device__ uint32_t operator-=(uint32_t x) {
    return atomicSub(count + (threadIdx.x % N_ATOMIC), x); 
  }
  inline __device__ uint32_t operator+=(uint32_t x) {
    return atomicAdd(count + (threadIdx.x % N_ATOMIC), x); 
  }
  // spread the counts across the counter
  __device__ __host__ void set(uint32_t x) {
    for(int i=0; i < N_ATOMIC; i++) count[i]=x/N_ATOMIC;
    for(int i=0; i < x % N_ATOMIC; i++) count[i] += 1;
  }
  inline __device__ __host__ uint32_t getCount() {
#if !defined(__CUDA_ARCH__)
    this->cpyDtoH();
#endif
    // simplest slow method for right now.
    uint32_t sum=0;
    for(int i=0; i < N_ATOMIC; i++) {
      sum += count[i];
    }
    return sum;
  }
};

// end of class definitions

#endif

It is important to note that this code is predicated on the assumption that the object size and layout are identical between the host processor and the GPU. It is crucial to check that the classes that include SHADOW_MACRO() are at least trivially copyable.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video