Channels ▼
RSS

Tools

Atomic Operations and Low-Wait Algorithms in CUDA


Test 3: Utilizing a C++ Object on Both the Host and GPU

In this test (Listing Five), the entire ParallelCounter object is copied to and from the host with cudaMemcpy(). The object foo is initialized on the host, incremented on the GPU with the doPassedCounter() kernel, and copied back to the host, where the getCount() method is used to check the result.

Notice that the ability to call getCount() on both the GPU (in the previous example) and on the host (in this example) is enabled by annotating the getCount() method in the ParallelCounter class definition with __host__ and __device__ qualifiers.

{ //  Instantiate an object on the CPU and copy it to the GPU
      ParallelCounter<SPREAD> foo, *d_foo;
      foo.set(0);
      cudaMalloc(&d_foo, sizeof(foo));
      cudaMemcpy(d_foo, &foo, sizeof(foo), cudaMemcpyDefault);
      if(cudaPeekAtLastError() != cudaSuccess) 
          throw "failed memcpy or cudaMalloc";
      
      doPassedCounter<<<nBlocks, nThreadsPerBlock>>>(d_foo,nSamples);
      if(cudaPeekAtLastError() != cudaSuccess) throw "failed doPassedCounter";       

      cudaMemcpy(&foo, d_foo, sizeof(foo), cudaMemcpyDefault);
      if(cudaPeekAtLastError() != cudaSuccess) throw "failed memcpy";

      uint32_t check = foo.getCount(); // read result on host
      cerr << nSamples << " " << check << endl;
      cudaFree(d_foo);
      if(cudaPeekAtLastError() != cudaSuccess) throw "failed free";
      assert(check == nSamples);
    }

Listing Five: Utilize a C++ object on both the host and GPU.

Test 4: Map an Object into UVA Mapped Memory for Use by Both Devices

Clearly, NVIDIA is moving to a unified virtual architecture (UVA), where objects are transparently accessible from both the host and GPU devices. At the moment, the NVIDIA method cudaMallocHost() must be called to map a region of memory into both devices.

Listing Six creates a ParallelCounter object in mapped memory. The counter is set to zero on the host and then utilized in the doPassedCounter() kernel. A cudaDeviceSynchronize() call ensures that the kernel has completed; after which, the state of the counter is read on the host. Note that no explicit memory transfers are required!

#ifdef USE_MAPPED
    { // test 3: use mapped memory (simpler and cleaner but slow!)
      ParallelCounter<SPREAD> *ptCntr;
      // allocate in mapped memory
      cudaMallocHost(&ptCntr, sizeof(ParallelCounter<SPREAD>)); 
      if(cudaPeekAtLastError() != cudaSuccess) throw "cudaHostMalloc failed";       
      ptCntr->set(0); // initialize on the host
      doMappedCounter<<<nBlocks, nThreadsPerBlock>>>(ptCntr,nSamples);
      cudaDeviceSynchronize();

      uint32_t check = ptCntr->getCount(); // read result on host
      cerr << nSamples << " " << check << endl;
      cudaFreeHost(ptCntr);
      if(cudaPeekAtLastError() != cudaSuccess) throw "failed cudaFreeHost";

      assert(check == nSamples);
    }
#endif

Listing Six: Map an object into UVA memory.

While convenient for many problems, mapped memory is currently not cached on the GPU as of CUDA-5. This means that any computation that accesses any location in mapped memory many times will probably perform badly. A two order-of-magnitude performance decrease will be shown for this approach in the following performance analysis section.

Don’t let the poor computational performance of mapped memory prevent you from using it. The performance analysis in this article merely highlights the need to use mapped memory appropriately. In particular, the ability to use one pointer to access data on both the host and device is essential to many code implementations. In short, enjoy the convenience of mapped memory, but just be aware that high performance requires a copy operation to/from global memory. As will be discussed in the next article, it is possible to implement a C++ base class that provides the convenience of mapped memory with high-performance C++ objects.

Performance

The NVIDIA nvprof text-based profiler is used to provide the following performance data. This choice eliminates the need to manually instrument the example code, thus making it cleaner and simpler.

To build the the test code. save the source code in Listing Two to a file, firstCounter.cu. This file can be compiled under Linux for sm 2.0 and later devices with the Nvidia compiler command in Listing Seven:

nvcc -O3 -DSPREAD=32 -arch=sm_20 firstCounter.cu -o first

Listing Seven: The nvcc command to build firstCounter.cu.

The test code is profiled while incrementing the counter 4 billion times with the nvprof command-line shown in Listing Eight:

nvprof ./first 4000000000 0

Listing Eight: The nvprof command used to run the example.

Results 1 shows the output produced when running on a Kepler K20c installed as device 0:

======== NVPROF is profiling first...
======== Command: first 4000000000 0
device 0 nSamples 4000000000 spread 32 nBlocks 65535 threadsPerBlock 256
4000000000 4000000000
4000000000 4000000000
Checking if ParallelCounter is a POD: TRUE
***** Passed all sanity checks! *****
======== Profiling result:
 Time(%)      Time   Calls       Avg       Min       Max  Name
   50.05  711.98ms       1  711.98ms  711.98ms  711.98ms  doPassedCounter(ParallelCounter<unsigned int=32>*, unsigned int)
   49.95  710.55ms       1  710.55ms  710.55ms  710.55ms  doCounter(unsigned int)
    0.00   10.88us       1   10.88us   10.88us   10.88us  finiCounter(unsigned int*)
    0.00    6.37us       2    3.18us    3.01us    3.36us  [CUDA memcpy DtoH]
    0.00    3.07us       1    3.07us    3.07us    3.07us  initCounter(void)
    0.00    1.95us       1    1.95us    1.95us    1.95us  [CUDA memcpy HtoD]

Results 1: Performance results on a Kepler K20c.

Notice that both the doCounter() and doPassedCounter() kernels take approximately 710 ms. Running the same executable with an NVIDIA C2070 on device 1 produces the output in Results 2:

$:~/articles_nvidia/nv025$ nvprof ./first 4000000000 1
======== NVPROF is profiling first...
======== Command: first 4000000000 1
device 1 nSamples 4000000000 spread 32 nBlocks 65535 threadsPerBlock 256
Checking if ParallelCounter is a POD: TRUE
4000000000 4000000000
4000000000 4000000000
***** Passed all sanity checks! *****
======== Profiling result:
 Time(%)      Time   Calls       Avg       Min       Max  Name
   50.00     1.85s       1     1.85s     1.85s     1.85s  doCounter(unsigned int)
   50.00     1.85s       1     1.85s     1.85s     1.85s  doPassedCounter(ParallelCounter<unsigned int=32>*, unsigned int)
    0.00    3.74us       2    1.87us    1.79us    1.95us  [CUDA memcpy DtoH]
    0.00    3.14us       1    3.14us    3.14us    3.14us  initCounter(void)
    0.00    2.39us       1    2.39us    2.39us    2.39us  finiCounter(unsigned int*)
    0.00    1.09us       1    1.09us    1.09us    1.09us  [CUDA memcpy HtoD]

Results 2: Performance results on a Fermi C2050.

These results show that the Kepler card runs roughly 2x faster than the Fermi card when using the ParallelCounter class (0.71 seconds vs. 1.85 seconds).

Profiling Atomic Add of a Single Memory Location

The simple source code in Listing Nine performs the same work as the firstCounter.cu example. The two kernels initCounter() and doCounter() should be self-explanatory. The rest of the code follows the same logic as firstCounter.cu.

#include <iostream>
using namespace std;
#include <cstdio>
#include <stdint.h>
#include <cassert>

__global__ void initCounter(uint32_t *result) {
  *result = 0;
}
__global__ void doCounter(uint32_t *result, uint32_t nSamples) {
  uint32_t tid = threadIdx.x + blockIdx.x * blockDim.x;

  while(tid < nSamples) {
    atomicAdd(result, 1);
    tid += blockDim.x * gridDim.x;
  }
}

int main(int argc, char *argv[])
{
  try {
    if(argc < 3) {
      cerr << "Use: nSamples device" << endl;
      return -1;
    }
    
    uint32_t nSamples=atoi(argv[1]);
    int device=atoi(argv[2]);
    
    cudaSetDevice(device);
    if(cudaPeekAtLastError() != cudaSuccess) throw "failed to set device";
    
    int nThreadsPerBlock=256;
    int nBlocks = nSamples/nThreadsPerBlock +((nSamples%nThreadsPerBlock)?1:0);
    if(nBlocks > 65535) nBlocks=65535;
    
    cout << "device " << device << " nSamples " << nSamples
        << " nBlocks " << nBlocks
        << " threadsPerBlock " << nThreadsPerBlock << endl;

    uint32_t *result;
    cudaMalloc(&result, sizeof(uint32_t));
    if(cudaPeekAtLastError() != cudaSuccess) throw "cudaMalloc failed";

    initCounter<<<1,1>>>(result);
    doCounter<<<nBlocks,nThreadsPerBlock>>>(result, nSamples);

    uint32_t check = 10;
    cudaMemcpy(&check, result, sizeof(uint32_t), cudaMemcpyDefault);
    if(cudaPeekAtLastError() != cudaSuccess) throw "memcpy failed";

    cerr << nSamples << " " << check << endl;
    assert(check == nSamples);
    
    cudaFree(result);

  } catch (const char * e) {
    cerr << "caught exception: " << e << endl;
    cudaError err;
    if((err=cudaGetLastError()) != cudaSuccess)
      cerr << "CUDA reports: " << cudaGetErrorString(err) << endl;
    return -1;
  }

}

Listing Nine: Source code for simpleCounter.cu.

The code simpleCounter.cucode can be built with the nvcc command in Listing Ten:

nvcc -O3 -arch=sm_20 singleCounter.cu –o singleCounter

Listing Ten: the nvcc compilation command for simpleCounter.cu.

Figure 1 shows the excellent performance that can be achieved with the ParallelCounter class. Due to excessive runtime, the C2050 simpleCounter.cu runtimes are reported only up to nSamples of 400 million. The speed of the Kepler atomicAdd() is clearly shown by the green line as compared to a C2050. Still, a Fermi GPU using the ParallelCounter class will run faster than a Kepler. The K20c is clearly the fastest when using the ParallelCounter class. (Note that compiling the applications with SM_35 for Kepler did not affect the reported runtimes.)

CUDA
Figure 1: Observed performance of simpleCounter.cu and the ParallelCounter class on a K20c and C2050 GPU.

The profiling results reported by nvprofafter compiling firstCounter.cu with USE_MAPPED defined (Results 2) show the dramatic impact that the lack of caching has on mapped memory. Note the runtime increased from 712ms to 182 seconds (first two lines), which is a 255x slowdown!

$ nvprof ./first 4000000000 0
======== NVPROF is profiling first...
======== Command: first 4000000000 0
device 0 nSamples 4000000000 spread 32 nBlocks 65535 threadsPerBlock 256
4000000000 4000000000
4000000000 4000000000
4000000000 4000000000
Checking if ParallelCounter is a POD: TRUE
***** Passed all sanity checks! *****
======== Profiling result:
 Time(%)      Time   Calls       Avg       Min       Max  Name
   99.22   182.08s       1   182.08s   182.08s   182.08s  doMappedCounter(ParallelCounter<unsigned int=32>*, unsigned int)
    0.39  712.02ms       1  712.02ms  712.02ms  712.02ms  doPassedCounter(ParallelCounter<unsigned int=32>*, unsigned int)
    0.39  710.56ms       1  710.56ms  710.56ms  710.56ms  doCounter(unsigned int)
    0.00   10.85us       1   10.85us   10.85us   10.85us  finiCounter(unsigned int*)
    0.00    6.85us       2    3.42us    3.23us    3.62us  [CUDA memcpy DtoH]
    0.00    3.10us       1    3.10us    3.10us    3.10us  initCounter(void)
    0.00    2.05us       1    2.05us    2.05us    2.05us  [CUDA memcpy HtoD]

Results 3: Performance results on Kepler including the use of mapped memory.

Even though currently restricted from a performance point of view, mapped memory is very useful for creating and moving complex objects between host and GPU memory — especially those that contain pointers. The keys to remember with mapped memory are: use layout and size compatible POD_struct objects; and copy heavily utilized regions of mapped memory to faster global memory.

Conclusion

The performance graph in Figure 1 really tells the story of this article. The ParallelCounter class is all about robust performance regardless of how it is used in a parallel environment. The ability to maintain high performance regardless of how it is used — including pathological cases where all the threads increment the counter at the same time — makes the ParallelCounter class useful in applications ranging from histograms to parallel stacks and data allocators.

C++ developers should note the object layout and size compatibility between the host and device. This article discussed and used POD_structs, which are the simplest and most restrictive form of C++ compatibility. Newer forms of C++ object compatibility exist, such as is_standard_layout() and is_trivially_copiable().

In the future, it is likely that the need for transparent data movement will almost entirely be removed when NVIDIA enables a cached form of mapped memory. Perhaps some form of the Linux madvise() API will be used. When writing the examples for this article, I observed that mapped memory ran as fast as global memory whenever all the data fit inside a single cache line. This indicates that cached mapped memory has the potential to become the de facto method of sharing memory between the host and device(s).


Rob Farber is a frequent contributor to Dr. Dobb's on CPU and GPGPU programming topics.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video