Channels ▼
RSS

Design

CUDA: Unifying Host/Device Interactions with a Single C++ Macro


Dynamically Allocate an Array of Objects on the Host and Initialize in Parallel

In Listing Nine (test 5), the entire device-side copy of the foo array is allocated with a single call to cudaMalloc(). Per the earlier discussion, the set_d_ptr() method is called with the appropriate location of each element in the device array. For demonstration purposes, a parallel for loop is used via the OpenMP "omp parallel for" pragma. This also requires that nvcc is passed the –fopenmp flag.

Listing Nine: The test 5 code snippet.

    { // test 5: dynamically allocate an array of objects. Makes use
      // of set_d_ptr and the contiguous layout of an array of
      // shadowed objects.
      int n=1000000;
      int index=10;
      ParallelCounter<SPREAD> *foo = new ParallelCounter<SPREAD>[n];
      ParallelCounter<SPREAD> *d_foo;
      cudaMalloc(&d_foo, n*sizeof(ParallelCounter<SPREAD>));
      
      // Note: the ParallelCounters array must be contiguous to work!
#pragma omp parallel for
      for(int i=0; i < n; i++) {
	foo[i].set(0);
	foo[i].set_d_ptr(d_foo + i); 
      }
      
      //try index and adjacent indices to see if corruption occurs
      doTest5<<<nBlocks, nThreadsPerBlock>>>(foo[index].d_ptr(),nSamples);
      uint32_t check = foo[index].getCount(); // read result on host
      assert(check == nSamples);

      assert( (index-1) >=0);
      doTest5<<<nBlocks, nThreadsPerBlock>>>(foo[index-1].d_ptr(),nSamples);
      check = foo[index-1].getCount(); // read result on host
      assert(check == nSamples);

      assert( (index+1) < n);
      doTest5<<<nBlocks, nThreadsPerBlock>>>(foo[index+1].d_ptr(),nSamples);
      check = foo[index+1].getCount(); // read result on host
      assert(check == nSamples);

      delete [] foo;
      cudaFree(d_foo);
      cerr << "Passed test 5" << endl;
    }

As part of the sanity checking, this code tests for correctness by using the counters at index-1, index, index+1. This test performs a basic (albeit not comprehensive) test to verify that the ParallelCounter objects do not overlap.

Dynamically Allocate and Use Objects Solely on the Device

The final test (Listing Ten) demonstrates that ParallelCounter objects that include SHADOW_MACRO() can be created and utilized solely on the device:

  • The allocDeviceTest() kernel creates an array of n objects on the GPU.
  • The initDeviceTest() kernel initializes the object — in parallel — to zero.
  • The fcnDeviceTest() kernel increments the counter in parallel nSamples times.
  • The finiDeviceTest() kernel calls getCount() on the device to return the counter value in result and frees the array of objects on the GPU.

An assert()is used on the host to verify the counter is correct.

Listing Ten: The test 6 code snippet.

    { // test 6: allocate and use on GPU
      int n=10000; //subject to internal GPU allocation limits
      int index=10;
      uint32_t *result;
      cudaMalloc(&result, sizeof(uint32_t));
      if(cudaPeekAtLastError() != cudaSuccess) throw "cudaMalloc failed";

      allocDeviceTest<<<1,1>>>(n);
      initDeviceTest<<<n/256+1,256>>>(n);
      fcnDeviceTest<<<nBlocks,nThreadsPerBlock>>>(index, nSamples);
      finiDeviceTest<<<1,1>>>(index, result);

      uint32_t check = 10;
      cudaMemcpy(&check, result, sizeof(uint32_t), cudaMemcpyDefault);
      if(cudaPeekAtLastError() != cudaSuccess) throw "memcpy failed";
      assert(check == nSamples);

      cudaFree(result);
      cerr << "Passed test 6" << endl;
    }

Performance Analysis

The NVIDIA nvprof text-based profiler is used to provide performance data. This eliminates the need to manually instrument the example code, thus making the code both cleaner and simpler.

To build the test code, save the source code in Listing Three to ParallelCounter.hpp and Listing Four to secondCounter.cu. The test application can be compiled under Linux for sm 2.0 and later devices with the nvcc command shown below:

nvcc -O3 -DSPREAD=32 -arch=sm_20 -Xcompiler -fopenmp secondCounter.cu -o second

Profiling the application with nvprof while incrementing the counter 4-billion times produces the report shown in Listing Eleven when running on a Kepler K20c installed as device 0:

Listing Eleven: Performance results on a Kepler K20c.

 $ nvprof ./second 4000000000 0
======== NVPROF is profiling second...
======== Command: second 4000000000 0
device 0 nSamples 4000000000 spread 32 nBlocks 65535 threadsPerBlock 256
ParallelCounter<SPREAD> is_standard_layout: 1
ParallelCounter is in standard layout
Passed test 1
Passed test 2
Passed test 3
Passed test 4
Passed test 5
Passed test 6
***** Passed all sanity checks! *****
======== Profiling result:
 Time(%)      Time   Calls       Avg       Min       Max  Name
   34.51     1.78s       3  594.33ms  536.01ms  710.85ms  doTest5(ParallelCounter<unsigned int=32>*, unsigned int)
   13.76  710.90ms       1  710.90ms  710.90ms  710.90ms  doTest3(ParallelCounter<unsigned int=32>*, unsigned int)
   13.76  710.90ms       1  710.90ms  710.90ms  710.90ms  doTest2(ParallelCounter<unsigned int=32>*, unsigned int)
   13.76  710.90ms       1  710.90ms  710.90ms  710.90ms  doTest4(ParallelCounter<unsigned int=32>*, unsigned int)
   13.76  710.90ms       1  710.90ms  710.90ms  710.90ms  doTest1(ParallelCounter<unsigned int=32>*, unsigned int)
   10.40  537.23ms       1  537.23ms  537.23ms  537.23ms  fcnDeviceTest(int, unsigned int)
    0.04    1.96ms       1    1.96ms    1.96ms    1.96ms  allocDeviceTest(unsigned int)
    0.00   84.25us       1   84.25us   84.25us   84.25us  initDeviceTest(unsigned int)
    0.00   27.90us       8    3.49us    3.04us    3.81us  [CUDA memcpy DtoH]
    0.00   20.10us       1   20.10us   20.10us   20.10us  finiDeviceTest(int, unsigned int*)
    0.00   15.01us       7    2.14us    2.05us    2.37us  [CUDA memcpy HtoD]

These profiling results clearly indicate that the ParallelCounter class delivers the same performance on the GPU and the first ParallelCounter class from the previous article. It is reasonable to conclude that SHADOW_MACRO() provides much of the convenience of mapped memory, but does not have negative performance consequences. It does not seem to matter how the shadowed version of ParallelCounter is used, as all the tests take approximately the same time.

Profiling an NVIDIA C2070 on device 1 produces the output shown in Listing Twelve:

Listing Twelve: Performance results on a Fermi C2050.

$ nvprof ./second 4000000000 1

$ nvprof ./second 4000000000 1
======== NVPROF is profiling second...
======== Command: second 4000000000 1
device 1 nSamples 4000000000 spread 32 nBlocks 65535 threadsPerBlock 256
ParallelCounter<SPREAD> is_standard_layout: 1
ParallelCounter is in standard layout
Passed test 1
Passed test 2
Passed test 3
Passed test 4
Passed test 5
Passed test 6
***** Passed all sanity checks! *****
======== Profiling result:
 Time(%)      Time   Calls       Avg       Min       Max  Name
   43.97     7.27s       3     2.42s     1.85s     2.76s  doTest5(ParallelCounter<unsigned int=32>*, unsigned int)
   11.20     1.85s       1     1.85s     1.85s     1.85s  fcnDeviceTest(int, unsigned int)
   11.20     1.85s       1     1.85s     1.85s     1.85s  doTest1(ParallelCounter<unsigned int=32>*, unsigned int)
   11.20     1.85s       1     1.85s     1.85s     1.85s  doTest2(ParallelCounter<unsigned int=32>*, unsigned int)
   11.20     1.85s       1     1.85s     1.85s     1.85s  doTest4(ParallelCounter<unsigned int=32>*, unsigned int)
   11.20     1.85s       1     1.85s     1.85s     1.85s  doTest3(ParallelCounter<unsigned int=32>*, unsigned int)
    0.01    1.80ms       1    1.80ms    1.80ms    1.80ms  allocDeviceTest(unsigned int)
    0.00  148.90us       1  148.90us  148.90us  148.90us  initDeviceTest(unsigned int)
    0.00   19.41us       1   19.41us   19.41us   19.41us  finiDeviceTest(int, unsigned int*)
    0.00   16.29us       8    2.04us    1.57us    2.18us  [CUDA memcpy DtoH]
    0.00   10.18us       7    1.45us    1.09us    2.02us  [CUDA memcpy HtoD]

The K20c Kepler GPU clearly exhibits a roughly 2.6x speedup over a Fermi C2070, due (in part) to the much faster atomic operations. Still, the Fermi GPU performs well even though this is a pathological test (where all the threads attempt to increment the parallel counter at the same time).

Conclusion

The integration of the general-purpose SHADOW_MACRO() into the ParallelCounter class adds much of the transparency and simplicity of mapped memory without sacrificing speed. Host-side STL classes can work with SHADOW_MACRO()-enabled classes to leverage the power and convenience of the STL and potentially deliver significant performance gains compared to the host. For performance and convenience reasons, programmers should consider incorporating SHADOW_MACRO() into their classes.

In the future, it is likely that the need for transparent data movement between host and device will almost entirely be removed if NVIDIA enables a cached form of mapped memory. Perhaps some form of the Linux madvise() API will be used. When writing the examples for this article, I observed that mapped memory ran as fast as global memory when all the data fit inside a single cache line. This indicates that cached mapped memory has the potential to become the de facto method of sharing memory between the host and GPU devices in the near future.


Rob Farber is a frequent contributor to Dr. Dobb's on CPU and GPGPU programming topics.

Related Article

Atomic Operations and Low-Wait Algorithms in CUDA


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video