Channels ▼
RSS

Parallel

A Massively Parallel Stack for Data Allocation


Building the Test Code

Save the source code in Listing Five to histo.cu. Also save the source code for parts 1 and 2 (Listings Three and Four) to MappedTypeArray.hpp. The test application can be compiled under Linux for sm 2.0 and later devices with the following nvcc command:

nvcc -O3 -DSPREAD=32 -arch=sm_20 -Xcompiler -fopenmp histo.cu -o histo

Profiling the application with nvprof while incrementing the counter 4 billion times produces Listing Six when running on a Kepler K20c installed as device 0.

Listing Six: Performance results on a Kepler K20c.

$ nvprof ./histo 4000000000 5 0
======== NVPROF is profiling histo...
======== Command: histo 4000000000 5 0
device 0 nSamples 4000000000 spread 32 nBlocks 65535 threadsPerBlock 256
MappedTypeArray is_standard_layout: 1
MappedTypeArray is in standard layout
Before push device reports: Size 4
After push device reports: Size 9
The host reports 13 items allocated
bin 0 count 800000000
bin 1 count 800000000
bin 2 count 800000000
bin 3 count 800000000
bin 4 count 800000000
bin 100 count 0
bin 101 count 0
bin 102 count 0
bin 103 count 0
bin 200 count 0
bin 201 count 0
bin 202 count 0
bin 203 count 0
total 4000000000 should have 4000000000
***** Passed all sanity checks! *****
======== Profiling result:
 Time(%)      Time   Calls       Avg       Min       Max  Name
   99.97  716.39ms       1  716.39ms  716.39ms  716.39ms  doHisto(unsigned int, int)
    0.02  154.75us       1  154.75us  154.75us  154.75us  pushResults(unsigned int, MappedTypeArray<HistoType>*, int)
    0.00   23.23us       1   23.23us   23.23us   23.23us  createHisto(int)
    0.00    7.71us       1    7.71us    7.71us    7.71us  initHisto(int)
    0.00    3.36us       1    3.36us    3.36us    3.36us  [CUDA memcpy DtoH]
    0.00    1.95us       1    1.95us    1.95us    1.95us  [CUDA memcpy HtoD]

These profiling results clearly indicate that nearly all of the time was spent filling the histogram with the doHist() kernel.

The output shows that the bins reported by the host contain the correct values for the host preload, device histogram calculation, and the host load after the device has finished. The output from the printf() in doHist() indicates the side of the device-side of output is correct.

Profiling an NVIDIA C2070 on device 1 shows that MappedTypeArray works correctly and efficiently with Fermi cards as well. (Listing Seven)

Listing Seven: Performance results on a Fermi C2070.

$ nvprof ./histo 4000000000 5 1
======== NVPROF is profiling histo...
======== Command: histo 4000000000 5 1
device 1 nSamples 4000000000 spread 32 nBlocks 65535 threadsPerBlock 256
MappedTypeArray is_standard_layout: 1
MappedTypeArray is in standard layout
Before push device reports: Size 4
After push device reports: Size 9
The host reports 13 items allocated
bin 0 count 800000000
bin 1 count 800000000
bin 2 count 800000000
bin 3 count 800000000
bin 4 count 800000000
bin 100 count 0
bin 101 count 0
bin 102 count 0
bin 103 count 0
bin 200 count 0
bin 201 count 0
bin 202 count 0
bin 203 count 0
total 4000000000 should have 4000000000
***** Passed all sanity checks! *****
======== Profiling result:
 Time(%)      Time   Calls       Avg       Min       Max  Name
  100.00     6.46s       1     6.46s     6.46s     6.46s  doHisto(unsigned int, int)
    0.00   98.09us       1   98.09us   98.09us   98.09us  pushResults(unsigned int, MappedTypeArray<HistoType>*, int)
    0.00   20.32us       1   20.32us   20.32us   20.32us  createHisto(int)
    0.00    6.46us       1    6.46us    6.46us    6.46us  initHisto(int)
    0.00    1.95us       1    1.95us    1.95us    1.95us  [CUDA memcpy DtoH]
    0.00    1.50us       1    1.50us    1.50us    1.50us  [CUDA memcpy HtoD]

An Interesting Experiment with Mapped Memory

Allocating a completely memory-mapped version of the MappedTypeArray class is straightforward. Tests have shown that performance remained high even though MappedTypeArray objects heavily reuse data for the BoundedParallelCounter count array. It seems likely that performance stays high for two reasons:

  • Allocating/pushing data tends is a linear operation that works nicely in mapped memory even though it is not currently cached on the GPU.
  • Using the default size of 32 for N_ATOMIC means that the entire BoundedParallelCounter count array can fit onto one cache line. While high performance in this situation is not guaranteed, tests have shown that repeated operations on a single mapped-memory cache line (which will occur with repeated scavenge() operations) can deliver high performance. Unfortunately, it is not possible to guarantee that only a single cache line will be used, which means that arbitrary PCIe data transfers can unexpectedly degrade performance by orders of magnitude.

Conclustion

In practice, the MappedTypeArray class has proven to be extremely useful, given it acts as both a parallel stack and fast constant-sized memory allocator. Of special note is the ability to load complex data structures on the host for use by the GPU and vice versa.

The programmer must ensure that the C++ object layout is usable by both the host and device, and that the layout compatibility is not compromised by specifying a non-standard-layout type to the MappedTypeArray template. Double-checking that the type is either standard layout or can be trivially copied is important. (For more information, look to the GNU __is_pod(), __is_standard_layout(), or __has_trivial_copy() methods. Microsoft users can utilize the is_pod(), is_standard_layout(), and has_trivial_copy() methods.)

The capabilities provided by MappedTypeArray as a fast, constant type memory allocator and parallel stack are very useful and arguably essential for code that utilizes dynamic parallelism. For this reason, the MappedTypeArray class is designed to be a general-purpose class both now and in the future.

To speed multi-device allocation, the MappedTypeArray class can be adapted to act as a NUMA-like memory allocator for applications that span multiple GPUs and hybrid code that also uses the host at the same time. The MappedTypeArray class can also be easily adapted to utilize cached mapped memory when that feature is eventually added to CUDA.

Looking to the future, it is likely that Gemini (or twin) GPU configurations will give way to 4-way, 8-way, and even 16-way GPU devices. (see Rob Farber on the Far-reaching HPC Implications from GTC 2013 for why I think this will happen). It is reasonable to assume that these multi-GPU and hybrid CPU + GPU systems will only burden the CUDA memory allocator, causing it to become even more performance-challenged. As a result, the need for a fast, constant region memory allocator like the MappedTypeArray class will become even more pressing.


Rob Farber is a frequent contributor to Dr. Dobb's on CPU and GPGPU programming topics.

Related Articles

Atomic Operations and Low-Wait Algorithms in CUDA

Unifying Host/Device Interactions with a Single C++ Macro

A Robust Histogram for Massive Parallelism


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video