Building the Test Code
Save the source code in Listing Five to histo.cu
. Also save the source code for parts 1 and 2 (Listings Three and Four) to MappedTypeArray.hpp
. The test application can be compiled under Linux for sm 2.0 and later devices with the following nvcc
command:
nvcc -O3 -DSPREAD=32 -arch=sm_20 -Xcompiler -fopenmp histo.cu -o histo
Profiling the application with nvprof
while incrementing the counter 4 billion times produces Listing Six when running on a Kepler K20c installed as device 0.
Listing Six: Performance results on a Kepler K20c.
$ nvprof ./histo 4000000000 5 0 ======== NVPROF is profiling histo... ======== Command: histo 4000000000 5 0 device 0 nSamples 4000000000 spread 32 nBlocks 65535 threadsPerBlock 256 MappedTypeArray is_standard_layout: 1 MappedTypeArray is in standard layout Before push device reports: Size 4 After push device reports: Size 9 The host reports 13 items allocated bin 0 count 800000000 bin 1 count 800000000 bin 2 count 800000000 bin 3 count 800000000 bin 4 count 800000000 bin 100 count 0 bin 101 count 0 bin 102 count 0 bin 103 count 0 bin 200 count 0 bin 201 count 0 bin 202 count 0 bin 203 count 0 total 4000000000 should have 4000000000 ***** Passed all sanity checks! ***** ======== Profiling result: Time(%) Time Calls Avg Min Max Name 99.97 716.39ms 1 716.39ms 716.39ms 716.39ms doHisto(unsigned int, int) 0.02 154.75us 1 154.75us 154.75us 154.75us pushResults(unsigned int, MappedTypeArray<HistoType>*, int) 0.00 23.23us 1 23.23us 23.23us 23.23us createHisto(int) 0.00 7.71us 1 7.71us 7.71us 7.71us initHisto(int) 0.00 3.36us 1 3.36us 3.36us 3.36us [CUDA memcpy DtoH] 0.00 1.95us 1 1.95us 1.95us 1.95us [CUDA memcpy HtoD]
These profiling results clearly indicate that nearly all of the time was spent filling the histogram with the doHist()
kernel.
The output shows that the bins reported by the host contain the correct values for the host preload, device histogram calculation, and the host load after the device has finished. The output from the printf()
in doHist()
indicates the side of the device-side of output
is correct.
Profiling an NVIDIA C2070 on device 1 shows that MappedTypeArray
works correctly and efficiently with Fermi cards as well. (Listing Seven)
Listing Seven: Performance results on a Fermi C2070.
$ nvprof ./histo 4000000000 5 1 ======== NVPROF is profiling histo... ======== Command: histo 4000000000 5 1 device 1 nSamples 4000000000 spread 32 nBlocks 65535 threadsPerBlock 256 MappedTypeArray is_standard_layout: 1 MappedTypeArray is in standard layout Before push device reports: Size 4 After push device reports: Size 9 The host reports 13 items allocated bin 0 count 800000000 bin 1 count 800000000 bin 2 count 800000000 bin 3 count 800000000 bin 4 count 800000000 bin 100 count 0 bin 101 count 0 bin 102 count 0 bin 103 count 0 bin 200 count 0 bin 201 count 0 bin 202 count 0 bin 203 count 0 total 4000000000 should have 4000000000 ***** Passed all sanity checks! ***** ======== Profiling result: Time(%) Time Calls Avg Min Max Name 100.00 6.46s 1 6.46s 6.46s 6.46s doHisto(unsigned int, int) 0.00 98.09us 1 98.09us 98.09us 98.09us pushResults(unsigned int, MappedTypeArray<HistoType>*, int) 0.00 20.32us 1 20.32us 20.32us 20.32us createHisto(int) 0.00 6.46us 1 6.46us 6.46us 6.46us initHisto(int) 0.00 1.95us 1 1.95us 1.95us 1.95us [CUDA memcpy DtoH] 0.00 1.50us 1 1.50us 1.50us 1.50us [CUDA memcpy HtoD]
An Interesting Experiment with Mapped Memory
Allocating a completely memory-mapped version of the MappedTypeArray
class is straightforward. Tests have shown that performance remained high even though MappedTypeArray
objects heavily reuse data for the BoundedParallelCounter count
array. It seems likely that performance stays high for two reasons:
- Allocating/pushing data tends is a linear operation that works nicely in mapped memory even though it is not currently cached on the GPU.
- Using the default size of 32 for
N_ATOMIC
means that the entireBoundedParallelCounter count
array can fit onto one cache line. While high performance in this situation is not guaranteed, tests have shown that repeated operations on a single mapped-memory cache line (which will occur with repeatedscavenge()
operations) can deliver high performance. Unfortunately, it is not possible to guarantee that only a single cache line will be used, which means that arbitrary PCIe data transfers can unexpectedly degrade performance by orders of magnitude.
Conclustion
In practice, the MappedTypeArray
class has proven to be extremely useful, given it acts as both a parallel stack and fast constant-sized memory allocator. Of special note is the ability to load complex data structures on the host for use by the GPU and vice versa.
The programmer must ensure that the C++ object layout is usable by both the host and device, and that the layout compatibility is not compromised by specifying a non-standard-layout type to the MappedTypeArray
template. Double-checking that the type is either standard layout or can be trivially copied is important. (For more information, look to the GNU __is_pod()
, __is_standard_layout()
, or __has_trivial_copy()
methods. Microsoft users can utilize the is_pod()
, is_standard_layout()
, and has_trivial_copy()
methods.)
The capabilities provided by MappedTypeArray
as a fast, constant type memory allocator and parallel stack are very useful and arguably essential for code that utilizes dynamic parallelism. For this reason, the MappedTypeArray
class is designed to be a general-purpose class both now and in the future.
To speed multi-device allocation, the MappedTypeArray
class can be adapted to act as a NUMA-like memory allocator for applications that span multiple GPUs and hybrid code that also uses the host at the same time. The MappedTypeArray
class can also be easily adapted to utilize cached mapped memory when that feature is eventually added to CUDA.
Looking to the future, it is likely that Gemini (or twin) GPU configurations will give way to 4-way, 8-way, and even 16-way GPU devices. (see Rob Farber on the Far-reaching HPC Implications from GTC 2013 for why I think this will happen). It is reasonable to assume that these multi-GPU and hybrid CPU + GPU systems will only burden the CUDA memory allocator, causing it to become even more performance-challenged. As a result, the need for a fast, constant region memory allocator like the MappedTypeArray
class will become even more pressing.
Rob Farber is a frequent contributor to Dr. Dobb's on CPU and GPGPU programming topics.
Related Articles
Atomic Operations and Low-Wait Algorithms in CUDA