Channels ▼
RSS

Parallel

CUDA, Supercomputing for the Masses: Part 6


In Part 5 of this article series on CUDA (short for "Compute Unified Device Architecture"), I discussed memory performance and the use of shared memory in reverseArray_multiblock_fast.cu. In this installment, I examine global memory using the CUDA profiler

Astute readers of this series timed the two versions of the reverse array example discussed in Part 4 and Part 5 and were puzzled about how the shared memory version is faster than the global memory version. Recall that the shared memory version, reverseArray_multiblock_fast.cu, kernel copies array data from the global memory to the shared memory, then back to global memory while the slower kernel, reverseArray_multiblock.cu, only copies data from global memory to global memory. Since global memory performance is between 100x-150x slower than shared memory, shouldn't the significantly slower global memory performance dominate the runtime of both examples? Why is the shared memory version faster?

Answering this question requires understanding more about global memory plus the use of additional tools from the CUDA development environment -- specifically the CUDA profiler. Profiling CUDA software is fast and easy, as both the text and visual versions of the profiler read hardware profile counters on CUDA-enabled devices. Enabling text profiling is as easy as setting the environmental variables that start and control the profiler. Using the visual profiler is equally easy: Just start cudaprof and start clicking in the GUI. Profiling provides valuable insight. The collection of profile events is handled entirely by hardware within CUDA enabled devices. However, profiled kernels are no longer asynchronous. Reporting of results to the host only occurs after each kernel completes, which minimizes any communications impact.

Global Memory

Understanding how to efficiently use global memory is an essential requirement to becoming an adept CUDA programmer. Following is a brief discussion about global memory that should be sufficient to understand the performance difference between reverseArray_multiblock.cu and reverseArray_multiblock_fast.cu. Future columns will, of necessity, continue to explore efficient uses of global memory. In the meantime, a detailed discussion on global memory, with illustrations, can be found in Section 5.1.2.1 of the CUDA Programming Guide.

Global memory delivers the highest memory bandwidth only when the global memory accesses can be coalesced within a half-warp so the hardware can then fetch (or store) the data in the fewest number of transactions. CUDA Compute Capability devices (1.0 and 1.1) can fetch data in a single 64-byte or 128-byte transaction. If the memory transaction cannot be coalesced, then a separate memory transaction will be issued for each thread in the half-warp, which is undesirable. The performance penalty for non-coalesced memory operations varies according to the size of the data type. The CUDA documentation provides some rough guidelines for the expected performance degradation to expect for various size data types:

  • 32-bit data types will be roughly 10x slower
  • 64-bit data types will be roughly 4x slower
  • 128-bit data types will be roughly 2x slower

Global memory access by all threads in the half-warp of a block can be coalesced into efficient memory transactions on a G80 architecture when:

  1. The threads access 32-, 64- or 128-bit data types.
  2. All 16 words of the transaction must lie in the same segment of size equal to the memory transaction size (or twice the memory transaction size when accessing 128-bit words). This implies that the starting address and alignment are important.
  3. Threads must access the words in sequence: the kth thread in the half-warp must access the kth word. Note: not all threads in a warp need to access memory for the thread accesses to coalesce. This is called a "divergent warp".

Newer architectures such as the GT200 family of devices have more relaxed coalescing requirements than those just discussed. I will discuss architectural differences more deeply in a future column. For purposes here, suffice to say that if you tune your code to coalesce well on a G80 CUDA-enabled device, it will coalesce well on a GT200 device.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Comments:

ubm_techweb_disqus_sso_-0f345b3d243465df74d23c2918bc0df8
2013-02-16T18:58:50

(2013-02-16)
Great series and very helpful so far until here. Can you please clarify if this information (Lesson 6) remains relevant/true in CUDA 3.0? Admittedly I do not understand this lesson (yet). However I can not reproduce your results either.

1) reverseArray_fast.cu (shared-memory version) IS slower. Numerous trials have validated my expectation. reverseArray_Fast.cu performs an equal number of global-mem read/writes in addition to shared-memory read/writes. (NOTE: I've been timing kernels only, not host/device mem-copies. Is this wrong?)

2) The profiler-configuration flags discussed here appear unavailable in the GUI and CMD-line profiler. Is there a newer equivalent?

===
// launch kernel
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
sdkStartTimer(&timer);
reverseArrayBlockFast<<< dimGrid, dimBlock,
sharedMemSize >>>( d_b, d_a );
cudaThreadSynchronize();
sdkStopTimer(&timer);

===
( // Run with: dimA = 256 * 1024)
NV_Warning: Ignoring the invalid profiler config option: gld_coherent
NV_Warning: Ignoring the invalid profiler config option: gld_incoherent
NV_Warning: Ignoring the invalid profiler config option: gst_coherent
NV_Warning: Ignoring the invalid profiler config option: gst_incoherent
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GTX 670
# CUDA_CONTEXT 1
# TIMESTAMPFACTOR 12e08ac4270b69d0
method,gputime,cputime,occupancy
method=[ memcpyHtoD ] gputime=[ 330.880 ] cputime=[ 534.400 ]
method=[ _Z17reverseArrayBlockPiS_ ] gputime=[ 20.896 ] cputime=[ 28.760 ] occupancy=[ 1.000 ]
method=[ memcpyDtoH ] gputime=[ 764.384 ] cputime=[ 1338.480 ]
method=[ memcpyHtoD ] gputime=[ 332.416 ] cputime=[ 463.600 ]
method=[ _Z21reverseArrayBlockFastPiS_ ] gputime=[ 23.424 ] cputime=[ 28.680 ] occupancy=[ 1.000 ]
method=[ memcpyDtoH ] gputime=[ 764.416 ] cputime=[ 1352.760 ]

===
( // Run with: dimA = 256 * 1024 * 32 )
NV_Warning: Ignoring the invalid profiler config option: gld_coherent
NV_Warning: Ignoring the invalid profiler config option: gld_incoherent
NV_Warning: Ignoring the invalid profiler config option: gst_coherent
NV_Warning: Ignoring the invalid profiler config option: gst_incoherent
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GTX 670
# CUDA_CONTEXT 1
# TIMESTAMPFACTOR 12e08ac4271a3c70
method,gputime,cputime,occupancy
method=[ memcpyHtoD ] gputime=[ 31802.623 ] cputime=[ 32370.561 ]
method=[ _Z17reverseArrayBlockPiS_ ] gputime=[ 528.640 ] cputime=[ 53.080 ] occupancy=[ 1.000 ]
method=[ memcpyDtoH ] gputime=[ 41440.992 ] cputime=[ 42200.438 ]
method=[ memcpyHtoD ] gputime=[ 25485.152 ] cputime=[ 25864.680 ]
method=[ _Z21reverseArrayBlockFastPiS_ ] gputime=[ 606.368 ] cputime=[ 29.320 ] occupancy=[ 1.000 ]
method=[ memcpyDtoH ] gputime=[ 38040.000 ] cputime=[ 38555.402 ]


Permalink
ubm_techweb_disqus_sso_-764dabb9231d0060090f315d289d63aa
2012-10-26T07:18:06

not noticeable difference on my machine. For some reason using shared memory is slower.

alpha@alphahost:~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release$ ./reverseArray_slow
Correct!
alpha@alphahost:~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release$ cat cuda_profile_0.log
NV_Warning: Ignoring the invalid profiler config option: gputime
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GTX 680
# CUDA_CONTEXT 1
# TIMESTAMPFACTOR fffff6a3f3cd8c60
method,gputime,cputime,regperthread,occupancy
method=[ memcpyHtoD ] gputime=[ 158.016 ] cputime=[ 136.000 ]
method=[ _Z17reverseArrayBlockPiS_ ] gputime=[ 18.176 ] cputime=[ 9.000 ] regperthread=[ 10 ] occupancy=[ 1.000 ]
method=[ memcpyDtoH ] gputime=[ 158.624 ] cputime=[ 402.000 ]
alpha@alphahost:~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release$ ./reverseArray_fast
Correct!
alpha@alphahost:~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release$ cat cuda_profile_0.log
NV_Warning: Ignoring the invalid profiler config option: gputime
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GTX 680
# CUDA_CONTEXT 1
# TIMESTAMPFACTOR fffff6a3f3d2b9a0
method,gputime,cputime,regperthread,occupancy
method=[ memcpyHtoD ] gputime=[ 158.016 ] cputime=[ 130.000 ]
method=[ _Z17reverseArrayBlockPiS_ ] gputime=[ 19.808 ] cputime=[ 9.000 ] regperthread=[ 6 ] occupancy=[ 1.000 ]
method=[ memcpyDtoH ] gputime=[ 158.208 ] cputime=[ 404.000 ]


Permalink
SeanTechWeb
2012-09-20T17:42:56

Thanks for the report. This has been fixed.


Permalink
ubm_techweb_disqus_sso_-63541d1027d7256429fd36989c0618c3
2012-09-20T03:35:08

This page is overlaid the code when using print option. Try it!!


Permalink

Video