Channels ▼
RSS

Parallel

CUDA, Supercomputing for the Masses: Part 21


Shared Memory

Shared memory can be either 16KB or 48KB per SM. It is now arranged in 32 banks that are 32-bits wide, which means that working with 64-bit quantities in shared memory no longer causes warp serialization (NVIDIA Programming Guide section G.4.3.2). Previous generation GPUs required a workaround when storing 64-bit quantities in shared memory to avoid warp serialization that required splitting 64-bit data into two 32-bit values, storing them separately in shared memory then merging back to the original form at retrieval. This is no longer necessary on the Fermi, so if you used padding in legacy code to avoid shared memory bank conflicts on GT200/Tesla C1060:



__shared tile [16][17];

then be sure to change both tile size and padding to warp size for Fermi:



__shared tile [32][33];

However, the majority of 128-bit memory accesses will still cause a two-way bank conflict in shared memory.

Threads can communicate via shared memory without using _syncthreads, if they all belong to the same warp, e.g. if (tid < 32) { ... }

On Tesla C1060, a simple declaration was sufficient:


__shared__ int cross[32];

On C2050 (Fermi), make sure to have volatile in front of the shared memory declaration, if it is used for communication between warps in a thread block.



volatile __shared__ int cross[32];

The reason is:

  • C1060 (GT200) could access shared memory directly as operand.
  • C2050 (Fermi) uses load/store architecture into registers.

The volatile keyword avoids the risk that the compiler may silently cache the previously loaded shared memory value in a register, and fail to reload it again on next reference.

The bandwidth of shared memory is a challenge as noted in Vasily Volkov's GTC talk, "Better Performance at Lower Occupancy." Instead, it is recommended on slide 47 that the CUDA Programming Guide is incorrect - that accessing shared memory is not as fast as accessing a register even in the absence of a bank conflict. Further, this slide notes that the gap between shared memory and register bandwidth has increased with Fermi. Specifically:

  • Shared memory bandwidth was more than 3x lower than register bandwidth on pre-Fermi GPUs.
  • Shared memory bandwidth is more than 6x lower that register bandwidth on Fermi.

This presentation also notes on slide 50 that the gap between shared memory and arithmetic throughput has increased with Fermi. The data on slide 50 shows:

  • G80-GT200: 16 banks vs. 8 thread processors (2:1)
  • GF100: 32 banks vs. 32 thread processors (1:1)
  • GF104: 32 banks vs. 48 thread processors (2:3)

For this reason, it is recommended that registers be used whenever possible even though the number of available registers has decreased on Fermi GPUs as noted below.

Register and local memory

Fermi supports 32K 32-bit registers that can provide up to 63 registers per thread and 21 if the threads are fully populated (according to David Kirk's August 2010 "CUDA and Fermi Update"). This is confusing because it looks like Fermi with 32k registers provides twice the number of registers (16k) that were available on a GT200 GPU. The reason is that a Fermi streaming multiprocessor is not the same as a G80 or GT200 streaming multiprocessor as is discussed next. In a nutshell:

  • The maximum number of possible registers is 63 because that is all the bits available for indexing into the register store.
  • If the SM is running 1,536 threads, then only 21 registers can be used due to limitations in the amount of memory available for registers.
  • This number degrades gracefully from 63 to 21 as the workload (and hence resource requirements) increases by number of threads.

The good news is that registers always spill to the L1 cache, which will significantly increase application performance that requires local memory. In addition, the stack resides in up to 1KB of the L1 cache.

Fermi dual-issue streaming multiprocessors

Each Fermi streaming multiprocessor (SM) has its own:

  • Control unit that decodes and issues instructions and schedules threads.
  • Pipelines for integer, fp32, fp64, special function unit (SFU), DirectX 11 instructions.
  • Registers and shared memory plus the L1 and constant caches.
  • Texture units that utilize L1 cache.

BeHardware has a nice summary of how the SIMD units work together:

  • Each of the multiprocessors has a double scheduler that runs at the low frequency and 4 execution blocks at the high frequency:
    • Two 16-way SIMD units (the 32 "cores"): 32 FMA FP32s, 32 ADD INT32s, 16 MUL INT32s, 16 FMA FP64s.
    • A quadruple SFU unit: 4 FP32 special functions or 16 interpolations
    • A 16-way 32-bit Load/Store unit.
  • Each of the two 16-way SIMD units are distinct and work on a different instruction and different warp.
    • The first unit can execute 16 FMA FP32s while the second concurrently processes 16 ADD INT32s, which appears to the scheduler as if they executed in one cycle.
    • The quadruple SPU unit is decoupled and the scheduler can therefore send instructions to two SIMD units once it is engaged, which means the SFUs and SIMDs can be working concurrently. This can be a big win for applications that use transcendental functions.
    • It is not clear how dual-issue scheduling works with double-precision as 64-bit instructions use all the register resources. Thus we don't know if a single SIMD handles all double-precision operations or if two SIMDs run together at half speed.

Issuing instructions

Streaming multiprocessor instructions are issued in-order per warp. A warp consists of 32 consecutive threads that are analogous to a 32-thread vector. Maximum performance occurs when all 32 threads execute the same instruction.

Hardware has been added that handles branching and predication. This means that each thread can execute its own code path. The Fermi whitepaper notes, "Predication enables short conditional code segments to execute efficiently with no branch instruction overhead." There is no performance reduction if all threads in a warp branch together. Concurrently, threads in different warps can take different code paths without reducing performance. As with previous architectures, there is a performance reduction when paths diverge within a warp as each code path within the warp must be taken.

Registers and warp scheduling

Registers and other thread state are partitioned among threads. Unlike a CPU core that can quickly switch between two hyper-threads, GPU multiprocessor may switch between dozens of warps containing many threads. However, a thread will block when an instruction argument is not available.

Understanding register usage and when a thread will block leads to a study of ILP (Instruction Level Parallelism) as opposed to TLP (Thread Level Parallelism). Most CUDA programmers focus on getting the highest possible occupancy to exploit TLP. The idea is to give the scheduler a large amount of threads from which to choose to hide latency and allow the GPU to schedule work to best utilize all available resources.

In contrast, ILP can be used to hide latency and achieve high efficiency at low occupancy and with fewer numbers of threads. The big advantage is more registers per thread. An excellent presentation, "Use registers and multiple outputs per thread" by Vasily Volkov discusses instruction level parallelism and shows that as few as 64 threads are enough to hide latency and maintain performance. A good discussion on this topic occurred on the NVIDIA developer forum here.

The Fermi architecture actually encourages the use of smaller blocks since it can schedule more blocks due to the additional resources per SM. Concurrent kernel execution also means that those smaller blocks don't even need to be from the same kernel (although they do need to be from the same context.) This presents a new way of thinking about problems to achieve both high utilization and performance.

As noted in a very nice presentation for the August 2010 Teragrid conference, "Analysis and Tuning Case Study", slide 71 notes that the benefits of ILP and smaller thread blocks include:

  • Fewer threads means more registers per thread.
  • More thread blocks per multiprocessor.
    • This implies only two 512-thread blocks as there is a maximum limit of 1024 threads per multiprocessor for a GT200 and 1532 for Fermi.
  • An advantage is that doubling the work does not necessarily double register requirement.
    • Fewer registers per thread block implies more thread blocks (e.g. TLP).
  • Even if the maximum thread blocks are used, registers are faster than shared memory so use them.

CPU developers will recognize ILP as a form of superscalar execution. Anandtech has a nice discussion on how the GF104 has gone superscalar where they note, "superscalar execution is a method of extracting Instruction Level Parallelism from a thread. If the next instruction in a thread is not dependent on the previous instruction, it can be issued to an execution unit for completion at the same time as the instruction preceding it." This article also provides a good explanation of differences between GF100 and GF104 products.

The challenge for CUDA programmers regardless of their use of ILP or TLP lies in understanding how registers and other resources are allocated on the SM. There isn't much information available to help. One of the best current sources is Coon et al. 2008. (Tracking register usage during multithreaded processing using a scoreboard having separate memory regions and storing sequential register size indicators, U.S. Patent No. 7434032.)

Fermi has taken many steps in the right direction including spilling registers to high-speed L1 cache memory and unifying the memory spaces within the GPU. However, the compiler and PTX assembler will likely add unexpected resource consumption. (PTX-compliant binaries act as a hardware-neutral distribution format for GPU computing applications and middleware. When applications are installed on a target machine, the GPU driver translates the PTX binaries into the low-level machine instructions that are directly executed by the hardware. Depending on architecture, this JIT compilation can further increase register use.) Experimentation is still the best measure.

The GigaThread engine, concurrent kernels, and host/GPU data transfers

In a nutshell, the Fermi GigaThread engine provides concurrent kernel execution as well as performs simultaneous bidirectional data transfers (reading and writing data at the same time) over the PCIe bus.

Succinctly, commands are read by the GPU via the host interface. The GigaThread engine then creates and dispatches thread blocks to various SMs. As previously discussed, individual SMs then schedule the warps and distributes work among the ALU (Arithmetic Logic Unit) and other execution units on the streaming multiprocessor.

Thread blocks from the first kernel in the execution queue are launched first. If there additional resources are available, then thread blocks from a kernel in another stream (but same program context) are launched. In an ideal case, two kernels can concurrently run on the same GPU when the sum total of all the resources is less than or equal to the resources available on the GPU. However, few CUDA developers specify their kernel execution configurations to run on half the GPU unless they purposely wish to utilize concurrent kernel execution. A more common scenario occurs when streaming multiprocessors gradually free up as a kernel completes, say if the load is unbalanced. In this case, the Fermi concurrent kernel capability allows some thread blocks from the next queued kernel to "get a head start" by running on the free SMs. This provides better GPU utilization and can reduce time to solution — especially if the currently running kernel has an unbalanced workload.

The GigaThread engine can manage 1,536 simultaneously active threads for each streaming multiprocessor across 16 kernels. Switching from one application to another is about 20 times faster on Fermi than on previous-generation GPUs. This is quick enough to maintain high utilization on a Fermi GPU even when running multiple applications, like a mix of compute code and graphics code. Please see Parts 15, 17, and 18 of this article series for examples of how to do this. Efficient multitasking is important for consumers for video games with physics-based effects and scientists who need to steer and view computationally intensive simulations.

Following are some frame rates for the primitive restart rendering of the virtual terrain example from Part 18, see Table 3.

[Click image to view at full size]
Table 3

In addition to managing application context switching, the GigaThread engine also provides a pair of streaming data-transfer engines, each of which can fully saturate Fermi's PCI Express host interface. The PCIe bus is bi-directional, so aggregate bandwidth will be doubled. Typically, one engine will be used to move data from host to GPU memory when setting up a GPU computation, while the other will be used to move results from GPU memory back to the host.

IEEE754-2008 arithmetic and atomic operations

Fermi improves the speed and accuracy of double precision calculations. Fully functional GF100 products provide eight times the peak double precision floating point performance over the GT200. Single precision operations have been accelerated and each SM has four Special Function Units (SFUs) that are dedicated to the fast execution of transcendental functions like sine, cosine, reciprocal, and square root.

Scientific and graphics applications benefit from the improved floating-point accuracy -especially when handling very small numbers. In particular Fermi provides better handling of subnormals, which are small numbers that lie between 0 and the absolute smallest normalized number the floating point system supports. Previous architectures rounded subnormals to zero, introducing a loss in precision that manifested itself as artifacts or a loss of detail in graphic applications and a loss of precision in scientific applications. Preserving accuracy benefits graphics applications because it can mean the difference between a protrusion on a demon's face looking like a pimple or a tiny menacing spike. Improved accuracy can prevent introducing non-physical artifacts into scientific calculations. For a general overview, I suggest viewing my Scientific Computing column, "Numerical precision: how much is enough?"

The default settings for computation on GPU are now more conservative to support HPC:

  • Denormal support.
  • IEEE-conformant division and square root.

If your application runs faster on Fermi with -arch=sm_13 than -arch=sm_20 then the PTX JIT has used "old" Tesla C1060 settings, which favor speed:

  • Flush-to-zero instead denormals.
  • No IEEE-precise division, no IEEE-precise square root

For similar results in -arch=sm_20<, use: -ftz=true -prec-div=false -prec-sqrt=false

Additional numerical speed and accuracy information is contained in the CUDA Programming Guide (v3.2 dated 10/22/2010) sections 5.4.1 and G.2

Fermi now utilizes a fused multiply add (FMA) capability instead of the multiply add (MAD) instruction in previous GPU generations. The reason is that FMA is more precise as it retains full precision in the intermediate state and only rounds at the end. For CUDA programmers, this means that numerical results will change (for the better) on Fermi architectures. The inclusion of FMA instructions also accelerates certain functions such as division and square root.

Atomic operations are nearly an order of magnitude faster on Fermi because of the L2 cache. If a memory address is in the L2 cache, then it is possible read, modify and update the data at that location without requiring a round-trip to and from global memory changes in the L2 cache without going back and forth to global memory. For more information on how to use atomic operation, consult the NVIDIA Programming manual or the Wu-chen Feng and Schucai Xiao paper, "To GPU Synchronize or not GPU Synchronize?"

Summary

The Fermi architecture provides both improved processing power and a more general computational framework that includes a real cache structure, support for ECC memory for increased reliability, greatly improved double precision performance and accuracy.

As they say, ""the proof is in the pudding". The fact that the world'''s largest production environments in animation, supercomputing, and the oil and gas industries are now running large numbers of CUDA-enabled GPGPUs demonstrates the validity of this computational model and hardware. They certainly appear tasty.

With over 250+ million CUDA-enabled GPGPUs already sold and many organizations considering GPGPU technology, knowledgeable GPGPU developers are in a wonderful position. The advent of the Fermi architecture means that GPGPU development has clearly not stagnated and it opens up even more opportunities for the cognoscenti!

It is also worth noting that CUDA is maturing and that pathways are being created to run CUDA on multicore processors and to translate it to OpenCL. Per slide 74 of the Oxford briefing, in Figure 2:

[Click image to view at full size]
Figure 2: source NVIDIA slide 74 (http://people.maths.ox.ac.uk/gilesm/cuda/cudaconf_oxford.pdf)

References

Mcuda: http://impact.crhc.illinois.edu/mcuda.php

Ocelot: http://code.google.com/p/gpuocelot/

Swan: http://www.multiscalelab.org/swan


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video