The CUDA Execution Model
To potentially increase performance, each hardware multiprocessor has the ability to actively process multiple blocks at one time. How many depends on the number of registers per thread and how much shared memory per block is required by a given kernel. The blocks that are processed by one multiprocessor at one time are referred to as active. Kernels with minimal resource requirements can better utilize (or occupy) each multiprocessor because the registers and shared memory of the multiprocessor are split among all the threads of the active blocks. Use the CUDA occupancy calculator to explore the trade-offs between number of threads and active blocks versus the number of registers and amount of shared memory. Finding the right combination can greatly increase the performance of your kernels. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch. (See the Part 3 discussion on
cudaGetLastError() to find out how to catch these failures.)
Each active block is split into SIMD ("Single Instruction Multiple Data") groups of threads called "warps". Each warp contains the same number of threads, called the "warp size", which are executed by the multiprocessor in a SIMD fashion. This means each thread within a warp is broadcast the same instruction from the instruction store, which directs the thread to perform some operation or manipulation of local and/or global memory. The SIMD model is efficient and cost effective from a hardware standpoint, but from a software standpoint it unfortunately serializes conditional operations (e.g., both branches of the conditional must be evaluated). Be aware that conditional operations can have profound effects on the runtime of your kernels. With care this is generally a manageable problem but it can be problematic for some problems.
Active warps (that is, all the warps from all active blocks) are time-sliced: The thread scheduler periodically switches from one warp to another to maximize the use of the multiprocessor's computational resources. The order of execution of the warps within a block and of blocks themselves is undefined, which means they can occur in any order. However, threads can be synchronized with
__syncthreads(). Be aware that only after the execution of
__syncthreads() are writes to shared (and global) memory guaranteed to be visible. Unless the variable is declared as volatile, the compiler is free to optimize (that is, reorder or eliminate) memory reads and writes to increase performance. The
__syncthreads() call is allowed inside the scope of a conditional, but only if the conditional evaluates identically across the entire thread block. If not, the code execution is likely to hang or produce unintended side effects. Happily,
__syncthreads() has low overhead as it only takes four (4) clock cycles to issue for a warp so long as no other thread has to wait for any other thread. A half-warp is either the first or second half of a warp, which is an important concept for memory accesses including coalescing memory accesses as discussed later in this article.
There are several take away messages from the previous discussion:
- Multiprocessor resources such as shared memory are limited and valuable.
- Effectively managing limited multiprocessor resources, such as shared memory, for a range of CUDA-enabled device configurations is a fact of life for CUDA developers.
- Be aware that conditional operations (such as
ifstatements) can have a profound effect on the runtime of your kernels.
- The CUDA occupancy calculator and nvcc compiler are important tools to learn and use -- especially when exploring execution configurations.