Channels ▼


CUDA vs. Phi: Phi Programming for CUDA Developers

Key Distinctions for CUDA Programmers

Both CUDA and Phi coprocessors provide high degrees of parallelism that can deliver high application performance. For the most part, CUDA programmers with existing application code have already written their software so it can run well on Phi coprocessors. However, additional work may be required to achieve the highest possible performance.

Differences between MIMD and SIMT parallel thread execution will cause some applications to run faster (potentially much faster) on one hardware platform or the other. Those interested in portability will look to exploit the commonality between Phi coprocessor and CUDA threads. Others may find that differences between SIMT and MIMD performance characteristics and parallel runtime capabilities justify the selection of one device over another, or elicit a "wait and see" attitude on porting efforts. Many in the HPC world will leverage the different capabilities of MIMD and SIMT threads to make the fullest use of all their hardware, be it CUDA-enabled GPUs or Phi coprocessors.

Exploiting MIMD Execution

Many CUDA applications do not utilize the MIMD capability of the SIMT execution model (with the minor exception of conditionals within a warp discussed next).
There are three ways to initiate MIMD execution on GPUs:

  • Through the use of conditional operations via an uber-kernel.
  • Via the new CUDA 5 Hyper-Q capability.
  • Through the use of asynchronous streams.

Example 5 demonstrates how to call separate tasks using conditionals in a CUDA uber-kernel:

__device__ void foo() { ... }
__device__ void bar() { ... }

__global__ void mimd() {
      if(blockIdx.x == 0)     foo();
      if(blockIdx.x == 1)     bar();

main() {
      mimd<<< 2, 512 >>>();   // Correspond to 512-wide Phi SIMD

Example 5: MIMD execution with conditional (courtesy NVIDIA).

Load-balancing can be a challenge in the uber-kernel approach. In the previous code snippet, foo() will be blocked because the kernel will not return until after bar() completes causing a load balancing problem. This can be inefficient — especially when bar() takes much longer to run than foo().

CUDA 5 introduced Hyper-Q, which is a feature that helps programmers saturate a GPU with multiple different kernel launches. Hyper-Q is useful when the individual kernels are too small to saturate the GPU, or when work needs to be initiated at different times from the CPU process. This avoids the load-balancing issues of the uber-kernel approach that requires all work be batched up front. However, Hyper-Q is very new; it will be interesting to see how quickly it is adopted and how much it benefits CUDA applications in practice.

The use of asynchronous queues also solves the load-balancing problem as foo() and bar() can run either one after the other or concurrently depending on resource utilization. In other words, if foo() finishes early then something else can be launched to fill its place while bar() is still running. Without streams or Hyper-Q, this cannot happen. To utilize streams, the CUDA example would become what is shown in Example 6.

__global__ void foo() { ... } // Note now it's global
__global__ void bar() { ... }

main() {
      cudaStream_t s1, s2;

      foo<<< 1, 512, 0, s1 >>>();
      bar<<< 1, 512, 0, s2 >>>();

Example 6: MIMD execution with streams (courtesy NVIDIA).

While using CUDA streams is straightforward, most programmers rely on the use of the default stream and do not set up separate streams or make use of multiple concurrent kernels. While this is fine for large uniform workloads that fully utilize the device, MIMD threads have the potential to more fully utilize the hardware when the workload is irregular or when there are multiple smaller workloads that individually do not utilize all of the SM. The fact that Phi coprocessor programmers get MIMD threads for free can benefit such workloads.


The MIMD thread capability of Phi coprocessor eliminates the restrictions of the block oriented GPU computation as each thread is free to follow its own code path. The GPU SIMT execution model conforms to this MIMD ideal by freeing each thread block to follow separate code paths.

The challenge occurs when two or more threads within a thread block, specifically when threads within a single warp need to follow separate paths. While SIMD execution inside the SM can deliver high performance, conditional operations can cause a sequential slowdown by requiring the a warp to execute more than one code path. This is referred to as "warp divergence."

Static compiler analysis can eliminate some conditional operations or determine when predication can provide a low-impact evaluation of conditionals at runtime. With predication, the SM evaluates both sides of the branch and keeps one of the results based on the value of the Boolean branch condition. While predication can be effective for small branches, it is not effective when conditionals branch into longer code paths. In extreme cases, conditional operations can force the SM to follow separate SIMD code paths for each thread causing a number of threads per warp slowdown.

Transitioning CUDA applications to the Phi coprocessor can reduce or eliminate the effects of conditional operations on the SIMD blocks due to the use of individual MIMD threads. For this reason, applications that exhibit high warp divergence on GPUs may run faster on Phi coprocessors. However, conditional operations also affect the efficiency of vector unit. Vector hardware supports conditional operations through a mask that defines which elements are required. Zeros in the vector mask represent lost cycles.

Vector Parallel versus TLP

A key challenge for CUDA programmers looking to run on Phi coprocessors lies in reconciling the difference between the 60-240 concurrent vector/MIMD threads supported by the preproduction Phi coprocessors and the thousands to millions of threads they are strongly encouraged to use when writing a CUDA program. Succinctly, the degree of parallelism is comparable between both architectures as floating-point SIMD threads translate well to vector operations, while both hardware platforms can provide high amounts of MIMD parallelism.

CUDA programmers are taught to express as much parallelism as possible in their programs through the use of thousands to millions of concurrent threads. The CUDA argument is twofold: The more parallelism a developer uses today means the better the application will run on ever more parallel hardware in the future; and the more active thread blocks an application provides to the device the better the chance that every SM will be fully utilized so the device can deliver high performance. The second point is referred to as thread-level parallelism or TLP.

Based on the x86 architecture, the Phi coprocessor provides an unprecedented level of parallelism in a vector multicore architecture by providing up to 60 hyperthreaded cores where each core can run four MIMD threads concurrently. In combination with the per-core wide vector unit, the preproduction Phi coprocessors can perform many thousands floating-point operations per clock cycle.

SIMD versus Vector Parallel

At the level of a thread block, SIMD execution naturally exploits the parallel operations of the groups of threads within the thread block. CUDA programmers are well aware that they must manually create this mapping to a grid of thread blocks by hand. Newer programming platforms like OpenACC, (which is analogous to OpenMP) utilize pragma annotations and a compiler to map the application source code to a grid of thread blocks.

Most Phi coprocessor programmers rely on directives or Intel's Cilk Plus to help the compiler vectorize their code. It is also possible to directly program the vector units using compiler intrinsic operations or assembly language. Without conditional operations, both SIMD threads and vector operations will fully utilize the hardware. As discussed previously (See Conditionals), conditional operations can detrimentally affect the runtime of a thread block and efficiency of a vector unit.

Thread-Level Parallelism (TLP)

CUDA programmers rely on TLP to schedule work on every SM on a GPU. GPU languages such as CUDA and OpenCL enforce the restriction that only threads within a thread block can communicate with each other. The benefit of this restriction is that all the thread blocks can run independently of each other and be assigned arbitrarily by the GPU scheduler. Each SM internally maintains a queue of the active thread blocks it needs to process. Those thread blocks that have all the data dependencies satisfied for their current instruction (remember these are SIMD devices) are marked as ready to run. Each SM can then process ready-to-run thread blocks as computational resources become available, which means the SM has the ability to keep all internal integer, floating-point, and special function units busy.

Programming with large numbers of threads means ensuring the queues inside each SM contain as many thread blocks as possible. (The actual number, know as "occupancy," depends on resource utilization in the SM.) The greater the number of thread blocks queued on each SM, the better the performance is likely to be because of high utilization. The independence of the thread blocks also means the application can scale to arbitrarily large numbers of SMs as long as there are enough thread blocks to keep each SM busy.

Thread Scheduling and Affinity

Thread scheduling on Phi coprocessors is performed by the Linux operating system that runs on the device. In addition to the flexibility of the MIMD thread model and full Posix threads, Linux also provides programmers access to a vast amount of multi-threaded algorithms and software developed for multi-core x86 hardware. Existing software projects such as OpenCL; CUDA-86; Ocelot; the Thrust TBB and OpenMP back ends; plus many others demonstrate that the CUDA TLP execution model will translate easily, with high-performance, and excellent scalability to the Phi coprocessor runtime environment.

Thread affinity and NUMA support play an important role in performance because they affect the efficiency of the Phi coprocessor caches and dependence on memory bandwidth. While the 200 GB/s bandwidth of the preproduction Phi coprocessor memory subsystem is impressive, it is still a bottleneck for a teraflop/sec. chip. Care should be exercised that the assumption of coalesced GPU memory accesses might cause false sharing across cores on Phi coprocessors (16 32-bit values per cache line and 4 threads per core, mean some false-sharing is likely).
There are 3 preset thread affinity schemes: compact, scatter, and balanced available for tuning thread affinity:

  • Compact tries to use minimum number of cores by pinning 4 threads to a core before filling the next core.
  • Scatter tries to evenly distribute threads across all cores.
  • Balanced tries to equally scatter threads across all cores such that adjacent threads (sequential thread numbers) are pinned to the same core. One caveat being that all cores refers to (total number of cores -1) because during an offload one core is reserved for the operating system.

Interested readers can find more about the affinity schemes in "Best Known Methods for Using OpenMP on Intel Many Integrated Core Architecture" when it becomes available on the public Intel site.
Note: Because Phi coprocessor runs Linux, operating system jitter due to system daemons and multiple user processes can introduce performance variations — especially for tightly coupled parallel calculations such as reductions. An excellent starting paper to understand the impact of jitter is "The Case of the Missing Supercomputer Performance."

Large Pages

Large pages will be of special interest to CUDA programmers looking to use Phi coprocessors. Briefly, the use of the larger 2MB pages can benefit many algorithms — even such common tasks as matrix multiplication. In virtual memory architectures like GPUs and the x86 processors, a TLB (Translation Lookaside Buffer) provides an on-chip cache to improve the speed of virtual address translation. When a page entry is in the TLB, application addresses can be translated to physical RAM addresses with minimal overhead and no additional RAM accesses. While TLB caches are fast, they are also quite small and the overhead and performance penalty incurred by a TLB miss is significant.

The importance of larger pages for floating-point dominated applications can be understood by considering any array operation — however trivial — that requires stepping through memory in strides that are greater than the standard page size used by the system. These are common scenarios that frequently occur when working with two- and higher dimensional matrices. Because of the stride size, each memory access requires looking up a new page in the TLB. If the array is sufficiently large, then each memory access can cause a TLB miss and corresponding performance drop. Utilizing larger page sizes can result in fewer TLB misses which will increase application performance because the processor does not have to wait (or wait as long) for data.

Phi coprocessors have the option to utilize 2 MB pages, which can potentially speed application performance by reducing or eliminating TLB misses. Regardless, memory bandwidth limitations and TLB misses affect both GPUs and Phi coprocessors.

Thread Communications and Atomic Operations

Most CUDA developers have already designed their applications so they do not communicate outside of a thread block. In general, these applications should run well and with high performance on Phi coprocessors. However, some applications might have to solve multiple small problems where each problem is too small to fully utilize the GPU. In these cases, CUDA programmers should look to the potential speedup Phi coprocessors can deliver through the use of MIMD threads. Full hardware utilization might be achieved by exploiting the asynchronous message passing and concurrent thread execution of the Posix thread model. (This might also indicate that performance gains can be achieved on the GPU with concurrent kernels.)

Some CUDA applications require that thread blocks communicate. The workaround in CUDA is to pass data via global memory and use atomic operations (instructions that are guaranteed to complete a change to a memory location in one operation) as the mechanism for passing data or for synchronization. From a scalability point of view, atomic operations are a last resort solution for any parallel algorithm because they can serialize, or linearize execution causing a factor of N slowdown on devices containing N processing elements. Still, it is impossible to implement some algorithms without atomic operations. For these algorithms, lock-free and wait-free data structures utilized within concurrent Posix threads might provide a higher-performance algorithmic solution.

Performance Is the Reason for These Devices

The following example demonstrates that pre-production Phi coprocessors can deliver a teraflop/s of performance with the MKL in thunking and native modes as shown by the graph in Figure 3. The matrix.c code also provides an OpenMP implementation for a base line reference. A more detailed discussion of this source code can be found in "Programming Intel's Xeon Phi: A Jumpstart Introduction," my first article in this series. The MKL library can run in three modes: natively on the coprocessor, in offload mode, and on the host processor. The MKL runtime reported by doCheck() provides that ability to observe how fast highly optimized code can run on the Phi coprocessor.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.