Channels ▼
RSS

Parallel

CUDA vs. Phi: Phi Programming for CUDA Developers


In my previous article on the new Intel Xeon Phi coprocessors, I focused on getting up and running as quickly as possible. This article discusses how programming the Phi compares with CUDA programming.

Intel designed the 60-core Phi coprocessor (previously called "MIC" in the literature) so it can be programmed like a conventional x86 processor core while incorporating extensions such as a bidirectional ring interconnect for massive parallelism and a wide 512-bit per core vector unit to deliver high floating-point performance. While CUDA applications can run on x86 hardware, it's important to know how architectural differences between GPUs and Intel Xeon Phi coprocessors affect performance and application design. The good news is that CUDA applications will readily map onto the Phi coprocessor's vector-parallel architecture and run with high performance. The challenge lies in achieving the best possible performance.

The preproduction Intel Xeon Phi coprocessors based on the first generation Intel Xeon Phi product codenamed "Knights Corner" currently provide more than a teraflop/s of floating-point performance by using pragmas to augment existing code so that it offloads work from the host processor to the Phi. This was discussed in my previous article. Other means of integrating operations between the host and the Phi include: recompiling the source code to run directly on the coprocessor as a separate native SMP (Symmetric Multi-Processor) many-core Linux computer, accessing the coprocessor as an accelerator through optimized libraries such as the Intel Math Kernel Library (MKL), and using each coprocessor as an MPI node or alternatively as a device containing a cluster of MPI nodes.

From this list, experienced programmers will recognize that the Phi support the full gamut of modern and legacy programming models — including CUDA. The advice from Intel's James Reinders is, "Program with lots of threads that use vectors with your preferred programming languages and parallelism mode" (see "Programming for the Intel Xeon family of products"). Figure 1 is a graphic illustration of the performance potential of a Phi coprocessor as one uses the combined parallel and vector capability of the device.


Figure 1 : Intel Xeon Phi coprocessor vector parallelism vs. performance (courtesy Intel).

The Phi's offload mode reflects how CUDA programmers (and GPU developers using OpenACC) currently utilize their GPU devices. In offload mode, data must move across the PCI bus to the external device just like when programming a GPU (see Figure 2). Similarly, the optimized MKL library for Intel Xeon Phi coprocessor provides both a native interface (where data resides natively on the device) and a thunking interface (where data is transferred across the PCIe bus for every operation) just like CUBLAS and CUFFT. A working SGEMM example in this article demonstrates that a preproduction Intel Xeon Phi coprocessor can deliver more than a teraflop using the thunking and native interfaces.


Figure 2 : Intel Xeon Phi coprocessor PCIe connectivity (courtesy Intel).

Both Intel Xeon Phi coprocessors and GPU devices can accelerate MPI (Message Passing Interface) applications in offload mode where portions of the application are accelerated by the remote device. However, CUDA programmers need to remember that they can also run MPI and OpenMP code natively on the Phi coprocessor. In fact, each Phi runs Linux internally. While not the focus of this article, MPI programmers might find that they can just recompile existing applications to run on the Phi without any porting effort. Running MPI natively, each Phi can act as a separate SMP node in a distributed MPI application. Alternatively, the large number of cores (currently 61 in the preproduction devices, of which 60 are available for computation) also means that each coprocessor can act as a device containing a cluster of MPI nodes.

The analysis in my article, Intel's 50+ core MIC architecture, indicates that memory capacity will likely be the main limitation for the current generation of Intel Xeon Phi coprocessors — especially for developers who wish to run SMP or MPI applications natively on the device. That article also predicts that offload programming will be the most popular mode as the large memory capacity of the host can be used to augment the limited per-device memory capacity plus the state-of-the-art host processors can accelerate serial sections of code.

Software and Tools to Port CUDA Code to Phi Coprocessors

Even though they are in the final stages of pre-production, Phi coprocessors already enjoy an ecosystem of compilers, profilers and debuggers as a result of their compatibility with existing x86 hardware.

To run on Intel Xeon Phi coprocessors, CUDA kernels need to be modified. At the moment, this needs to be done by hand. While it is technically possible to run CUDA on Phi coprocessors, products such as CUDA-86 do not currently generate code for these devices. An OpenCL compiler for the Intel Xeon Phi coprocessor is coming. This means that CUDA programmers can consider Wu Feng's CU2CL CUDA-to-OpenCL source translator to port their code. In the future, an LLVM translation project might be able to create executable code for the Phi.

Table1 provides a brief summary of the similarity and differences between CUDA and Intel Xeon Phi coprocessor programming modes and possible approaches (aside from hand translation) to port CUDA.

Programming Approach

Intel Xeon Phi Coprocessor

CUDA-enabled devices

Languages such as C/C++/Fortran, etc.

Both native and offload modes but requires the use of a threading model like Pthreads or OpenMP

Only through the offload programming mode. Many languages can be accelerated only by calling CUDA, OpenCL, or library methods.

CUDA, OpenCL acceleration

Offload mode. OpenCL compiler support is coming. Technically possible for CUDA, but products such as CUDA-x86 do not currently generate code for Intel Xeon Phi coprocessor. Alternate possible paths include (1) the CU2CL CUDA-to-OpenCL source translator, (2) LLVM translation, and (3) manual translation.

On-device as an offload accelerator

Directive-based programming

Via OpenMP natively and in offload mode

Via OpenACC as an external accelerator

Programming with libraries

Both native and offload mode

Both on-device and offload

MPI

Both native and offload mode

Only offload support

Table 1: Summary of common programming approaches.

Differences Between a CUDA and a Phi Coprocessor Thread

Both CUDA and Intel Xeon Phi coprocessor applications utilize a large number of concurrent threads of execution to exploit hardware parallelism and achieve high performance. It is important to understand the distinctions between CUDA and Intel Xeon Phi coprocessor threads.

From a programming standpoint, both CUDA and Intel Xeon Phi coprocessor threads are regions of serial code that have been identified by the programmer or compiler as candidates for parallel execution. It is up to the runtime of the device to decide how the parallel threads will run.

Each serial region of code contained within a thread is computational universal, meaning that any computable function can theoretically be implemented within an individual CUDA or Phi coprocessor thread. In other words, neither thread type places any restriction on the type of computation that can be expressed within a single thread. A programmer is free to use whatever code they wish in a thread. This does not imply that CUDA and Phi coprocessor threads provide equivalent capabilities and performance in a parallel environment as discussed in the next section.

The generic programmability of individual CUDA and Phi coprocessor threads can be seen in Examples 1 and 2, which show a single-threaded "Hello World" CUDA and Phi coprocessor program. To alleviate concerns that the GPU is offloading the printf() to the host processor, a simple function toBoolean() has been added that converts a binary integer to a Boolean string on the device. These two examples are provided to demonstrate the generic programmability of individual CUDA and Phi coprocessor threads. The real value in GPU and Phi coprocessor programming lies in the performance these devices can deliver when running large numbers of parallel threads.

#include <stdio.h>

__device__ void toBinary(char* buf, int d)
{
  for(int i=0; i < 8*sizeof(int); i++)
	 buf[i] = ((d>>i)&0x1)?'1':'0';
  buf[8*sizeof(int)] = 0;
}

__global__ void hello(int d)
{
  int tid = blockDim.x * blockIdx.x + threadIdx.x;
  char *s="Hello World";
  char buf[8*sizeof(int)+1];
  toBinary(buf,d);
  
  printf("%s thread %d, d=%d binary %s\n",s, tid, d, buf);
  
}

int main()
{
  const int nThreads=1, nBlocks=1;
  hello<<<nBlocks, nThreads>>>(5);
  cudaThreadSynchronize();
  return 0;
}

Example 1: A CUDA "Hello World" example.

  $ nvcc -arch=sm_20 hello.cu -run
  Hello World thread 0, d=5 binary 10100000000000000000000000000000
  

Example 2: Compilation and runtime output of the CUDA "Hello World."

Example 3 is the "Hello World" implemented in OpenMP that can run natively on an Phi coprocessor or in offload mode. The offload pragmas highlighted in boldface specify what methods and loops are to be offloaded to the Phi coprocessor.

#include <omp.h>
#include <stdio.h>

#pragma offload_attribute (push, target (mic))

void toBinary(char* buf, int d)
{
  for(int i=0; i < 8*sizeof(int); i++)
	 buf[i] = ((d>>i)&0x1)?'1':'0';
  buf[8*sizeof(int)] = 0;
}

void hello(int d)
{
  int tid = omp_get_thread_num();
  char *s="Hello World";
  char buf[8*sizeof(int)+1];
  toBinary(buf,d);
  
  printf("%s thread %d, d=%d binary %s\n",s, tid, d, buf);
  
}
#pragma offload_attribute (pop) 

int main()
{
  const int nThreads=1;
  #pragma offload target(mic)
  #pragma omp parallel for
  for(int i=0; i < nThreads; i++) hello(5);
  return 0;
}

Example 3 : An OpenMP "Hello World" example.

Example 4 has two commands that show how to compile Example 3 for native execution and offload execution. The output of the a.out binary running in offload mode is also shown.

  $ icc -mmic -openmp -Wno-unknown-pragmas -std=c99 hello.c
  $ icc -std=c99 hello.c
  $ ./a.out
  Hello World thread 0, d=5 binary 10100000000000000000000000000000

Example 4: Phi coprocessor compilation and offload mode output.

CUDA and Phi Coprocessor Threads Differ When Running in Parallel

While programming a single CUDA or Phi coprocessor thread is similar, programming multiple CUDA and Phi coprocessor threads in parallel can be very different. These differences stem from the fact that CUDA threads must be grouped together into blocks of threads (called "thread blocks" in CUDA and "work-groups" in OpenCL) that execute concurrently on the GPU Streaming Multiprocessors (SMs) according to a SIMD (Single Instruction Multiple Data) model, while Phi coprocessors run generic MIMD (Multiple Instruction Multiple Data) threads individually on the x86 cores.

The freedom MIMD provides in allowing any thread to execute any instruction at any time confers some clear performance advantages for many computational problems. For this reason, GPU devices also support MIMD execution by allowing any block of threads to be scheduled to run on any SM on the GPU, which lets the GPU perform many different instructions in parallel at the same time. This MIMD capability is most efficient when the problem matches the granularity of the SM thread block size (32 threads for current GPU devices). GPUs that support this modified form of MIMD capability are referred to as SIMT (Single Instruction Multiple Thread) devices.

From a software point of view:

  • The Phi coprocessor uses MIMD threads that place no restriction on how threads run or communicate. POSIX Pthreads are a popular example of an API that supports this generic capability. It is the programmer's responsibility to ensure that threads do not deadlock, enter race conditions, or limit scalability. OpenMP is a pragma-based approach that simplifies the use of these generic threads.
  • CUDA programmers are required to group threads into blocks. All threads within a block execute simultaneously according to a SIMD execution model. Only threads within a thread block can communicate with each other. MIMD execution is possible at the granularity of a thread block, but it must be manually initiated by the programmer. Blocks can communicate through global memory on the GPU, but this is discouraged as it can cause deadlock, race conditions, and scalability issues. CUDA and OpenCL are the gateway development platforms for programming GPU architectures. OpenACC is a new pragma-based approach that simplifies the use of GPU threads.

Degree of Parallelism

Hardware designers can leverage the SIMD restriction that all threads must run the same instruction at the same time to pack a lot of power efficient computational capability into a small space. Both CPU and GPU processor designs exploit these desirable SIMD characteristics. For example, all modern x86 processors support various SSE (Streaming SIMD Extensions) instructions to accelerate application performance.

The GPU hardware designers ramped this up to an extreme by "betting the company" on SIMD-based streaming multiprocessors. This is the reason that GPU product designs can tout thousands of concurrent threads of execution.

For example, a new NVIDIA Kepler K20 GPU claims to have 2,880 "CUDA cores." In reality, computing the actual number of threads a K20 can run at any particular moment of time is far more complex than this simple marketing number. More precisely, a Kepler SM can concurrently execute up to 64 warps (where a warp is 32 threads), meaning that 2048 SIMD threads can be actively running (and not queued) at any moment in time. A K20 GPU contains 15 SMs, which means up to 2K*15 = 30K threads can be queued to run at any given moment on the device. How many are actively running during any given clock tick ranges from none at all to some number around the 2,880 "CUDA core" number quoted by NVIDIA. (Note that The K20 SM can issue 2 instructions each from 4 warps on the same clock cycle, which means the hardware can run as many as 15*4*2 = 120 warps. Even with perfect code, some internal pipelines do not have the ability to support the parallelism of 3,840 threads for very long. This appears to be the reason that the K20 GPUs are marketed at the lower number of 2,880 CUDA cores.) The actual number depends on the code (meaning what instructions are queued on each SM), what SM resources are available, and how many instructions have their data dependencies satisfied. Similarly, the number of possible MIMD operations that can run at a given moment in time is also complicated.

Simplistically, the number of SM provides an approximate number of MIMD instructions that can be actively running. However, each K20 SM has the ability to run up to 64 warps, meaning that 15 * 64 or 960 MIMD operations can potentially run at any given moment in time. This is why the K20 GPUs provide an aggregate number of 960 program counters per device. Bottom line: GPU hardware is designed to provide a high degree of both SIMD and MIMD parallelism.

GPU manufacturers are betting that the application code will utilize a sufficiently large number of threads that the hardware can exploit a high degree of on-device SIMD and MIMD parallelism to deliver high performance. SIMD processing is closely related to vector processing, so this bet was originally predicated on the belief that application developers with existing vector code would be willing to port their software to use the then new high-performance low-cost GPU hardware.

The Phi coprocessor hardware designers followed a different approach by ramping up their design to "bet the product line" on many-core parallelism combined with a wide per-core wide vector unit. Vector processor design is well understood, so designers can pack a large amount of floating-point and integer capability into a small space on a chip power-efficiently. Vector processors augment traditional multicore MIMD threads by providing additional operations per clock. As a result, the per core vector unit basically multiplies the number of cores by the number concurrent vector operations to give Phi coprocessors a very high degree of floating-point and integer parallelism. Further, vector processing is a well-established programming model that was once the de facto standard for parallel programming. As a result, the Phi coprocessor design ramps up performance and lowers the cost of access for users running decades worth of legacy software written for vector hardware.

Basically, the Phi coprocessor designers are betting that there is sufficient vector and thread parallelism in the application code to efficiently utilize the hardware and achieve high performance. It does not matter if the threads require SIMD or MIMD execution because the Phi coprocessor cores can run either MIMD or SIMD threads efficiently. However, each Phi coprocessor thread must utilize the vector unit or else the device will only be able to deliver the performance of the parallel Pentium class cores. For non-vector code, the fallback is to use the Phi coprocessors as support processors that run conventional x86 applications. In this way, Phi coprocessor customers have the ability to leverage all of their hardware investment with minimal software effort.

The good news for CUDA programmers who wish to utilize Phi coprocessors is that CUDA maps very nicely onto vector hardware. When writing CUDA code to utilize the GPU SIMD processors efficiently, developers also create an efficient mapping to x86 SSE and the new Phi coprocessor's wide vector instructions. It is expected that because the 512-bit width of the wide vector instructions matches an integer multiple of the width of the GPU streaming multiprocessors when they process a thread block, performance will be excellent on Phi devices. In short, the SIMD operations for thread block should translate to one or more wide vector operations.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video