The preproduction Intel Xeon Phi coprocessors based on the Knights Corner (KNC) chips can provide well over one teraflop of floating-point performance. Developers can reach this supercomputing level of number crunching power via one of several routes:
- Using pragmas to augment existing codes so they offload work from the host processor to the Intel Xeon Phi coprocessors(s)
- Recompiling source code to run directly on coprocessor as a separate many-core Linux SMP compute node
- Accessing the coprocessor as an accelerator through optimized libraries such as the Intel MKL (Math Kernel Library)
- Using each coprocessor as a node in an MPI cluster or, alternatively, as a device containing a cluster of MPI nodes.
From this list, experienced programmers will recognize that the Phi coprocessors support the full gamut of modern and legacy programming models. Most developers will quickly find that they can program the Phi in much the same manner that they program existing x86 systems. The challenge lies in expressing sufficient parallelism and vector capability to achieve high floating-point performance, as the Intel Xeon Phi coprocessors provide more than an order of magnitude increase in core count over the current generation quad-core processors. Massive vector parallelism is the path to realize that high performance.
The focus of this first article is to get up and running on Intel Xeon Phi as quickly as possible. Complete working examples will show that only a single offload pragma is required to adapt an OpenMP square-matrix multiplication example to run on a Phi coprocessor. Performance comparisons demonstrate that both the pragma-based offload model and using Intel Xeon Phi as an SMP processor compare favorably against the MKL library optimized for the host, and that the optimized Phi MKL library can easily deliver over a teraflop. A second installment, next week, will discuss how programming the Phi compares with CUDA programming.
The Xeon Phi Hardware Model from a Software Perspective
The Intel Xeon Phi KNC processor is essentially a 60-core SMP chip where each core has a dedicated 512-bit wide SSE (Streaming SIMD Extensions) vector unit. All the cores are connected via a 512-bit bidirectional ring interconnect (Figure 1). Currently, the Phi coprocessor is packaged as a separate PCIe device, external to the host processor. Each Phi contains 8 GB of RAM that provides all the memory and file-system storage that every user process, the Linux operating system, and ancillary daemon processes will use. The Phi can mount an external host file-system, which should be used for all file-based activity to conserve device memory for user applications. Even though Linux on Intel Xeon Phi provides a conventional SMP virtual memory environment, the coprocessor cards do not support paging to an external device.
Figure 1: Knights Corner microarchitecture.
A preproduction card using a Knights Corner chip achieved a score of 189 GB/s on the streams triad benchmark with ECC (Error Correcting Code) enabled. It is expected the production Intel cards shipping next month will deliver higher performance. The theoretical maximum bandwidth of the Intel Xeon Phi memory system is 352 GB/s (5.5GTransfers/s * 16 channels * 4B/Transfer), but internal bandwidth limitations inside the KNC chips (specifically the ring interconnect) plus the overhead of ECC memory limit achievable performance to 200 GB/s or less.
Each Intel Xeon Phi core is based on a modified Pentium processor design that supports hyperthreading and some new x86 instructions created for the wide vector unit. As illustrated in Figure 2, developers need to utilize both parallelism and vector processing to achieve high performance. Programmers are free to work with their preferred programming languages and parallelism models so long as the application can scale to match Phi capabilities.
Figure 2: High-performance Xeon Phi applications exploit both parallelism and vector processing.
Per the analysis presented earlier in my Dr. Dobb's Journal article, "Intel's 50+ Core MIC Architecture," memory capacity will likely be the main limitation for the current generation of Phi devices (previously called "MIC") especially for those who wish to run native SMP or MPI applications directly on the device.
The current PCIe packaging complicates the offload programming model as external data breaks an assumption made by the SMP execution model that any thread can access any data in a shared memory system without paying a significant performance penalty. As can be seen in Figure 3, the PCIe bandwidth is significantly lower than that of the on-board memory.
Figure 3: PCIe and memory bandwidths.
So achieving high offload computational performance with external coprocessors requires that developers:
- Transfer the data across the PCIe bus to the coprocessor and keep it there
- Give the coprocessor enough work to do
- Focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks and moving data back and forth to the host processor.
Be aware that the preproduction Intel Xeon Phi cards have only one DMA engine, so any communications (network file-system, MPI, sockets, ssh, and so forth) between the coprocessor and host can interfere with offload data transfers and thereby w application performance.
While the aggregate Intel Xeon Phi computational performance is high, each core is slow and has limited floating-point performance when compared with a modern Sandy Bridge processor. High performance can be achieved only when a large number of parallel threads (minimum 120) are utilized, and they issue instructions to the wide vector units quickly enough to keep the vector pipeline full. The current generation of coprocessor cores support up to four concurrent threads of execution via hyperthreading. Most developers will rely on the compiler to recognize when the Intel Xeon Phi special wide vector instructions can be issued to the per core vector units. (More-adventurous programmers can utilize compiler intrinsic operations or assembly language to access the vector units.) This means that existing libraries and applications must be recompiled to run well on the Phi. In general, the best floating-point performance will be realized when each core is running two threads that actively issue instructions to the vector unit. For a 61-core coprocessor, this means that the programmer must be able to effectively utilize 120 threads (two times of the number of cores minus one core reserved for the operating system) inside their application.
Empirically, it appears that the internal Pentium cores are not fast enough to keep their associated per core vector unit busy when running only one thread. Running with two threads per core appears to be the generic minimum thread count, best performance sweet spot. This is only a general rule of thumb, as much depends on the type and amount of work performed by each thread before it issues a vector operation. (Note that future Intel Xeon Phi products will likely support greater parallelism, so the ability to support higher application thread counts is highly encouraged.)
The key to Intel Xeon Phi floating-point performance is the efficient use of the per core vector unit. To access the vector unit, the compiler must be able to recognize SSE-compatible constructs so it can generate the special Intel Xeon Phi vector instructions. Developers with legacy code can test if their applications will benefit from Xeon Phi floating-point capability by simply telling the compiler to utilize the SSE instructions on the current x86 processor (through the GNU
–msse or other compiler switch). Applications that run faster with SSE (or conversely slow down when the use of SSE instructions is disabled) will likely benefit from the Intel Xeon Phi wide vector unit. Applications that don't benefit from the SSE instruction set will be limited to the performance of the individual Pentium-based cores. Although this means Intel Xeon Phi will probably not be a performance star for non-vector applications, these coprocessors can still be used as support devices that provide many-core parallelism and high memory bandwidth.