Channels ▼
RSS

Parallel

Programming Intel's Xeon Phi: A Jumpstart Introduction


The preproduction Intel Xeon Phi coprocessors based on the Knights Corner (KNC) chips can provide well over one teraflop of floating-point performance. Developers can reach this supercomputing level of number crunching power via one of several routes:

  • Using pragmas to augment existing codes so they offload work from the host processor to the Intel Xeon Phi coprocessors(s)
  • Recompiling source code to run directly on coprocessor as a separate many-core Linux SMP compute node
  • Accessing the coprocessor as an accelerator through optimized libraries such as the Intel MKL (Math Kernel Library)
  • Using each coprocessor as a node in an MPI cluster or, alternatively, as a device containing a cluster of MPI nodes.

From this list, experienced programmers will recognize that the Phi coprocessors support the full gamut of modern and legacy programming models. Most developers will quickly find that they can program the Phi in much the same manner that they program existing x86 systems. The challenge lies in expressing sufficient parallelism and vector capability to achieve high floating-point performance, as the Intel Xeon Phi coprocessors provide more than an order of magnitude increase in core count over the current generation quad-core processors. Massive vector parallelism is the path to realize that high performance.

The focus of this first article is to get up and running on Intel Xeon Phi as quickly as possible. Complete working examples will show that only a single offload pragma is required to adapt an OpenMP square-matrix multiplication example to run on a Phi coprocessor. Performance comparisons demonstrate that both the pragma-based offload model and using Intel Xeon Phi as an SMP processor compare favorably against the MKL library optimized for the host, and that the optimized Phi MKL library can easily deliver over a teraflop. A second installment, next week, will discuss how programming the Phi compares with CUDA programming.

The Xeon Phi Hardware Model from a Software Perspective

The Intel Xeon Phi KNC processor is essentially a 60-core SMP chip where each core has a dedicated 512-bit wide SSE (Streaming SIMD Extensions) vector unit. All the cores are connected via a 512-bit bidirectional ring interconnect (Figure 1). Currently, the Phi coprocessor is packaged as a separate PCIe device, external to the host processor. Each Phi contains 8 GB of RAM that provides all the memory and file-system storage that every user process, the Linux operating system, and ancillary daemon processes will use. The Phi can mount an external host file-system, which should be used for all file-based activity to conserve device memory for user applications. Even though Linux on Intel Xeon Phi provides a conventional SMP virtual memory environment, the coprocessor cards do not support paging to an external device.


Figure 1: Knights Corner microarchitecture.

A preproduction card using a Knights Corner chip achieved a score of 189 GB/s on the streams triad benchmark with ECC (Error Correcting Code) enabled. It is expected the production Intel cards shipping next month will deliver higher performance. The theoretical maximum bandwidth of the Intel Xeon Phi memory system is 352 GB/s (5.5GTransfers/s * 16 channels * 4B/Transfer), but internal bandwidth limitations inside the KNC chips (specifically the ring interconnect) plus the overhead of ECC memory limit achievable performance to 200 GB/s or less.

Each Intel Xeon Phi core is based on a modified Pentium processor design that supports hyperthreading and some new x86 instructions created for the wide vector unit. As illustrated in Figure 2, developers need to utilize both parallelism and vector processing to achieve high performance. Programmers are free to work with their preferred programming languages and parallelism models so long as the application can scale to match Phi capabilities.


Figure 2: High-performance Xeon Phi applications exploit both parallelism and vector processing.

Per the analysis presented earlier in my Dr. Dobb's Journal article, "Intel's 50+ Core MIC Architecture," memory capacity will likely be the main limitation for the current generation of Phi devices (previously called "MIC") — especially for those who wish to run native SMP or MPI applications directly on the device.

The current PCIe packaging complicates the offload programming model as external data breaks an assumption made by the SMP execution model that any thread can access any data in a shared memory system without paying a significant performance penalty. As can be seen in Figure 3, the PCIe bandwidth is significantly lower than that of the on-board memory.


Figure 3: PCIe and memory bandwidths.

So achieving high offload computational performance with external coprocessors requires that developers:

  • Transfer the data across the PCIe bus to the coprocessor and keep it there
  • Give the coprocessor enough work to do
  • Focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks and moving data back and forth to the host processor.

Be aware that the preproduction Intel Xeon Phi cards have only one DMA engine, so any communications (network file-system, MPI, sockets, ssh, and so forth) between the coprocessor and host can interfere with offload data transfers and thereby w application performance.

While the aggregate Intel Xeon Phi computational performance is high, each core is slow and has limited floating-point performance when compared with a modern Sandy Bridge processor. High performance can be achieved only when a large number of parallel threads (minimum 120) are utilized, and they issue instructions to the wide vector units quickly enough to keep the vector pipeline full. The current generation of coprocessor cores support up to four concurrent threads of execution via hyperthreading. Most developers will rely on the compiler to recognize when the Intel Xeon Phi special wide vector instructions can be issued to the per core vector units. (More-adventurous programmers can utilize compiler intrinsic operations or assembly language to access the vector units.) This means that existing libraries and applications must be recompiled to run well on the Phi. In general, the best floating-point performance will be realized when each core is running two threads that actively issue instructions to the vector unit. For a 61-core coprocessor, this means that the programmer must be able to effectively utilize 120 threads (two times of the number of cores minus one core reserved for the operating system) inside their application.

Empirically, it appears that the internal Pentium cores are not fast enough to keep their associated per core vector unit busy when running only one thread. Running with two threads per core appears to be the generic minimum thread count, best performance sweet spot. This is only a general rule of thumb, as much depends on the type and amount of work performed by each thread before it issues a vector operation. (Note that future Intel Xeon Phi products will likely support greater parallelism, so the ability to support higher application thread counts is highly encouraged.)

The key to Intel Xeon Phi floating-point performance is the efficient use of the per core vector unit. To access the vector unit, the compiler must be able to recognize SSE-compatible constructs so it can generate the special Intel Xeon Phi vector instructions. Developers with legacy code can test if their applications will benefit from Xeon Phi floating-point capability by simply telling the compiler to utilize the SSE instructions on the current x86 processor (through the GNU –msse or other compiler switch). Applications that run faster with SSE (or conversely slow down when the use of SSE instructions is disabled) will likely benefit from the Intel Xeon Phi wide vector unit. Applications that don't benefit from the SSE instruction set will be limited to the performance of the individual Pentium-based cores. Although this means Intel Xeon Phi will probably not be a performance star for non-vector applications, these coprocessors can still be used as support devices that provide many-core parallelism and high memory bandwidth.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Comments:

ubm_techweb_disqus_sso_-4d8fa8ccea523b7d61f2033f1273cdeb
2013-10-27T23:54:40

Ok, I found a workaround.

Copy the libraries from /opt/intel//composer_xe_2013_sp1.1.106/compiler/lib/mic to directory where I'm running the program.


Permalink
ubm_techweb_disqus_sso_-4d8fa8ccea523b7d61f2033f1273cdeb
2013-10-27T23:40:54

Hello,

When I ran the program at phu card I got

./firstMatrix.mic: error while loading shared libraries: libmkl_intel_lp64.so: cannot open shared object file: No such file or directory

Can you help me?


Permalink
ubm_techweb_disqus_sso_-9693305658e3eb3e55bdc63a84d548fa
2013-02-09T09:51:08

Hello,

Is it possible to launch java program on Phi card ?
In this case does the JVM 'see' all the 60 cores or each jvm will run in one core ?

Thanks,


Permalink
ubm_techweb_disqus_sso_-abcdf7519f63ec03c8fe76e793f18960
2012-12-14T12:50:05

Knight's corner but where is the I/O?
nVidia graphics card running as Linux cluster makes an awesome GIMP rendering engine. Why use "Phi" when Tesla is already available? Makes a fair x86 cluster with cpu emulation; I don't think that Intel will promote that but hint AMD: the graphics engine on your APUs (and Tegra) is faster than the "cpu".


Permalink

Video