Channels ▼
RSS

Parallel

Intel's 50+ core MIC architecture: HPC on a Card or Massive Co-Processor?


MIC as a Linux compute node

The MIC architecture is based on modified Pentium processing cores coupled to a per-core vector unit. The performance implications of this design decision will most certainly be hotly debated by the GPU and MIC communities for years to come. The Intel Server Room Blog notes, "The MIC architecture products are first and foremost compute nodes. They run an open source [L]inux OS, they are networked and can run applications." So, let's analyze MIC as if it were a separate many-core Linux computer connected to the host system by the PCIe bus, or MIC-as-a-compute-node.

From a source compatibility point of view, this model is attractive to organizations with millions of lines of legacy code: take your existing legacy source code, recompile for MIC using a "–mmic" compiler flag, and run. Most build systems make it easy to specify both the compiler and any special flags like "-mmic", which supports the claim by Jeff Nichols, a Director at Oak Ridge National Laboratory, that they were able to port "millions of lines of code... literally in days" to MIC. MIC-as-a-compute-node is of interest to owners of existing OpenMP, MPI + vector, and hybrid MPI + OpenMP applications as these codes have the potential to recompile and run.

But how well will it run? Even without hardware at hand, it is possible to get a sense of how well recompiled x86-based legacy code will run on MIC-as-a-compute-node by considering three factors: Amdahl's Law, wide SSE-like vector characteristics, and on-board memory capacity.

Amdahl's Law gives an approximation that models the ideal speedup that can happen when single-threaded programs are modified to run on parallel hardware. The speedup of a program using multiple processors in parallel computing is limited by the time required by the sequential fraction of the program. In the best case, those sections of code that can be parallelized can have their runtime reduced by a factor of N, where N is the number of parallel processing elements. Obviously, the time taken to complete the serial sections of code (e.g. those sections that cannot be parallelized) will not change, which means they can dominate the runtime when the number of parallel processing elements, N, is large.

Co-processors have the advantage of being able to exploit the performance capabilities of the latest and highest clock rate processors in the host system. In contrast, using MIC-as-a-compute-node means that serial sections of code will run on a single 1.2 GHz to 1.6 GHz Pentium core. This difference in processor performance relative to a state-of-the-art processing core can increase the fraction of time spent in sequential code and cause applications to run more slowly on MIC-as-a-compute-node compared to MIC-as-a-coprocessor.

The key to MIC floating-point performance is the efficient use of the per core vector unit. To access the vector unit, the compiler must be able to recognize SSE compatible constructs so it can generate the MIC SSE-like assembly language instructions. The test with your current hardware and compiler is simple: tell your compiler to utilize the SSE instructions on your x86 processor through the "–msse" or other compiler switch. Applications that run faster will probably benefit from the MIC vector unit. (Conversely, check if the application slows down by disabling the use of SSE instructions.) Those applications that don't benefit from the SSE instruction set will most likely be limited to the performance of the individual Pentium based cores (or that of 50 to 64 1.2GHz - 1.6GHz Pentium processors). For additional analysis and discussion of "-msse" compatibility on MIC, see Greg Pfister's "MIC and the Knights" by Greg Pfister.

Balance ratios are conventional, established measures used in HPC to evaluate potential system performance. My 2010 GTC presentation lists the four important balance ratios for the current PNNL (Pacific Northwest National Laboratory) Chinook supercomputer: memory capacity, memory bandwidth, aggregate link bandwidth, and interconnect latency.

Table 2 below compares the MIC balance ratios for legacy workloads against the PNNL Chinook supercomputer. (Note: This table makes several assumptions about MIC capabilities and so the values should be considered with caution.) The ratios can be easily updated as Intel publishes more performance data on KNC:

  • Intel has not yet released information about the amount of memory that will be available on each MIC card. This table arbitrarily assumes that each MIC card will contain 8 GB of RAM.
  • Intel has yet not released information about the how fast MIC can communicate across the PCIe bus. The table arbitrarily assumes effective utilization of the available PCIe bandwidth (e.g. 16 GB/s on a PCIe gen-2 bus and 32 GB/s should the Knights Corner cards utilize a PCIe gen-3 interface).
  • Intel has not yet released information about communication latency through the internal ring interconnect or across the PCIe bus. The following table assumes that communications latency will be software limited and comparable to existing Infiniband software stacks.
  • This table lists 8-core Chinook and 50-core KNC balance ratios based on the assumption that peak floating-point performance will be achieved by all the per-core vector units.

Balance Category

MIC Knights Corner
(50-core)

Chinook
(8-core)

Memory Amount (Bytes/flop)

0.008

0.46

Memory Bandwidth (B/s/flop/s)

?

0.21

Aggregate Link BW (B/s/flop/s)

0.016 (PCIe gen-2)
0.032 (PCIe gen-3)

0.17

Interconnect Latency (ms)

< 2

1.1

Table 2 : Potential balance ratios for MIC-as-a-compute-node

With the exception of latency, larger values are preferred for generic legacy workloads such as NWChem – a porting effort referenced in the MIC literature – for which the Chinook supercomputer was designed. The small ratio of MIC memory capacity to flops indicates that memory capacity will be a significant problem and likely obstacle for many legacy applications. From a communications bandwidth and latency stand-point, MIC-as-a-compute-node is very interesting as most applications will run at a fraction of the peak floating-point rate. In particular, a PCIe gen-3 bus has the potential to act as a 256 Gb/s data link, which exceeds current, commodity InfiniBand capabilities.

When running on MIC-as-a-compute-node, legacy "compile and run" customers should consider the following factors as listed in Table 3.

Feature Categories

Projected Application Profile to Run Well on MIC-as-a-Compute-Node

Memory Usage

  • MPI and OpenMP application code + data must have a small per-core memory footprint.
  • Must be cache friendly to efficiently use the on-core cache memory and avoid memory bandwidth bottlenecks.
  • Need to avoid serialization side-effects from semaphores and atomic operations like C++ smart pointers and reference counted objects.

Balance of Scalar and Vector

  • High flop rates will be achieved through SSE-like vector operations.
  • Need to avoid Amdahl's Law sequential bottlenecks due to low Pentium performance compared to modern high clock-rate CPU cores

Table 3: Project application characteristics to run well on MIC-as-a-compute-node

In summary, MIC does not eliminate the need to rewrite legacy applications except for those applications that can run in the memory footprint of the PCIe device. Of that subset of applications, only those that currently do not rely on serial calculations but depend mostly on SSE acceleration will be able to benefit from the per-core vector unit to achieve high floating-point performance. Further, high performance will probably require cache friendly applications. Those applications that can recompile and run will probably still require modification to make full use of the MIC capabilities. In particular, memory limitations will most likely require re-architecting the application to use the current generation of MIC devices as a co-processor.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video