Channels ▼
RSS

Parallel

Intel's 50+ core MIC architecture: HPC on a Card or Massive Co-Processor?


Convergent Evolution in HPC: Intel MIC and NVIDIA GPU

Five years ago, NVIDIA disrupted the high performance computing industry with the release of CUDA in February 2007. In combination with low cost of teraflop/sec (single-precision) GPU hardware, NVIDIA brought supercomputing to the masses along with co-processor acceleration of both C and Fortran applications. With the MIC announcement, Intel has followed suit along this convergent evolutionary path.

While similarly packaged as a PCIe device, Intel has taken a different architectural approach to massively-parallel computing hardware. The KNC generation of MIC products appears to be HPC- oriented, which means high-end customers can now choose from two types of teraflop/sec capable PCIe-based co-processors.

Comparing NVIDIA GPUs and Intel MIC

GPU designs utilize many streaming multiprocessors (SM) where each SM can run up to 32 concurrent SIMD (Single Instruction Multiple Data) threads of execution. The current generation of Fermi GPUs  supports 512 concurrent SIMD threads of execution that can be sub-divided into 16 separate SIMT (Single Instruction Multiple Thread) tasks. The upcoming NVIDIA Kepler GPUs will support even greater parallelism. For example, the GTX 680 will support 1,536  concurrent SIMD threads.

Teraflop/sec. performance is achieved through a per-SM hardware scheduler that can quickly identify those SIMD instructions that are ready-to-run (meaning they have no unresolved dependencies). Ready-to-run instructions are then dispatched to keep multiple integer, floating-point, and special function units busy.  A per-GPU hardware scheduler similarly allocates work (via CUDA thread blocks or OpenCL™ work-groups) to ensure high utilization across all the SM on a GPU.

High flops/watt efficiency is realized through the use of a SIMD execution model inside each SM that requires less supporting logic than non-SIMD architectures. GPU hardware architects have been able to capitalize on this savings by devoting more power and space to 64-bit addressing, additional ALUs, floating-point, and Special Function Units for transcendental functions. Some reviewers report  that NVIDIA expects Kepler to deliver "about 3x improvement in [double precision] performance per watt …" over Fermi.
Other notable characteristics include:

  • Data-parallel operations are spread across the SMs of one or more GPU devices.
  • Task-parallelism is accomplished by running concurrent kernels on different SMs and/or multiple devices plus the host processor.
  • MPI jobs are accelerated by using one or more GPUs per process and capabilities like GPUdirect, which optimizes data transfer into device memory.

Intel MIC

The Intel MIC architecture in the KNC chip utilizes x86 Pentium-based processing cores that support four threads per core. According to The Register, the next generation Knights Corner has "64 cores on the die, and depending on yields and the clock speeds that Intel can push on the chip, it will activate somewhere between 50 and 64 of those cores and run them at 1.2GHz to 1.6GHz".  The preceding implies that each KNC chip will provide between 200 and 256 concurrent threads of execution.

Teraflop/sec floating-point performance can be achieved when enough of the SMP threads issue special SSE-like instructions to fully utilize an enhanced vector/SIMD unit that resides on each core. (Note: this requires the use of a special "-mmic" compiler switch to tell the Intel compilers to look for cases when these MIC-specific vector instructions can be utilized, or via hand-coding with intrinsic operations.)

High flops/watt efficiency is realized by leveraging the simplicity of the original in-order short execution pipeline Pentium design and the power savings of chips created with their 22 nm manufacturing process.  MIC also derives high flops/watt from using wide vector units.  The logic for the Pentium core is small relative to modern processor cores, which left room for additional logic to support 64-bit addressing, four concurrent threads per core, and a large 512-bit wide vector unit. Per the TACC Stampede announcement, the initial revision of KNC per-core vector unit will deliver 50% higher floating-point performance in 2013.
Other notable characteristics include:

  • Data-parallel tasks appear to be mainly accelerated by the per-core vector units.
  • Task-parallelism is accelerated by running a task per thread and separate tasks on the device(s) and host processor.
  • MPI jobs are accelerated by using one or more MIC devices per process or capabilities like MIC-as-a-compute-node discussed later in this article.

 

NVIDIA GPU

Intel MIC

Degree of Parallelism

Fermi supports 512 concurrent SIMT threads of execution.  Kepler will triple this number to 1,536 threads.

Knights Corner expected to support between 200 and 256 concurrent threads.

Achieving High Performance

A per-SM hardware scheduler keeps multiple computational units busy by identifying and dispatching any ready-to-run SIMD instructions.

A compiler or programmer utilizes special SSE-like instructions to keep each per-core vector unit busy.

Achieving Power Efficiency

The per-SM SIMD execution model requires less supporting logic, leading to high power efficiency and floating-point performance. Expect a 3x increase in Kepler double-precision efficiency.

Leverages the simplicity of the original Pentium design and the floating-point capability of a 512-bit vector unit along with the power savings of manufactured with a 22 nm process.

Data-parallel acceleration

Data-parallel operations are spread across the SMs of one or more GPU devices.

Data-parallel operations accelerated by the per-core vector units and are spread across the cores of one or more devices.

Task-parallel acceleration

Concurrent kernel execution allows multiple kernels to run on one or more SM.

Concurrent threads can run multiple tasks on the device.

MPI acceleration

MPI jobs are accelerated by using one or more GPUs per MPI process and optimized data transfer capabilities like GPUdirect.

MPI jobs are accelerated by using one or more MIC devices per MPI process, or one MPI process per MIC core.

Table 1: GPGPU and MIC architectural approaches to massive parallelism

For more detailed information about the MIC architecture, I recommend reading, "Larrabee" A Many-Core x86 Architecture for Visual Computing" as MIC is based on the Larrabee computing architecture with the visualization capability removed. The NVIDIA documentation such as the Fermi whitepaper, my tutorial series in Dr. Dobb's, and my book, "CUDA Application Design and Development" are good  sources for more detailed information about NVIDIA GPUs.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video