Channels ▼
RSS

Parallel

Intel's 50+ core MIC architecture: HPC on a Card or Massive Co-Processor?


Programming Model Considerations for GPUs and MIC co-processors

When evaluating co-processors for a legacy application, it is necessary to consider the characteristics of the programming model. Whenever possible, use benchmarks to evaluate the transfer and computational efficiency of co-processor applications that are similar to the intended application.

MPI has strong support as it is the de facto standard distributed scientific computing framework.  As mentioned above, developers must consider memory footprint and Amdahl's Law sequential code limitations when porting MPI code to MIC.  When porting to both MIC and GPU co-processors, some considerations are:

  • Most implementations use one co-processor per MPI process.
  • Network bandwidth limitations tend to be a key bottleneck making most MPI applications network rather than compute bound. Use whatever features are available (like GPUdirect) to optimize data transfers.
  • Use all the co-processors in a single MPI process when communications latency is an issue. This can be particularly useful for applications that perform many latency bound operations such as reductions.

While directives-based programming such as OpenMP has strong support in the developer community, directives for co-processors is a "Work-in-progress." For legacy code based on OpenMP, code modifications will certainly be required as legacy applications do not have a concept of data location (Tthey assume an SMP model). Something like "#pragma target (device)" is required. Standardization is moving quickly to prevent a "tower of Babel" proliferation of incompatible pragma specifications.

Note that many OpenMP code bases were developed when two, four, or eight cores were considered "many." High core count processors may expose scaling issues. In this regard, you should:

  • Be aware that atomic and common synchronization operations (such as semaphores, reference counting, etc.) might expose unexpected serialization bottlenecks.
  • Pay particular attention to scaling behavior on cache-coherent architectures like MIC and the impact of conditional operations on SIMD-based GPU architectures.
  • Many legacy OpenMP apps have directives employed at the innermost loop level, which limits the achievable parallelism.  Code modification may be required to expose more parallelism.

Common libraries providing FFT and BLAS functionality should perform well on co-processors because they are optimized for a particular architecture and set of hardware capabilities.  (This assumes data transfers do not limit performance.) 

Language platforms based on a strong-scaling execution model, such as CUDA and OpenCL, will likely perform well on both architectures because they provide linear scaling according to number of processing elements and provide the best Amdahl's Law reduction in parallel code runtime.  Scaling behavior of the computational kernels should not be an issue unless global atomic operations are utilized.

In all programming approaches, high performance can be achieved when the compute intensive portions of the application conform to the three rules of high-performance co-processer programming mentioned previously. If not, expect floating-point performance to be either PCIe or device memory limited.

In summary, expect to use MIC and GPUs as co-processors and that software will rapidly evolve to hide differences between co-processor hardware. Growing support for OpenACC by vendors like CAPS and PGI can make co-processors an attractive, highly portable option for legacy OpenMP codes because the source code intrusion is relatively small.   Vendor libraries provided by NVIDIA and Intel already provide an optimized framework for some applications.  Generally, applications written in OpenCL and CUDA will deliver the greatest performance and longevity due to their use of a strong scaling execution model that can achieve a linear parallel code speedup regardless of number of processing elements. In addition, these languages provide asynchronous queues that can choreograph tasks and data movement among numerous devices.


Approach

Programming Considerations for Legacy Codes

MPI (Message Passing Interface)

  • Co-processor accelerated MPI processes can potentially make better use of on-board resources.
  • Assuming 50 cores and 8 GB per device, each MPI process on a MIC card will have roughly 160 MB for program and data storage. Note: more cores implies less data per core.
  • MPI processes on MIC cores must be particularly frugal in memory usage because each MPI process requires a separate copy of all data.
  • MIC-as-a-compute-node may exhibit Amdahl's Law sequential bottlenecks due to low clock rate Pentium performance compared to modern CPU cores.

Directive-based programming (OpenMP and OpenACC)

  • OpenACC has the potential to transparently run OpenMP applications on co-processors from any vendor with minimal modification.
  • Legacy code can potentially "compile and run" on MIC-as-a-compute-node assuming both program and data fits in memory.
  • High core counts may expose surprising serialization bottlenecks. For example, reference counted objects such as smart pointers in C++ may cause serialization bottlenecks on atomic operations.
  • Applications should be "cache friendly" to avoid memory bandwidth bottlenecks. The effectiveness and messaging overhead of the MIC cache coherency model at high core counts is currently unknown.
  • MIC-as-a-compute-node may exhibit Amdahl's Law sequential bottlenecks due to low clock rate Pentium performance compared to modern CPU cores.
  • Beware the PCIe bottleneck.

Common libraries providing FFT and BLAS functionality

  • Optimized libraries should run well and reflect the co-processor hardware performance capabilities.
  • Subject to memory capacity and bandwidth limitations of the PCIe bus.

Accelerator execution model like  OpenCL

  • Strong-scaling execution model applications like those written OpenCL (and potentially CUDA) have the potential to run well.

Table 4: Summary table of various programming model comments

The rapidity in which the industry is moving to support legacy programming is reflected in NVIDIA's directives based developer effort at SC11 that delivered 5x to 20x speedups for several legacy applications in two days or less. Intel's HPC General Manager Rajeeb Hazra expresses a similar view about MIC compiler technology: "It eliminates code porting to a certain extent," redefining the effort so that, "It just makes it an optimization job." As always, caveat emptor still applies regardless of the technology used.

Hands-on comments by TACC about programming MIC as of March 1, 2012 can be found in the "Oil and Gas High Performance Computing Workshop" video.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video