Programming Model Considerations for GPUs and MIC co-processors
When evaluating co-processors for a legacy application, it is necessary to consider the characteristics of the programming model. Whenever possible, use benchmarks to evaluate the transfer and computational efficiency of co-processor applications that are similar to the intended application.
MPI has strong support as it is the de facto standard distributed scientific computing framework. As mentioned above, developers must consider memory footprint and Amdahl's Law sequential code limitations when porting MPI code to MIC. When porting to both MIC and GPU co-processors, some considerations are:
- Most implementations use one co-processor per MPI process.
- Network bandwidth limitations tend to be a key bottleneck making most MPI applications network rather than compute bound. Use whatever features are available (like GPUdirect) to optimize data transfers.
- Use all the co-processors in a single MPI process when communications latency is an issue. This can be particularly useful for applications that perform many latency bound operations such as reductions.
While directives-based programming such as OpenMP has strong support in the developer community, directives for co-processors is a "Work-in-progress." For legacy code based on OpenMP, code modifications will certainly be required as legacy applications do not have a concept of data location (Tthey assume an SMP model). Something like "#pragma target (device)" is required. Standardization is moving quickly to prevent a "tower of Babel" proliferation of incompatible pragma specifications.
Note that many OpenMP code bases were developed when two, four, or eight cores were considered "many." High core count processors may expose scaling issues. In this regard, you should:
- Be aware that atomic and common synchronization operations (such as semaphores, reference counting, etc.) might expose unexpected serialization bottlenecks.
- Pay particular attention to scaling behavior on cache-coherent architectures like MIC and the impact of conditional operations on SIMD-based GPU architectures.
- Many legacy OpenMP apps have directives employed at the innermost loop level, which limits the achievable parallelism. Code modification may be required to expose more parallelism.
Common libraries providing FFT and BLAS functionality should perform well on co-processors because they are optimized for a particular architecture and set of hardware capabilities. (This assumes data transfers do not limit performance.)
Language platforms based on a strong-scaling execution model, such as CUDA and OpenCL, will likely perform well on both architectures because they provide linear scaling according to number of processing elements and provide the best Amdahl's Law reduction in parallel code runtime. Scaling behavior of the computational kernels should not be an issue unless global atomic operations are utilized.
In all programming approaches, high performance can be achieved when the compute intensive portions of the application conform to the three rules of high-performance co-processer programming mentioned previously. If not, expect floating-point performance to be either PCIe or device memory limited.
In summary, expect to use MIC and GPUs as co-processors and that software will rapidly evolve to hide differences between co-processor hardware. Growing support for OpenACC by vendors like CAPS and PGI can make co-processors an attractive, highly portable option for legacy OpenMP codes because the source code intrusion is relatively small. Vendor libraries provided by NVIDIA and Intel already provide an optimized framework for some applications. Generally, applications written in OpenCL and CUDA will deliver the greatest performance and longevity due to their use of a strong scaling execution model that can achieve a linear parallel code speedup regardless of number of processing elements. In addition, these languages provide asynchronous queues that can choreograph tasks and data movement among numerous devices.
Approach |
Programming Considerations for Legacy Codes |
MPI (Message Passing Interface) |
|
|
|
|
|
Accelerator execution model like OpenCL |
|
Table 4: Summary table of various programming model comments
The rapidity in which the industry is moving to support legacy programming is reflected in NVIDIA's directives based developer effort at SC11 that delivered 5x to 20x speedups for several legacy applications in two days or less. Intel's HPC General Manager Rajeeb Hazra expresses a similar view about MIC compiler technology: "It eliminates code porting to a certain extent," redefining the effort so that, "It just makes it an optimization job." As always, caveat emptor still applies regardless of the technology used.
Hands-on comments by TACC about programming MIC as of March 1, 2012 can be found in the "Oil and Gas High Performance Computing Workshop" video.


