The Code Migration Conundrum
While teraflop/sec. performance is compelling, there is no guarantee that any of these devices will deliver high performance (or even a performance benefit) for any given application. This uncertainty coupled with the risk and costs of a porting effort has kept many customers with legacy code from investing in this new technology.
Chip manufacturers, and the industry as a whole, have invested heavily in several programming models to make porting efforts as fast and risk free as possible. As mentioned, NIVIDA's investment in CUDA has been very successful. Not surprisingly, well established legacy programming models have also attracted much attention. For comparison purposes, this article will focus on four types of programming models that are supported by the industry for massively-parallel co-processors:
- MPI (Message Passing Interface)
- Directive-based programming like OpenMP and OpenACC
- Common libraries providing FFT and BLAS functionality
- Language platforms based on a strong-scaling execution model (CUDA and OpenCL)
The current packaging of GPU and MIC massively-parallel chips as external PCIe devices complicates each of these programming models. For example, the overhead incurred by host/device data transfers breaks an assumption made by the SMP execution model that any thread can access any data in a shared memory system without paying a significant performance penalty. Efforts like OpenACC (and potentially OpenMP 4.0) are attracting attention because they provide a standard method to specify data locality. The hope is that minimal code changes will be required to modify legacy code to run on co-processors.
Limited on-board memory also requires partitioning computational problems into pieces that can fit into device memory. At this time, a human programmer is required to partition larger computational problems into smaller pieces that can run on a co-processor and achieve high performance by efficiently overlapping computation and communication. Hybrid memory cubes hold hope for large memory co-processors in the future, but it is unclear whether the next generation NVIDIA Kepler or Intel MIC cards will use this technology.
Succinctly, achieving performance with co-processors generally requires that the programmer to:
- Transfer the data across the PCIe bus onto the device and keep it there;
- Give the device enough work to do;
- Focus on data reuse within the co-processor(s) to avoid memory bandwidth bottlenecks.
Bottom line: Semantic limitations coupled with the costs and complexity of utilizing data located in multiple memory spaces plus limitations in on-board memory capacity currently prevents the automatic translation of legacy code to both GPUs and MIC co-processors. Some porting effort is required.


