Steve Reinhardt is Vice President of Joint Research at Interactive Supercomputing. Contact him at firstname.lastname@example.org.
Organizations that depend heavily on high-performance computing (HPC) are salivating at the prospect of the performance potential of GPUs (general-purpose graphics processing units), which is often 10 or even 100X faster performance per chip, usually with less power consumption than mass-market x86 sockets. However, the path to widespread realization of this performance contains considerable obstacles. Advanced software can overcome these obstacles, though establishing a language environment that is both stable and highly productive in the GPU context will require investments that few HPC-dependent organizations may be expecting.
The primary obstacle is a change to the "virtual machine" model of the underlying hardware. Successful HPC languages like Fortran, C/C++, and MATLAB's M are based on a straightforward virtual machine, which has survived drastic hardware changes -- pipelining in serial processors in the '60s, vector processing in the '70s and '80s, and caches in the '80s and '90s -- enabling hundreds of millions of lines of code representing many billions of dollars of investment to be reused (albeit sometimes obscenely inefficiently) on new systems, and fueling the diffusion of HPC use into new domains. The pioneers of the stream processing technology behind GPUs understood the inefficiencies in the use of memory bandwidth caused by mapping typical HPC programs to general-purpose microprocessors, so they designed better memory interfaces and execution structures that often provide 10X and sometimes 100X higher performance for well-suited code. Unfortunately, the current state-of-the-art of GPU languages and compilers means that this new memory hierarchy is exposed to the programmer, who must explicitly arrange data appropriately to reap the potential higher speed.
For example, in most GPUs there is a level of memory that is closest to the processor and provides the highest performance for memory-intensive operations; programmers are highly motivated to use this memory for best performance. To achieve this in OpenCL, the emerging standard for GPU programming, a programmer would add the __private keyword to the declaration of a variable, and would need to ensure that the total space used by all private-memory variables does not exceed the capacity of the private memory. Recently the Portland Group released Version 9.0 of its compilers, implementing a new Accelerator Model, which seems a signifcant step in the right direction, apparently going beyond less-intrusive declaration (modest value) to automatic identification (high value). If innovations such as the Accelerator Model do not take hold and the identification of closest-memory variables continues to be manual, coping with this will cost an adopting organization both money (programmer labor) and time (for code conversion). Market-wide, the manual approach will not be practical because it will be too expensive and there aren't enough programmers proficient to make this type of change to all the codes needing higher performance.
At Interactive Supercomputing, we believe that the current GPU programming methods are not high-level enough for widespread use, which is necessary before large organizations can make the investments necessary to exploit the power of GPUs broadly. GPU advocates may counter that the astonishing GPU performance potential will be so attractive that developers in droves will rewrite their codes. This is possible, but history should make us pause before concluding this is certain.
As a prominent example, the Cray-1 vector processor offered nearly as much performance benefit over its competitors as today's GPUs; that is, often a factor of 10 and rarely a factor of 100. (The Cray-1's factor-of-two performance advantage for pure scalar code didn't hurt, either.) Like the GPUs, the Cray-1 achieved its advantage by innovations in the memory hierarchy that matched the streaming and data re-use character of HPC algorithms, but Cray blended this hardware change with a smart compiler that transparently (or nearly so, via directives) exploited this new hierarchy for high performance. Fortran, and later C, programmers were typically unaware of memory ports, vector registers, and chaining, which were critical elements of achieving high performance. Many contemporary observers believed that without such a compiler Cray Research would not have achieved the commercial success that it did. GPU advocates will fairly point out that the GPUs' much greater affordability (on the order of $500 compared to the order of $10M for early Cray systems) will support much more experimentation, but our point here is about the step beyond experimentation, to widespread adoption, for which usable established standards are essential. Looking more broadly, since the invention of Fortran 50 years ago (and C 40 years ago), there have been no hardware innovations that have successfully disrupted the memory model of these languages as the current GPU programming methods do, and that fact should make us consider other outcomes. (Careful readers may object that the move to distributed-memory parallelism via MPI has disrupted the memory model in exactly this way. While MPI forces the programmer to deal with distributed memory explicitly, it does so without changing the language in a way that prevents the resulting code from compiling efficiently by a compiler unaware of distributed memory.)