Graphical Processing Unit (GPU) specialist Nvidia has been vocal on the subject of the Kepler architecture-based Nvidia Tesla K20 GPU, which the company predicted would be the highest performance processor the HPC industry has ever seen when it was unveiled in May.
The firm's senior devtech engineer Peter Messmer has reported that recent performance tests on real-world scientific applications show that the forthcoming GPU surpasses expectations. He also says that he's thrilled about Kepler's new Hyper-Q feature, which is designed to increase performance for what the company notes as "thousands of legacy MPI applications", without requiring a major code rewrite.
NOTE: Message Passing Interface (MPI) is a standardized and portable message-passing system designed by a group of researchers from academia and industry to function on a wide variety of parallel computers. The standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in Fortran 77 or the C programming language. Several well-tested and efficient implementations of MPI include some that are free and in the public domain. These fostered the development of a parallel software industry, and there encouraged development of portable and scalable large-scale parallel applications.
To illustrate the power of Hyper-Q, Messmer says that he picked a traditionally difficult code for GPUs called CP2K, a popular MPI-based molecular simulations code. Hyper-Q maximizes GPU utilization for the CP2K application, resulting in more than double the performance compared to running the same code without it.
How Hyper-Q Works
Messmer writes, "A GPU consists of multiple CUDA cores grouped into streaming multiprocessors operating in parallel. A hardware unit called the CUDA Work Distributor (CWD) is responsible for assigning work to the individual multiprocessors."
"In the current Fermi architecture, the CWD has a single connection to the host CPU, and work from different MPI processes is merged into this single queue. This serialization could easily lead to false dependencies among work from different MPI processes, limiting the amount of work that can be executed concurrently on the GPU. This often results in an under-utilized GPU."
"Hyper-Q removes this limitation. As shown in the graphic, the new Kepler-based Tesla K20 GPU provides 32 work queues between the host and the GPU, enabling multiple MPI processes to run concurrently on the GPU. Each MPI process can be assigned to a different hardware work queue, maximizing GPU utilization and increasing overall performance."
The suggestion here is that while MPI developers will be thrilled with the added performance, they'll be equally enamored with how Hyper-Q makes porting legacy MPI codes to the GPU significantly easier.
Nvidia explains that legacy MPI-based codes were often created to run on multicore CPU systems, with the amount of work assigned to each MPI process scaled accordingly. However, this often meant that MPI processes didn't generate enough work to fully occupy the GPU. To make the code launch enough work to fully utilize the GPU, developers frequently were required to modify their code significantly.
Hyper-Q now claims to recode efforts considerably because developers can now throw many MPI processes with small- and medium-size workloads at a shared GPU.
"Developers no longer need to modify their codes to put enough work into a single MPI process. Rather, they can send up to 32 MPI processes with variable workloads to the GPU and just let the GPU do all the heavy lifting to maximize performance," said Nvidia's Messmer.