GPU innovator NVIDIA has announced its CUDA 4.0 toolkit for developers focused on building parallel applications that port to NVIDIA GPUs. With the company's own GPUDirect 2.0 technology shipping in this latest release, there is now peer-to-peer communication support across GPUs within a single server or workstation.
Ramping up its offering for multi-GPU application programming, NVIDIA is also highlighting it Unified Virtual Addressing (UVA) technology, which is intended to provide a single merged-memory address space for the main system memory and the GPU memories, again enabling so-called "quicker and easier" parallel programming. Finally, NVIDIA is currently highly vocal about its Thrust C++ Template Performance Primitives Libraries -- this technology offers a collection of open source C++ parallel algorithms and data structures that are designed to ease programming for C++ developers. With Thrust, routines such as parallel sorting are argued to be 5X to 100X faster than with Standard Template Library (STL) and Threading Building Blocks (TBB).
"Having access to GPU computing through the standard template interface greatly increases productivity for a wide range of tasks, from simple cashflow generation to complex computations," said Peter Decrem, director of rates products at market valuation and risk management company Quantifi. "The Thrust C++ library has lowered the barrier of entry significantly by taking care of low-level functionality like memory access and allocation, allowing the financial engineer to focus on algorithm development in a GPU-enhanced environment."
The CUDA 4.0 architecture release also includes MPI integration with CUDA applications so that modified MPI (Message-Passing Interface standard) implementations automatically move data from and to the GPU memory over InfiniBand when an application initiates an MPI send or receive call.
There is also:
- Multithread Sharing of GPUs -- Multiple CPU host threads can share contexts on a single GPU, making it easier to share a single GPU by multithreaded applications.
- Multi-GPU Sharing by Single CPU Thread -- A single CPU host thread can access all GPUs in a system. Developers can easily coordinate work across multiple GPUs for tasks such as "halo" exchange in applications.
- New NPP Image and Computer Vision Library -- A rich set of image transformation operations that enable rapid development of imaging and computer vision applications.
A release candidate of CUDA Toolkit 4.0 will be available free of charge beginning March 4, 2011.


