Nvidia has updated CUDA 6, a new release representing a new version of the CUDA Toolkit said to include some of the most significant new functionality its history.
CUDA is a parallel computing platform and programming model that strives to increase computing performance by harnessing the power of the graphics processing unit.
The most important new features of CUDA 6 are support for Unified Memory; CUDA on Tegra K1 mobile/embedded system-on-a-chip; XT and Drop-In library interfaces; remote development in NSight Eclipse Edition; and what are more generally classed as "many improvements" to the CUDA developer tools.
In terms of Unified Memory and how it might be used the company makes the following notes:
"In a typical PC or cluster node today, the memories of the CPU and GPU are physically distinct and separated by the PCI-Express bus. Before CUDA 6, that is exactly how the programmer has to view things. Data shared between the CPU and GPU must be allocated in both memories, and explicitly copied between them by the program. This can add a lot of complexity to CUDA programs."
Now though, Unified Memory bids to simplify GPU memory management by providing a unified pool of memory accessible to code running on either the CPU or the GPU.
Unified Memory creates a pool of managed memory shared between the CPU and GPU, bridging the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using a single pointer. The key is that the system automatically migrates data allocated in Unified Memory between host and device so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU.
Nvidia says that parallel computing on every one of its GPUs has been a goal since the first release of CUDA. CUDA 6 and the new Tegra K1 system on a chip (SoC) finally enables "CUDA Everywhere", and this means with CUDA capability top to bottom from the smallest mobile processor to the most powerful Tesla K40 accelerator.
CUDA 6 introduces XT Library interfaces, which provide automatic scaling of cuBLAS level 3 and 2D/3D cuFFT routines to 2 or more GPUs. This means that if you have one or more dual-GPU accelerator cards in your workstation or cluster node, you can automatically take advantage of them for intensive FFTs and matrix-matrix multiplication.
cuBLAS XT also enables multiplication of matrices that are too large to fit in the memory of a single GPU, because it operates directly on matrices allocated in CPU memory, tiling the matrix and overlapping computation with memory transfers.
According to the Nvidia developer blog, "A common use case for GPU developers is to develop HPC software that runs on a remote server or cluster, or on an embedded system such as Jetson TK1. The NSight Eclipse Edition Integrated Development Environment now supports a complete remote development workflow: edit source code in the IDE running on your local PC (e.g., a laptop), then build, run, debug, and profile the application remotely on a server with a CUDA-capable GPU. NSight takes care of syncing the source code to the remote machine, and you can use all the CUDA-aware debugging and profiling features of NSight in your local IDE."