In Part 1 of this article series, I presented a simple first CUDA (short for "Compute Unified Device Architecture") program called moveArrays.cu to familiarize you with the CUDA tools for building and executing programs. For C programmers, this program did nothing more than call the CUDA API to allocate memory and move data to and from the CUDA device. Nothing new was added that might cause confusion in learning how to use the tools to build and run a CUDA program.
This article builds on that first example by adding a few additional lines of code to perform a simple calculation on the CUDA device -- specifically incrementing each element in a floating-point array by 1. Amazingly, this example already provides the basic framework ("move data to CUDA-enabled device(s), perform a calculation and retrieve result") for solving many problems with CUDA!
Before tackling more advanced topics, you first need to understand:
- What is a kernel? A kernel is a function callable from the host and executed on the CUDA device -- simultaneously by many threads in parallel.
- How does the host call a kernel? This involves specifying the name of the kernel plus an execution configuration. For the purposes of this column, an execution configuration just means defining the number of parallel threads in a group and the number of groups to use when running the kernel for the CUDA device. This is actually an important topic that will be discussed in greater depth in future columns.
- How to synchronize kernels and host code.
At the top of the Listing One (incrementArrays.cu), we see an example host routine, incrementArrayOnHost
and our first kernel, incrementArraysOnDevice
.
The host function incrementArrayOnHost
is just a simple loop over the number of elements in an array to increment each array element by one. This function is used for comparison purposes at the end of this code to verify the kernel performed the correct calculation on the CUDA device.
Next in the Listing One is our first CUDA kernel, incrementArrayOnDevice
. CUDA provides several extensions to the C-language. The function type qualifier __global__
declares a function as being an executable kernel on the CUDA device, which can only be called from the host. All kernels must be declared with a return type of void
.
The kernel incrementArrayOnDevice
performs the same calculation as incrementArrayOnHost
. Looking within incrementArrayOnDevice
, you see that there is no loop! This is because the function is simultaneously executed by an array of threads on the CUDA device. However, each thread is provided with a unique ID that can be used to compute different array indicies or make control decisions (such as not doing anything if the thread index exceeds the array size). This makes incrementArrayOnDevice
as simple as calculating the unique ID in the register variable, idx
, which is then used to uniquely reference each element in the array and increment it by one. Since the number of threads can be larger than the size of the array, idx
is first checked against N
, an argument passed to the kernel that specifies the number of elements in the array, to see if any work needs to be done.
So how is the kernel called and the execution configuration specified? Well, control flows sequentially through the source code starting at main until the line right after the comment containing the statement Part 2 of 2
in Listing One.
// incrementArray.cu #include <stdio.h> #include <assert.h> #include <cuda.h> void incrementArrayOnHost(float *a, int N) { int i; for (i=0; i < N; i++) a[i] = a[i]+1.f; } __global__ void incrementArrayOnDevice(float *a, int N) { int idx = blockIdx.x*blockDim.x + threadIdx.x; if (idx<N) a[idx] = a[idx]+1.f; } int main(void) { float *a_h, *b_h; // pointers to host memory float *a_d; // pointer to device memory int i, N = 10; size_t size = N*sizeof(float); // allocate arrays on host a_h = (float *)malloc(size); b_h = (float *)malloc(size); // allocate array on device cudaMalloc((void **) &a_d, size); // initialization of host data for (i=0; i<N; i++) a_h[i] = (float)i; // copy data from host to device cudaMemcpy(a_d, a_h, sizeof(float)*N, cudaMemcpyHostToDevice); // do calculation on host incrementArrayOnHost(a_h, N); // do calculation on device: // Part 1 of 2. Compute execution configuration int blockSize = 4; int nBlocks = N/blockSize + (N%blockSize == 0?0:1); // Part 2 of 2. Call incrementArrayOnDevice kernel incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N); // Retrieve result from device and store in b_h cudaMemcpy(b_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); // check results for (i=0; i<N; i++) assert(a_h[i] == b_h[i]); // cleanup free(a_h); free(b_h); cudaFree(a_d);
This queues the launch of incrementArrayOnDevice
on the CUDA-enabled device and illustrates another CUDA addition to the C-language, an asynchronous call to a CUDA kernel. The call specifies the name of the kernel and the execution configuration enclosed between triple angle brackets "<<<" and ">>>". Notice the two parameters that specify the execution configuration: nBlocks
and blockSize
, which will be discussed next. Any arguments to the kernel call are provided via a standard C-language argument list for a function delimited in the standard C-language fashion with "(" and ")". In this example, the pointer to the device global memory a_d
(which contains the array elements) and N
(the number of array elements) are passed to the kernel.