Channels ▼


CUDA, Supercomputing for the Masses: Part 3

Looking at the Source Code

Looking at the source code for, you notice that the structure of the program is very, very similar to the structure of from Part 2. An error routine, checkCUDAError is provided so the host can print out a human-readable message and exit when an error is reported by cudaGetLastError. As can be seen, checkCUDAError is judiciously utilized throughout the program to check for errors.

The program essentially creates a 1D array of integers, h_a, containing the integer values [0 .. dimA-1]. Array h_a is moved via cudaMemcpy to array d_a, which resides in global memory on the device. The host then launches the reverseArrayBlock kernel to copy the array contents in reverse order from d_a to d_b, which is another global memory array. Again, cudaMemcpy is used to transfer data -- this time from d_b to the host. A check is then performed on the host to verify that the device produced the correct result (e.g, [dimA-1 .. 0]).

// includes, system
#include <stdio.h>
#include <assert.h>

// Simple utility function to check for CUDA runtime errors
void checkCUDAError(const char* msg);

// Part3: implement the kernel
__global__ void reverseArrayBlock(int *d_out, int *d_in)
    int inOffset  = blockDim.x * blockIdx.x;
    int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
    int in  = inOffset + threadIdx.x;
    int out = outOffset + (blockDim.x - 1 - threadIdx.x);
    d_out[out] = d_in[in];
// Program main
int main( int argc, char** argv) 
    // pointer for host memory and size
    int *h_a;
    int dimA = 256 * 1024; // 256K elements (1MB total)

    // pointer for device memory
    int *d_b, *d_a;

    // define grid and block size
    int numThreadsPerBlock = 256;

    // Part 1: compute number of blocks needed based on 
    // array size and desired block size
    int numBlocks = dimA / numThreadsPerBlock;  

    // allocate host and device memory
    size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int);
    h_a = (int *) malloc(memSize);
    cudaMalloc( (void **) &d_a, memSize );
    cudaMalloc( (void **) &d_b, memSize );

    // Initialize input array on host
    for (int i = 0; i < dimA; ++i)
        h_a[i] = i;

    // Copy host array to device array
    cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

    // launch kernel
    dim3 dimGrid(numBlocks);
    dim3 dimBlock(numThreadsPerBlock);
    reverseArrayBlock<<< dimGrid, 
         dimBlock >>>( d_b, d_a );

    // block until the device has completed

    // check if kernel execution generated an error
    // Check for any CUDA errors
    checkCUDAError("kernel invocation");

    // device to host copy
    cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

    // Check for any CUDA errors

    // verify the data returned to the host is correct
    for (int i = 0; i < dimA; i++)
        assert(h_a[i] == dimA - 1 - i );

    // free device memory

    // free host memory

    // If the program makes it this far, then the results are 
    // correct and there are no run-time errors.  Good work!

    return 0;
void checkCUDAError(const char *msg)
    cudaError_t err = cudaGetLastError();
    if( cudaSuccess != err) 
        fprintf(stderr, "Cuda error: %s: %s.\n", msg, 
                                  cudaGetErrorString( err) );

A key design feature of this program is that both arrays d_a and d_b reside in global memory on the device. The CUDA SDK provides an example program, bandwidthTest, which provides some information about the device characteristics. On my system, the global memory bandwidth is slightly over 60 GB/s. This is excellent until you consider that this bandwidth must service 128 hardware threads -- each of which can deliver a large number of floating-point operations. Since a 32-bit floating-point value occupies four (4) bytes, global memory bandwidth limited applications on this hardware will only be able to deliver around 15 GF/s -- or only a small percentage of the available performance capability. (This assumes the application only reads from global memory and does not write to it.) Obviously, higher performance applications must reuse data in some fashion. This is the function of shared and register memory and it is our job as programmers to gain the maximum benefit of these memory types. To gain a better understanding of machine balance as floating-point capability relates to memory bandwidth (and other machine characteristics), read my article HPC Balance and Common Sense.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.