Channels ▼


CUDA, Supercomputing for the Masses: Part 12

Demonstrating write-combining

The following program,, demonstrates the use of separate write-combined, mapped, pinned memory to increment the elements of an array by one. This required changing incrementArrayOnHost and incrementArrayOnDevice to read from array a and write to array b. In this way, coherency issues are avoided and streaming performance should be achieved. The cudaHostAllocWriteCombined flag was also added to the cudaHostAlloc calls. We rely on the CUDA calls to the driver to issue the appropriate fence operation to ensure the writes become globally visible.

#include <stdio.h>
#include <assert.h>
#include <cuda.h>

// define the problem and block size

void incrementArrayOnHost(float *b, float *a, int N)
  int i;
  for (i=0; i < N; i++) b[i] = a[i]+1.f;

__global__ void incrementArrayOnDevice(float *b, float *a, int N)
  int idx = blockIdx.x*blockDim.x + threadIdx.x;
  if (idx < N) b[idx] = a[idx]+1.f;

void checkCUDAError(const char *msg)
  cudaError_t err = cudaGetLastError();
  if( cudaSuccess != err) {
    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );

int main(void)
  float *a_m, *b_m; // pointers to mapped host memory
  float *a_d, *b_d; // pointers to mapped device memory
  float *check_h;   // pointer to host memory used to check results
  size_t size = N*sizeof(float);
  cudaDeviceProp deviceProp;

#error "This CUDART version does not support mapped memory!\n"

  // Get properties and verify device 0 supports mapped memory
  cudaGetDeviceProperties(&deviceProp, 0);

  if(!deviceProp.canMapHostMemory) {
    fprintf(stderr, "Device %d cannot map host memory!\n", 0);

  // set the device flags for mapping host memory

  // allocate host mapped arrays
  int flags = cudaHostAllocMapped|cudaHostAllocWriteCombined;
  cudaHostAlloc((void **)&a_m, size, flags);
  cudaHostAlloc((void **)&b_m, size, flags);

  // Get the device pointers to memory mapped
  cudaHostGetDevicePointer((void **)&a_d, (void *)a_m, 0);
  cudaHostGetDevicePointer((void **)&b_d, (void *)b_m, 0);

  /* initialization of the mapped data. Since a_m is write-combined,
     it is not guaranteed to be initialized until a fence operation is
     called. In this case that should happen when the kernel is
     invoked on the GPU */
  for (i=0; i<N; i++) a_m[i] = (float)i;

  // do calculation on device:
  // Part 1 of 2. Compute execution configuration
  int blockSize = N_THREADS_PER_BLOCK;
  int nBlocks = N/blockSize + (N%blockSize > 0?1:0);

  // Part 2 of 2. Call incrementArrayOnDevice kernel 
  incrementArrayOnDevice <<< nBlocks, blockSize >>> (b_d, a_d, N);

  // Note the allocation and call to incrementArrayOnHost occurs 
  // asynchronously to the GPU
  check_h = (float *)malloc(size);
  incrementArrayOnHost(check_h, a_m,N);

  // Make certain that all threads are idle before proceeding

  // cudaThreadSynchronize() should have caused an sfence
  // to be issued, which will guarantee that all writes are done

  // check results. Note: the updated array is in b_m, not b_d
  for (i=0; i<N; i++) assert(check_h[i] == b_m[i]);

  // cleanup

 // free mapped memory (and device pointers)


CUDA 2.2 changes the data movement paradigm by providing APIs for mapped, transparent data transfers between the host and GPU(s). These APIs also allow the CUDA programmer to make data sharing between the host and graphics processor(s) more efficient by exploiting asynchronous operation, full-duplex PCIe data transfers, through the use of write combined memory, and by adding the ability for the programmer to share pinned memory with multiple GPUs.

Personally, I have used these APIs as a convenience when porting existing scientific codes onto the GPU because mapped memory allows me to keep the host and device data synchronized while I incrementally move as much of the calculation onto the GPU as possible. This allows me to verify my results after each change to ensure nothing has broken, which can be a real time and frustration saver when working with complex codes with many inter-dependencies. Additionally, I also use these APIs to increase efficiency by exploiting asynchronous host and multiple GPU calculations plus full-duplex PCIe transfers and other nice features of the CUDA 2.2 release.

I also see the new CUDA 2.2 APIs facilitating the development of entirely new classes of applications ranging from operating systems to real-time systems.

One example is the RAID research performed by scientists at the University of Alabama and Sandia National Laboratory that transformed CUDA-enabled GPUs into high-performance RAID accelerators that can calculate Reed-Solomon codes in real-time for high-throughput disk subsystems (see Accelerating Reed-Solomon Coding in RAID Systems with GPUs, by Matthew Curry, Lee Ward, Tony Skjellum, Ron Brightwell). From their abstract, "Performance results show that the GPU can outperform a modern CPU on this problem by an order of magnitude and also confirm that a GPU can be used to support a system with at least three parity disks with no performance penalty".

My guess is we will see a CUDA-enhanced Linux md (multiple device or software RAID) driver sometime in the near future. Imagine the freedom of not being locked into a proprietary RAID controller. If something breaks, just connect your RAID array to another Linux box to access the data. If that computer does not have an NVIDIA GPU then just use the standard Linux software md driver to access the data.

Don't forget that CUDA-enabled devices can accelerate and run multiple applications at the same time. An upcoming article demonstrating how to incorporate graphics and CUDA will exploit that capability. Until then, try running a separate graphics application while running one of your CUDA applications. I think you will be surprised at how well both applications will perform.

For More Information

Rob Farber is a senior scientist at Pacific Northwest National Laboratory. He has worked in massively parallel computing at several national laboratories and as co-founder of several startups. He can be reached at

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.