Channels ▼


CUDA, Supercomputing for the Masses: Part 20

Annotating Traces with the NVTX Library

The NVTX library provides a very powerful way to label sections of the computation to provide an easy-to-follow link back to what the actual code is doing. While simple to use, annotating Parallel Nsight traces can greatly help in understanding what is going on as will be shown in the following example, which uses concurrent processing on both a CPU core and the GPU to perform a calculation.

As can be seen in the listing below, shares the work between a CPU core and the GPU by partitioning a computation on a vector between the two systems. Succinctly, this program:

  • Allocates a vector h_data of size DATASIZE on the host and initializes it with random values.
  • A portion of the vector is reserved for the GPU via the variable gpuLen. The vector d_data is then allocated and random values are transferred from h_data to d_data with cudaMemcpy().
  • The host then starts the asynchronous kernel d_SillyMult() on the GPU. This kernel scales the vector values in a silly manner by adding each vector element to itself nScale times. Similarly, the host performs the same computation using the method SillySum(), which calculates the sum of the absolute values of its portion of h_data as defined by hostLen. Once SillySum() returns, the cuBLAS routine cublasSasum() is called to finish the calculation on the GPU by computing the sum of the absolute value of the scaled d_data elements. The cublasSasum() method requires that the host system wait for a result from the GPU, thereby synchronizing the two sub-systems.
  • The partial sums from both CPU and GPU are then added together and compared against a golden check that calculates the sum of the absolute values of the entire h_data vector scaled by nScale with a multiplication. Because floating-point arithmetic is approximate, the results of the CPU/GPU computation will differ slightly from the golden check. However, this difference is small.

While obviously a contrived example, the value of hostLen can be chosen to balance the workload between the GPU and the CPU. As will be discussed below, Parallel Nsight is used to find the appropriate value of hostLen that can match the time taken to calculate the partial sum on the GPU (calculated by calling d_sillyMult() and cublasSasum()) with the time required to run the sillySum() method on a single core of the slower host processor.

To make the tuning easier, the start of the GPU computational region was noted by pushing a character string describing that region on the NVTX stack with nvtxRangePushA(char*). At the end of the region, the label is removed from the stack with nvtxRangePop(). As can be seen, the nvToolsExt.h header is included at the beginning of In addition, the nvToolsEx32_0 library was linked with the executable.

Similarly, the NVTX push and pop operations were used to mark the computational region for the host computation. Since CUDA kernel launches are asynchronous, it is possible to see the concurrent computation of the SillySum() function on the host through the nested NVTX regions.

#define DATASIZE 10000000
#include <nvToolsExt.h>

#include <math.h>
#include <cuda.h>
#include <cublas.h>
#include <cstdio>
#include <iostream>
#include <string>
using namespace std;

void initializeData(float* h_data, int datalen)
	nvtxRangePushA("Initialize Data");
	for(int i=0; i < datalen; i++) {
		h_data[i] = 1e-6f * rand()/(float)RAND_MAX;

float sillySum(float* h_data, int nScale, int datalen)
	float sum=0.f;
	for(int i=0; i < datalen; i++) {
		register float tmp = h_data[i];
		for(int j=1; j< nScale; j++) tmp += h_data[i];
		sum += fabs(tmp);

//Simple kernel fills an array with perlin noise
__global__ void d_sillyMult(float* data, int nScale, int datalen)
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  if(tid < datalen) {
     register float tmp = data[tid];
     for(int j=1; j< nScale; j++) 
	  tmp += data[tid];
      data[tid] = tmp;

void myMain()
  const int datalen = DATASIZE;
  const int nScale = 1000;
  float *h_data, *d_data;
  int hostLen = 200000;
  int gpuLen=datalen - hostLen;
  char gpuID[256];
  char cpuID[256];

  cout << "cpuLen " << hostLen << " gpuLen " << gpuLen << endl;
  sprintf(gpuID,"GPU %dk floats",gpuLen/1000);
  sprintf(cpuID,"CPU %dk floats",hostLen/1000);


  // allocate space on the host
  h_data = new float[datalen];
  initializeData(h_data, datalen);

  // allocate space on the device
  cudaMalloc((void **)&d_data, sizeof(float)*gpuLen);

  // transfer the data
  cudaMemcpy(d_data, h_data+hostLen, sizeof(float)*gpuLen,

  int nThreads=256;
  int nBlocks = gpuLen/nThreads; 
  nBlocks += ((gpuLen%nThreads)>0)?1:0;

  float partialSum[2];
  cerr << "pause to separate out the calculation" << endl;

  d_sillyMult<<< nBlocks, nThreads>>>(d_data, nScale, gpuLen);

  partialSum[1] = sillySum(h_data, nScale, hostLen);

  partialSum[0] = cublasSasum(gpuLen, d_data, 1);

  cerr << "pause to separate out the golden" << endl;

  float hybridSum = partialSum[0] + partialSum[1];

  nvtxRangePushA("Shortcut golden");
  float hostSum = 0.f;
  for(int i=0; i < datalen; i++) hostSum += fabs(nScale * h_data[i]);

  cout << "hybrid Sum " << partialSum[0] + partialSum[1] << endl;
  cout << "host   Sum " << hybridSum << endl;
  cout << "difference " << fabs(hybridSum - hostSum) << endl;
  // free data
  delete [] h_data;

  getchar(); // To see the error values in Parallel Nsight
  cudaThreadSynchronize(); //ensure that Parallel Nsight sees the trace info

The application was traced using the Trace All option with Parallel Nsight. Note that there is a Tools Extension field in the trace below (e.g. the rows with the big blue blocks). Using a mouseover, we see that the CPU 200k floats computation of SillySum() on the host took 481,755 μs. The screenshot of the function call report, shows that the d_SillySum() kernel took 672,250 μs and the cublasSasum() method required negligible time.

[Click image to view at full size]

[Click image to view at full size]

The workload between the host and GPU was balanced by using the previous Nsight trace and report to adjust the size of hostLen so that approximately equal times were taken by the host and GPU. For convenience, very coarse adjustments were made. Finer adjusts can potentially provide even greater performance. Even so, we see that the GPU can process 9800k vector elements in the roughly same time that a single processor core can complete the work for 200k elements. This example makes two points:

  • The GPU is roughly 49 times faster for this particular problem.
  • Use of the NVTX library made it easy to understand the trace and identify the pertinent information on the display.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.