Annotating Traces with the NVTX Library
The NVTX library provides a very powerful way to label sections of the computation to provide an easy-to-follow link back to what the actual code is doing. While simple to use, annotating Parallel Nsight traces can greatly help in understanding what is going on as will be shown in the following example, which uses concurrent processing on both a CPU core and the GPU to perform a calculation.
As can be seen in the listing below, cpuGPU.cu shares the work between a CPU core and the GPU by partitioning a computation on a vector between the two systems. Succinctly, this program:
- Allocates a vector
h_dataof sizeDATASIZEon the host and initializes it with random values. - A portion of the vector is reserved for the GPU via the variable
gpuLen. The vectord_datais then allocated and random values are transferred fromh_datatod_datawithcudaMemcpy(). - The host then starts the asynchronous kernel
d_SillyMult()on the GPU. This kernel scales the vector values in a silly manner by adding each vector element to itselfnScaletimes. Similarly, the host performs the same computation using the methodSillySum(), which calculates the sum of the absolute values of its portion ofh_dataas defined byhostLen. OnceSillySum()returns, thecuBLASroutinecublasSasum()is called to finish the calculation on the GPU by computing the sum of the absolute value of the scaledd_dataelements. ThecublasSasum()method requires that the host system wait for a result from the GPU, thereby synchronizing the two sub-systems. - The partial sums from both CPU and GPU are then added together and compared against a golden check that calculates the sum of the absolute values of the entire
h_datavector scaled bynScalewith a multiplication. Because floating-point arithmetic is approximate, the results of the CPU/GPU computation will differ slightly from the golden check. However, this difference is small.
While obviously a contrived example, the value of hostLen can be chosen to balance the workload between the GPU and the CPU. As will be discussed below, Parallel Nsight is used to find the appropriate value of hostLen that can match the time taken to calculate the partial sum on the GPU (calculated by calling d_sillyMult() and cublasSasum()) with the time required to run the sillySum() method on a single core of the slower host processor.
To make the tuning easier, the start of the GPU computational region was noted by pushing a character string describing that region on the NVTX stack with nvtxRangePushA(char*). At the end of the region, the label is removed from the stack with nvtxRangePop(). As can be seen, the nvToolsExt.h header is included at the beginning of cpuGPU.cu. In addition, the nvToolsEx32_0 library was linked with the executable.
Similarly, the NVTX push and pop operations were used to mark the computational region for the host computation. Since CUDA kernel launches are asynchronous, it is possible to see the concurrent computation of the SillySum() function on the host through the nested NVTX regions.
#define DATASIZE 10000000
#include <nvToolsExt.h>
#include <math.h>
#include <cuda.h>
#include <cublas.h>
#include <cstdio>
#include <iostream>
#include <string>
using namespace std;
void initializeData(float* h_data, int datalen)
{
nvtxRangePushA("Initialize Data");
for(int i=0; i < datalen; i++) {
h_data[i] = 1e-6f * rand()/(float)RAND_MAX;
}
nvtxRangePop();
}
float sillySum(float* h_data, int nScale, int datalen)
{
float sum=0.f;
for(int i=0; i < datalen; i++) {
register float tmp = h_data[i];
for(int j=1; j< nScale; j++) tmp += h_data[i];
sum += fabs(tmp);
}
return(sum);
}
//Simple kernel fills an array with perlin noise
__global__ void d_sillyMult(float* data, int nScale, int datalen)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid < datalen) {
register float tmp = data[tid];
for(int j=1; j< nScale; j++)
tmp += data[tid];
data[tid] = tmp;
}
}
void myMain()
{
const int datalen = DATASIZE;
const int nScale = 1000;
float *h_data, *d_data;
int hostLen = 200000;
int gpuLen=datalen - hostLen;
char gpuID[256];
char cpuID[256];
cout << "cpuLen " << hostLen << " gpuLen " << gpuLen << endl;
sprintf(gpuID,"GPU %dk floats",gpuLen/1000);
sprintf(cpuID,"CPU %dk floats",hostLen/1000);
cublasInit();
// allocate space on the host
h_data = new float[datalen];
initializeData(h_data, datalen);
// allocate space on the device
cudaMalloc((void **)&d_data, sizeof(float)*gpuLen);
// transfer the data
cudaMemcpy(d_data, h_data+hostLen, sizeof(float)*gpuLen,
cudaMemcpyHostToDevice);
int nThreads=256;
int nBlocks = gpuLen/nThreads;
nBlocks += ((gpuLen%nThreads)>0)?1:0;
float partialSum[2];
cerr << "pause to separate out the calculation" << endl;
getchar();
nvtxRangePushA(gpuID);
d_sillyMult<<< nBlocks, nThreads>>>(d_data, nScale, gpuLen);
nvtxRangePushA(cpuID);
partialSum[1] = sillySum(h_data, nScale, hostLen);
nvtxRangePop();
partialSum[0] = cublasSasum(gpuLen, d_data, 1);
nvtxRangePop();
cudaThreadSynchronize();
cerr << "pause to separate out the golden" << endl;
getchar();
float hybridSum = partialSum[0] + partialSum[1];
nvtxRangePushA("Shortcut golden");
float hostSum = 0.f;
for(int i=0; i < datalen; i++) hostSum += fabs(nScale * h_data[i]);
nvtxRangePop();
cout << "hybrid Sum " << partialSum[0] + partialSum[1] << endl;
cout << "host Sum " << hybridSum << endl;
cout << "difference " << fabs(hybridSum - hostSum) << endl;
// free data
cudaFree(d_data);
delete [] h_data;
cublasShutdown();
getchar(); // To see the error values in Parallel Nsight
cudaThreadSynchronize(); //ensure that Parallel Nsight sees the trace info
}
The application was traced using the Trace All option with Parallel Nsight. Note that there is a Tools Extension field in the trace below (e.g. the rows with the big blue blocks). Using a mouseover, we see that the CPU 200k floats computation of SillySum() on the host took 481,755 μs. The screenshot of the function call report, shows that the d_SillySum() kernel took 672,250 μs and the cublasSasum() method required negligible time.
The workload between the host and GPU was balanced by using the previous Nsight trace and report to adjust the size of hostLen so that approximately equal times were taken by the host and GPU. For convenience, very coarse adjustments were made. Finer adjusts can potentially provide even greater performance. Even so, we see that the GPU can process 9800k vector elements in the roughly same time that a single processor core can complete the work for 200k elements. This example makes two points:
- The GPU is roughly 49 times faster for this particular problem.
- Use of the NVTX library made it easy to understand the trace and identify the pertinent information on the display.


