### Annotating Traces with the NVTX Library

The NVTX library provides a very powerful way to label sections of the computation to provide an easy-to-follow link back to what the actual code is doing. While simple to use, annotating Parallel Nsight traces can greatly help in understanding what is going on as will be shown in the following example, which uses concurrent processing on both a CPU core and the GPU to perform a calculation.

As can be seen in the listing below, cpuGPU.cu shares the work between a CPU core and the GPU by partitioning a computation on a vector between the two systems. Succinctly, this program:

- Allocates a vector
`h_data`

of size`DATASIZE`

on the host and initializes it with random values. - A portion of the vector is reserved for the GPU via the variable
`gpuLen`

. The vector`d_data`

is then allocated and random values are transferred from`h_data`

to`d_data`

with`cudaMemcpy()`

. - The host then starts the asynchronous kernel
`d_SillyMult()`

on the GPU. This kernel scales the vector values in a silly manner by adding each vector element to itself`nScale`

times. Similarly, the host performs the same computation using the method`SillySum()`

, which calculates the sum of the absolute values of its portion of`h_data`

as defined by`hostLen`

. Once`SillySum()`

returns, the`cuBLAS`

routine`cublasSasum()`

is called to finish the calculation on the GPU by computing the sum of the absolute value of the scaled`d_data`

elements. The`cublasSasum()`

method requires that the host system wait for a result from the GPU, thereby synchronizing the two sub-systems. - The partial sums from both CPU and GPU are then added together and compared against a golden check that calculates the sum of the absolute values of the entire
`h_data`

vector scaled by`nScale`

with a multiplication. Because floating-point arithmetic is approximate, the results of the CPU/GPU computation will differ slightly from the golden check. However, this difference is small.

While obviously a contrived example, the value of `hostLen`

can be chosen to balance the workload between the GPU and the CPU. As will be discussed below, Parallel Nsight is used to find the appropriate value of `hostLen`

that can match the time taken to calculate the partial sum on the GPU (calculated by calling `d_sillyMult()`

and `cublasSasum()`

) with the time required to run the `sillySum()`

method on a single core of the slower host processor.

To make the tuning easier, the start of the GPU computational region was noted by pushing a character string describing that region on the NVTX stack with `nvtxRangePushA(char*)`

. At the end of the region, the label is removed from the stack with `nvtxRangePop()`

. As can be seen, the nvToolsExt.h header is included at the beginning of cpuGPU.cu. In addition, the nvToolsEx32_0 library was linked with the executable.

Similarly, the NVTX push and pop operations were used to mark the computational region for the host computation. Since CUDA kernel launches are asynchronous, it is possible to see the concurrent computation of the `SillySum()`

function on the host through the nested NVTX regions.

#define DATASIZE 10000000 #include <nvToolsExt.h> #include <math.h> #include <cuda.h> #include <cublas.h> #include <cstdio> #include <iostream> #include <string> using namespace std; void initializeData(float* h_data, int datalen) { nvtxRangePushA("Initialize Data"); for(int i=0; i < datalen; i++) { h_data[i] = 1e-6f * rand()/(float)RAND_MAX; } nvtxRangePop(); } float sillySum(float* h_data, int nScale, int datalen) { float sum=0.f; for(int i=0; i < datalen; i++) { register float tmp = h_data[i]; for(int j=1; j< nScale; j++) tmp += h_data[i]; sum += fabs(tmp); } return(sum); } //Simple kernel fills an array with perlin noise __global__ void d_sillyMult(float* data, int nScale, int datalen) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if(tid < datalen) { register float tmp = data[tid]; for(int j=1; j< nScale; j++) tmp += data[tid]; data[tid] = tmp; } } void myMain() { const int datalen = DATASIZE; const int nScale = 1000; float *h_data, *d_data; int hostLen = 200000; int gpuLen=datalen - hostLen; char gpuID[256]; char cpuID[256]; cout << "cpuLen " << hostLen << " gpuLen " << gpuLen << endl; sprintf(gpuID,"GPU %dk floats",gpuLen/1000); sprintf(cpuID,"CPU %dk floats",hostLen/1000); cublasInit(); // allocate space on the host h_data = new float[datalen]; initializeData(h_data, datalen); // allocate space on the device cudaMalloc((void **)&d_data, sizeof(float)*gpuLen); // transfer the data cudaMemcpy(d_data, h_data+hostLen, sizeof(float)*gpuLen, cudaMemcpyHostToDevice); int nThreads=256; int nBlocks = gpuLen/nThreads; nBlocks += ((gpuLen%nThreads)>0)?1:0; float partialSum[2]; cerr << "pause to separate out the calculation" << endl; getchar(); nvtxRangePushA(gpuID); d_sillyMult<<< nBlocks, nThreads>>>(d_data, nScale, gpuLen); nvtxRangePushA(cpuID); partialSum[1] = sillySum(h_data, nScale, hostLen); nvtxRangePop(); partialSum[0] = cublasSasum(gpuLen, d_data, 1); nvtxRangePop(); cudaThreadSynchronize(); cerr << "pause to separate out the golden" << endl; getchar(); float hybridSum = partialSum[0] + partialSum[1]; nvtxRangePushA("Shortcut golden"); float hostSum = 0.f; for(int i=0; i < datalen; i++) hostSum += fabs(nScale * h_data[i]); nvtxRangePop(); cout << "hybrid Sum " << partialSum[0] + partialSum[1] << endl; cout << "host Sum " << hybridSum << endl; cout << "difference " << fabs(hybridSum - hostSum) << endl; // free data cudaFree(d_data); delete [] h_data; cublasShutdown(); getchar(); // To see the error values in Parallel Nsight cudaThreadSynchronize(); //ensure that Parallel Nsight sees the trace info }

The application was traced using the Trace All option with Parallel Nsight. Note that there is a Tools Extension field in the trace below (e.g. the rows with the big blue blocks). Using a mouseover, we see that the CPU 200k floats computation of `SillySum()`

on the host took 481,755 μs. The screenshot of the function call report, shows that the `d_SillySum()`

kernel took 672,250 μs and the `cublasSasum()`

method required negligible time.

The workload between the host and GPU was balanced by using the previous Nsight trace and report to adjust the size of `hostLen`

so that approximately equal times were taken by the host and GPU. For convenience, very coarse adjustments were made. Finer adjusts can potentially provide even greater performance. Even so, we see that the GPU can process 9800k vector elements in the roughly same time that a single processor core can complete the work for 200k elements. This example makes two points:

- The GPU is roughly 49 times faster for this particular problem.
- Use of the NVTX library made it easy to understand the trace and identify the pertinent information on the display.