Analysis of Asynchronous I/O
Although asynchronous I/O streams have not yet been covered in this tutorial series, we can use the NVIDIA GPU Computing SDK version 3.1 sample
simpleMultiCopy to show how Parallel Nsight handles codes with complex asynchronous behavior.
The following steps were used to build this SDK example:
- Download the Windows version of the CUDA 3.1 SDK. It can be found here.
- Run the executable, which will install the SDK examples in C:\ProgramData\NVIDIA Corporation.
- Copy the SDK folder NVIDIA GPU Computing SDK to one of your folders.
- Change to the folder NVIDIA GPU Computing SDK\C\src\simpleMultiCopy.
- Double-click on the simpleMultiCopy_vc90.sln icon. The Visual Studio Conversion Wizard will appear to create a version of this solution that can be used with Parallel Nsight.
- Build the project.
- Don’t forget to set the Nsight User Properties | Connection Name to specify your remote machine! Note that the connection name may also be set on the Activity page itself.
- Analysis does not necessarily require a remote connection. The name localhost can be used for Connection Name so long as the monitor is installed on the local machine.
Now run the executable using the analyzer. From the File menu on the top toolbar select New | File …| and any of the options in the NVIDIA selection as shown in the screen below:
Once an option is selected (in this case Trace All), the following screen appears:
Scrolling down, it is clear there is a wealth of options from which to choose.
Click on the Launch button. When the program pauses at the end, press Enter on the target machine keyboard to terminate the simpleMultiCopy application (or click the Kill button). Once Parallel Nsight has retrieved the trace from the target machine, a summary report will appear. The Capture Control icon will change from red, to yellow (indicating data is being transferred), to green.
As can be seen in Trace timeline below, Parallel Nsight provides a tremendous amount of information that is easily accessible via mouseover and zooming operations as well as various filtering operations. Given the volume of information available in these traces, it is essential to know that regions of the timeline can be selected by clicking the mouse on the screen at a desired starting point of time. A vertical line will appear on the screen. Then press the Shift key and move the mouse (with the button pressed) to the end region of interest. This will result in a grey overlay as shown below. A nice feature is that the time interval for the region is calculated and displayed.
We see that it took a short while, about 0.02884 seconds for the asynchronous transfers to get started and a somewhat longer interval for all four streams to really start moving data. Clicking within the grey region will zoom the display to show the just the selected time interval. This makes it very convenient to select and zoom into intervals in the timeline. Other useful controls (that are consistent with typical timeline interfaces in CAD and audio software) are:
- Ctrl + Mousewheel: smoothly zoom into or out of the timeline.
- Ctrl + Drag: pan around in the timeline.
General workflow tips when using Parallel Nsight: the Application or System Trace options can be used to determine if the application is CPU bound, memory bound, or kernel bound. This can be done by looking at the Timeline.
- CPU bound. There will be large areas where no kernel or memory copy is occurring but the application threads (Thread State) is Green
- PCIe transfer limited. Kernel execution is blocked while waiting on memory transfers to or from the device. This can be seen by looking at the Memory row. If much time is being spent doing memory copies then consider using the Streams API to pipeline the application, which can overlap memory transfers and kernels. Before changing code, compare the duration of the transfers and kernels to ensure a performance gain will be realized.
- Kernel bound. If the majority of the application time is spent waiting on kernels to complete then switch to the "Profile CUDA" activity, re-run the application, to collect information from the hardware counters. This can help guide how to optimize kernel performance.
Zooming into a region of the timeline view allows Parallel Nsight to provide the names of the functions and methods as sufficient space becomes available in each region. This really helps the readability of the traces.