As happened with the example in the previous article, the GPU Threads window provides valuable information that enables us to understand the C++ AMP code being debugged. Click on the Expand Thread Switcher button located at the upper-left corner and a new panel will display both the
Tile and the
Thread coordinates that are active in the debugger. In addition, the GPU Threads window will always display the valid coordinate ranges for both the
Tile and the
Thread. In this case, the valid range for the
Tile coordinates is
[0..63, 0..63], and the valid range for the
Thread coordinates is
[0..15, 0..15]. Figure 1 shows
Tile[0, 0] Thread[0, 0] as the active thread and the information about the coordinate ranges. You can also use the Parallel Watch window to freeze and thaw GPU threads as you are used to with CPU threads.
Evaluating Expressions for Each GPU Thread in the Parallel Watch Window
The Parallel Watch window allows you to simultaneously display the values that one expression holds on multiple GPU threads. You just need to click on the
<Add Watch> column and enter the expression. For example, you can add the following expressions as columns in the Parallel Watch window:
This way, you can set a breakpoint at the line
"tiled_idx.barrier.wait()" and execute until the debugger stops at this breakpoint many times. You will be able to see how the watches display the values for the expressions in the different GPU threads (Figure 2).
Figure 2: The Parallel Watch 1 window displaying the values that each expression holds on the different GPU threads.
col values evaluated for each thread, you can easily identify what piece of data each thread is working on. You will notice that the code takes some time to execute because the GPU software emulator allows you to work with four threads (and evaluating so many variables for 256 threads consumes CPU resources). However, each Visual Studio update and GPU device driver might bring new features, so it is always wise to click on the "Dump statistics to Output window" button located at the right-hand side of the top of the GPU Threads window. In this case, the Output window will display the features of the GPU software emulator. Notice the information about the grid dimensions, group dimensions, active groups, completed groups, and not started groups.
GPU Device Created. 'cppamp.exe' (GPU Device): Loaded 'C:\Users\gaston\Documents\Visual Studio 2013\Projects\cppamp\ Debug\cppamp.exe'. Symbols loaded. Information for 'cppamp.cpp_line_57' kernel on device 'DirectX Reference Rasterizer' with warp size 4: Grid Dimensions : 64x64x1 Group Dimensions : 16x16x1 Shared Memory Usage per Group : 2048 bytes Register Usage per Thread : 2064 bytes Active Groups : 1 Completed Groups : 0 Not Started Groups : 4095
Now, set a breakpoint at the line
"int row = tiled_idx.local" and execute the application until the debugger stops at this breakpoint. Click on the"Dump statistics to Output window" button located at the top-right of the GPU Threads window. In this case, the Output window will indicate that one group has been completed and that the number of not started groups is 4094 (4095 - 1):
Active Groups : 1 Completed Groups : 1 Not Started Groups : 4094
As you can see from the information, the execution has moved to
Tile[0, 1], so
Tile[0, 0] (also known as the first group) has completed its execution. If you continue the execution and take a look at the Parallel Watch 1 window (Figure 3), you will see 256 threads listed for
Tile[0, 1]. Obviously, it would be a bit complex to analyze the information provided by the Parallel Watch 1 window for 256 GPU threads within Visual Studio. If you are a Microsoft Excel user, you can click on the Open in Excel button at the top of the Parallel Watch 1 window and use Excel to analyze the snapshot for all the evaluated expressions in each thread. If aren't a Microsoft Excel user, you can click on the dropdown, select the Export to CSV option, and use your favorite application to analyze the contents of the exported data.
Figure 3: The Parallel Watch 1 window displaying the evaluated expressions for the 256 threads related to
The code has two calls to the
tiled_idx.barrier.wait() method that block the execution of all the threads in a tile until all the threads within that tile have reached the call. You can use the GPU Threads window to visualize the blocked and active threads. Figure 4 shows the GPU Threads window displaying 24 GPU threads that are blocked in the call to
tiled_idx.barrier.wait(). You can click on the flag icon at the left-hand side of the Thread Count column for the blocked threads, and the Parallel Watch 1 window will flag the 24 blocked threads in the grid. This way, you can easily identify the 24
Thread coordinates that are blocked and switch to them by using the grid in the Parallel Watch 1 window.
Figure 4: The GPU Threads window displaying 24 GPU threads that are blocked in the call to
Visual Studio 2012 and 2013 added useful enhancements that make it possible to understand what happens in GPU kernels, even when they launch dozens and dozens of GPU threads. If you take advantage of the debugging features, you will be able to optimize your algorithms and resolve many defects.
Gaston Hillar is a frequent contributor to Dr. Dobb's.