OpenGL Rendering Methods Trace Analysis
The following Trace All trace was taken with SIMPLE_ONE_BY_ONE defined.
The Compute timeline shows that the k_perlin() CUDA kernel only takes 0.1% of the time, which indicates this rendering method is clearly not limited by the performance of the CUDA kernel. The thinness of the vertical line showing the time taken in k_perlin() relative to the other activities required to render this 3D artificial world visually illustrates the speed of the CUDA kernel. Also note that the Thread State timeline is solid red. A mouseover tells us that the thread is idle.
The Tools Extension Events report shows the initialization of the mesh actually takes very little time or roughly 202 μs. We also clearly see the nesting of the methods as recorded by the NVTX calls. Follow-on calls to this rendering method show that the triangle fan mesh initialization is correctly called only once.
Utilizing the OpenGL API Call Summary, as seen below, shows that most of the time in the SIMPLE_ONE_BY_ONE rendering code is spent in glDrawElements(), which consumed the vast majority of the capture time.
In comparison, the Compute timeline taken when using the PRIMITIVE_RESTART rendering code shows that the k_perlin() CUDA kernel takes 1.1% of the time. In addition, the Thread State timeline rapidly alternates between red and green indicating GPU activity. This is also shown in the Device % timeline.
Still, the OpenGL API Call Summary shows that swapping buffers for rendering is easily the dominant runtime component.
Zooming in on the first rendering operation with primitive restart shows that the complex, computationally intensive GPU Perlin Noise generation k_perlin() kernel visually appears to take roughly twice the time of the simple triangle fan mesh generation on the host!
The primitive restart Tools Extension Events report shows that the mesh initialization only takes 126 μs.
In contrast, we see that the MULTI_DRAW rendering code again spends little time in the k_perlin() kernel (circled for clarity in the figure below) -- although the k_perlin() kernel appears to make good use of the device when active. Most of the rendering time is spent in the OpenGL glMultiDrawElements() elements call.
This is confirmed with the Tools Extension Events report.
Examining the three Tool Extensions Events reports, we gain a much better understanding why primitive restart is so fast. Except for primitive restart, the k_perlin() kernel (even though it purposely has a couple performance issues for readers to find) is overwhelmed by the time taken by the OpenGL rendering calls.
Among the three OpenGL rendering methods, primitive restart is clearly the fastest as the NVTX annotated rendering regions take the following approximate times:
- Primitive restart: around 60 μs.
- Multidraw: around 3,900 μs.
- Iteratively drawing each triangle fan: approximately 1,100,000 μs.


