Dive Deeper than the CPU Utilization Graph to Check Efficiency
The CPU utilization graph provides important information that allows you to detect a load imbalance problem when parallelized code runs on a multicore CPU. However, a sustained high CPU load for all the cores doesn't mean that your parallelized code is efficient.
The main goal of a parallelized algorithm is to translate multicore power into application performance. You parallelize your algorithms because you want to run code on all the available cores. You want to take advantage of the horse power offered by modern multicore CPUs. When you expect an algorithm to use all the available cores in any Windows version, it is very common to check the CPU Usage History graph shown by Windows Task Manager.
If you use the option that allows you to see one graph per CPU, Windows Task Manager displays one graph per logical core or hardware thread. If each graph displays a sustained high CPU utilization value, it means that the algorithm is running code in all the available cores. However, this high CPU utilization might represent an unnecessary overhead added by the parallelization process.
In "Boosting Performance with Atomic Operations in .NET 4" I showed a simple example that demonstrated the importance of considering atomic operations when you want to achieve the best performance for a parallelized algorithm. This example is also useful to understand the importance of diving deeper than the CPU utilization graph to check the efficiency of the parallelization process.
Visual Studio 2010 Premium or Ultimate versions allow you to visualize the behavior of a multithreaded application. If you launch the concurrency profiling method for the lock version, Visual Studio will allow you to visualize the degree of parallelism in your application on the CPU utilization graph. If you click on CPU Utilization, the average CPU utilization value for the process could lead you to draw wrong conclusions. The next screenshot shows an average CPU utilization of 85% when the code runs in a computer with a quad-core CPU.
The degree of parallelism in the application seems to be excellent. The application is running code on all the available cores. However, the application is running unnecessary synchronization code, and therefore, the algorithm has inefficient parallelized code. The application required 38,972 milliseconds to run, with the profiler running in the background.
If you switch to the Threads view, and you check the Synchronization blocking profile, you will realize that System.Threading.Monitor.Enter is responsible of 22,662.51 milliseconds of exclusive blocking time. The next screenshot shows the valuable information provided by the Synchronization blocking profile report within the Threads view:
The lock keyword calls System.Threading.Monitor.Enter to acquire the mutual-exclusion lock. Each time that the code calls System.Threading.Monitor.Enter, the application consumes CPU cycles. However, because the lock isn't necessary, these CPU cycles waste CPU horse power.
If you launch the concurrency profiling method for the atomic operations version, the new average CPU utilization value is usually lower than the value shown for the locks version. The next screenshot shows an average CPU utilization of 69% when the code runs in a computer with a quad-core CPU.
The degree of parallelism in the application seems to be lower than the previous version. However, the algorithm is more efficient because the application required less time to run. The application required 16,056 milliseconds to run, with the profiler running in the background. The average CPU utilization is lower than the value shown by the locks version but the atomic operations version requires less time to run. You don't want to waste CPU cycles. You just want your application to run faster while providing correctness, and to scale as the number of cores increases.
If you switch to the Threads view, and you check the Synchronization blocking profile, you will realize that there are just 1.09 milliseconds of exclusive blocking time, caused by System.Threading.Tasks.Parallel.For. The next screenshot shows the valuable information provided by the Synchronization blocking profile report within the Threads view:
When you worked with serial code running on single-core CPUs, a sustained high CPU load didn't mean that your code was efficient. The same happens in the multicore world. Profiling tools are very useful to allow you to detect inefficient code. The Concurrency Visualizer introduced in Visual Studio 2010 Premium or Ultimate versions provides valuable information about the behavior of a multithreaded application. However, remember to dive deeper than the CPU Utilization Graph.

