TMonitor, a new tool developed by the CPUID team, offers the possibility to understand what's going on with each hardware thread (logical core) on some modern multicore microprocessors.
Very old hardware running a single sequential code application was easier to understand than modern hardware running many applications executing dozens of software threads distributed into the available hardware threads (logical cores).
Parallelized code creates many tasks (the packages), stealing work from many threads (the cars), running on hardware threads (logical cores, the lanes). It sounds simpler using packages, cars and lanes to explain the main software and hardware layers involved in the execution of parallelized code. In fact, there is indeed much more than this. However, I'll keep the focus on the packages, the cars and the lanes.
How long is it going to take to travel 800 miles with 4 packages, using 4 cars (1 package in each car)? It depends on three variables, the number of available lanes, the cars' maximum speed and the maximum speed limit for each lane. I'll assume the cars' drivers are going to respect the maximum speed limits. It also depends on other variables. However, I'll keep the focus on these three variables.
There are many problems:
- Some lanes aren't completely independent lanes. They share some regions with other lanes (Hyper-Threading technology).
- The maximum speed limit for each lane could be reduced or increased (Energy saving schemes, Enhanced Intel SpeedStep Technology and Intel Turbo Boost Technology among others).
- The cars' speed isn't constant. It changes because the cars' drivers find some traffic jams on the roads (operating system's scheduler decisions, sharing hardware resources, concurrency problems, inefficient code and I/O bottlenecks among others).
As you may guess, these problems happen in nanoseconds. Therefore, it is very important to understand modern parallel hardware in order to create efficient parallelized code.
There is a new tool, developed by the CPUID team, TMonitor, still in beta version, that allows you to display the active clock of each individual hardware thread (logical core) of a multicore microprocessor. It displays a graph showing the maximum speed limit for each lane, as shown in the following picture for a quad-code microprocessor (four physical cores without Hyper-Threading technology, four logical cores, four hardware threads):
TMonitor displaying 4 idle hardware threads (all frequencies = 2,400 MHz = 2.4 GHz).
TMonitor uses a very high refresh rate (20 times per second), therefore, it allows you to small clock variations for each hardware thread in real-time.
TMonitor displaying 1 hardware thread with its increased clock (one of the frequencies = 2800 MHz = 2.8 GHz).
TMonitor displaying 4 hardware threads with their increased clocks (all frequencies = 2,800 MHz = 2.8 GHz).
It can show you what's going on with the hardware threads. You can see the different maximum speed limits while parallelized applications are running. One of its interesting features is the possibility to detect Intel's Turbo Boost activation for each hardware threads.
The application is very simple to download and run. It comes in both 32-bits and 64-bits versions for Windows. This beta version has some limitations. It works only on Intel Core 2 and Core i3; i5 and i7 microprocessors. However, taking into account the other excellent free tools developed by the CPUID team, you can expect support for many other microprocessors soon.
The next time you want to understand what's going on with your parallelized code, you can use TMonitor to have more information about the underlying hardware. This way, you'll be able to understand why some small changes in the code could produce very different performance results.
Don't forget about maximum speed limits, packages, cars and lanes.