Multi-Core Performance Tools
Unique tools for analyzing performance related to multi-core processors are still somewhat few in number. System profilers can provide information on processes executing on a system; however interactions in terms of messaging and coordination between processes are not visible. Tools that offer visibility into this coordination typically must be cognizant of the particular API in use. POSIX Threads is a commonly employed multi-core programming API and therefore has relatively broad tools support.
Intel Thread Profiler
The Intel Thread Profiler identifies thread-related performance issues and is capable of analyzing OpenMP, POSIX, and Windows† multithreaded applications. When used to profile an application, some of the key capabilities include:
- The display of a histogram of aggregate data on time spent in serial or parallel regions.
- The display of a histogram of time spent accessing locks, in critical regions, or with threads waiting at implicit barriers for other threads.
Intel Thread Profiler employs what is termed critical path analysis where events are recorded including spawning new threads, joining terminated threads, holding synchronization objects, waiting for synchronization objects to be released, and waiting for external events. An execution flow is created that is the execution through an application by a thread, and each of the listed events above can split or terminate the flow. The critical path is defined as the longest flow through the execution from the start of the application until it terminates. The critical path is important because any improvement in threaded performance along this path would increase overall performance of the application.
Data recorded along the critical path includes the number of threads that are active and thread interactions over synchronization objects. Figure 8 depicts the Intel Thread Profiler GUI divided into two sections: Profile View and Timeline View. On top is the Profile View, which gives a histogram representation of data taken from the critical path and can be organized with different filters that include the following:
- Number of active threads on the critical path.
- Object view: identifies the synchronization objects encountered by threads
- Thread view: shows the contribution of each thread to the critical path.
Benefits of these filters and views include:
- Knowledge of the amount of parallelism available during the application execution
- Helping locate load imbalances between threads
- Determining what synchronization objects were responsible for the most contention between threads
The Timeline View shows the critical path over the time that the application has run. The critical path travels from one thread to another and shows the amount of time threads spend executing or waiting for a synchronization object.
CriticalBlue Prism is another example of a toolsuite aimed at optimized software development for multi-core and/or multithreaded architectures. Prism can be used across the full range of activities needed to migrate existing sequential single core software onto a multi-core platform.
Prism's analyses are based on a dynamic tracing approach. Traces of the user's software application are extracted either from a simulator of the underlying processor core or via an instrumentation approach where the application is dynamically instrumented to produce the required data. Once a trace has been loaded into Prism the user can start to analyze the application behavior in a multi-core context. In addition to standard profiling data showing functions and their relative execution times, Prism provides the user with specific insight relevant in a multi-core processor context. Examples of the views and analyses available in Prism are:
- Histogram showing activity over time by individual function and memory
- Dynamic call graph showing function inter-relationships and frequency
- Data dependency analysis between functions on sequential code
- What-if scheduling to explore the impact of executing functions in separate threads
- What-if scheduling to explore the impact of varying the numbers of processor cores employed
- What-if scheduling to explore the impact of removing identified data dependencies
- What-if scheduling to explore the impact of cache misses on multi-core execution performance
- What-if scheduling to explore the benefit of Intel Hyper-Threading Technology on multi-core execution performance
- Data race analysis between functions on multithreaded code.
Figure 9 is a screen shot of Prism analyzing sequential code where the user has forced several functions to execute in their own threads and a trial schedule has been generated on 4 cores. This trial schedule was modeled on unchanged sequential code and enables the user to exhaustively test and optimize the parallelization code prior to making code changes.
Power Performance Tools
As previously mentioned, the "race to idle" power optimization strategy is implemented by employing the single-core and multi-core performance tools mentioned previously. The focus of this section is on tools to assist with idle mode optimization.
Two types of tools assess power performance. The first type of tool measures the actual power used by the device via physical probes, a technique referred to as probe-based profiling. The second type of tool employs counters in the platform that measure power state transitions, a technique referred to as power state-based profiling. For the sake of completeness a brief description of each type of tool follows; however only power state-based profiling is discussed at length and employed in the case study.
Probe-based profiling employs an external device to measure and record the power utilized by the system as a specific application executes. Typically, there is some mechanism to correlate the power readings with points in the application. An industry example of such a tool is the TMS320C55x† Power Optimization DSP Starter Kit, which integrates National Instruments Power Analyzer to provide a graphical view of power utilization over time. The Intel Energy Checker SDK is another probe-based profiling tool that targets desktop and server platforms. This tool measures power from the AC adaptor using a measurement tool such as those available from Watts up? and enables correlation with specific regions of application code. The data transfer assumes a shared file system, which currently limits applicability to desktop and server computing platforms.
Power-state profiling tools rely upon software interfaces into the platform's power states, which, instead of providing a measure of power utilization, provide the number of times transitions occur between the platform power states. The process of idle-mode optimization works by enabling increasing application functionality and inspecting the recorded power data at every stage. In many cases, additional power state transitions will be recorded. Many of these additional transitions are necessary because as more functionality of the application is enabled, more processing is required. However, at each step, power state differences should be measured, understood, and optimized away if truly unneeded.
PowerTOP is a Linux targeted tool that performs power state profiling and targets idle mode optimization techniques. The tool executes on the target device with an operating mode similar to the common Unix tool, top, where the tool provides a dashboard-like display. The intent is that the display would provide real-time updates as your applications execute on the target device. Figure 10 displays a screenshot of PowerTOP and highlights its functionality. The tool provides six categories of information, which are summarized as follows:
- C state residency information. The average amount of time spent in each C state and the average duration that is spent in each C state
- P state residency information. The percentage of time the processor is in a particular P state
- Wakeups per second. The number of times per second the system moves out of an idle C state
- Power usage. An estimate of the power currently consumed and the amount of battery life remaining
- Top causes for wakeups. A rank ordered list of interrupts, processes, and functions causing the system to transition to C0
- Wizard mode. Suggestions for changes to the operating system that could reduce power utilization
For more information about performance optimization and architecture options, see Break Away with Intel Atom Processors: A Guide to Architecture Migration by Lori Matassa and Max Domeika.