Lori Matassa and Max Domeika are the authors of Break Away with Intel Atom Processors: A Guide to Architecture Migration
Good software design seeks a balance between simplicity and efficiency. Performance of the application is an aspect of software design; however correctness and stability are typically prerequisite to extensive performance tuning efforts. A typical development cycle is depicted in Figure 1 and consists of four phases: design, implementation, debugging, and tuning. The development cycle is iterative and concludes when performance and stability requirements are met. Figure 1 further depicts a more detailed look inside of the tuning phase, which consists of single processor core optimization, multi-core processor optimization, and power optimization.
One key fact to highlight about the optimization process is that changes made during this phase can require another round of design, implementation, and debug. It is hoped that a candidate optimization would require minimal changes, but there are no guarantees. Each proposed change required as a result of a possible optimization should be evaluated in terms of stability risk, implementation effort, and performance benefit.
Similarly, the tune step is also iterative with the goal of reaching a satisfactory equilibrium between single core, multi-core, and power performance. The components of the tune step are summarized as follows:
- Single processor core tuning. Optimization of the application assuming execution on one Intel Atom processor core. This step focuses on increasing performance, which is typically the reduction of execution time.
- Multi-core tuning. Optimization of the application taking advantage of parallel technology including Intel Hyper-Threading Technology and multiple processor cores. This step focuses on increasing performance, which is typically the reduction of execution time.
- Power tuning. Optimization of the application focusing on power utilization. This step focuses on reducing the amount of power used in accomplishing the same amount of work.
Single Processor Core Tuning
Single processor core tuning focuses on improving the behavior of the application executing on one physical Intel Atom processor core. Intel Hyper-Threading Technology is not considered during this phase; it enables one physical processor core to appear as two cores and introduces issues more related to multi-core processing. This tuning step isolates the behavior of the application from more complicated interactions with other threads or processes on the system. This step is not entirely focused on what traditionally is called serial tuning because parallelism in the form of vector processing or acceleration technology can be considered.
The foundation of performance tuning is built upon complementary assertions that of the Pareto principle and Amdahl's law. The Pareto principle, colloquially known as the 80/20 rule, states that 80 percent of the time spent in an application is in 20 percent of the code. This observation helps prioritize optimization efforts to the areas of highest impact, namely the most frequently executed portions of the code. Amdahl's law provides guidance on the limits of optimization. For example, if your optimization can only be applied to 75 percent of the application, the maximum theoretical speedup is 4 times.
Single processor core tuning is itself comprised of multiple steps, which are characterized as first gaining an understanding of the application and then tuning based upon general performance analysis and tuning and then analysis and tuning specific to the Intel Atom processor. The single processor core tuning process is summarized by the following steps:
- Benchmark. Develop a benchmark that represents typical application usage.
- Profile. Analyze and understand the architecture of the application.
- Compiler optimization. Use aggressive optimizations if possible.
- General microarchitecture tuning. Tune based upon insight from general performance analysis statistics. These statistics, such as clock cycles per instruction retired, are generally accepted performance analysis statistics that can be employed regardless of the underlying architecture.
- Intel Atom processor tuning. Tune based on insight about known processor "glass jaws." These include statistics and techniques to isolate performance issues specific to the Intel Atom processor.
Multi-Core Processor Tuning
The focus of multi-core processor tuning is on the effective use of parallelism that takes advantage of more than one processor core. This step pertains to both Intel Hyper-Threading Technology and true multi-core processing. There are some issues specific to each; where appropriate these differences are highlighted. Second, at the application level, two techniques allow you to take advantage of multiple processor cores, multitasking and multithreading. Multitasking is the execution of multiple operating system processes on a system. In the context of one application, multitasking requires the division of work between distinct processes and special effort is required to share data between processes. Multithreading is the execution of multiple threads and by default assumes memory is shared, which introduces its own set of concerns. This article limits itself to discussion of multithreading because multitasking is a more mature technology and one where the operating system governs much of the policy of execution. Multithreading in the context of the Intel Atom processor is much more under the control of the software developer.
Developing software for multi-core processors requires good analysis and design techniques. A wealth of information on these techniques is available in literature by Mattson et al., Breshears, and many others.
Tuning of multithreaded applications on the Intel Atom processor requires ensuring good performance when the application is executing on both, logical processor cores available via Intel Hyper-Threading Technology, and multiple physical processor cores. General multithreading issues that affect performance regardless of the architecture must be addressed. These issues include for example lock contention and workload balance. One of the performance concerns when executing under Intel Hyper-Threading Technology is on the shared resources of the processor core. For example, the caches are effectively shared between two concurrently executing threads. In a worst case scenario, it is possible for one thread to cause the other to miss in the cache on every access. Tuning for multi-core processors adds another level of complication as the possible thread interactions and cache behavior can be even more complicated. It is possible for two threads to cause false sharing, which limits performance but can be easily addressed. Understanding techniques to analyze performance and how to mitigate these performance issues are essential.
Converting a serial application to take advantage of multithreading requires an approach that uses the generic development cycle, consisting of these five phases: Analysis, Design, Implementation, Debug, and Tune. There are threading tools that help with code analysis, debugging, and performance tuning.
- Analysis. Develop a benchmark that represents typical system usage and comprised by concurrent execution of processes and threads. In many cases, the benchmark from the single core tuning phase and the initial parallel implementation may be an appropriate starting point. Use a system performance profiler such as the Intel VTune Performance Analyzer to identify the performance hotspots in the critical path. Determine if the identified computations can be executed independently. If so, proceed to the next phase; otherwise look for other opportunities with independent computations.
- Design. Determine changes required to accommodate a threading paradigm (data restructuring, code restructuring) by characterizing the application threading model (data-level or task-level parallelization). Identify which variables must be shared and if the current design structure is a good candidate for sharing.
- Implementation. Convert the design into code based on the selected threading model. Consider coding guidelines based on the processor architecture, such as the use of the PAUSE instruction within spin-wait loops. Make use of the multithreading software development methodologies and tools.
- Debug. Use runtime debugging and thread analysis tools such as Intel Thread Checker.
- Tune. Tune for concurrent execution on multiple processor cores executing without Intel Hyper-Threading Technology. Tune for concurrent execution on multiple processor cores executing with Intel Hyper-Threading Technology.
Tuning that is focused on power utilization is a relatively new addition to the optimization process for Intel architecture processors. The goal of this phase is to reduce the power utilized by the application when executing on the embedded system. One of the key methods of doing so is by helping the processor enter and stay in one of its idle states.
In an embedded system, power at its fundamental level is a measure of the number of watts consumed in driving the system.
Power can be consumed by several components in a system. Typically, the display and the processor are the two largest consumers of power in an embedded computing system. Other consumers of system power include the memory, hard drives, solid state drives, and communications. Power management features already exist in many operating systems and enable implementation of power policy where various components are powered down when idle for long periods. A simple example is turning off the display after a few minutes of idle activity. Power policy can also govern behavior based upon available power sources. For example, the embedded system may default to a low level of display brightness when powered by battery as opposed to being plugged into an outlet.
Several statistics exist for characterizing the power used by a system including:
- Thermal design power (TDP) . The maximum amount of heat that a thermal solution must be able to dissipate from the processor so that the processor operates under normal operating conditions. TDP is typically measured in watts.
- "Plug load" power. A measure of power drawn from an outlet as the embedded system executes. Plug load power is typically measured in watts.
- Battery power draw. A estimate of power drawn from a battery as the embedded system executes. Typically, battery power draw is stated in watts and is based upon estimates from ACPI.
Your project requirements will guide which of these power measurements to employ and what goals will be set with regard to them.
The Intel Atom processor enables a number of power states, which are classified into C-states and P-states. C-states are different levels of processor activity and range from C0, where the processor is fully active down to C6 where the processor is completely idle and many portions of the processor are powered down. P-states, known as performance states, are different levels of processor frequency and voltage.
To determine if optimizations improve power utilization, a tool is required to measure power utilization. There are two categories of tools to measure power on an embedded system. The first category provides a direct measurement and employs physical probes to measure the amount of power used. These probes could be as simple as a plug load power probe between the device and the electrical outlet. They could require more extensive probes placed on the system board monitoring various power rails such as those required to execute the EEMBC Energybench benchmark.
The second category, power state profiling tools, employs an indirect method of measuring power utilization. Instead of directly measuring power, this class of tool measures and reports on the amount of time spent in different power states. The objective when using these tools is to understand what activities are causing the processor to enter C0 and to minimize them.
The goal of power tuning is two-fold:
- Minimize time in active state.
- Maximize time in inactive state.
On the surface it may seem like these goals are redundant; however in practice both are required. Power is expended in transitioning into and out of idle modes. A processor that is repeatedly waking up and then going back to sleep may consume more power than a processor that has longer periods in an active state. In general, the end result is for the system to be in idle mode 90 percent of the time. Of course, this end result depends on the specific workload and application. Techniques to meet this goal follow one of two tuning strategies, which are summarized as follows:
- Race to Idle. The tasks are executed as quickly as possible to enable the system to idle. This approach typically entails aggressive performance optimization using similar techniques as single core and multi-core performance tuning.
- Idle mode optimization. Iteratively add software components executing on the system and analyze power state transitions to ensure these components are as nondisruptive to power utilization as possible.
High power utilization has several causes, including:
- Poor computational efficiency
- Poor memory management
- Bad timer and interrupt behavior
- Poor power awareness
- Bad multithreading behavior