Dr. Dobb's | How The Theory of Constraints Can Help in Software Optimization

How The Theory of Constraints Can Help in Software Optimization

Software optimization is the process of improving the software by eliminating bottlenecks

June 24, 2009
URL:http://www.drdobbs.com/how-the-theory-of-constraints-can-help-i/212903287

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

--Donald Knuth (adapted from C. A. R. Hoare)

The Theory of Constraints (TOC) is a philosophy of continuous improvement which is defined as a framework that can be applicable to many disciplines. The theory is based on the idea that in any complex system, there is usually one aspect of that system that limits its ability to achieve its goal or optimal functioning. To achieve any significant improvement of the system, the constraint must be identified and resolved.

Software optimization, in the generic sense, is the process of improving the software by eliminating bottlenecks so that it operates more efficiently on a given hardware and uses resources optimally. In keeping with Knuth's advice, TOC can help developers to identify the bottlenecks of their code for optimization that have the most return on investment (ROI). Identifying the bottlenecks in the target application and eliminating them appropriately is the key to an efficient optimization.

There are many optimization methodologies, which help developers to answer the questions of why to optimize, what to optimize and how to optimize, and these methods aid developers to reach their performance requirements. These questions are not any different than the questions asked to improve an organization, supply chain management, or manufacturing.

In this article, I propose that TOC, which is a philosophy of management and improvement, can be also used and applied for software optimization. The focus of this article is how to identify the bottlenecks and eliminate them with the help of Intel VTune Performance Analyzer while following the framework that is described by Theory of Constraints.

Introduction to Software Optimization: Hotspots and More

Performance tuning usually focuses on reducing the time it takes to complete a well-defined workload. Performance events can be used to measure the elapsed time and therefore reducing the elapsed time of completing a workload is equivalent to reducing measured processor cycles (clockticks).

Intel VTune Performance Analyzer is a powerful tool for software developers who are interested in performance analysis. The easiest way to identify the hotspots in a given application is to sample the application with processor cycles and VTune analyzer helps the developer identify where most of the CPU cycles are spent, in addition to many other processor events1, by utilizing two profiling techniques; sampling and call graph. The sampling can be in two forms: processor event based and time based sampling. Event-based sampling (EBS) relies on the performance monitoring unit (PMU) supported by the processors. From this point forward, I refer to event-based sampling (EBS) as "sampling."

When a sampling activity used, VTune analyzer by default uses processor cycles and instructions retired to analyze the application. (Recent generations of Intel 64 and IA-32 processors feature microarchitectures using an out-of-order execution engine. They are also accompanied by an in-order front " instructions retired.") The count of cycles, also known as "clockticks," forms the fundamental basis for measuring how long a program takes to execute. The total cycle measurement is the start to finish view of total number of cycles to complete the application of interest. In typical performance tuning situations, the metric Total cycles can be measured by the event CPU_CLK_UNHALTED.CORE.

The other event used by default sampling activity is instructions retired. This event indicates the number of instructions that retired or executed completely. This does not include partially processed instructions executed due to branch mis-predictions. The ratio of clockticks (non-halted) and instructions retired is called clocks per instruction retired (CPI) and it can be good indicator of performance problems (indicator of the efficiency of instructions generated by the compiler and/or low CPU utilization). (The Intel Core microarchitecture is capable of reaching CPI as low as 0.25 in ideal situations. The greater value of CPI for a given workload indicates that there are more opportunities for code tuning to improve performance.)

The approach I use here is explained in greater details in Intel 64 and IA-32 Intel Architecture Optimization Reference Manual. In this approach, it is accurate to say that the total number of cycles an application takes is the sum of the cycles issuing μops4and the cycles not issuing μops (stalls). This formula is given with the processor event names in Formula 2.


Total Cycles = Cycles issuing μops + Cycles not issuing μops

Formula 1

It is worth noting that this decomposition is approximate in its nature so it should be taken as estimation.

Therefore, my focus is on three basic concepts:

Minimizing "stalls" by reducing or optimizing long latency operations such as memory accesses, floating-point operations, etc.,
Minimizing "non-retired" instructions by reducing the branch mis-predictions,
Minimizing "retired" instructions.

In Intel Core microarchitecture (Figure 1), up to four μops may be retired per cycle and 6 μops6 can be executed. The results of Intel 64 and IA-32 instructions must be committed in original program order before they are retired. Reducing the number of retired instructions will improve the execution time as will reducing the stalls.

Cycles wasted due to stalls indicate sub-optimal execution; therefore identifying them will be the main focus. Stalls can be categorized as"

Execution stalls
Front End stalls
Mispredicted branch pipeline flushing
Cycles wasted executing instructions that are not retired

With the introduction of Core architecture, the approach to identify and root-cause stall cycles became easier and more systematic. Before going into deeper analysis and explanation of the events, let's introduce the Theory of Constraints.

Figure 1: Intel Core Microarchitecture Pipeline Functionality

Applications Will Perform as Fast as Their Bottlenecks: Gentle Introduction to the Theory of Constraints

TOC is a philosophy of management and improvement developed by Eliyahu M. Goldratt. As mentioned earlier, it is a process of continuous improvement which focuses on the idea that in any complex system, there is usually one aspect of that system that limits its ability to achieve its goal.

The TOC processes defined by Goldratt are used to improve the health of an organization and these processes are very similar to the ones used by software optimization methodologies.

In this theory, five steps are defined. Before these steps are applied, a goal must be articulated. In our case, this is improving the performance of our application or meeting a performance requirement. Having articulated the goal, the steps can be given as:

Identify the constraint (the root-cause that prevents achievement of the goal)
Decide how to exploit the constraint (make sure the constraint identified is doing what it is designed to do. i.e: execution unit is executing but not stalling)
Subordinate and synchronize everything else to the above decisions
Elevate the performance of the constraint (increase capacity of the constraint. i.e: taking advantage of multiple cores, parallelism)
If in any of the above steps the constraint has shifted, go back to Step 1.

The most critical piece in these steps is identifying the constraints. The questions to ask are:

"What to change?",
"To What to Change?" and
"How to Change?"

The TOC defines processes called thinking processes to answer these questions and provides a problem-solving framework based on cause and effect. The basic idea while trying to reach our goal (desired performance) is to identify the hotspots, understand the underlying cause and effect relation and eliminate the causes. If we take the four major stall categories given earlier as an example, we can say that most of the time execution stalls are the cause of performance problems (effect) and resource related stalls. While sub-optimal algorithms are the cause of execution stalls (effect).

In the rest of the article I show how you can identify the constraints preventing optimal application performance by using the TOC methodology as a guideline for software optimization on Core architecture.

Core Architecture Cycle Analysis in Detail

Figure 2 gives you an idea of how to approach cycle decomposition and performance analysis exercise. While Figure 3 is one interpretation of Figure 2 for Core architecture with some of the microarchitectural event names given. This is clearly not the complete breakdown of all the events, but nonetheless a good starting point for further analysis.

Figure 2: Performance Events Drill-Down and Software Tuning Feedback Loop

The idea behind Figure 3 is to identify cycles with the help of VTune Performance Analyzer where: 1) no μops are dispatched for execution, and 2) cycles which are executed but are not retired (due to speculative nature of the processor). The non-retired cycles are basically non-productive cycles and cause ineffective usage of the execution unit.

Figure 3: One interpretation of Figure 2 with some of the microarchitectural event names.

Cycles dispatching μops can be counted with the RS_UOPS_DISPATCHED.CYCLES_ANY event while cycles where no μops were dispatched (stalls) can be counted with the RS_UOPS_DISPATCHED.CYCLES_NONE event. Therefore the equation given earlier in Formula 1 can be re-written as given in Formula 2. The ratio of RS_UOPS_DISPATCHED.CYCLES_NONE to CPU_CLK_UNHALTED.CORE will tell you the percentage of cycles wasted due stalls. These very stalls can turn the execution unit of a processor into a major bottleneck. The execution unit by definition is always the bottleneck because it defines the throughput and an application will perform as fast as its bottleneck. Therefore it is extremely critical to identify the causes for the stall cycles and remove them if possible.


CPU_CLK_UNHALTED.CORE ~ RS_UOPS_DISPATCHED.CYCLES_ANY + RS_UOPS_DISPATCHED.CYCLES_NONE

Formula 2

Our goal is to determine how we can minimize the causes for the stalls and let the "bottleneck" (i.e, execution unit due to stalls) do to what it is designed to do. In sum, the execution unit should not sit idle and wait for whatever reason.

There are many contributing factors to the stall cycles and sub-optimal usage of the execution unit. Memory accesses (e.g, cache misses), Branch mis-predictions (pipeline flushes as a result), Floating-point (FP) operations (ops) (e.g, long latency operations such as division, fp control word change etc) and μops not retiring due to the out of order (OOO) engine can be given as some of them.

Some of the key events that are used in breaking down the stalls cycles in Figure 3 are given below. VTune can help to sample these events not only on your application but also on the entire system.

MEM_LOAD_RETIRED.L2_LINE_MISS, MEM_LOAD_RETIRED.L1_LINE_MISS: Number of retired load operations that missed the L2/L1 (respectively) cache. When the event count is multiplied with penalty in cycles, one can estimate the impact on the stalls cycles. However, this approach oversimplifies the fact that the OOO engine can handle multiple outstanding load misses.
DTLB_MISSES.ANY: Number of Data Table Lookaside Buffer (DTLB) misses. The count includes misses detected as a result of speculative accesses. Typically a high count for this event indicates that the code accesses a large number of data pages. The cost of a DTLB lookup miss is about 10 cycles. The event MEM_LOAD_RETIRED.DTLB_MISS measures the number of load micro-ops that experienced a DTLB miss.
RESOURCE_STALLS.BR_MISS_CLEAR: Number of cycles after a branch misprediction is detected at execution until the branch and all older micro-ops retire. During this time new micro-ops cannot enter the out-of-order pipeline.
- BR_INST_RETIRED.MISPRED and its ratio to BR_INST_RETIRED.ANY can also give idea on how many of the overall branch instructions were mispredicted.
RAT_STALLS.ANY: Number of stall cycles due to various reasons, the breakdown of the stalls counted by this event can also be counted separately; see VTune analyzer help).
Type conversions and long latency operation will also impact the performance of the execution unit. For example:
- Division which is a long latency operation and its impact on the execution unit can be measured with DIV (counts the number of divide ops) and IDLE_DURING_DIV (counts the number of cycles the divider is busy and no other execution unit or load operation is in progress) respectively.
- RESOURCE_STALLS.FPCW: counts the number of cycles while execution was stalled due to writing the floating-point unit (FPU) control word.
DELAYED_BYPASS.[FP/LOAD/SIMD]: Number of times floating point/load/SIMD operations use data immediately after the data was generated by a non-floating point/non-SIMD unit. Such cases result in one penalty cycle due to data bypass between the units. Normally small penalties such as this one are hidden by out of order (OOO) execution.
ILD_STALLS: Number of Instruction Length Decoder stall cycles due to a length changing prefix (LCP). The event ILD_STALLS measures the number of times the slow decoder was triggered, the cost of each instance is 6 cycles. This event is front-end related.

Even though OOO engine takes care of small stall penalties (usually anything less 10 cycles), it can be a good exercise to identify the locations of these events to find a correlation with the un-dispatched cycles.

As mentioned above, non-productive cycles (non-retired instruction) utilize the execution unit unnecessarily; thus, identifying and eliminating those instructions is crucial. Although there isn't a direct event to measure the cycles associated with non-retiring μops, Formula 2 and 3 can be used to estimate non-retired cycles by using already performance events.


RS_UOPS_DISPATCHED -   (UOPS_RETIRED.ANY + UOPS_RETIRED.FUSED " UOPS_RETIRED.MACRO_FUSION)

Formula 3


RS_UOPS_DISPATCHED -   (UOPS_RETIRED.ANY + UOPS_RETIRED. LD_IND_BR + UOPS_RETIRED.STD_STA)

Formula 4

Applying TOC: Cause and Effect Relation and Focusing Steps

By doing this decomposition exercise, we are simply identifying the contributors of the stall cycles in the application and we are indeed building the cause and effect relationship which in TOC is called as Current Reality Tree. The components are called 'UnDesirable Effects' (UDE) [6] and the UDEs are symptoms resulting from a to-be-determined cause.

The analysis and decomposition exercise can easily be performed by the help of Vtune analyzer as mentioned earlier. The Vtune analyzer will help the developers to identify the functions and source code segments causing these events. The outcome of the analysis should give an idea of what causes the stall cycles, which functions and code segments (with source line information) contribute to execution to be constrained. Once the major constraint is identified (i.e, biggest contributor to the stall cycles), the developer needs to look into removing or reducing the conditions causing the problems.

To identify the few things that need to be optimized, you should rely on cause and effect relationships. As you can imagine, there could be multiple symptoms because of the same cause or there could be multiple causes for the same symptom. The fewer the contributors that cause the problem(s) the more powerful and focused our optimization effort will be.

Figure 4: Identifying cause and effect relationships

Having discussed how cycle decomposition can be done, it is worthwhile summarizing how the five focusing steps of TOC can be used in the methodology mentioned above; see Table 1.

Table 1

It is critical to note that after constraints are identified and removed, the next steps should be to ensure others parts of the system (i.e, front end) keep up with the execution unit (Subordinate & Synchronize) so that execution unit doesn't sit idle. If this is not ensured then the solution is only sub-optimal and prevents us to reach our goal.

Think Parallel to Elevate

The "Elevate" step needs extra attention especially nowadays given that multicore processors are replacing single core processors, and already spanning from high-end servers to desktops and laptops. The expectation is that multicore processors will sooner or later become the norm. Therefore taking advantage of parallelism in the software development will help to better take advantage of the hardware resources and utilize the benefits of Core architecture.

Parallelism will help developers take advantage of the system and processor resources. Multithreading is an effective way to leverage multi-core processors and a great way to elevate the performance of the constraints. For example, the latency introduced by long floating point related ops such as divisions and their impact can be reduced by multi-threading. The VTune analyzer can help to evaluate the impact of such code regions. The number of division instructions (DIV) and the impact of the division on the execution unit (IDLE_DURING_DIV) can be used for this purpose.

Conclusion

Software optimization is a method of art and requires special attention right from the beginning of any software development cycle. Like Knuth said:" premature optimization is the root of all evil". One needs to identify the area of interest for optimization carefully by leveraging methods mentioned above. Core architecture with its improved PMU support and with the help of VTune analyzer's event based sampling makes systematic performance analysis easier. TOC can provide a new way of looking at software optimization and identifying the constraints on any system. Multicore processors provide great opportunities to increase the performance of applications but require new ways of thinking. Parallelism can be designed and implemented to avoid many potential constraints.

Think parallel when thinking of constraints.