Core Architecture Cycle Analysis in Detail
Figure 2 gives you an idea of how to approach cycle decomposition and performance analysis exercise. While Figure 3 is one interpretation of Figure 2 for Core architecture with some of the microarchitectural event names given. This is clearly not the complete breakdown of all the events, but nonetheless a good starting point for further analysis.
The idea behind Figure 3 is to identify cycles with the help of VTune Performance Analyzer where: 1) no μops are dispatched for execution, and 2) cycles which are executed but are not retired (due to speculative nature of the processor). The non-retired cycles are basically non-productive cycles and cause ineffective usage of the execution unit.
Cycles dispatching μops can be counted with the RS_UOPS_DISPATCHED.CYCLES_ANY event while cycles where no μops were dispatched (stalls) can be counted with the RS_UOPS_DISPATCHED.CYCLES_NONE event. Therefore the equation given earlier in Formula 1 can be re-written as given in Formula 2. The ratio of RS_UOPS_DISPATCHED.CYCLES_NONE to CPU_CLK_UNHALTED.CORE will tell you the percentage of cycles wasted due stalls. These very stalls can turn the execution unit of a processor into a major bottleneck. The execution unit by definition is always the bottleneck because it defines the throughput and an application will perform as fast as its bottleneck. Therefore it is extremely critical to identify the causes for the stall cycles and remove them if possible.
CPU_CLK_UNHALTED.CORE ~ RS_UOPS_DISPATCHED.CYCLES_ANY + RS_UOPS_DISPATCHED.CYCLES_NONE
Our goal is to determine how we can minimize the causes for the stalls and let the "bottleneck" (i.e, execution unit due to stalls) do to what it is designed to do. In sum, the execution unit should not sit idle and wait for whatever reason.
There are many contributing factors to the stall cycles and sub-optimal usage of the execution unit. Memory accesses (e.g, cache misses), Branch mis-predictions (pipeline flushes as a result), Floating-point (FP) operations (ops) (e.g, long latency operations such as division, fp control word change etc) and μops not retiring due to the out of order (OOO) engine can be given as some of them.
Some of the key events that are used in breaking down the stalls cycles in Figure 3 are given below. VTune can help to sample these events not only on your application but also on the entire system.
- MEM_LOAD_RETIRED.L2_LINE_MISS, MEM_LOAD_RETIRED.L1_LINE_MISS: Number of retired load operations that missed the L2/L1 (respectively) cache. When the event count is multiplied with penalty in cycles, one can estimate the impact on the stalls cycles. However, this approach oversimplifies the fact that the OOO engine can handle multiple outstanding load misses.
- DTLB_MISSES.ANY: Number of Data Table Lookaside Buffer (DTLB) misses. The count includes misses detected as a result of speculative accesses. Typically a high count for this event indicates that the code accesses a large number of data pages. The cost of a DTLB lookup miss is about 10 cycles. The event MEM_LOAD_RETIRED.DTLB_MISS measures the number of load micro-ops that experienced a DTLB miss.
- RESOURCE_STALLS.BR_MISS_CLEAR: Number of cycles after a branch misprediction is detected at execution until the branch and all older micro-ops retire. During this time new micro-ops cannot enter the out-of-order pipeline.
- BR_INST_RETIRED.MISPRED and its ratio to BR_INST_RETIRED.ANY can also give idea on how many of the overall branch instructions were mispredicted.
- RAT_STALLS.ANY: Number of stall cycles due to various reasons, the breakdown of the stalls counted by this event can also be counted separately; see VTune analyzer help).
- Type conversions and long latency operation will also impact the performance of the execution unit. For example:
- Division which is a long latency operation and its impact on the execution unit can be measured with DIV (counts the number of divide ops) and IDLE_DURING_DIV (counts the number of cycles the divider is busy and no other execution unit or load operation is in progress) respectively.
- RESOURCE_STALLS.FPCW: counts the number of cycles while execution was stalled due to writing the floating-point unit (FPU) control word.
- DELAYED_BYPASS.[FP/LOAD/SIMD]: Number of times floating point/load/SIMD operations use data immediately after the data was generated by a non-floating point/non-SIMD unit. Such cases result in one penalty cycle due to data bypass between the units. Normally small penalties such as this one are hidden by out of order (OOO) execution.
- ILD_STALLS: Number of Instruction Length Decoder stall cycles due to a length changing prefix (LCP). The event ILD_STALLS measures the number of times the slow decoder was triggered, the cost of each instance is 6 cycles. This event is front-end related.
Even though OOO engine takes care of small stall penalties (usually anything less 10 cycles), it can be a good exercise to identify the locations of these events to find a correlation with the un-dispatched cycles.
As mentioned above, non-productive cycles (non-retired instruction) utilize the execution unit unnecessarily; thus, identifying and eliminating those instructions is crucial. Although there isn't a direct event to measure the cycles associated with non-retiring μops, Formula 2 and 3 can be used to estimate non-retired cycles by using already performance events.
RS_UOPS_DISPATCHED - (UOPS_RETIRED.ANY + UOPS_RETIRED.FUSED – UOPS_RETIRED.MACRO_FUSION)
RS_UOPS_DISPATCHED - (UOPS_RETIRED.ANY + UOPS_RETIRED. LD_IND_BR + UOPS_RETIRED.STD_STA)