Channels ▼
RSS

Tools

How The Theory of Constraints Can Help Software Optimization


Core Architecture Cycle Analysis in Detail

Figure 2 gives you an idea of how to approach cycle decomposition and performance analysis exercise. While Figure 3 is one interpretation of Figure 2 for Core architecture with some of the microarchitectural event names given. This is clearly not the complete breakdown of all the events, but nonetheless a good starting point for further analysis.

Figure 2: Performance Events Drill-Down and Software Tuning Feedback Loop

The idea behind Figure 3 is to identify cycles with the help of VTune Performance Analyzer where: 1) no μops are dispatched for execution, and 2) cycles which are executed but are not retired (due to speculative nature of the processor). The non-retired cycles are basically non-productive cycles and cause ineffective usage of the execution unit.

Figure 3: One interpretation of Figure 2 with some of the microarchitectural event names.

Cycles dispatching μops can be counted with the RS_UOPS_DISPATCHED.CYCLES_ANY event while cycles where no μops were dispatched (stalls) can be counted with the RS_UOPS_DISPATCHED.CYCLES_NONE event. Therefore the equation given earlier in Formula 1 can be re-written as given in Formula 2. The ratio of RS_UOPS_DISPATCHED.CYCLES_NONE to CPU_CLK_UNHALTED.CORE will tell you the percentage of cycles wasted due stalls. These very stalls can turn the execution unit of a processor into a major bottleneck. The execution unit by definition is always the bottleneck because it defines the throughput and an application will perform as fast as its bottleneck. Therefore it is extremely critical to identify the causes for the stall cycles and remove them if possible.


CPU_CLK_UNHALTED.CORE ~ RS_UOPS_DISPATCHED.CYCLES_ANY + RS_UOPS_DISPATCHED.CYCLES_NONE

Formula 2

Our goal is to determine how we can minimize the causes for the stalls and let the "bottleneck" (i.e, execution unit due to stalls) do to what it is designed to do. In sum, the execution unit should not sit idle and wait for whatever reason.

There are many contributing factors to the stall cycles and sub-optimal usage of the execution unit. Memory accesses (e.g, cache misses), Branch mis-predictions (pipeline flushes as a result), Floating-point (FP) operations (ops) (e.g, long latency operations such as division, fp control word change etc) and μops not retiring due to the out of order (OOO) engine can be given as some of them.

Some of the key events that are used in breaking down the stalls cycles in Figure 3 are given below. VTune can help to sample these events not only on your application but also on the entire system.

  • MEM_LOAD_RETIRED.L2_LINE_MISS, MEM_LOAD_RETIRED.L1_LINE_MISS: Number of retired load operations that missed the L2/L1 (respectively) cache. When the event count is multiplied with penalty in cycles, one can estimate the impact on the stalls cycles. However, this approach oversimplifies the fact that the OOO engine can handle multiple outstanding load misses.
  • DTLB_MISSES.ANY: Number of Data Table Lookaside Buffer (DTLB) misses. The count includes misses detected as a result of speculative accesses. Typically a high count for this event indicates that the code accesses a large number of data pages. The cost of a DTLB lookup miss is about 10 cycles. The event MEM_LOAD_RETIRED.DTLB_MISS measures the number of load micro-ops that experienced a DTLB miss.
  • RESOURCE_STALLS.BR_MISS_CLEAR: Number of cycles after a branch misprediction is detected at execution until the branch and all older micro-ops retire. During this time new micro-ops cannot enter the out-of-order pipeline.
    • BR_INST_RETIRED.MISPRED and its ratio to BR_INST_RETIRED.ANY can also give idea on how many of the overall branch instructions were mispredicted.
  • RAT_STALLS.ANY: Number of stall cycles due to various reasons, the breakdown of the stalls counted by this event can also be counted separately; see VTune analyzer help).
  • Type conversions and long latency operation will also impact the performance of the execution unit. For example:
    • Division which is a long latency operation and its impact on the execution unit can be measured with DIV (counts the number of divide ops) and IDLE_DURING_DIV (counts the number of cycles the divider is busy and no other execution unit or load operation is in progress) respectively.
    • RESOURCE_STALLS.FPCW: counts the number of cycles while execution was stalled due to writing the floating-point unit (FPU) control word.
  • DELAYED_BYPASS.[FP/LOAD/SIMD]: Number of times floating point/load/SIMD operations use data immediately after the data was generated by a non-floating point/non-SIMD unit. Such cases result in one penalty cycle due to data bypass between the units. Normally small penalties such as this one are hidden by out of order (OOO) execution.
  • ILD_STALLS: Number of Instruction Length Decoder stall cycles due to a length changing prefix (LCP). The event ILD_STALLS measures the number of times the slow decoder was triggered, the cost of each instance is 6 cycles. This event is front-end related.

Even though OOO engine takes care of small stall penalties (usually anything less 10 cycles), it can be a good exercise to identify the locations of these events to find a correlation with the un-dispatched cycles.

As mentioned above, non-productive cycles (non-retired instruction) utilize the execution unit unnecessarily; thus, identifying and eliminating those instructions is crucial. Although there isn't a direct event to measure the cycles associated with non-retiring μops, Formula 2 and 3 can be used to estimate non-retired cycles by using already performance events.


RS_UOPS_DISPATCHED -   (UOPS_RETIRED.ANY + UOPS_RETIRED.FUSED – UOPS_RETIRED.MACRO_FUSION) 
Formula 3


RS_UOPS_DISPATCHED -   (UOPS_RETIRED.ANY + UOPS_RETIRED. LD_IND_BR + UOPS_RETIRED.STD_STA) 

Formula 4


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video