Channels ▼
RSS

Parallel

Avoiding Pipeline Stalls in Hyper-Threaded Processors


Richard Gerber and Andrew Binstock are the authors of " Programming with Hyper-Threading Technology"


Due to the architecture of modern processors and the tendency to attempt pre-execution of known independent blocks of code, multi-threaded and parallel programming can be a tricky adventure. It becomes imperative that developer are aware of certain processor-level issues that may adversely impact the performance of a multi-threaded application. One common source of performance degradation are pipeline stalls. This article explores a few specific situations in which pipeline stalls occur; and even though many of these situations cannot be completely avoided or removed, developers should make the effort of reducing their frequency whenever possible.

Spin Waits and the pause Instruction

The NetBurst architecture is particularly adept at spotting sequences of instructions that it can execute out of original: program order, that is, ahead of time. These sequences are characterized by:

  • Having no dependency on other instructions;
  • Not causing side effects that affect the execution of other instructions (such as such as modifying a global state).

When the processor spots these sequences, it executes the instructions and stores the results. The processor cannot fully retire these instructions because it must verify that assumptions made during their speculative execution are correct. To do this, the assumed instruction path and context are compared with the correct path instruction path. If the speculation was indeed correct, then instructions are retired (in program order). However, if the assumptions are wrong, a lot of things can happen. In a particularly bad case, called a full stall, all instructions in flight are terminated and retired in careful sequence, all the pre-executed code is thrown out, and the pipeline is cleared and restarted at the point of incorrect speculation -- this time with the correct path.

One common sequence that the processor frequently executes out of order is the spin wait. This tight loop generally consists of a handful of assembly instructions written here in pseudo-code:


top_of_loop:
	   load x into a register
	   compare to 0
	   if not equal, goto top_of_loop
	   else . . .

The three instructions -- load register, compare, jump -- are ones the speculative execution engine is particularly good at recognizing and blazing through. It sees that the loop does not depend on any variables being calculated by other instructions and so the sequence can be executed without fear of disturbing other instructions. In addition, it knows that if x changes value while the loop is running, this change will be caught before the instructions are retired by the processor. As a result, it grabs this sequence and executes it numerous times and very quickly. In the process, it floods the processor's store of instructions to be retired with the repeated iterations of the loop. With no reason to slow down, the speculative execution continues to crank out the instructions at full tilt. Finally, the variable being waited on changes value. The instruction-retirement logic recognizes this change and triggers the full pipeline stall: it discards all the pre-executed iterations of the loop that are waiting to be retired, it retires all other instructions in flight, and it determines where the pipeline should resume and sets the pipeline to that instruction.

This pipeline stall, however, is not the only downside. Dozens of loop iterations were performed needlessly. This unnecessary work tied up execution units and it flooded the reorder buffer, the area inside the processor that holds speculatively executed instructions prior to their retirement. On a processor with Hyper-Threading Technology, this extra work has a serious, detrimental impact on the performance of the other thread: it starves the second logical processor of resources, so that both threads are effectively incapable of doing any work simply because the loop is spinning so fast.

It is clear that the loop variable cannot change faster than the memory bus can update it. Hence, there is no benefit to pre-execute the loop faster than the time needed for a memory refresh. By inserting a pause instruction into a loop, the programmer tells the processor to wait (literally to do nothing) for the amount of time equivalent to this memory access. On processors with Hyper-Threading Technology, this respite enables the other thread to use all of the resources on the physical processor and continue processing.

Inserting the pause instruction can be done in one of two ways. With embedded assembly language, it is simply:


_asm
{
   pause
}

Using the intrinsics in the Intel C++ compiler and newer versions of the Microsoft C/C++ compiler, the instruction is _mm_pause(). For example, a tight loop might be:


while ( x != synchronization_variable )
   _mm_pause();

On processors that predate Hyper-Threading Technology, the pause instruction is translated into a no-op, that is a no-operation instruction, which simply introduces a one instruction delay.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.