Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼

Jim Dempsey

Dr. Dobb's Bloggers

Superscalar Programming with HyperThreading and Shared Cache Systems

August 27, 2010

Today's article examines superscalar programming techniques on HyperThread and Shared cache systems.

For background information, see my five-part article series on Superscalar Programming 101 (Matrix). That article series demonstrates superscalar techniques but does not fully demonstrate the relationship between running your HyperThread capable system with HyperThreading disabled verses HyperThreading enabled. Here I focus on that relationship.

Under typical programming experiences it is often quoted that HyperThreading yields a 15% to 30% boost in performance. See http://en.wikipedia.org/wiki/HyperThreading/ for more information. To some, the interpretation is:

While one thread attains 100% two threads each attain 57% to 65% performance.

And due to your desire for peak performance out of each thread, there is an incentive (misconception) to turn HyperThreading off to boost performance of each thread.

In my article series, I demonstrate the performance differences amongst various parallel programming techniques but did not show charts for comparison between HT and non-HT enabled processors. Here, I will begin with a chart of the effects of HT-enabled verses HT-disabled on typical code that is not optimized for cache locality.

[Click image to view at full size]
Figure 1

Figure 1 shows the performance relationships on a 4-Processor Intel Xeon X7560 with each processor having 8 cores and each core capable of HyperThreading. Technique P is a non-cache sensitive simple parallel technique. PT is similar technique that performs a transposition prior to multiplication and which attains an 8x performance boost over technique P. PTx is same as PT but with the addition of the use of _mm_... xmm intrinsic functions which attained a 13x performance boost over technique P. Cache sensitive and significantly faster techniques are additionally shown in the article but are counter productive for use in this comparative illustration. Our illustration is intended to show benefits gained under customary use.

Other than for the dip in chart line P with N between 350 and 650 we find a performance boost with HT of between 700% and 750% over same processor(s) model with HT turned off.

Why isn't this 15% to 30%? Is black magic involved? Are the data correct?

I will attempt to convince you that the data are correct and the chart data is correct as well.

On the Intel X7560

HT-disabled: 8 cores, 1 thread per core, 8 L1 caches, 8 L2 caches, 1 L3 shared cache
HT-enabled: 8 cores, 2 threads per core, 8 L1 caches, 8 L2 caches, 1 L3 shared cache

You might reply "so…. (pregnant pause here)

How can this be? Lower performance == faster code??

Under some narrow set of circumstances, the performance boost of HT can be extraordinary.

From http://www.anandtech.com/show/3648/xeon-7500-dell-r810/7/

On Xeon X7560

L3 63 clocks
L2 9 clocks ( 7x faster than L3)
L1 4 clocks ( 2.25x faster than L2, 15.75x faster than L3)

With the matrix sizes represented in the charts above, the entire matrix can be held in cache. As each thread works through its slice of the matrix, 64 byte cache lines are read from one of: L1, L2 L3 or memory (4 memory busses on this system).

With HT off the thread's source of data is one of:

its own L1 cache
its own L2 cache
the shared L3 cache

With HT on the thread's source of data is one of:

its own L1 cache
its HT sibling's L1 cache (same cache)
its own L2 cache
its HT sibling's L2 cache (same cache)
the shared L3 cache

Note that although the L1 and L2 caches are the same for the HT siblings, some parallel functions can benefit from data fetched by one thread being able to be used by the other HT sibling. And in this case, one HT sibling pays the price of the 63 clock ticks for the fetch from L3 to L1/L2 and the other HT sibling benefitting of the use (some) data in L1/L2 cache.

When using the higher performing cache optimized techniques Parallel Tag Team Transpose (PTT) and Parallel Tag Team Transpose with _mm_... xmm intrinsics (PTTx) we find lesser benefit from HT on the lower matrix sizes.

[Click image to view at full size]
Figure 2

What HT buys you is under some (many) circumstances, the cache load and memory load latencies experienced by on HT sibling can be exploited by the other HT sibling. If you code accordingly, you too can experience significant benefits from HT capable processors.

These levels of benefit from HT can be attained in certain sections of your code. It would be wise to keep this in mind before you decide to turn of HT.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.