Superscalar Programming with HyperThreading and Shared Cache Systems
Today's article examines superscalar programming techniques on HyperThread and Shared cache systems.
For background information, see my five-part article series on Superscalar Programming 101 (Matrix). That article series demonstrates superscalar techniques but does not fully demonstrate the relationship between running your HyperThread capable system with HyperThreading disabled verses HyperThreading enabled. Here I focus on that relationship.
Under typical programming experiences it is often quoted that HyperThreading yields a 15% to 30% boost in performance. See http://en.wikipedia.org/wiki/HyperThreading/ for more information. To some, the interpretation is:
While one thread attains 100% two threads each attain 57% to 65% performance.
And due to your desire for peak performance out of each thread, there is an incentive (misconception) to turn HyperThreading off to boost performance of each thread.
In my article series, I demonstrate the performance differences amongst various parallel programming techniques but did not show charts for comparison between HT and non-HT enabled processors. Here, I will begin with a chart of the effects of HT-enabled verses HT-disabled on typical code that is not optimized for cache locality.
Figure 1 shows the performance relationships on a 4-Processor Intel Xeon X7560 with each processor having 8 cores and each core capable of HyperThreading. Technique P is a non-cache sensitive simple parallel technique. PT is similar technique that performs a transposition prior to multiplication and which attains an 8x performance boost over technique P. PTx is same as PT but with the addition of the use of _mm_... xmm intrinsic functions which attained a 13x performance boost over technique P. Cache sensitive and significantly faster techniques are additionally shown in the article but are counter productive for use in this comparative illustration. Our illustration is intended to show benefits gained under customary use.
Other than for the dip in chart line P with N between 350 and 650 we find a performance boost with HT of between 700% and 750% over same processor(s) model with HT turned off.
Why isn't this 15% to 30%? Is black magic involved? Are the data correct?
I will attempt to convince you that the data are correct and the chart data is correct as well.
On the Intel X7560
HT-disabled: 8 cores, 1 thread per core, 8 L1 caches, 8 L2 caches, 1 L3 shared cache
HT-enabled: 8 cores, 2 threads per core, 8 L1 caches, 8 L2 caches, 1 L3 shared cache
You might reply "so…. (pregnant pause here)
How can this be? Lower performance == faster code??
Under some narrow set of circumstances, the performance boost of HT can be extraordinary.
From http://www.anandtech.com/show/3648/xeon-7500-dell-r810/7/
On Xeon X7560
L3 63 clocks
With the matrix sizes represented in the charts above, the entire matrix can be held in cache. As each thread works through its slice of the matrix, 64 byte cache lines are read from one of: L1, L2 L3 or memory (4 memory busses on this system).
With HT off the thread's source of data is one of:
its own L1 cache
With HT on the thread's source of data is one of:
its own L1 cache
Note that although the L1 and L2 caches are the same for the HT siblings, some parallel functions can benefit from data fetched by one thread being able to be used by the other HT sibling. And in this case, one HT sibling pays the price of the 63 clock ticks for the fetch from L3 to L1/L2 and the other HT sibling benefitting of the use (some) data in L1/L2 cache.
When using the higher performing cache optimized techniques Parallel Tag Team Transpose (PTT) and Parallel Tag Team Transpose with _mm_... xmm intrinsics (PTTx) we find lesser benefit from HT on the lower matrix sizes.
What HT buys you is under some (many) circumstances, the cache load and memory load latencies experienced by on HT sibling can be exploited by the other HT sibling. If you code accordingly, you too can experience significant benefits from HT capable processors.
These levels of benefit from HT can be attained in certain sections of your code. It would be wise to keep this in mind before you decide to turn of HT.
L2 9 clocks ( 7x faster than L3)
L1 4 clocks ( 2.25x faster than L2, 15.75x faster than L3)
its own L2 cache
the shared L3 cache
its HT sibling's L1 cache (same cache)
its own L2 cache
its HT sibling's L2 cache (same cache)
the shared L3 cache

