Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Parallel

Programming the Cell Processor


The Bitmap: Hitting the Bottleneck

Throughout development, Step 5 has constantly been the bottleneck of the entire process. At the highest degree of optimization, it still visits only between 35- and 114-million edges per second per SPE. This throughput is much less than the other steps. We started with a sequential version of Step 5, which is pretty straightforward (Listing Six; available at http://www.ddj.com/code/), and we optimized it extensively, obtaining Listing Seven (available at http://www.ddj.com/code/). The result is not exactly the most readable code, but it gives you a good idea of what deeply optimized source code for the Cell looks like.

The initial code does not compile efficiently on the Cell: Only 19 registers are used out of 128, the average CPI is 2.18, no SIMD operations are used, only 3 percent of the issues are dual issues (SPEs have two separate instruction pipelines and can issue two instructions at a time, if they don't conflict), and 48.5 percent of the cycles are spent waiting in stalls, either for memory operations or for intermediate results. On the other hand, our final implementation is 3.68-times faster, it uses 82 registers, it is optimally SIMDized, has a CPI equal to 0.99, 31 percent of the issues are dual, and stalls occur in 32.9 percent of the cycles.

Code in step5() (Listing Seven) is divided in two similar portions. The first portion processes vertices owned by the local SPE, which are ready immediately because they did not need to be transferred in the all-to-all exchange. The second portion processes each incoming Qin after waiting for the associated sentinel. Since it takes more time to process a Qin (4.10 microseconds) than to transfer it (1.19 microseconds), and the first Qin does not need to be transferred, the actual total time spent waiting for sentinels is zero.

As for optimizations, we inlined functions and unrolled the loops. Both optimizations reduce branch penalties and create larger basic blocks. With larger basic blocks, the compiler can schedule the instructions more efficiently, decreasing stalls and increasing the dual issue rate.

We employ 128-bit loads and SIMD operations (the spu_xxx intrinsics in Listing Seven) wherever it is convenient. This leads to better register exploitation and lower CPI. We also issue four loads per iteration with "offset pointers"; that is, a base pointer plus a small literal offset. These are compiled as load instructions in the "d-form," which calculates the sum of a register and an immediate offset without the need for a separate address computation instruction.

We got rid of branches by using software speculation. Branches cost up to 24 cycles when predicted incorrectly. So, we set bits in the bitmap unconditionally, even when they were already set. This has no impact on the algorithm's veracity. And we queue vertex codes into Qnext even when they're not needed. Then, we advance the end pointer of the queue only if the vertex was actually added. This allows most of the code to proceed unconditionally.

We use "restrict" pointers. Normally, a sequence of multiple loads causes stalls because the compiler does not try to schedule them concurrently. In fact, this would break the C pointer semantics if the loads alias. And bitmap load/stores are responsible for more than 30 percent of the time spent in this step. To improve performance, we declare as __restrict__ the pointers to values in the bitmap, guaranteeing that no aliasing takes place and allowing the four loads to proceed in parallel. And at graph generation, we shuffle adjacency lists appropriately so that they do not cause aliasing.

Conclusion

In short, the Cell offers an impressive potential for performance. However, due to its architecture and limited support offered by the compiler, you can't expect to exploit this potential by just recompiling your current applications. Applications must be radically redesigned in terms of computation and data transfers. Computational bottlenecks must often be analyzed and addressed manually, and data transfers must be properly orchestrated in order to hide their latency completely under the computational delays. This work has been supported by the Laboratory Directed Research and Development Program for the Data-Intensive Computing Initiative at Pacific Northwest National Laboratory, operated by Battelle for the U.S. Department of Energy under Contract DEAC0576RL01830.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.