Channels ▼


Programming the Cell Processor

Source Code Accompanies This Article. Download It Now.

Fitting In

Listing Two has a major issue—it assumes no limits on the size of its data structures. But the space in an LS is limited—code, stack, global data, and heap must fit in 256 KB. Therefore, we leave Q, Qnext, and G in main memory, and operate on smaller buffers, called bQ, bQnext, and bG, respectively, which are allocated in the LS. Figure 1 describes the new algorithm.

Figure 1: A dataflow diagram of our final implementation of the algorithm.

Step 1 fetches a portion of Q into buffer bQ via a DMA transfer. We hide the latency of this transfer with a double-buffering technique. In short, while buffer 0 is being processed, new data are transferred into buffer 1. When we finish processing buffer 0 and transferring buffer 1, we swap the buffers: Buffer 1 comes into use, and another transfer is initiated on buffer 0; see Listing Three (available at The same technique is used for putting bQnext into main memory (in Step 6).

Step 2 scans vertices in bQ and loads the respective adjacency lists in bG until bG is full. Also, bG is managed as a double buffer. But adjacency lists are sparse blocks in main memory, therefore, multiple transfers are needed. We carry them out with a DMA list. The Cell supports DMA lists up to 2048 transfers, each involving up to 16 KB. You must know the size of each transfer in advance to set it up in the DMA list, but there is no obvious way to know the size of an adjacency list before loading it. We solve this with a hack— we represent vertices with vertex codes; for instance, a 32-bit word where the upper bits contain the vertex identifier, and the lower bits contain its adjacency list length. The results are given in Listing Four (available at, where macros VERTEX_CODE_TO_INDEX and VERTEX_CODE_TO_LENGTH extract the two fields from a vertex code. With the help of the length field, Step 2 can stuff the maximum possible amount of adjacency lists into bG, minimizing wasted space.

Step 3 splits the adjacency lists loaded by Step 2 into the respective Qouts. To expedite this process, at graph generation, we prepare adjacency lists in the graph in a split per-SPE format. A header specifies the offset and length of each per-SPE portion of that list.

Step 4 is an all-to-all exchange in which each SPE delivers each of the Qout queues into a Qin inside its respective recipient. To detect at recipient side when each Qin is ready, we use a sentinel. The sentinel is a flag that the sender sets when the transfer is complete. To detect transfer completion, the recipient waits for the sentinel to change value. Hardware support is employed to guarantee that sentinels are not transferred before their payload. That is, we interleave payload and sentinels in a DMA list and employ a DMA list with a barrier; for instance, intrinsic mfc_putlb in Listing Five (available at

For maximum efficiency, we avoid the transfer of Qout[i] to the same SPE i, and we schedule transfers in a circular fashion that exploits the geography of the EIB. The send order is computed by init_all_to_all() and stored in Qout_send_order; see Listing Five. The order and coordinates of transfers are known in advance, so the DMA list can be prepared once for all at program initialization. Thanks to this, Step 4 boils down to just a call to mfc_putlb.

Step 5 scans the vertices in each incoming Qin queue; if they were not marked before, it adds them to its Qnext and marks them. Step 5 is the most expensive step of the algorithm and hardest to optimize.

Step 6 commits Qnext to Qnext in main memory with double buffering. The same considerations made for Step 1 apply.

To achieve the best results, we overlap computation and data transfers as much as possible. Due to double buffering, an iteration takes two "scheduling periods," but iterations are pipelined, so one iteration completes at each period. With the aforementioned parameters, a period lasts 47 microseconds and processes 2700 edges on average. The latency of data transfers is completely hidden. Computations never have to wait for data being transferred. The overall latency is determined by the sum of all computations.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.