Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Parallel

Programming the Cell Processor


The Problem

To illustrate the peculiarities of Cell programming, we use the Breadth-First Search (BFS) on a graph. Despite its simplicity, this algorithm is important because it is a building block of many applications in computer graphics, artificial intelligence, astrophysics, national security, genomics, robotics, and the like.

Listing One is a minimal BFS implementation in C. Variable G contains the graph in the form of an array of adjacency lists. G[i].length tells how many neighbors the i-th vertex has, which are in G[i].neighbors[0], G[i].neighbors[1], and so on. The vertex from which the visit starts is in variable root. A BFS visit proceeds in levels: First, the root is visited, then its neighbors, then its neighbors' neighbors, and so on. At any time, queue Q contains the vertices to visit in the current level. The algorithm scans every vertex in Q, fetches its neighbors, and adds each neighbor to the list of vertices to visit in the next level, Qnext. To prevent being caught in loops, the algorithm avoids visiting those vertices that have been visited before. To do so, it maintains a marked array of Boolean variables. Neighbors are added to Qnext only when they are not already marked, then they get marked. At the end of each level, Q and Qnext swap, and Qnext is emptied.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

/* ... */

/* the graph */
vertex_t * G;

/* number of vertices in the graph */
unsigned card_V;

/* root vertex (where the visit starts) */
unsigned root;

void parse_input( int argc, char** argv );

int main(int argc, char ** argv)
{
  unsigned *Q, *Q_next, *marked;
  unsigned  Q_size=0, Q_next_size=0;
  unsigned  level = 0;

  parse_input(argc, argv);
  graph_load();

  Q      = 
          (unsigned *) calloc(card_V, sizeof(unsigned));
  Q_next = 
          (unsigned *) calloc(card_V, sizeof(unsigned));
  marked = 
          (unsigned *) calloc(card_V, sizeof(unsigned));

  Q[0] = root;
  Q_size  = 1;
  while (Q_size != 0)
    {
      /* scanning all vertices in queue Q */
      unsigned Q_index;
      for ( Q_index=0; Q_index<Q_size; Q_index++ )
      {
        const unsigned vertex = Q[Q_index];
        const unsigned length = G[vertex].length;
        /* scanning each neighbor of each vertex */
        unsigned i;
      for ( i=0; i<length; i++)
          {
            const unsigned neighbor =
              G[vertex].neighbors[i];
      if( !marked[neighbor] ) {
            /* mark the neighbor */
            marked[neighbor]      = TRUE;
            /* enqueue it to Q_next */
            Q_next[Q_next_size++] = neighbor;
          }
        }
      }
      level++;
      unsigned * swap_tmp;
      swap_tmp    = Q;
      Q           = Q_next;
      Q_next      = swap_tmp;
      Q_size      = Q_next_size;
      Q_next_size = 0;
    }
  return 0;
}
Listing One

On a Pentium 4 HT running at 3.4 GHz, this algorithm is able to check 24-million edges per second. On the Cell, at the end of our optimization, we achieved a performance of 538-million edges per second. This is an impressive result, but came at the price of an explosion in code complexity. While the algorithm in Listing One fits in 60 lines of source code, our final algorithm on the Cell measures 1200 lines of code.

Let's Get Parallel

The first step in adapting programs to a multicore architecture is making it parallel. The basic idea is to split loop for (Q_index=0; Q_index<Q_size; Q_index++)... among different SPEs. Then you access a lock marked by the protection of a synchronization mechanism. Locks work fine in cache-coherent shared-memory machines with uniform memory and limited threads, but scale poorly on multicore systems. Instead, we partition both the vertex space and the marked array evenly among the SPEs. Each SPE explores only the vertices it owns, and forwards the others to their respective owners. Function which_owner() returns the owner of a given vertex identifier.

Rather than synchronizing at a fine grain, we adopt a Bulk-Synchronous Parallel (BSP) approach. In BSP, an algorithm is split in steps, and all the cores execute the same step at the same time. After each step, there is a barrier; see barrier() in Listing Two (available at http://www.ddj.com/code/). At a barrier, whoever finishes first waits for all the others to complete before proceeding to the next step. The BSP approach is very common in the parallel programming community because it greatly simplifies the control flow and the communication protocols, at the expense of a negligible performance penalty.

Listing Two is a pseudo-C rendition of the algorithm in BSP form. The code is executed by each of the SPEs, numbered from 0 to Nspe-1. Each SPE examines the neighbor lists of each of the vertices in its Q, encountering neighbors that belong to different SPEs. It dispatches them, putting those owned by the i-th SPE in queue Qout[i]. Then, an all-to-all exchange takes place, which routes the vertices to their respective owners. Each Qout[i] is sent to the i-th SPE, which receives it into Qin[s], where s is the identifier of the sender SPE. Next, each SPE examines the incoming adjacency lists, and marks and adds vertices to its private Qnext as before. The entire algorithm is complete when all the SPEs find their Qs empty. This is done via a reduce_all operation, which performs a distributed addition of all the variables Q_size among all the SPEs.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.