# The Mystery and Magic of Prefix Scan

June 07, 2012

Prefix Scan might be considered an elemental parallel computation. This operation computes all partial sums of a vector of values resulting in a vector, the same size as the original vector, where each element is the sum of the preceding elements in the original vector up to the corresponding position. There are two versions of Prefix Scan: inclusive and exclusive. For an inclusive scan, the result is the sum of all preceding values as well as the value of the element itself. The exclusive version does not include the value of the vector element.

Prefix Scan will work with any associative combining operation. The most common is addition (and I'll use this as my default operation), but the operations of multiplication, maximum, minimum, and logical operations (such as AND, OR, and Exclusive OR) are all possible, too. The combining operation used does not need to be commutative because of the fixed order that is used to combine elements.

Prefix Scan can be used as part of many other parallel algorithms, one of which I will demonstrate in the next post. Before we go there, allow me to employ a bit of prestidigitation and magic words (that Harry Potter doesn't even know) to astound and amaze you with a description and transformations of the Prefix Scan algorithm. If I can't quite bring you to "astounded and amazed," I hope that I can keep you above "bored and yawning."

To start, below is serial code to implement inclusive Prefix Scan of an array of `N` integers, storing the results in corresponding elements of the array `prefixScan`.

``` prefixScan[0] = A[0];
for (int i = 1; i < N; i++)
prefixScan[i] = prefixScan[i-1] + A[i];
```

After a quick pass of my hands over the code and a hearty "Presto Chango!" I find that the code for the exclusive Prefix Scan would initialize the `prefixScan[0]` element with a value of zero and, in the loop, the index of the `A` element will be `[i-1]`.

Before I get too deeply into the topic of Prefix Scan, let's revisit and keep in mind the two concepts that I Illustrated in my last post: work and span. The work of the serial Prefix Scan is `O(N)` since there are `O(N)` addition operations. This is also the span of the serial algorithm. The DAG (Directed Acyclic Graph) for the code above is a simple chain of `O(N)` nodes, each with a single addition operation. The nature of the algorithm suggests that this operation is inherently serial. You can't compute the value of `prefixScan[23]` without computing all the preceding items.

So ("Alakazam!"), here's how you can do it in parallel.

Let's first take a look at how you can implement the inclusive Prefix Scan operation on the PRAM model. For the algorithm we need `N` processors, one per item stored in the arrays, and indexed with the same labels as the corresponding array item, `0` to `N-1`. The very first step is to copy all items from the original array into the results array. Next, the algorithm loops `O(log N)` times. Each loop iteration will have processors reading the value of a preceding item (if it is within the array boundaries) and adding the value to the value of the item with the same index as the processor. The distance between the two array elements to be added together will be powers of 2 starting with 1 (2**0), then 2, then 4, and so on.

As an example, consider the updates done to the [7] element of the results array. During the course of the algorithm execution this element will be updated by values in the [6], [5], and [3] elements. At no time will my fingers leave my hand as I describe the details of these three updates. After the first loop iteration the [7] element will contain the sum of the original values of the [6] and [7] elements. During the second iteration the value of the [5] element, which has the sum of the [4] and [5] items, is added into the [7] element ([6-7]). This will leave the sum of the [4] to [7] values, [4-7], into the [7] item. In the third iteration the value in the [3] item, which now contains the sum of the [0] to [3] original values, is added to the value held in the [7] element. Consequently ("Hocus Pocus!"), the [7] item has the sum of all values [0-7] that precede it in the array.

Below is pseudo-code for the PRAM algorithm. The number of executions of the outer loop is the ceiling of the base-2 logarithm of the length of the array, `N`. The body of the loop is executed in parallel by the (up to) `N` processors. The offset back to the array element value to be added to the element at index `k` is 2 raised to the (`j-1`) power. If the offset indexed element is in the array, the two values are added and the result stored in the `[k]` element.

```for j := 1 to log2(N) do
for all k in [0..N-1] parallel do
if (k ≥ 2**(j-1)) then
x[k] := x[k – 2**(j-1)] + x[k]  // or whatever operation needed
fi
od
od
```

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

# First C Compiler Now on Github

The earliest known C compiler by the legendary Dennis Ritchie has been published on the repository.

# HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

# Building Bare Metal ARM Systems with GNU

All you need to know to get up and running... and programming on ARM

# Amazon's Vogels Challenges IT: Rethink App Dev

Amazon Web Services CTO says promised land of cloud computing requires a new generation of applications that follow different principles.

# How to Select a PaaS Partner

Eventually, the vast majority of Web applications will run on a platform-as-a-service, or PaaS, vendor's infrastructure. To help sort out the options, we sent out a matrix with more than 70 decision points to a variety of PaaS providers.

More "Best of the Web" >>