The Mystery and Magic of Prefix Scan
Prefix Scan might be considered an elemental parallel computation. This operation computes all partial sums of a vector of values resulting in a vector, the same size as the original vector, where each element is the sum of the preceding elements in the original vector up to the corresponding position. There are two versions of Prefix Scan: inclusive and exclusive. For an inclusive scan, the result is the sum of all preceding values as well as the value of the element itself. The exclusive version does not include the value of the vector element.
Prefix Scan will work with any associative combining operation. The most common is addition (and I'll use this as my default operation), but the operations of multiplication, maximum, minimum, and logical operations (such as AND, OR, and Exclusive OR) are all possible, too. The combining operation used does not need to be commutative because of the fixed order that is used to combine elements.
Prefix Scan can be used as part of many other parallel algorithms, one of which I will demonstrate in the next post. Before we go there, allow me to employ a bit of prestidigitation and magic words (that Harry Potter doesn't even know) to astound and amaze you with a description and transformations of the Prefix Scan algorithm. If I can't quite bring you to "astounded and amazed," I hope that I can keep you above "bored and yawning."
To start, below is serial code to implement inclusive Prefix Scan of an array of
N integers, storing the results in corresponding elements of the array
prefixScan = A; for (int i = 1; i < N; i++) prefixScan[i] = prefixScan[i-1] + A[i];
After a quick pass of my hands over the code and a hearty "Presto Chango!" I find that the code for the exclusive Prefix Scan would initialize the
prefixScan element with a value of zero and, in the loop, the index of the
A element will be
Before I get too deeply into the topic of Prefix Scan, let's revisit and keep in mind the two concepts that I Illustrated in my last post: work and span. The work of the serial Prefix Scan is
O(N) since there are
O(N) addition operations. This is also the span of the serial algorithm. The DAG (Directed Acyclic Graph) for the code above is a simple chain of
O(N) nodes, each with a single addition operation. The nature of the algorithm suggests that this operation is inherently serial. You can't compute the value of
prefixScan without computing all the preceding items.
So ("Alakazam!"), here's how you can do it in parallel.
Let's first take a look at how you can implement the inclusive Prefix Scan operation on the PRAM model. For the algorithm we need
N processors, one per item stored in the arrays, and indexed with the same labels as the corresponding array item,
N-1. The very first step is to copy all items from the original array into the results array. Next, the algorithm loops
O(log N) times. Each loop iteration will have processors reading the value of a preceding item (if it is within the array boundaries) and adding the value to the value of the item with the same index as the processor. The distance between the two array elements to be added together will be powers of 2 starting with 1 (2**0), then 2, then 4, and so on.
As an example, consider the updates done to the  element of the results array. During the course of the algorithm execution this element will be updated by values in the , , and  elements. At no time will my fingers leave my hand as I describe the details of these three updates. After the first loop iteration the  element will contain the sum of the original values of the  and  elements. During the second iteration the value of the  element, which has the sum of the  and  items, is added into the  element ([6-7]). This will leave the sum of the  to  values, [4-7], into the  item. In the third iteration the value in the  item, which now contains the sum of the  to  original values, is added to the value held in the  element. Consequently ("Hocus Pocus!"), the  item has the sum of all values [0-7] that precede it in the array.
Below is pseudo-code for the PRAM algorithm. The number of executions of the outer loop is the ceiling of the base-2 logarithm of the length of the array,
N. The body of the loop is executed in parallel by the (up to)
N processors. The offset back to the array element value to be added to the element at index
k is 2 raised to the (
j-1) power. If the offset indexed element is in the array, the two values are added and the result stored in the
for j := 1 to log2(N) do for all k in [0..N-1] parallel do if (k ≥ 2**(j-1)) then x[k] := x[k – 2**(j-1)] + x[k] // or whatever operation needed fi od od