Structured parallel patterns are highly relevant to parallel programming. The patterns approach I take here is based on the observation that a relatively small set of task organizations and their relationships to data recur frequently in parallel algorithms. Understanding and using these patterns when appropriate can lead to clearer and more maintainable parallel programs. In turn, supporting these patterns directly in a parallel programming platform (and in particular supporting their composition) enables the efficient development of portable and maintainable, yet high-performance, parallel software. In this article, I continue my catalog of useful patterns with a discussion of the "scan" pattern. A scan arises from a loop-carried dependency such as the following:

a[0] = b[0]; for (int i=1; i<N; i++) { a[i] = f(a[i-1],b[i]); }

Each iteration of this loop depends on data computed in the previous iteration. The data-dependency graph of this code fragment looks something like Figure 1.

Let's call the operation given by **f** the "combiner". In general, if we place no restrictions on the combiner, this pattern should be called a "fold". Note that the combiner takes as input the output of the previous loop iteration as well as a new input value. Folds cannot, in their most general definition, be implemented efficiently in parallel. However, if the combiner is associative, that is, if we can combine adjacent inputs in any order so the combiner satisfies **f(a,f(b,c)) = f(f(a,b),c)**, then efficient parallel implementations are possible. We call this special case of fold a "scan". The parallel implementation of scan depends on reordering the operations using the associativity of the combiner. In fact, there are (at least) three ways to parallelize scan with different performance characteristics. I'll first discuss these three approaches to parallelization, then discuss how scan can be used in practice with an example.

A scan takes a one-dimensional array as input and produces a one-dimensional array of the same size as its output. Every element of the output array is a reduction of all the elements of the input array up to the position of that output element. So one way to parallelize a scan is to form a reduction tree for every output element, and then merge the redundant elements of all those trees. This gives rise to the parallel algorithm in Figure 2. This algorithm invokes the combiner operation O(N lg N) times and takes O(lg N) steps to complete. Note that I use "lg" for "base 2 logarithm"; the base is not important for the "big-O" order notation but later I will give exact values for operation counts.

When discussing parallel algorithms, the total number of operations is called the "work complexity" and the longest path through the data-dependency graph of the algorithm is called the "step complexity". Given an infinite number of parallel processors, the step complexity gives the best-case latency for the algorithm. In practice, though, the work complexity can be important too, since usually we will only have a finite number of processors, and will have to serialize some of the available parallelism. This serialization converts some of the work complexity into latency.

The obvious serial implementation of scan (given by the code given at the start of this article and Figure 1) only invokes the combiner function O(N) times. The work complexity of the naive parallel algorithm in Figure 2 is worse than this, but the step complexity is better. So the naive parallel algorithm would do more work in total, but would still complete faster than the serial algorithm given enough processors. On P processors, parallel scan is NOT generally P times faster than serial scan, that is, it does not scale linearly.

It is possible to derive work-efficient parallel scan algorithms that have O(N) work complexity and O(lg N) step complexity. However, these algorithms typically require more evaluations of the combiner f than the serial implementation. In addition, the longest path (the step complexity) typically increases as well when we try to reduce the work complexity. For example, the graph in Figure 3 shows an alternative "tree" parallel algorithm for scan. This algorithm reorganizes the scans based on a binary expansion of the output position and is work-efficient. However it invokes the combiner **f** 11 times, whereas the original serial algorithm only had to invoke it seven times. It also takes five steps (although it could be compressed to four by merging two "layers" of operators in this specific case) to complete, whereas the algorithm in Figure 2 only required three steps. On the other hand, the tree algorithm is not as regular as the naive algorithm given above, and so in practice may be harder to implement efficiently.

Another way to derive a work-efficient scan algorithm is by "blocking", which is more efficient on a finite number of processors as it creates relatively coarse-grained units of work. First, divide the input array up into **P** equal-sized chunks (for "P processor cores"), then do a scan in parallel on each processor (taking **O(N)** work and **O(N/P)** time). The top-most (reduction) results of these partial scans are then collected into another array and scanned in turn. This takes **O(P)** time using a serial algorithm on one processor, **O(lg P)** if parallelized; although if **P** is small a parallel approach may not be efficient. The k^{th} output of this "master" scan are then used to initialize parallel updates of the **(k+1)**^{th} chunk, taking **O(N - N/P)** work and **O(N/P)** time. The 0th chunk does not have to be updated as it is already correct (you could also do only reductions in the first pass, but then would need an extra sub-scan in the last pass). It is obvious the blocked parallel algorithm takes more work than the single serial scan, since we have to go over the arrays twice, plus the overhead of the "master" scan, but it does take only **O(N)** work altogether, and so is work-efficient.

It's useful to compare the actual number of operations for these various algorithms, since some significant costs and scale factors can hide behind the big-O notation. Table 1 (thanks to Michael Liao for most of this) shows the actual number of operations in the various scan algorithms we have discussed, assuming a serial implementation of the "master" scan in the blocked algorithm.

One of the problems with scan is that on a serial processor we really don't want use the same order as the parallel algorithm, since it takes more work. Therefore, in comparison with reduction, it is harder to get a single consistent and efficient portable implementation that gets exactly the same result on all possible processor configurations when an only "approximately" associative operator, such as floating-point addition, is used as the combiner. This is not a problem, however, for truly associative combiner operations.

Another point worth mentioning is that there are really two definitions of scan: inclusive scan, which does the reduction of the input up to and including the position of the output element, and exclusive scan, which only does the reduction of the sequence in the input before the current output position. The algorithms shown here are for inclusive scan. Inclusive scan has the advantage that the last element of the output is also the reduction of the input. However, exclusive scan is convenient for many algorithms. Exclusive scan can be derived from the inclusive scan by inserting an identity element (which depends on the combiner being used) at the beginning and dropping the last element from the output of the inclusive scan. To avoid confusion in my future posts I intend to call the exclusive scan a "(parallel) prefix" operation and inclusive scan just "(parallel) scan". This is not a universal convention, unfortunately.

So, what are the applications of scan? For the previous patterns based on loop parallelization, map and reduce, it was pretty obvious when they could be applied. Scan is harder to recognize but does show up surprisingly often, and in fact is often used internally in the implementation of some of other important patterns I plan to discuss in the future. Basically, if a serial algorithm has a 1D loop-carried dependency of the form shown at the start of this posting, it is worth checking if this dependency can be converted into a scan.

Applications of scan include manipulation of data structures, where it can be used to compute the positions of objects prior to a gather or scatter. In particular it can be used to efficiently implement pack and split (to be discussed later), which can in turn be used to efficiently implement radix sort. Scan can also be used to compute the integral of a sampled function, using "add" as the combiner. For example, given a histogram, an add-scan can be used to compute the cumulative distribution. This can in turn be used with binary search to generate random samples according to a particular probability distribution, a common requirement in Monte Carlo algorithms.

However, these are relatively obvious applications, based on using "add" as the combiner and the interpretation of scan as "integration". Let's consider a not-so-obvious application based on another combiner. Suppose you have the following code structure that you want to parallelize, where **c** is an array of Boolean flags:

t = a[0]; for (int i=0; i<N; i++) { if (c[i]) t = a[i]; b[i] = t; }

For example, suppose we had the following inputs:

c = [0,0,1,0,1,1,0,0,1,0,0,0,1,0,1] a = [4,6,1,4,3,8,2,7,9,1,4,6,8,9,1]

Applying the above code would generate:

b = [4,4,1,1,3,8,8,8,9,9,9,9,8,8,1]

In other words, the Boolean array **c** marks the start of a "segment" and the value from the input array **a** is "broadcast" over every element of that segment. There is an implied "starting" segment as well that **a[0]** is copied over, whether or not **c[0]** is 1. For simplicity **c** is given as a "parallel" input… but in a real application, it might be the output of yet another scan or other parallel operation.

Consider defining a combiner function **f** that takes two tuples **(C0,V0)** and **(C1,V1)** as input and generates an output tuple **(C,V)**. In these tuples, **C** is a Boolean flag and **V** is a value (same element type as arrays **a** and **b**, above). If we define f as follows, we can use it in the scan pattern to compute the "segment broadcast" shown above:

(C,V) = f((C0,V0),(C1,V1)) = (C0|C1, C1 ? V1 : V0)

Suppose we also have "zip" and "unzip" functions. Define the zip function so that it combines two arrays of the same size into a single array of tuples, while unzip takes them apart (note: this example is pseudocode, not any particular language, platform, or library). So as a preprocess, let's combine the inputs **c** and **a** into a single array **q** using "zip":

q = zip(c,a) = [(0,4),(0,6),(1,1),(0,4),(1,3),(1,8),(0,2),(0,7),(1,9),(0,1),(0,4),(0,6),(1,8),(0,9),(1,1)];

Now let's apply the scan pattern using our combiner "f", using a serial implementation:

p[0] = q[0]; for (int i=1; i<N; i++) { p[i] = f(p[i-1],q[i]); }

What happens? Well, the first element of the output **p[0]** is just a copy of the input **q[0]**, so that gives (0,4). Then, in the first iteration we apply the combiner f to (0,4) and (0,6), giving (0,4) as a result:

p[0] = (0,4) = q[0] p[1] = (0,4) = f(p[0],q[1]) = f((0,4),(0,6)) = (0|0, 0 ? 6 : 4)

Continuing with the other iterations we have:

p[2] = (1,1) = f(p[1],q[2]) = f((0,4),(1,1)) = (0|1, 1 ? 1 : 4) p[3] = (1,1) = f(p[2],q[3]) = f((1,1),(0,4)) = (1|0, 0 ? 4 : 1) p[4] = (1,3) = f(p[3],q[4]) = f((1,1),(1,3)) = (1|1, 1 ? 3 : 1) p[5] = (1,8) = f(p[4],q[5]) = f((1,3),(1,8)) = (1|1, 1 ? 8 : 3) p[6] = (1,8) = f(p[5],q[6]) = f((1,8),(0,2)) = (1|0, 0 ? 2 : 8) p[7] = (1,8) = f(p[6],q[7]) = f((1,8),(0,7)) = (1|0, 0 ? 7 : 8) p[8] = (1,9) = f(p[7],q[8]) = f((1,8),(1,9)) = (1|1, 1 ? 9 : 8) p[9] = (1,9) = f(p[8],q[9]) = f((1,9),(0,1)) = (1|0, 0 ? 1 : 9) p[10] = (1,9) = f(p[9],q[10]) = f((1,9),(0,4)) = (1|0, 0 ? 4 : 9) p[11] = (1,9) = f(p[10],q[11]) = f((1,9),(0,6)) = (1|0, 0 ? 6 : 9) p[12] = (1,8) = f(p[11],q[12]) = f((1,9),(1,8)) = (1|1, 1 ? 8 : 9) p[13] = (1,8) = f(p[12],q[13]) = f((1,8),(0,9)) = (1|0, 0 ? 9 : 8) p[14] = (1,1) = f(p[13],q[14]) = f((1,8),(1,1)) = (1|1, 1 ? 1 : 8)

Now extracting the second element of the output tuples, we have the desired result:

unzip(q,1) = [4,4,1,1,3,8,8,8,9,9,9,9,8,8,1]

Are we done? No, we are not. We have defined a combiner that produces the desired output when used with the serial implementation, but we can't use it with a parallel implementation until we show that the combiner is associative. In other words, we need a proof that the following equality is true:

f(f((C0,V0),(C1,V1)), (C2,V2)) = f((C0,V0), f((C1,V1),(C2,V2)))

It's clear that the "**C0|C1**" part of the combiner is associative, since the Boolean "or" operation is associative. We can prove the conditional assignment part is associative too by converting it into some Boolean masking operations and using Boolean algebra identities. However, associativity can also be inferred from a case analysis, as in Figure 4. If **C2** is 1, we always pick **V2** as the output. If **C2** is 0, then if **C1** is 1, we pick **V1**. We only pick **V0** if both **C1** and **C2** are false. Interestingly, we never care what **C0** is.

Note that associativity is not the same as commutativity. The above combiner does not commute, that is, in general **f(a,b)** does not equal **f(b,a)**; it's asymmetrical. But it is associative, and that's all we care about. We could probably also optimize the above process to get rid of the "or-scan" part of the operation; I'll leave this as an exercise for the reader…

Of course we don't want to have to go through this every time we see a new piece of code. Unfortunately, in general, proving associativity is hard although it is possible to disprove it with a single counter-example. In order to avoid going through the process every time, one approach is to have a library of "examples" of serial dependencies that are equivalent to scans, and then recognize those when they appear in serial code.

Scan is a classic data-parallel operation and the literature contains many interesting examples of its use and discussion of its implementation; for example, see Prefix Sums and Their Applications by Guy E. Blelloch. This paper includes some additional non-obvious examples, such as how to turn general linear recurrences (which show up frequently in applications like signal processing and linear algebra) into scans.

In summary, scan is a useful operation but it can sometimes be hard to see how to apply it in practice. However, often the key to parallelizing an algorithm is recognizing that some unavoidable one-dimensional serial dependency uses an associative combiner and can be converted into a scan. Scan also shows up in the implementation of many other useful operations, such as sort, pack, and segmented collective operations (the "segmented broadcast" above is an example of a segmented collective operation).

### More Patterns