# The Mystery and Magic of Prefix Scan

You may have noticed that there isn't any synchronization in the pseudo-code. PRAM algorithms execute in a lockstep fashion. Thus, each step is assumed to complete in all processors and all variables updated with any new values before the next step is begun. Moreover, when specifying a PRAM model, you must also decide whether reads of the same shared variable are allowed and, if allowed, how writes to a shared location by more than one processor will be handled. Besides the possibility of a shared loop iteration variable (`j`

), the above algorithm doesn't require concurrent reads or writes, so I don't need to go to that much detail.

As in the serial algorithm, the work for the PRAM Prefix Scan algorithm is `O(N)`

. This is a case, though, where the Big-O notation hides a significant difference between the work actually done between the serial and parallel algorithms. The serial code uses exactly `N-1`

additions to compute the Prefix Scan. The PRAM algorithm uses that many additions in just the first loop iteration. Subsequent loop iterations execute `N/2`

, `N/4`

, `N/8`

, and so on, additions. Summing these up, assuming `N`

is a power of 2, yields another `N-1`

additions for a total work amount of `2N-2`

addition operations. The PRAM algorithm will do twice as much work as the serial code. However (and this is the important part), the span of the PRAM algorithm is `O(log N)`

. It is the span metric, not work, that determines the execution time of a parallel algorithm.

Now, most systems don't have an unlimited number of cores that can be used, so a more practical algorithm for Prefix Scan is needed for real machines. For some number of cores/threads in a system, I can divide up the arrays into a number of (roughly) equal-sized chunks, one chunk per thread to be used. Each thread first executes a serial Prefix Scan on its assigned chunk. This gives me a set of "partial" prefix scans. To compute the final answer for any chunk I need the sum of all the elements in all of the preceding chunks.

Luckily, the final element of an inclusive Prefix Scan is the total of all the elements in the array. If I copy these final chunk elements into a new array and run an exclusive Prefix Scan on the array (*"Walla Walla, Washington!"*), I will have, in each array location, the sum of all elements in preceding chunks each indexed to a thread. I then simply have threads add the corresponding value of preceding totals from a new array item to all elements in the assigned chunk and compute the final Prefix Scan values.

If I have `P`

cores/threads and `N`

is divisible by `P`

, what is the work and span of this more real-world implementation? The first phase has `P*(N/P - 1)`

total additions to execute a Prefix Scan operation over the local chunks, and the last phase has `P*(N/P)`

additions to update elements in the chunk with the sum of all previous chunks. `P`

is likely in the range of 2 to 16, so the Prefix Scan of final chunk elements in the middle phase is expected to be done in serial since the overhead of doing it in parallel is probably greater than the serial execution time. For this part, the work (and span) is `P-1`

. Combining the work for each of the three phases gives us a total of `2N - P + (P - 1) = 2N-1`

, just about the same as the PRAM algorithm. The span, assuming zero synchronization cost between phases, will be `2*(N/P - 1) + (P – 1) = 2N/P - 3 + P`

, which will be `O(N)`

for practical values of `P`

.

Now that we've reviewed (or revealed) the details of Prefix Scan, let me tie it all together through the almost otherworldly lens of manycore processors. With nothing up my sleeve, watch what happens to the span when I increase the number of cores available to work on the Prefix Scan algorithm. As `P`

approaches `N`

(i.e., as the number of cores assigned to do the computation gets closer to the number of items in the array), the first part of my span formula gets closer and closer to being a constant. "But wait," you say, "doesn't the `P`

term now approach `N`

?" True. However, at some point the overhead of the parallelism will be small enough compared to the execution time that I can change the serial second phase into the PRAM algorithm version. Thus, with a quick *"Abracadabra!"* and a well-timed *"A la peanut butter sandwiches!"* that term approaches `O(log N)`

, which then is the span of this algorithm.

And so, as the smoke clears and the rabbit jumps back into my hat, let me leave you with a final thought. Manycore processors, with their hundreds or thousands of cores, can be the "secret ingredient" that will increase the value of `P`

in my work and span formulas above, and make even a practical implementation of Prefix Scan faster and closer to the ideal PRAM execution time versus running the algorithm on 2 or 8 cores. With that tidbit in mind, for my next trick, I will show how Prefix Scan can be used to reduce the overall span of the parallel Quicksort algortihm.