One of the most common patterns is the Reduction. A reduction combines all the elements in a collection into one using an associative two-input, one-output operator. Given n elements in the collection, using the operator any two adjacent elements can be chosen and combined into one to give n-1 elements. This process can be repeated until there is only one element. If the operator is addition, then a reduction computes the sum of all the elements in a collection. If it is maximum, then the reduction computes the largest value in the collection.
Reductions are used in many algorithms to compute error metrics and termination conditions for iterative algorithms. In addition, reduction may be itself be used in parallel as part of another algorithm. Matrix multiplication, for example, can be seen as a "map" of "reductions" over "partitions" formed from the rows and columns of the input matrices, where Map and Partition are patterns. Reductions are also used extensively in Monte Carlo simulations (using in finance, computer graphics, image processing, and a variety of other fields) were averages and variances of a large number of random simulations need to be computed.
Recall that an operator *is associative if it satisfies (a * b) * c = a * (b * c), which means that the result is independent of the order in which the combinations are performed. If this is the case, then a wide variety of different operation orders can be used in implementing the reduction. Modular integer arithmetic (where overflow wraps around), logic operations such as AND, OR, and XOR, and minimum and maximum are all associative.
Unfortunately, floating point arithmetic is only approximately associative: due to round-off errors, different orders can give different results. In order to achieve deterministic results with floating point reductions, a specific order must be chosen in advance. In order to get portable results, the operators themselves must be portable (for example, both machines need to be IEEE floating-point compliant) and the same order of execution needs to be used on both machines.
The two most natural orders for implementing reduction are shown in the diagrams below. Let's call the one on the left the serial order and the one on the right the binary tree order.
Due to the chain of dependencies in the serial order, it cannot be executed in parallel. However, every horizontal "level" of the binary tree order can be executed in parallel over lg n passes, and if the operator is fully associative, the serial order can be reorganized into the binary tree order.
Interestingly, regardless of the value of n the binary tree order does not require any more executions of the operator than the serial order: both require n-1 executions of the operator. However, the binary tree order does use more memory: each "pass" requires an array of temporary values as output, and so requires O(n) space. The serial order, in contrast, requires only a single intermediate value and so requires only O(1) space. In practice, though, we have to read in the input, and efficient off-chip memory access typically delivers an entire block of data at once (for example, a cache line). Since we can reuse the space for the inputs for the intermediate values when using the binary tree order, no extra space is actually required. Still, the serial order will tend to be more efficient since it can use registers for its intermediate result.
As a counter to this, the binary tree order is often more numerically stable if the values being added are all the same sign and roughly the same magnitude, as is often the case in Monte Carlo simulations. Consider a summation of a million elements using single-precision arithmetic, where all the input elements have the same sign and are roughly the same magnitude. Single precision floating point only has about 6 to 7 decimal digits of precision. Towards the end of the chain in the serial order, new values from the input end up getting added to a very large value in the "accumulator". In the worst case, the relatively small input values get rounded down relative to the large accumulator value and are ignored. Large reductions done in serial order then have an unfortunate tendency to "plateau" and return an incorrect answer if performed with only single precision. In contrast, the binary tree order tends to combine values of roughly similar magnitude for all intermediate computations given the same input.
Regardless of how a basic reduction module is built, larger reductions can be built from it. Suppose we have a four-way reduction "block"; we can build a parallel 16-way reduction using two passes, where the first pass uses four parallel blocks and the last uses only one.
We do not care how the basic block is implemented. One way, therefore, to get a fast parallel reduction is to use a serial order in the basic blocks and then compose them in parallel. This approach can be parametrized by the length of the serial block. In general, the optimal value of this parameter will depend on the number of registers available and other factors like cache block sizes, but there should be a fair range of serial block sizes that are efficient on a given machine.
In order to achieve deterministic execution, you only have to pick a specific order and stick to it. In order to achieve portability between processors, you need to use the same order (in particular, the same serial basic block size as well as the same tree structure) on all processors. In practice, while it may not be possible to simultaneously use the optimal parameter values for each machine, a consistent size somewhat larger than 2 (say, 8) should perform reasonably well on a variety of architectures while still being portable. The biggest portability concern will usually appear when converting from serial code to parallel code, since that will often require converting from a serial order to a parallelizable binary tree order.
The Intel RapidMind platform supports reductions natively, as well as multidimensional reductions. For example, with a multidimensional reduction it is possible to reduce the rows or columns of a 2D array, giving a 1D array of results. The Intel RapidMind platform also supports a form of reduction we call "category reduction", which is similar to the type of reduction used in Google's map-reduce approach to parallelism, where only elements with the same "category label" are combined.