Parallel Computation Patterns
We will now introduce a collection of parallel computation patterns. We have divided parallel patterns into two categories: computational patterns, which can actually operate on data values, and data access patterns, which cannot. These are often combined, and many of the computational patterns in this section also access and update data in specific (and typically coherent) ways.
The map parallel computation pattern applies a function to every element of a collection (or set of collections with the same shape), and creates a new collection (or set of collections) with the results from the function invocations. The order of execution of the function invocations is not specified, which allows for parallel execution. If the functions are pure functions with no side effects, then the map operation is deterministic while succinctly allowing the specification of a large amount of parallelism. In general, the (pure) functions used by the map can also recursively support other kinds of serial and parallel patterns and data management.
The map operation accesses data for input and output in a way that exposes useful spatial coherence. Many functions are executed at once, and it is known in advance which functions access neighboring values in the input and output collections. This makes it possible to automatically implement a variety of serial, parallel, and memory optimizations in the implementation of the map function, including software pipelining, cache prefetch and eviction, and cache boundary alignment. If the behavior of neighboring elements in a map can be assumed to lead to similar control flow behavior, then some simple approaches to vectorization based on masking can also be effective.
A reduction applies a pairwise associative operation to all the elements of a collection, reducing it to a single element.
Sometimes, when writing a function intended to be used in a map operation, it is desired to also compute a reduction at the same time. A good example is an iterative solver. The inner loop of such a solver usually performs both a matrix vector operation and a reduction, the latter being used to test convergence. In general, efficient implementations will need to fuse patterns together. There are other examples, such as pack, where fusion is even more important for performance.
Some other forms of reduction are sometimes used. These can be seen as fusions of pure reductions with other patterns. Multidimensional reductions (for example, reductions of the rows of an array) can be expressed by combining a partitioning pattern with a map and a reduction. In a category reduction an operator is applied that labels elements and then a reduction is applied to all elements with the same label. The Google map-reduce programming model is based a single fused map and category reduction operation combined with the serial execution patterns.
C. Superscalar sequences
Sequence is a fundamental serial pattern. In the sequence pattern, one operation is completely finished before another one is started. However, when the operations are pure functions without side effects, the operations given in a sequence only need to be ordered by their data dependencies, which in the case of pure functions are made explicit.
In general a sequence generates a DAG (task graph) of data dependencies. A simple asynchronous execution rule allows for parallelism while still permitting serial reasoning by the programmer: if a task tries to read data that is not yet ready, it blocks until the input data is ready.
Although in this pattern the input code is conceptually serial, the data dependencies in the graph allow independent tasks to execute in parallel.
Under the superscalar model direct communication and synchronization between tasks using message passing is not permitted. In fact, tasks do not need to be simultaneously active, and their execution may in fact be serialized. Instead of unstructured low-level communication, two other structured patterns for sharing and communicating data between simultaneously active tasks can be used: the pipeline pattern and nested parallelism. The pipeline pattern allows for producer-consumer communication, while the nested parallelism pattern allows for child-parent communication.
A pipeline is a set of simultaneously active tasks or "stages" that communicate in a producer-consumer relationship. A pipeline is not expressible as a superscalar task graph, since in a pipeline the data in a stage is persistent and stages are conceptually activated at the same time, unlike the tasks in a superscalar task graph. Pipelines are common in image and signal processing. Their model of local state update is a form of coherence not covered by other patterns. In addition, pipelines can be used to parallelize serially dependent activities ("folds") like compression and decompression.
Pipelines by themselves are not a complete solution to parallelization since pipelines tend to have a fixed number of stages. As such, they do not automatically scale to a large number of cores. However, pipelines can provide a useful multiplier on parallelism in otherwise difficult to parallelize problems.
Recursion is another fundamental serial control flow pattern. It is also associated with stack-based data allocation, which has good data coherence properties. When parallel patterns are nested recursively, they can be used to spawn additional parallel tasks. This allows a program to generate an arbitrary amount of nested parallelism. This form of nested parallelism is distinct from the form of nested parallelism that can be derived from segmented collective operations. However, it may be possible in many cases to identify certain patterns of more general recursive nested parallelism and map them into segmented operations for efficiency.
Nested parallelism can be invoked simply by invoking parallel patterns inside other parallel patterns, for example, by using a reduction or a map inside a function used inside another reduction or map. This generates a hierarchical task graph that can be expanded as needed to generate additional parallelism. The nested parallelism can be either task-parallel or data-parallel.
As a practical matter, arbitrary amounts of parallelism may not be useful. One of the advantages of deterministic parallel patterns is that they are all consistent with a specific serial ordering. An implementation needs to target a "grain size" that is most efficient for a given hardware target. Tasks that are too small need to be merged into larger serial tasks, while large serial tasks need to be decomposed into parallel tasks, preferably automatically. Serial consistency allows this to happen automatically without changing the result of the program.
F. Scans and Recurrences
A recurrence expresses one output from a function in terms of prior outputs. Recurrences often occur in serial code due to the use of loop-carried dependencies, but in certain cases they can be parallelized. One-dimensional recurrences can be parallelized into logarithmic time implementations if the dependency is associative, in which case it is usually called a scan [Blelloch 1990]. Multidimensional recurrences with a nesting depth of n can also always be parallelized over n-1 dimensions, even if the operator is not associative, using Lamport's hyperplane theorem [Lamport 1974].
A 1D recurrence, even if is not associative, is common and is often known as a fold. Folds will typically need to be implemented serially, although sequences of folds inside a map can be transformed into a parallel implementation using pipelines. As with reductions, there is a fundamental problem with identifying associative functions to allow parallelization in scans, as well as the problem of semi-associative operations such as floating-point arithmetic.
Examples of recurrences include integration and infinite-impulse response (recursive) filters. Many matrix factorization algorithms, such as Chebyshev factorization, can also often be expressed as recurrences.
Scans (and, in general, recurrences) over segmented and partitioned collections can also be implemented efficiently in a load-balanced form even in the case of segmented arrays where the sub-arrays may not all be the same size. Using such balanced primitive operations, it is possible to implement, for example, a balanced parallel form of recursive and "irregular" algorithms such as quicksort.