### Global Communication

A global communication operation is one in which many tasks must participate. When such operations are implemented, it may not be sufficient simply to identify individual producer/consumer pairs. Such an approach may result in too many communications or may restrict opportunities for concurrent execution. For example, consider the problem of performing a parallel reduction operation, that is, an operation that reduces N values distributed over N tasks using a commutative associative operator such as addition:

Let us assume that a single "manager'' task requires the result S of this operation. Taking a purely local view of communication, we recognize that the manager requires values etc., from tasks 0, 1, etc. Hence, we could define a communication structure that allows each task to communicate its value to the manager independently. The manager would then receive the values and add them into an accumulator (Figure 6). However, because the manager can receive and sum only one number at a time, this approach takes time to sum N numbers -- not a very good parallel algorithm!

This example illustrates two general problems that can hinder efficient parallel execution in algorithms based on a purely local view of communication:

- The algorithm is centralized : it does not distribute computation and communication. A single task (in this case, the manager task) must participate in every operation.
- The algorithm is sequential : it does not allow multiple computation and communication operations to proceed concurrently.

We must address both these problems to develop a good parallel algorithm.

### Distributing Communication and Computation

We first consider the problem of distributing the computation and communication associated with the summation. We can distribute the summation of the N numbers by making each task i , 0<i<N-1 , compute the sum:

The communication requirements associated with this algorithm can be satisfied by connecting the N tasks in a one-dimensional array (Figure 7). Task N-1 sends its value to its neighbor in this array. Tasks 1 through N-2 each wait to receive a partial sum from their right-hand neighbor, add this to their local value, and send the result to their left-hand neighbor. Task 0 receives a partial sum and adds this to its local value to obtain the complete sum. This algorithm distributes the N-1 communications and additions, but permits concurrent execution only if multiple summation operations are to be performed. (The array of tasks can then be used as a pipeline, through which flow partial sums.) A single summation still takes N-1 steps.

### Uncovering Concurrency: Divide and Conquer

Opportunities for concurrent computation and communication can often be uncovered by applying a problem-solving strategy called divide and conquer. To solve a complex problem (such as summing N numbers), we seek to partition it into two or more simpler problems of roughly equivalent size (e.g., summing N/2 numbers). This process is applied recursively to produce a set of subproblems that cannot be subdivided further (e.g., summing two numbers). The strategy is summarized in Algorithm 2.1. The divide-and-conquer technique is effective in parallel computing when the subproblems generated by problem partitioning can be solved concurrently. For example, in the summation problem, we can take advantage of the following identity (, n an integer):

The two summations on the right-hand side can be performed concurrently. They can also be further decomposed if n>1 , to give the tree structure illustrated in Figure 8. Summations at the same level in this tree of height can be performed concurrently, so the complete summation can be achieved in rather than N steps.

In summary, we observe that in developing an efficient parallel summation algorithm, we have distributed the N-1 communication and computation operations required to perform the summation and have modified the order in which these operations are performed so that they can proceed concurrently. The result is a regular communication structure in which each task communicates with a small set of neighbors.