In the partitioning phase of the design process, our efforts are focused on defining as many tasks as possible. This is a useful discipline because it forces us to consider a wide range of opportunities for parallel execution. We note, however, that defining a large number of fine-grained tasks does not necessarily produce an efficient parallel algorithm.
One critical issue influencing parallel performance is communication costs. On most parallel computers, we have to stop computing in order to send and receive messages. Because we typically would rather be computing, we can improve performance by reducing the amount of time spent communicating. Clearly, this performance improvement can be achieved by sending less data. Perhaps less obviously, it can also be achieved by using fewer messages, even if we send the same amount of data. This is because each communication incurs not only a cost proportional to the amount of data transferred but also a fixed startup cost.
In addition to communication costs, we may need to be concerned with task creation costs. For example, the performance of the fine-grained search algorithm illustrated in Figure 13, which creates one task for each search tree node, is sensitive to task creation costs.
Figure 12: Effect of increased granularity on communication costs in a two-dimensional finite difference problem with a five-point stencil. The figure shows fine- and coarse-grained two-dimensional partitions of this problem. In each case, a single task is exploded to show its outgoing messages (dark shading) and incoming messages (light shading). In (a), a computation on an 8x8 grid is partitioned into 8x8=64 tasks, each responsible for a single point, while in (b) the same computation is partioned into 2x2=4 tasks, each responsible for 16 points. In (a), 64x4=256 communications are required, 4 per task; these transfer a total of 256 data values. In (b), only 4x4=16 communications are required, and only 16x4=64 data values are transferred.
If the number of communication partners per task is small, we can often reduce both the number of communication operations and the total communication volume by increasing the granularity of our partition, that is, by agglomerating several tasks into one. This effect is illustrated in Figure 12. In this figure, the reduction in communication costs is due to a surface-to-volume effect . In other words, the communication requirements of a task are proportional to the surface of the subdomain on which it operates, while the computation requirements are proportional to the subdomain's volume. In a two-dimensional problem, the "surface'' scales with the problem size while the "volume'' scales as the problem size squared. Hence, the amount of communication performed for a unit of computation (the communication/computation ratio ) decreases as task size increases. This effect is often visible when a partition is obtained by using domain decomposition techniques.
A consequence of surface-to-volume effects is that higher-dimensional decompositions are typically the most efficient, other things being equal, because they reduce the surface area (communication) required for a given volume (computation). Hence, from the viewpoint of efficiency it is usually best to increase granularity by agglomerating tasks in all dimensions rather than reducing the dimension of the decomposition.
The design of an efficient agglomeration strategy can be difficult in problems with unstructured communications.
We can sometimes trade off replicated computation for reduced communication requirements and/or execution time. For an example, we consider a variant of the summation problem presented in Section 2.3.2, in which the sum must be replicated in each of the N tasks that contribute to the sum.
Figure 13: Using an array (above) and a tree (below) to perform a summation and a broadcast. On the left are the communications performed for the summation (s); on the right, the communications performed for the broadcast (b). After 2(N-1) or 2 log N steps, respectively, the sum of the N values is replicated in each of the N tasks.
A simple approach to distributing the sum is first to use either a ring- or tree-based algorithm to compute the sum in a single task, and then to broadcast the sum to each of the N tasks. The broadcast can be performed using the same communication structure as the summation; hence, the complete operation can be performed in either 2(N-1) or 2 log N steps, depending on which communication structure is used (Figure 13).
These algorithms are optimal in the sense that they do not perform any unnecessary computation or communication. However, there also exist alternative algorithms that execute in less elapsed time, although at the expense of unnecessary (replicated) computation and communication. The basic idea is to perform multiple summations concurrently, with each concurrent summation producing a value in a different task.
We first consider a variant of the array summation algorithm based on this idea. In this variant, tasks are connected in a ring rather than an array, and all N tasks execute the same algorithm so that N partial sums are in motion simultaneously. After N-1 steps, the complete sum is replicated in every task. This strategy avoids the need for a subsequent broadcast operation, but at the expense of (N-1)2 redundant additions and (N-1)2 unnecessary communications. However, the summation and broadcast complete in N-1 rather than 2(N-1) steps. Hence, the strategy is faster if the processors would otherwise be idle waiting for the result of the summation.
The tree summation algorithm can be modified in a similar way to avoid the need for a separate broadcast. That is, multiple tree summations are performed concurrently so that after log Nsteps each task has a copy of the sum. One might expect this approach to result in O(N2) additions and communications, as in the ring algorithm. However, in this case we can exploit redundancies in both computation and communication to perform the summation in just O(N log N) operations. The resulting communication structure, termed a "butterfly", is illustrated in Figure 14. In each of the log N stages, each task receives data from two tasks, performs a single addition, and sends the result of this addition to two tasks in the next stage.
Figure 14: The butterfly communication structure can be used to sum N values in log N steps. Numbers located in the bottom row of tasks are propagated up through intermediate stages, thereby producing the complete sum in each task in the top row.
Figure 15; The communication structures that result when tasks at different levels in a tree or butterfly structure are agglomerated. From top to bottom: a tree, a butterfly, and an equivalent representation of the butterfly as a hypercube. In each case, N=8 , and each channel is labeled with the step in which it is used for communication.
Agglomeration is almost always beneficial if analysis of communication requirements reveals that a set of tasks cannot execute concurrently. For example, consider the tree and butterfly structures illustrated in Figures 8 and 14. When a single summation problem is performed, only tasks at the same level in the tree or butterfly can execute concurrently. (Notice, however, that if many summations are to be performed, in principle all tasks can be kept busy by pipelining multiple summation operations.) Hence, tasks at different levels can be agglomerated without reducing opportunities for concurrent execution, thereby yielding the communication structures represented in Figure 15. The hypercube structure shown in this figure is a fundamental communication structure that has many applications in parallel computing.