Floorplan Algorithm Design
Partition. Algorithm 2.2, like Algorithm 1.1, has no obvious data structure to which we can apply domain decomposition techniques. Hence, we use a fine-grained functional decomposition in which each search tree node is explored by a separate task. As noted earlier, this means that new tasks will be created in a wavefront as the search progresses down the search tree, which will tend to be explored in a breadth-first fashion. Notice that only tasks on the wavefront can execute concurrently. We also need to address the issue of how to manage the Amin value, which must be accessed by all tasks. For now, we assume that it is encapsulated in a single task with which other tasks will communicate.
A quick review reveals one deficiency in this design. The breadth-first exploration strategy is likely to decrease performance dramatically by delaying discovery of solution nodes and hence reducing the amount of pruning that occurs, thereby leading to considerable redundant computation. We must bear this issue in mind in subsequent design phases.
Communication. In a parallel implementation of simple search (Algorithm 1.1), tasks can execute independently and need communicate only to report solutions. In contrast, branch-and-bound search requires communication during execution in order to obtain and update the search bound Amin. In designing a communication structure to achieve this goal, we need to trade off the benefits of frequent accesses to a centralized Amin value (which tends to reduce the amount of the search tree that must be explored) against communication costs.
One approach is to encapsulate responsibility for maintaining Amin in a centralized task, with which each task communicates when a solution is produced or a bound is required. This approach is simple and may even be efficient if communication is cheap, evaluating a node is expensive, and the number of processors is not too large. However, the centralized approach is inherently nonscalable. Since the manager must take a certain amount of time to process a request, the maximum rate at which it can service requests, and hence the maximum number of tasks that can execute concurrently, is bounded.
Various refinements to this centralized scheme can be imagined. We can modify Algorithm 2.2 to check Amin only periodically, for example when a depth counter incremented on each recursive call is an integer multiple of a specified frequency parameter. Or, we can partition the tree into subtrees, each with its own Amin submanager, and organize periodic exchanges of information between these submanagers. For example, submanagers can perform broadcast operations when they discover significantly better solutions.
Agglomeration. In the agglomeration phase of the design process we start to address practical issues relating to performance on target computers. In the floorplan optimization problem, this means we must address two potential deficiencies of the fine-grained algorithm that we have developed. The first will be familiar from earlier problems, that is, the cost of creating a large number of fine-grained tasks. This can be addressed using agglomeration, unless we believe that node evaluation is sufficiently expensive and task creation sufficiently cheap for the fine-grained algorithm to be efficient. For example, we can create one task for each search call in the foreach statement of Algorithm 2.2 until we reach a specified depth in the tree, and then switch to a depth-first strategy, thereby creating a single task that evaluates search calls in sequence (Figure 30). If the switch to depth-first search is performed at depth D and ci cell I(ci) has implementations, then in the absence of pruning this technique creates tasks.
The second potential deficiency is more subtle and relates to the scheduling of tasks rather than to their creation. In the absence of explicit programmer control, we can assume that the tasks created to evaluate search tree nodes will execute either in the order that they are created or perhaps in a random order. In either case, the search tree tends to be explored in a breadth-first fashion. This is undesirable because it tends to reduce the effectiveness of pruning and hence cause redundant computation. The solution to this problem is to control the order in which search tree nodes are explored. That is, we must implement a task-scheduling algorithm. Because this is really a mapping issue, we discuss it under "Mapping.''
Mapping. Recall that when we use a task-scheduling strategy, tasks (search tree nodes) become "problems'' to be executed by one of a smaller number of "worker'' tasks, typically one per processor. Workers generate new search problems as they expand nodes, and request new search problems each time they complete previously assigned problems. Requests can be handled using a centralized or decentralized strategy.
We can imagine a variety of alternative task-scheduling schemes for the floorplan optimization problem. One approach works in conjunction with the agglomeration scheme of Figure 30. A central manager first constructs a number of coarse-grained tasks, by exploring the search tree to depth D . These tasks are then assigned to idle workers in a demand-driven manner. Because each task can be represented by a short vector representing the path taken to its position in the tree, the data movement costs associated with this scheme are not high. Furthermore, because each processor executes one subtree at a time in a depth-first fashion, pruning is effective.
An interesting variant of this approach combines elements of both redundant work and cyclic mapping to avoid the need for a central manager. Every worker expands the tree to depth D . Then, each worker takes responsibility for a disjoint subset of the tasks generated. (This subset could be identified using a cyclic allocation strategy, for example.) Only if a worker becomes idle does it ask other workers for tasks.
A third strategy, more complex but also more general, is initially to allocate the root node to a single worker. Load balancing is then achieved by causing workers with empty queues to request problems from other workers. Each worker can then enforce a local depth-first search strategy, and hence increase the amount of pruning, by ordering its queue of search problems according to their depth in the tree. This method allows the worker to select problems far from the root for local execution and problems nearer to the root to hand to other workers.
Our choice of task scheduling strategy will depend on characteristics of our problem and target computer and can be determined by analysis and experiment. Notice that the communication structures used for task scheduling can be integrated with those proposed earlier for maintaining Amin. For example, a central manager used for task scheduling can also maintain and distribute an up-to-date search bound with each task. In decentralized schemes, the worker tasks that execute search problems can broadcast improved search bound values to other workers.
The parallel algorithm designed in this case study is certainly more complex, and perhaps less obvious, than that developed for the atmosphere model. It is clear from the start that functional decomposition techniques should be used to define tasks, that responsibility for maintaining Amin should be isolated from the rest of the computation, and that we can increase task granularity by switching from a parallel to a sequential evaluation strategy at a specified depth in the search tree. If we were concerned with parallelizing simple search, the design might be complete at this stage. However, the need to support pruning requires that we proceed with further refinements. In particular, we introduce a task-scheduling algorithm so that we can pursue depth-first search on each processor while exposing higher-level search tree nodes for idle workers.