Dr. Dobb's | Searching Decision Trees

Searching Decision Trees

Decision trees let you model a variety of complicated problems in a simple way. Rod shares some generic techniques that can accelerate a range of complex decision problems.

December 01, 1997
URL:http://www.drdobbs.com/architecture-and-design/searching-decision-trees/184410349

Dr. Dobb's Journal December 1997: Algorithm Alley

Rod is the author of Visual Basic Algorithms: A Developer's Sourcebook of Ready-To-Run Code (John Wiley & Sons, 1996). You can contact him at [email protected].

In the February 1997 "Algorithm Alley," Oleg Kiselyov scheduled a chess tournament by recursively adding pairs of players, then backtracking when a conflict was found. This basic idea also appears in game programs, which recursively examine future moves, then backtrack when a series of moves looks like a bad idea. It is usually a good strategy when you need to find the "best" answer.

This approach is often called a "tree search." Recursion corresponds to moving down in the tree to extend a possible solution; backtracking corresponds to moving back up the tree. Of course, the complete tree is usually very large, sometimes infinite. To achieve useful answers, you need to somehow limit your search. This month, Rod Stephens shares some generic techniques that can accelerate a range of complex decision problems. You might find it interesting to compare Rod's generic techniques to the specific approach presented by Oleg.

-- Tim Kientzle

Decision trees let you model a variety of complicated problems in a simple way. Each node in the tree represents a single decision leading towards a solution to the problem. A path from the root of the tree to a leaf represents a complete solution. To solve the original problem, your program needs to find a path through the tree that corresponds to the best possible solution. Here, I'll examine decision trees and show how you can apply them to a variety of difficult problems. I'll then describe several strategies for searching decision trees to find the best solution available.

The Partitioning Example

Suppose you and your sister inherit a bag of variously sized diamonds. You need to divide the diamonds into two groups so each of you receives as close to the same total value as possible. This is an example of the classic partitioning problem. Another example might be loading an assortment of boxes onto two trucks so that each truck has the same total weight.

You can model this problem using a decision tree. Each node in the tree represents the decision to place an item in one group or the other. For example, the left branch out of the tree's root node represents the decision to place the first item in group 1. The right branch represents the decision to place the first item in group 2. All solutions that lie below the right branch in the decision tree correspond to solutions where the first item is in group 2.

A path from the root node to a leaf represents a possible solution. Each branching indicates the group that should hold an item. For example, the path on the extreme left side of the tree, that includes only left branches, corresponds to a solution where every item is placed in the first group. This is not a good solution to the partitioning problem, since moving any object to the second group will make the values of the two groups more equal. The program's goal is to search the decision tree to find the best solution.

More Examples

You can model many other complex problems using decision trees. In the "knapsack" problem, you have a knapsack with a fixed size and several objects that each have a size and a value. You want to fill the knapsack with items so the total value in the knapsack is as large as possible. Knapsack problems arise in many fields including finance (which investments do you select with limited funds?), supply (which items do you ship on your weekly courier boat?), and production planning (which products do you build in a limited amount of time?).

Each node in the knapsack decision tree represents the decision on whether or not to place an item in the knapsack. The node's left branch indicates the item should be placed in the knapsack, while the right branch indicates it should be left out.

The decision trees for both the partitioning problem and the knapsack problem are binary. For problems involving N items, these trees are N levels tall and contain a total of 2^N leaf nodes representing 2^N possible solutions.

In the classic traveling-salesman problem (TSP), a salesperson must visit several cities and return to the starting point. The problem is to determine the order for visiting the cities that minimizes the total distance traveled. The TSP arises most often in vehicle routine and scheduling applications such as bulk garbage pickup, package delivery, and repair crew dispatching.

Unlike the nodes in trees that represent partitioning and knapsack problems, the nodes in a TSP decision tree do not all have the same number of branches. Each branch represents the decision to travel to a particular city next. If there are N cities, the first node will have N branches -- one for each of the cities that could be visited first. Nodes on the second level of the tree will have N-1 branches since there are N-1 cities left to visit. Nodes in each level of the tree have one fewer branch than those above.

These trees are larger than those representing partitioning or knapsack trees. For an N city TSP, the decision tree contains N! leaf nodes representing N! possible solutions. For example, when N=20, a partitioning problem decision tree contains roughly one million leaf nodes. A 20-city TSP's decision tree contains more than 2×10¹⁸ leaf nodes.

Exhaustive Search

One way to find the best solution in a decision tree is to search the entire tree. Your program does not need to actually build the tree, but thinking of it as traversing the tree makes solving the problem easier.

Listing One is Visual Basic code that exhaustively searches a partitioning decision tree. A complete program that demonstrates this code and other methods described here, is available electronically (see "Availability," page 3).

The program's Values array holds the items' values. For each item, the array TestSolution contains an entry indicating the group that contains the item.

The subroutine ExhaustiveSearch takes as a parameter the item that it should consider. It first determines if it has reached a leaf node. If it has, it compares the test solution to the best solution found so far. If there is an improvement, it saves the test results and continues.

If it has not reached a leaf node, ExhaustiveSearch sets the item's TestSolution array value to 1, indicating the item it is considering should be placed in the first group. It then calls itself to recursively examine solutions where the item is in the first group. When that call returns, the routine sets the item's TestSolution value to 2, indicating it should be placed in the second group. It again calls itself recursively to examine the possible solutions.

ExhaustiveSearch is relatively simple. The code that traverses the tree consists of four steps:

: 1. Place the item in the first group.
: 2. Recursively examine solutions with the item in the first group.
: 3. Place the item in the second group.
: 4. Recursively examine solutions with the item in the second group.

Branch and Bound

While an exhaustive search is simple and effective, it can be extremely slow. Decision trees tend to be huge and searching them completely can take a long time.

The branch-and-bound technique prunes many of the branches out of a decision tree. Given the current partial solution, it calculates an upper bound on the best solution possible. If the current partial solution cannot be improved so it is better than a solution that has already been found, the program does not need to consider this test solution further.

For example, consider a program that solves a ten-item partitioning problem. Suppose it quickly finds a solution where the total values of the two groups differ by 100. It continues searching the decision tree and reaches a point where it has assigned four items to a test solution and is therefore four levels into the decision tree.

Now imagine that the total values of the two test groups at this point differ by 1000. Suppose also that the total values of all of the remaining items not yet assigned to either group is only 700. In this case, there is no way the program can assign the remaining items to make the group totals differ by less than 300. Since this cannot beat the best solution found so far, there is no point continuing with this test solution. The program can skip assigning the remaining six items and not visit the nodes that lie beneath this point in the tree.

Listing Two presents the BranchAndBound subroutine. The variable UnusedValues tracks the total value of the unassigned items. The subroutine uses that value to decide when the current test solution is hopeless.

Tight bounds on the possible solutions can often allow branch-and-bound to trim huge portions of the decision tree. To solve a 20-item partitioning problem, an exhaustive search visits all two-million-plus nodes. Branch-and-bound typically visits only 100,000 or 200,000, depending on the particular data values.

Heuristics

Branch-and-bound techniques allow a program to examine only a fraction of a decision tree. Problems like partitioning, knapsack, and TSP grow so quickly, however, that even a small fraction of the tree may be too large to search. The decision tree for a 50-item partitioning problem contains more than 2×10¹⁵ nodes. Even if branch-and-bound visited only 1/20 of them, a computer examining one million nodes per second would take more than a year to find the best solution. To solve problems this large, a program must use heuristics -- an algorithm that will probably give a good result, but which is not guaranteed to produce the best solution possible.

Different heuristics are tailored to specific problems. For example, there are specialized heuristics for finding solutions to the traveling salesman problem. Some heuristics fall into categories that apply to almost any problem. Some of these categories include hill climbing, random, and incremental improvement.

Hill Climbing

A hill-climbing heuristic builds a solution a piece at a time by adding pieces that move the solution closer to the goal. These heuristics are called "hill climbing" by analogy with a hiker trying to find the top of a mountain at night. At all times, the hiker moves in the direction of steepest ascent -- the direction that moves closer to the goal. Eventually the hiker either reaches the top of the mountain or becomes stuck on top of a smaller hill.

A hill-climbing heuristic for the partitioning problem might consider the items in order of size, largest to smallest. It would add each item to whichever group was smaller at the time. If there are an even number of items, and the items all have roughly the same size, this algorithm produces a reasonably good solution almost instantly.

Listing Three builds a solution using this heuristic. At each step, the code examines the entire list looking for the next largest item. It would be more efficient to sort the items according to size first, but even with this handicap, the heuristic is extremely fast.

On the other hand, since this algorithm examines only one of the myriad possible solutions, it is unlikely to find the absolute best solution. It will find the optimal solution only if the problem has some special structure. For example, if the problem involves an even number of items that all have exactly the same value, this algorithm always finds a perfect solution.

Random

The biggest drawback to the hill-climbing heuristic is that it considers only a single solution. A random heuristic avoids that problem by generating many solutions randomly. It then selects the best solution it finds after many trials.

Listing Four examines random solutions. It stops when it sees no improvement after a certain number of trials. The example program (available electronically) uses ten times the number of items squared. For example, for a 30-item list, the algorithm would run until it had found no improvement in 10×30²=9000 consecutive trials. This takes much longer than the single trial required by the hill-climbing heuristic, but it is still faster than branch-and-bound for large problems.

Random heuristics generally produce better results than hill-climbing heuristics if you let them run long enough. They are also easy to apply to practically any problem. Creating a good hill-climbing heuristic can be difficult; creating a random heuristic is simple.

Incremental Improvement

An incremental-improvement heuristic builds on the strengths of a random heuristic. It starts with a randomly generated solution. It then makes random changes to the solution, trying to make an improvement. If one of the changes gives a better solution, the algorithm keeps the changes and continues. If a change does not improve the solution, it is discarded, and the algorithm continues using the original solution. The process continues until the algorithm fails to improve the solution for a certain number of trials in a row.

At that point, the algorithm generates another random initial solution and tries to improve it incrementally. The program generates many random initial solutions and tries to improve them until a certain number of adjusted solutions in a row do not result in a better final result.

The types of improvements the heuristic can make depend on the problem. For example, in the partitioning problem, the algorithm would begin by randomly assigning the items to the two groups. It might then randomly select one item from each group and swap them. If that results in an improved solution, the items are left in their new groups. Otherwise, they are swapped back. Listing Five uses this strategy.

Some programs try every possible pairwise swap when looking for improvements. A program may even try every possible swap repeatedly until pairwise swaps can make no more improvements. There are times, however, when pairwise swaps cannot improve a solution even though a better solution is possible. Sometimes three, four, or even more items must be swapped to make any improvement. Over the years, programmers have developed extremely complicated strategies for finding improvements for particular problems. Often it is easier and almost as effective to use more random trials using a simpler method.

Branch-and-Bound Revisited

Branch-and-bound uses a current best solution to avoid searching all of the decision tree. When the algorithm starts, the best solution is defined to have the worst possible value, so any solution is an improvement. This makes the algorithm initially descend down the left-most branch of the tree, even though that is unlikely to produce a reasonable result. In the partitioning problem, that solution corresponds to putting every item in one group, and it is a terrible solution.

A program can speed the process slightly by using a fast heuristic to find an initial solution. Branch-and-bound can then use that as a starting point to avoid checking a larger part of the decision tree. If the heuristic is good, the savings can be large. The example program (available electronically) uses the hill-climbing algorithm to find an initial solution for its second branch-and-bound test.

Summary

You can model many difficult problems using decision trees. Using exhaustive search, finding the best solution is slow but sure. Branch-and-bound also produces an optimal solution, but it is much faster than exhaustive search.

Problems like partitioning, knapsack, and TSP are so difficult, however, that even branch-and-bound works only for the smallest cases. For larger problems, you must use heuristics. Hill-climbing algorithms generate a solution extremely quickly, but they are unlikely to be optimal. Random and incremental improvement strategies allow a program to search for solutions the hill-climbing heuristic might miss.

Listing One

Dim Values() As LongDim NumValues As Integer



Dim BestSolution() As Integer
Dim TestSolution() As Integer
Dim BestDifference As Long
Dim TestDifference As Long
Dim Total(1 To 2) As Long



' *********************************************
' Search exhaustively considering the indicated node.



Private Sub ExhaustiveSearch(node As Integer)
Dim i As Integer
Dim bin As Integer
Dim diff As Long
    NodesVisited = NodesVisited + 1
    ' See if we have assigned every node in the test solution.
    If node > NumValues Then
        ' See if this test solution is better than the previous best solution.
        Total(1) = 0
        Total(2) = 0
        For i = 1 To NumValues
            bin = TestSolution(i)
            Total(bin) = Total(bin) + Values(i)
        Next i
        diff = Abs(Total(2) - Total(1))
        If diff < BestDifference Then
            ' Save the improved solution.
            For i = 1 To NumValues
                BestSolution(i) = TestSolution(i)
            Next i
            BestDifference = diff
        End If
        Exit Sub
    End If
    ' Recursively see what would happen if we add the next item to bin 1.
    TestSolution(node) = 1
    ExhaustiveSearch node + 1
    ' Recursively see what would happen if we add the next item to bin 2.
    TestSolution(node) = 2
    ExhaustiveSearch node + 1
End Sub

Back to Article

Listing Two

Dim UnusedValues As Long' *********************************************
' Search using branch and bound considering the indicated node.



Private Sub BranchAndBound(node As Integer)
Dim i As Integer
Dim bin As Integer
Dim value As Long
    NodesVisited = NodesVisited + 1
    ' See if we have assigned every node in the test solution.
    If node > NumValues Then
        ' This must be better than the previous best solution or we would 
        ' not have come here. Save the solution.
        BestDifference = Abs(Total(2) - Total(1))
        For i = 1 To NumValues
            BestSolution(i) = TestSolution(i)
        Next i
        Exit Sub
    End If
    ' See what would happen if we add the item to each of the bins.
    value = Values(node)
    For bin = 1 To 2
        ' Add the item to this test bin.
        TestSolution(node) = bin
        Total(bin) = Total(bin) + value
        UnusedValues = UnusedValues - value
        ' See if it would still be possible to improve current best solution.
        If Abs(Total(2) - Total(1)) - UnusedValues < BestDifference Then
            ' It is possible. Recursively explore this solution further.
            BranchAndBound node + 1
        End If
        ' Remove the item from the test bin.
        Total(bin) = Total(bin) - value
        UnusedValues = UnusedValues + value
    Next bin
End Sub

Back to Article

Listing Three

' *********************************************' Generate a solution heuristically. At each step add the largest 
' unassigned item to the list with the smaller total.
Private Sub HillClimbing()
Dim i As Integer
Dim j As Integer
Dim best_j As Integer
Dim best_value As Long
    ' Assign the ith largest item.
    For i = 1 To NumValues
        NodesVisited = NodesVisited + 1
        ' Find the ith largest item rather inefficiently.
        best_value = -1000000000
        For j = 1 To NumValues
            If BestSolution(j) = 0 And _
                Values(j) > best_value _
            Then
                best_value = Values(j)
                best_j = j
            End If
        Next j
        ' Assign the item to the emptier bin.
        If Total(1) < Total(2) Then
            BestSolution(best_j) = 1
            Total(1) = Total(1) + best_value
        Else
            BestSolution(best_j) = 2
            Total(2) = Total(2) + best_value
        End If
    Next i
    BestDifference = Abs(Total(2) - Total(1))
End Sub

Back to Article

Listing Four

' *********************************************' Generate random solutions until we have max_the_same in 
' a row that produce no improvement.



Private Sub RandomSearch(max_the_same As Long)
Dim num_the_same As Long
Dim i As Integer
Dim bin As Integer
Dim diff As Long
    ReDim ordering(1 To NumValues)
    ' Generate random solutions.
    Do While num_the_same <= max_the_same
        Total(1) = 0
        Total(2) = 0
        ' Generate a random solution.
        For i = 1 To NumValues
            bin = Int(2 * Rnd + 1)
            TestSolution(i) = bin
            Total(bin) = Total(bin) + Values(i)
            NodesVisited = NodesVisited + 1
        Next i
        ' See if this test solution is better than the previous best solution.
        diff = Abs(Total(2) - Total(1))
        If diff < BestDifference Then
            ' Save the improved solution.
            For i = 1 To NumValues
                BestSolution(i) = TestSolution(i)
            Next i
            BestDifference = diff
            num_the_same = 0
        Else
            num_the_same = num_the_same + 1
        End If
    Loop
End Sub

Back to Article

Listing Five

' *********************************************' Generate random solutions. Improve them until we have same_per_trial 
' in a row with no improvement. Repeat until we have same_trials in a row 
' with no improvement.
' *********************************************
Private Sub IncrementalImprovement(same_trials As Long, same_per_trial As Long)
Dim trials_the_same As Long



Dim impr_the_same As Long
Dim i As Integer
Dim j As Integer
Dim bin As Integer
Dim diff As Long
Dim test_diff As Long
Dim test_total(1 To 2) As Long
Dim num_in_bin(1 To 2) As Integer
Dim num As Integer
    ReDim ordering(1 To NumValues)
    ' Generate random solutions.
    Do While trials_the_same <= same_trials
        Total(1) = 0
        Total(2) = 0
        num_in_bin(1) = 0
        num_in_bin(2) = 0
        ' Generate a random solution.
        For i = 1 To NumValues
            bin = Int(2 * Rnd + 1)
            TestSolution(i) = bin
            Total(bin) = Total(bin) + Values(i)
            num_in_bin(bin) = num_in_bin(bin) + 1
            NodesVisited = NodesVisited + 1
        Next i
        diff = Abs(Total(2) - Total(1))
        ' Generate random improvements.
        If num_in_bin(1) = 0 Or num_in_bin(2) = 0 Then
            impr_the_same = same_per_trial + 1
        Else
            impr_the_same = 0
        End If
        Do While impr_the_same <= same_per_trial
            ' Pick an item in bin 1.
            num = Int(num_in_bin(1) * Rnd + 1)
            For i = 1 To NumValues
                If TestSolution(i) = 1 Then
                    num = num - 1
                    If num <= 0 Then Exit For
                End If
            Next i
            ' Pick an item in bin 2.
            num = Int(num_in_bin(2) * Rnd + 1)
            For j = 1 To NumValues
                If TestSolution(j) = 2 Then
                    num = num - 1
                    If num <= 0 Then Exit For
                End If
            Next j
            NodesVisited = NodesVisited + 2
            ' See if the swap would help.
            test_total(1) = Total(1) - Values(i) + Values(j)
            test_total(2) = Total(2) - Values(j) + Values(i)
            test_diff = Abs(test_total(2) - test_total(1))
            If test_diff < diff Then
                ' It helps.
                diff = test_diff
                Total(1) = test_total(1)
                Total(2) = test_total(2)
                TestSolution(i) = 2
                TestSolution(j) = 1
                impr_the_same = 0
            Else
                ' No improvement.
                impr_the_same = impr_the_same + 1
            End If
        Loop    ' End improving this solution.
        ' See if this test solution is better than previous best solution.
        diff = Abs(Total(2) - Total(1))
        If diff < BestDifference Then
            ' Save the improved solution.
            For i = 1 To NumValues
                BestSolution(i) = TestSolution(i)
            Next i
            BestDifference = diff
            trials_the_same = 0
        Else
            trials_the_same = trials_the_same + 1
        End If
    Loop
End Sub

Back to Article

DDJ