# Three Parallel Backtracking Designs

December 09, 2011

Before we get into the meat of this post, let me revisit the serial code I used for illustration last time. Specifically, I want to use a better method for recording queen positions.

Rather than keeping a 2D array that mimics a physical chessboard, I propose using an integer vector with length equal to the number of rows. The data stored in each element will be the column of a placed queen. This eliminates the chance that two queens will be on the same row, and the test for a legal position can simply search all lower indexed elements to see if those elements correspond to the same column (vertical attack) or are along a diagonal from the newly placed queen. If such a case is found, the newly placed queen attacks an already placed queen and `stillLegal()` will return FALSE since the position has no hope of leading to a legal solution.

It may seem like a "six of one, half dozen of another" situation, but having the simpler and smaller data structure to represent the current state of a partial solution will be a benefit to our parallel solutions. The simpler aspect is seen in the tests needed in `stillLegal()`. Checking for queens in the same column and testing the diagonal attack squares is easy, as seen in the code below. The positions of queens placed in rows "above" the currently placed queen are all that need to be examined. Any new move generated at the same level simply overwrites the previous column value in the vector and the test ignores any column values that are stored in higher indexed ("below") locations.

```BOOL stillLegal(int *board, int r)
{
BOOL safe = TRUE;
int i;
// Check vertical
for ( i = 0; i < r; ++i)
if (board[i] == board[r]) safe = FALSE;
// Check diagonals
int ld = board[r];  //left diagonal columns
int rd = board[r];  // right diagonal columns
for ( i = r-1; i >= 0; --i) {
--ld; ++rd;
if (board[i] == ld || board[i] == rd) safe = FALSE;
}

return safe;
}```

In serial, the same data structure representing the current board state can be used as long as already tested moves are taken back and replaced by new moves. In parallel, different tasks are going to need a private copy of the board and current partial solution to be examined. Otherwise you have a data race. The smaller data structure saves time and space in the parallel code because each new task generated from a call to `NQueens()` will use a copy of the current board to be used as a formal parameter. Even beach gorillas with too many muscles can realize that a smaller amount of data to be copied will add less overhead to the computation. Thus, if a legal position is reached by placing a queen in the next column location of the current row, a copy of that board configuration must be generated and used for a newly spawned task. You'll see this in the three design versions of the parallel `NQueens()` function.

#### Design 1: Every New Recursive Call Generates a New Task

This is the simplest code to write. For every recursive call, you simply create a task. Using OpenMP task construct, the first backtracking design parallel code would be:

```void NQueensD1(int *board, int row, int N)
{
if (row == N)
handleSolution(board);

else {
for (int i = 0; i < N; ++i) {
board[row] = i;
if (stillLegal(board, row)) {
// make copy of current board position
int *bnew = new int[N];
for (int j = 0; j <= row; ++j) bnew[j] = board[j];
NQueensD1(bnew, row+1, N);
}
}
}
return;
}```

I've already mentioned one benefit of the above code: It's simple to implement from the serial code. Another benefit is that the load per task is going to be balanced.

However, as if to kick sand in your face, all tasks ultimately do very little work in this implementation. The only thing a non-leaf task does is create new tasks for queens on the next row of the board, if there are any legal positions for a queen on that row. It takes time to create a task and assign that task to a thread. Since there isn't much computation in any task, the task creation time will be a large percentage of the work time. In other words, the overhead of creating the task will be high, let alone assigning it to a thread for execution. Since each task must have a private copy of the state configuration, there can be a large memory usage, too. Thus, any time spent allocating memory for a new current state object and copying relevant values is just more overhead that was not needed in the serial version.

#### Design 2: One task Created for Each Sub-Tree Rooted at a Particular Depth

In the `NQueensD2()` code below, I use a defined constant `LEVEL` to trigger the generation of tasks. A single OpenMP thread executes the (recursive) function until `row` equals `LEVEL`. At this point, a new task is created to carry on the search from that point in the tree. As above, a new board configuration copy is made and used as a parameter to the task.

```void NQueens(int *board, int row, int N)
{
if (row == N)
handleSolution(board);
else {
for (int i = 0; i < N; ++i) {
board[row] = i;
if (stillLegal(board, row))
NQueens(board, row+1, N);
}
}
return;
}

void NQueensD2(int *board, int row, int N)
{
for (int i = 0; i < N; ++i) {
board[row] = i;
if (stillLegal(board, row)) {
if (row < LEVEL)
NQueensD2(board, row+1, N); // go deeper in search tree
else {
int *bnew = new int[N];
for (int j = 0; j <= row; ++j) bnew[j] = board[j];
NQueens(bnew, row+1, N);   // generate task of serial function
}
}
}
return;
}```

This second design minimizes task creation/assignment (to executing thread) overhead. If `LEVEL` is set to 4, the maximum number of tasks will be 4096. Since nowhere near the maximum number of tasks will be created, there won't be a huge memory drain with hundreds of thousands of different copies of the "current" partial solution state sitting in tasks waiting to be assigned to a thread for execution.

One problem with Design 2 is that the fixed number of tasks can lead to poor scaling. If there are 128 or 256 threads available, spawning 64 tasks will not utilize the resources of the platform. To overcome this problem, the `LEVEL` for creating tasks needs to be chosen smartly. Even though I know that all potential 4096 tasks will not be created, I do hope that enough tasks from that maximum will be spawned to keep my thread team busy.

A related issue will be choosing a value for `LEVEL` (in relation to `N`) that ensures enough tasks are created to keep threads busy while the original single thread goes up and down the (recursive) tree in order to spawn more tasks. It wouldn't be fruitful to spawn enough tasks for half the thread team only to have those tasks complete before more tasks on other branches of the search tree are spawned.

Perhaps a less obvious drawback to this backtracking design is that the amount of work given to the tasks may vary dramatically, meaning some tasks will finish well before others. This leads to a load imbalance. Even if a number of tasks equal to the number of threads is created, the load imbalance can still hamper good scaling.

#### Design 3: Create a New Task for Each Child Node, But Only To a Certain Depth

The second design took the 97-pound weakling that was Design 1 and pumped it up a bit. The third design combines the best parts of the first two. (The `NQueens()` function called from `NQueensD3()` is the same as the one given for the second design code.)

```void NQueensD3(int *board, int row, int N)
{
for (int i = 0; i < N; ++i) {
board[row] = i;
if (stillLegal(board, row)) {
if (row < LEVEL) {
int *bnew = new int[N];
for (int j = 0; j <= row; ++j) bnew[j] = board[j];
NQueensD3(bnew, row+1, N);  // generate a task
}
else
NQueens(board, row+1, N);   // no more tasks, search from here
}
}
return;
}```

In Design 3 we create a relatively small number of tasks. We want the total number of tasks to be (close to) some multiple of the number of threads available. This creates (potentially) multiple pieces of work for each thread. Even so, by limiting the number of tasks, the overhead of task creation, assignment, and termination will be reduced from that of Design 1. For the same value of `LEVEL`, the total number of tasks created will be about the same as from Design 2, but since tasks are created early and often, the advantages of parallel execution are seen much sooner.

It's not all sunshine and bully-free beaches, though. The algorithm (in general) can be harder to code. (Comparing the code for `NQueensD2()` and `NQueensD3()` that may not be readily apparent.) The "best" level of the search tree to halt new task spawning will depend on workload and number of threads. Implementing a dynamic algorithm for setting the `LEVEL` value can complicate the code design or require many test runs to find the optimal level to be used.

#### Performance

Normally I won't include performance figures in these posts since it takes too many details to describe the actual execution platform in order for readers to recreate my results. Six months down the road the advice above will still be valid, but the computational platforms and software will have changed and the actual performance figures will be stale. Whenever I do include performance results, you should only use them for relative purposes in the most general way as an algorithmic comparison metric.

(If I were doing this as an employee of Intel, I would have to detail the processors used, the OS version, the versions of the compiler and other support software, as well as include a federally mandated disclaimer. Since I'm not writing this as an Intel employee, I will only say that if you implement the codes yourself, please rest assured that your mileage will vary.)

Running the serial code and the code from the three design alternatives given above on a 12 core/24 thread platform with `N` = 16 (yielding 14,772,512 solutions) and `LEVEL` = 4 (where relevant), I got the following execution results.

Code Time (in Seconds) Speedup
Serial 705.954 1.0
Design 1 Killed by system ---
Design 2 109.383 6.45X
Design 3 44.489 15.90X

When running the Design 1 code, the execution was killed by the system. This is likely for creating too many tasks and exhausting system resources. (The code runs to completion with `N` = 14.) As expected, the Design 3 code runs better than the Design 2 code (and the Design 1 code for smaller board sizes).

All this is well and good for backtracking problems that are very regular. That is, problems like N-Queens and Knight's Tour will have a fixed number of next moves from a given partial solution state regardless of the node in the search tree. Does this advice hold for problems that are more irregular? In my next post I will examine a backtracking solution to a logic puzzle that can spawn different numbers of next moves at different points within the search tree.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

# First C Compiler Now on Github

The earliest known C compiler by the legendary Dennis Ritchie has been published on the repository.

# HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

# Building Bare Metal ARM Systems with GNU

All you need to know to get up and running... and programming on ARM

# Amazon's Vogels Challenges IT: Rethink App Dev

Amazon Web Services CTO says promised land of cloud computing requires a new generation of applications that follow different principles.

# How to Select a PaaS Partner

Eventually, the vast majority of Web applications will run on a platform-as-a-service, or PaaS, vendor's infrastructure. To help sort out the options, we sent out a matrix with more than 70 decision points to a variety of PaaS providers.

More "Best of the Web" >>