Channels ▼

Clay Breshears

Dr. Dobb's Bloggers

Three Parallel Backtracking Designs

December 09, 2011

Before we get into the meat of this post, let me revisit the serial code I used for illustration last time. Specifically, I want to use a better method for recording queen positions.

Rather than keeping a 2D array that mimics a physical chessboard, I propose using an integer vector with length equal to the number of rows. The data stored in each element will be the column of a placed queen. This eliminates the chance that two queens will be on the same row, and the test for a legal position can simply search all lower indexed elements to see if those elements correspond to the same column (vertical attack) or are along a diagonal from the newly placed queen. If such a case is found, the newly placed queen attacks an already placed queen and stillLegal() will return FALSE since the position has no hope of leading to a legal solution.

It may seem like a "six of one, half dozen of another" situation, but having the simpler and smaller data structure to represent the current state of a partial solution will be a benefit to our parallel solutions. The simpler aspect is seen in the tests needed in stillLegal(). Checking for queens in the same column and testing the diagonal attack squares is easy, as seen in the code below. The positions of queens placed in rows "above" the currently placed queen are all that need to be examined. Any new move generated at the same level simply overwrites the previous column value in the vector and the test ignores any column values that are stored in higher indexed ("below") locations.

BOOL stillLegal(int *board, int r)
{
  BOOL safe = TRUE;
  int i;
  // Check vertical
  for ( i = 0; i < r; ++i)
    if (board[i] == board[r]) safe = FALSE;
    // Check diagonals
    int ld = board[r];  //left diagonal columns
    int rd = board[r];  // right diagonal columns
    for ( i = r-1; i >= 0; --i) {
      --ld; ++rd;
      if (board[i] == ld || board[i] == rd) safe = FALSE;
    }

    return safe;
}

In serial, the same data structure representing the current board state can be used as long as already tested moves are taken back and replaced by new moves. In parallel, different tasks are going to need a private copy of the board and current partial solution to be examined. Otherwise you have a data race. The smaller data structure saves time and space in the parallel code because each new task generated from a call to NQueens() will use a copy of the current board to be used as a formal parameter. Even beach gorillas with too many muscles can realize that a smaller amount of data to be copied will add less overhead to the computation. Thus, if a legal position is reached by placing a queen in the next column location of the current row, a copy of that board configuration must be generated and used for a newly spawned task. You'll see this in the three design versions of the parallel NQueens() function.

Design 1: Every New Recursive Call Generates a New Task

This is the simplest code to write. For every recursive call, you simply create a task. Using OpenMP task construct, the first backtracking design parallel code would be:

void NQueensD1(int *board, int row, int N)
{
  if (row == N) 
    handleSolution(board);
  
  else {
    for (int i = 0; i < N; ++i) {
      board[row] = i;
      if (stillLegal(board, row)) {
        // make copy of current board position
        int *bnew = new int[N];
        for (int j = 0; j <= row; ++j) bnew[j] = board[j];
#pragma omp task 
        NQueensD1(bnew, row+1, N);
      }
    }
  }
  return;
}

I've already mentioned one benefit of the above code: It's simple to implement from the serial code. Another benefit is that the load per task is going to be balanced.

However, as if to kick sand in your face, all tasks ultimately do very little work in this implementation. The only thing a non-leaf task does is create new tasks for queens on the next row of the board, if there are any legal positions for a queen on that row. It takes time to create a task and assign that task to a thread. Since there isn't much computation in any task, the task creation time will be a large percentage of the work time. In other words, the overhead of creating the task will be high, let alone assigning it to a thread for execution. Since each task must have a private copy of the state configuration, there can be a large memory usage, too. Thus, any time spent allocating memory for a new current state object and copying relevant values is just more overhead that was not needed in the serial version.

Design 2: One task Created for Each Sub-Tree Rooted at a Particular Depth

In the NQueensD2() code below, I use a defined constant LEVEL to trigger the generation of tasks. A single OpenMP thread executes the (recursive) function until row equals LEVEL. At this point, a new task is created to carry on the search from that point in the tree. As above, a new board configuration copy is made and used as a parameter to the task.

void NQueens(int *board, int row, int N)
{
  if (row == N) 
    handleSolution(board);
  else {
    for (int i = 0; i < N; ++i) {
      board[row] = i;
      if (stillLegal(board, row))
        NQueens(board, row+1, N);
    }
  }
  return;
}

void NQueensD2(int *board, int row, int N)
{
  for (int i = 0; i < N; ++i) {
    board[row] = i;
    if (stillLegal(board, row)) {
      if (row < LEVEL)
        NQueensD2(board, row+1, N); // go deeper in search tree
      else {
        int *bnew = new int[N];
        for (int j = 0; j <= row; ++j) bnew[j] = board[j];
#pragma omp task 
        NQueens(bnew, row+1, N);   // generate task of serial function
      }
    }
  }
  return;
}

This second design minimizes task creation/assignment (to executing thread) overhead. If LEVEL is set to 4, the maximum number of tasks will be 4096. Since nowhere near the maximum number of tasks will be created, there won't be a huge memory drain with hundreds of thousands of different copies of the "current" partial solution state sitting in tasks waiting to be assigned to a thread for execution.

One problem with Design 2 is that the fixed number of tasks can lead to poor scaling. If there are 128 or 256 threads available, spawning 64 tasks will not utilize the resources of the platform. To overcome this problem, the LEVEL for creating tasks needs to be chosen smartly. Even though I know that all potential 4096 tasks will not be created, I do hope that enough tasks from that maximum will be spawned to keep my thread team busy.

A related issue will be choosing a value for LEVEL (in relation to N) that ensures enough tasks are created to keep threads busy while the original single thread goes up and down the (recursive) tree in order to spawn more tasks. It wouldn't be fruitful to spawn enough tasks for half the thread team only to have those tasks complete before more tasks on other branches of the search tree are spawned.

Perhaps a less obvious drawback to this backtracking design is that the amount of work given to the tasks may vary dramatically, meaning some tasks will finish well before others. This leads to a load imbalance. Even if a number of tasks equal to the number of threads is created, the load imbalance can still hamper good scaling.

Design 3: Create a New Task for Each Child Node, But Only To a Certain Depth

The second design took the 97-pound weakling that was Design 1 and pumped it up a bit. The third design combines the best parts of the first two. (The NQueens() function called from NQueensD3() is the same as the one given for the second design code.)

void NQueensD3(int *board, int row, int N)
{
  for (int i = 0; i < N; ++i) {
    board[row] = i;
    if (stillLegal(board, row)) {
      if (row < LEVEL) {
        int *bnew = new int[N];
        for (int j = 0; j <= row; ++j) bnew[j] = board[j];
#pragma omp task 
        NQueensD3(bnew, row+1, N);  // generate a task
      }
      else 
        NQueens(board, row+1, N);   // no more tasks, search from here
    }
  }
  return;
}

In Design 3 we create a relatively small number of tasks. We want the total number of tasks to be (close to) some multiple of the number of threads available. This creates (potentially) multiple pieces of work for each thread. Even so, by limiting the number of tasks, the overhead of task creation, assignment, and termination will be reduced from that of Design 1. For the same value of LEVEL, the total number of tasks created will be about the same as from Design 2, but since tasks are created early and often, the advantages of parallel execution are seen much sooner.

It's not all sunshine and bully-free beaches, though. The algorithm (in general) can be harder to code. (Comparing the code for NQueensD2() and NQueensD3() that may not be readily apparent.) The "best" level of the search tree to halt new task spawning will depend on workload and number of threads. Implementing a dynamic algorithm for setting the LEVEL value can complicate the code design or require many test runs to find the optimal level to be used.

Performance

Normally I won't include performance figures in these posts since it takes too many details to describe the actual execution platform in order for readers to recreate my results. Six months down the road the advice above will still be valid, but the computational platforms and software will have changed and the actual performance figures will be stale. Whenever I do include performance results, you should only use them for relative purposes in the most general way as an algorithmic comparison metric.

(If I were doing this as an employee of Intel, I would have to detail the processors used, the OS version, the versions of the compiler and other support software, as well as include a federally mandated disclaimer. Since I'm not writing this as an Intel employee, I will only say that if you implement the codes yourself, please rest assured that your mileage will vary.)

Running the serial code and the code from the three design alternatives given above on a 12 core/24 thread platform with N = 16 (yielding 14,772,512 solutions) and LEVEL = 4 (where relevant), I got the following execution results.

Code Time (in Seconds) Speedup
Serial 705.954 1.0
Design 1 Killed by system ---
Design 2 109.383 6.45X
Design 3 44.489 15.90X

When running the Design 1 code, the execution was killed by the system. This is likely for creating too many tasks and exhausting system resources. (The code runs to completion with N = 14.) As expected, the Design 3 code runs better than the Design 2 code (and the Design 1 code for smaller board sizes).

All this is well and good for backtracking problems that are very regular. That is, problems like N-Queens and Knight's Tour will have a fixed number of next moves from a given partial solution state regardless of the node in the search tree. Does this advice hold for problems that are more irregular? In my next post I will examine a backtracking solution to a logic puzzle that can spawn different numbers of next moves at different points within the search tree.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video