Tools

Parallelizing N-Queens with the Intel Parallel Composer

, February 17, 2009

Solving an old problem with new tools

2 Parallelizing the Solution

This section describes the main "solver" routine and describes how to go about parallelizing it. We'll look at some basic parallel analysis and decomposition of the algorithm so that we know what can be parallelized. Then, we'll describe each of the parallel solutions. At the end of each solution, we'll describe what we think are the tradeoffs to both usability and performance we see in the methods.

First of all we need to find out where most of the time is spent in our program. This will directly lead us to the point where we have to start with the parallelization. We will use a profiler to find the hotspot in our program. In our case, this is the Intel Performance Tuning (PTU) from Whatif.intel.com; however, you could some other profiler, like the Vtune Performance Analyzer.

2.1 The Serial "Solver" Function

We have implemented a couple of serial versions. The first version is almost an inefficient direct translation of the pseudocode to C++, where we used a Standard Template Library (STL) vector to keep the information of the solution vector. This version is kind of inefficient as it resizes the STL vector any time a backtracking solution is discarded. The next version replaces the variable vector by a fixed array to get rid of the performance penalty. Of course this could be also handled differently.

The major change is that we use a more modular approach where we encapsulate the solver and the placing algorithm in separate functions. The straightforward algorithm is of course not the fastest one. Martin Richards the father of BCPL presented in 2005 a solution for the Queens problem using a recursive bit pattern algorithm. Our C++ version is a direct translation of the presented MCPL version. Some additional notes on the different algorithms: First of all there are a lot of clever possibilities how to improve the serial code. The second implementation is the easiest one to use in a threaded playground to see how the different paradigms can be implemented, so this is the basis for the parallel algorithms. Please keep in mind that the purpose of this sample code guide is NOT to teach the fastest possible algorithm but to show different implementations of parallel paradigms.

The third algorithm is only included to show that the use of the fastest algorithm should be done before starting with any parallelization concept. On the other hand it's much more complicated to understand. NOTE: The back track tree version is based on the original paper Backtracking Algorithms in MCPL using Bit Patterns and Recursion by Martin Richards.

2.1.1 Standard Template Library (STL) Solution

The sample code can be found here in the \Nqueens\nq-stl\ directory. This sample represents a simple transcription of the above algorithm to C++ using an STL vector object to represent the solution. The vector methods pop_back/push_back expand and collapse the solution vector. Accessing the STL vector in this manner is so poor that we replaced the vector by an integer array in the next implementation.

2.1.2 Serial Solution

The sample code can be found here in the \Nqueens\nq-serial\ directory. This sample represents a more modularized version of the N-Queens algorithm. Everything is wrapped into the solve function, where the solution vector is replaced by a simple integer array. Inside the solve function we call setQueen() which loops over the different columns in the first row. The advantage of this approach is that the different computations are independent which makes it easier to parallelize later. The core loop of the nqueens solver function is shown below.


void setQueen(int queens[], int row, int col) {
  for(int i=0; i<row; i++) {
     if (queens[i]==col) // vertical attacks
       return;
       if (abs(queens[i]-col) == (row-i) ) // diagonal attacks
          return;
       }
       queens[row]=col; // column is ok, set the queen
       if(row==size-1) {
           nrOfSolutions++;
         }
    else {
         for(int i=0; i<size; i++) { // try to fill next row
             setQueen(queens, row+1, i);
} [...]

For each column (i) we start our recursive backtracking algorithm from our driver function solve().


void solve(int queens[]) {
   for(int i=0; i<size; i++) {
   // try all positions in first row
   // create separate array for each recursion
       setQueen(queens, 0, i);
   }
}

Using a profiler allows you to find out where most of the time is spent in the program.

It should not be surprising that most time is spent in the abs() function, which we use to calculate if two queens are on the same diagonal (see algorithm description). Therefore abs() is the hottest function in setQueen(), which itself is called 348150 times by our core solve function from above.

2.2 Parallelizing the "Solver"

This example cannot replace studying in depth the different parallel paradigms we offer, but it will give you enough information to get you started. If you have worked through the serial examples, you should be able to identify hotspots, which follows nicely the Pareto principle which applies to most applications that 80% of the performance is spent in 20% of the code. This consideration, in general, limits the overall parallelization efforts. Next consider that one should start with parallelization on the highest possible abstraction level to make the solution as general as possible. This method should help make the approach reusable and refactorable for later use.

We have three modules to consider: solve(), setQueens(), and abs(). Of course, the solve() function is the one with the highest abstraction count and the abs() function with the lowest. So we use the solve() function. By looking at our decision tree we see that the different branches are independent from each other. This is an important and necessary condition for the parallelization strategy.

2.2.1 Windows 32 Threading API

The sample code can be found here in the \Nqueens\nq-win32api-intel\ directory.

A "classic" parallel implementation of the Nqueens algorithm would make use of a native threading like the Windows 32 API. The main advantage of the API approach is that the user has more control and power over threading with C++ than with any other languages. But the amount of code it takes to implement a given solution is of course much higher as the programmer has to implement all the thread handling which in the other cases are handled by the runtime. Therefore this approach is sometimes referred to as the assembly language for threading. You are more flexible in assembly language, but you could get what you want much faster in a high-level language with results that were as good or at least good enough. Also the number of cores which influences the number of created threads has to be determined. This is not so easy if you have a platform- independent solution in mind. This sample code guide is too short to teach the threading API so we provide only a very rough idea of the underlying method. First of all we have to create a thread local storage to store the thread-specific information like which part of the loop to work on. This is done in :


// Thread local data structure that will
//contain iteration range and message to print out
struct _thr_params {
   size_t start;
   size_t end;
   int *queens;
   std::string msg;
};
typedef struct _thr_params thr_params;

The next thing is to identify the portions of code to thread. This is of course still the module setQueens(). So we have to implement a driver function to coordinate the work of the multiple threads. This is done in run_threaded_loop():


void run_threaded_loop (int num_thr, size_t size, int _queens[]) {
    HANDLE* threads = new HANDLE[num_thr];
    thr_params* params = new thr_params[num_thr];
    for (int i = 0; i < num_thr; ++i) {
      // Give each thread equal number of rows
      params[i].start = i * (size/num_thr);
      params[i].end = params[i].start + (size/num_thr);
      params[i].queens = _queens;
      // Pass argument-pointer to a different
      // memory for each thread's parameter to avoid data races
      threads[i] = CreateThread (NULL, 0, run_solve,
      static_cast<void *> (¶ms[i]), 0, NULL);
   }
   // Join threads: wait until all threads are done
   WaitForMultipleObjects (num_thr, threads, true, INFINITE);
   // Free memory
   delete[] params;
   delete[] threads;
}

The next part is to encapsulate the code we like to run in parallel into a new function which we call run_solve(). Every instance of run_solve has its own local data for the start and end point of the loop indices.


// Worker thread function (to be executed by each thread in parallel)
DWORD WINAPI run_solve (void* param)
{
   // Retrieve arguments passed to this thread: param is a pointer to
   // void, static_cast converts this to a pointer to thr_params
   thr_params* params = static_cast<thr_params*> (param);
   // Print messages
   for (size_t i = params->start; i < params->end; ++i) {
   // create separate array for each recursion
   setQueen(new int[size], 0, i);
   }
   return 0;
}

Of course we also have to enable correct programming structures for avoiding race conditions which look very similar to the OpenMP and TBB approach:


CRITICAL_SECTION lock;
CRITICAL_SECTION plock;

By comparing the different solutions it's obvious by just looking at the pure number of source lines that the threading API approach is the most complex method with more pitfalls and opportunities to make an error. But it is also of course the approach with the most freedom and control over the implementation. It would also be nice to implement a clever way to automatically determine the number of cores to choose the right number of threads. In Intel TBB and OpenMP this is done automatically. So the threading API approach which we use can easily lead to an over-subscription (more threads than cores) or under-subscription (less threads than cores) when moving the executable to a different machine. Using more threads than cores would look like this in the Intel Thread Profiler:

The blue bar means oversubscription and leads directly to a performance degradation. This is also left as an exercise for the user together with the task to get rid of the yellow synchronization bars.

2.2.2 Parallelizing with task and taskcomplete

The sample code can be found here in the \Nqueens\nq-parexp-intel\ directory.

The Intel compiler uses newly added C/C++ language extensions to make parallel programming easier. There are four keywords introduced within this version of the compiler: __taskcomplete, __task, __par, and __critical. These keywords are used as statement prefixes. In order for the application to benefit from the parallelism afforded by these keywords, you must use the /Qopenmp compiler option during compilation. The compiler will link in the appropriate runtime support libraries, and the runtime system will manage the actual degree of parallelism. The parallel extensions utilize the OpenMP 3.0 runtime library, but abstract out the use of the OpenMP pragmas and directives, keeping the code more naturally written in C or C++.

We can parallelize the function, solve(), using __par. This prefix allows us to modify the function for parallel processing. Assuming that there is no overlap among the arguments, the solve() function is modified with the addition of the __par keyword. With no change to the way the function is called, the computation is parallelized. This will look like this:


void solve() {
  __par for(int i=0; i<size; i++) {
    // try all positions in first row
    // create separate array for each recursion
    // started here
    setQueen(new int[size], 0, i);
    }
}

The sample uses the __par directive to parallelize the loop by simply putting the macro in front of the loop. If we run our program several times we see that the results are different and it seems we have introduced some kind of error in our program by using the __par. This is one of the most common problems in parallel programming and is called a race condition. In this circumstance, multiple threads will each operate on a particular block of code, but the overall outcome of the execution is dependent upon which thread reaches the block of code first. In our example, two processes each increment the value of nrOfSolutions by 1 if they found a valid solution. This consists of three distinct operations:

Load the value of the nrOfSolutions into a register.
Add 1 to the value in the register.
Write the contents of the register back into the variable nrOfSolutions.

It is clear that if the two processes each operate one after the other, the final value of nrOfSolutions after both processes complete will be equal to the initial value plus 2. Another possible outcome, however, is that the first process could complete the first two steps outlined above, and then that process could be preempted by the second process. The second process would read the value of nrOfSolutions as for instance 6, increment it to 7, and then write it back to the variable nrOfSolutions. Once the first process resumes, it would continue where it left off, writing 6 back as the value of nrOfSolutions. Thus, in the first case, the code generates a final value of 7 for nrOfSolutions , and in the second case, the same code generates a final value of nrOfSolutions for X. Which result would be generated by any particular execution of the code could be unpredictable. This effect comes into play as we move from a serial program flow to a parallel program flow and is caused by the fact that all threading approaches communicate via the shared memory. Since detection of race conditions by code inspection is not feasible for non-trivial code, there are tools to help with this task: One way is to instrument the code, execute it with test data and log possible race conditions, for example with the Intel Thread Checker. Another approach is to flag such conditions during debugging of a desired code path. Intel Parallel Composer includes the Intel Parallel Debugger Extension that extends the Microsoft Visual Studio Debugger with capabilities that facilitate tracking down problems typical for threaded code.

A solution to this problem is to intercept a race condition introduced by the concurrent behavior of the algorithm by protecting the counter for the solutions by a mutual exclusion construct like _critical. This will of course slow down the execution a little bit as we serialize the program flow but we will get the right solution. The final code would look like this:


__critical
{
   // increasing the solution counter is not atomic
   nrOfSolutions++;
}

2.2.3 Starting with OpenMP 3.0

The sample code can be found here in the \Nqueens\nq-openmp-intel\directory.

OpenMP is an industry standard for portable multi-threaded application development. This approach is effective at fine-grain (loop-level) and large-grain (function-level) threading. OpenMP directives provide an easy and powerful way to convert serial applications into parallel applications, enabling potentially big performance gains from parallel execution on multi-core and symmetric multiprocessor systems. The original source is compiled unmodified. Although directives are inserted into the code, when no action is taken on them (for example, the application is not running in shared memory parallel mode) they do not change the program. For shared memory parallel computers, this allows for simple comparisons between serial and parallel runs.

Because only directives are inserted into the code, it is possible to make incremental code changes. The ability to make incremental code changes helps programmers maintain serial consistency. When the code is run on one processor, it gives the same result as the unmodified source code. OpenMP is a single source code solution that supports multiple platforms and operating systems. There is also no need to determine the number of cores as the OpenMP runtime chooses the right number for you.

The latest OpenMP, version 3.0, contains a new task-level parallelism construct that simplifies its use for parallelizing functions, in addition to the loop-level parallelism for which OpenMP is most commonly used. In our example we use exactly the same approach as in the parallel exploration version. The only difference is that the __par macro is replaced by an OpenMP pragma with exactly the same functionality.


void solve() {
   #pragma omp parallel for
   for(int i=0; i<size; i++) {
   // try all positions in first row
   // create separate array for each recursion
   // started here
   setQueen(new int[size], 0, i);
   }
}

The important function in our program is the solve function. Solve is easy to parallelize as the solutions are independent (review the search tree). The sample uses the OpenMP #pragma parallel for to parallelize the important loop. By using the Intel Parallel Debugger Extension we directly see that the same race condition is happening here.

We can avoid this shared access by protecting the counter for the solutions with a mutual exclusion construct like #pragma omp critical or #pragma omp atomic:


#pragma omp atomic
   nrOfSolutions++; // increasing the solution counter is not atomic

There is one distinct advantage of OpenMP over other typical Windows threading approaches such as Win32 threads. The pragma based technique allows us to use an incremental approach to parallelism this means one decides which parts and hotspot to parallelize and than one can go stepwise through your code and thread wherever it is necessary. Another advantage is that the serial code is don't destroyed and stays intact. Furthermore one doesn't have to care about advanced thread handling as the OpenMP runtime automatically does thread pooling whenever possible.

Since typical Windows threading does not use thread pooling, each thread encounters the thread startup overhead. This can be prohibitive for many applications. In summary OpenMP is a flexible and simple set of pragmas, function calls, and environment variables that explicitly instruct the compiler how and where to thread your application. By taking advantage of OpenMP functionality, threading programming is not that much harder than single-threaded programming.

2.2.4 OpenMP 3.0 Task queuing

The sample code can be found here in the \Nqueens\nq-openmp-taskq\directory.

Sometimes programs with irregular patterns of dynamic data structures or with complicated control structures like recursion are hard to parallelize efficiently. The work queuing model allows the user to exploit irregular parallelism, beyond that possible with OpenMP 2.0 or 2.5. The task pragma specifies the environment within which the enclosed units of work (tasks) are to be executed. When a task pragma is encountered lexically within a task block, the code inside the task block is conceptually queued into the queue associated with the task. To preserve sequential semantics, there is an implicit barrier at the completion of the task. The user is responsible for ensuring that no dependencies exist or that dependencies are appropriately synchronized, either between the task blocks, or between code in a task block and code in the task block outside of the task blocks. For our example this would look like:


#pragma omp parallel
#pragma omp single
{
  for(int i=0; i<size; i++) {
    // try all positions in first row
    // create separate array for each recursion
    // started here
#pragma omp task
    setQueen(new int[size], 0, i);
  }
}

In our example we only need one task queue therefore we need to set up the queue only by one thread (omp single). The setQueenscalls are independent of each others and therefore they fit nicely into the task concept.

With the Intel Parallel Debugger Extension in Visual Studio, you can easily inspect the state of tasks, teams, locks, barriers, or taskwaits in your OpenMP program in dedicated windows:

If you want to rule out problems that may have been introduced by threading your code you can also serialize the execution of the parallel regions without recompilation by selecting "Serialize parallel regions" from the Intel-specific debug menu. As a result the subsequent parallel regions in your program will be executed with only one thread, regardless of the settings of num_threads in your program or in the environment.

2.2.5 Parallelizing with Intel Threading Building Blocks (TBB)

The sample code can be found here in the \Nqueens\nq-tbb-intel\directory.

Intel Threading Building Blocks (TBB) offers a rich methodology to express parallelism in a C++ program. It is a library that helps you take advantage of multicore processor performance. It represents a higher-level, task-based parallelism that abstracts platform details and threading mechanism for performance and scalability. It nicely fits into the object-oriented and generic framework of C++. TBB uses a runtime based programming model and provides the user with generic parallel algorithms based on a template library similar to the standard template library (STL). The TBB task scheduler does the load balancing for you. With thread-based programming, you are often stuck dealing with load-balancing yourself, which can be tricky to get right. By breaking your program into many small tasks, the Intel TBB scheduler assigns tasks to threads in a way that spreads out the work evenly.

For the programmers not used to work with templates here is a very brief introduction to get you started with Intel TBB. Templates are programming constructs, which allow a more generic and data type independent programming. This generic approach can be combined with overloading and inheritance of operators to reach a very high level of abstraction and therefore allow powerful parallel design concepts.

Templates are construction skeletons with one or more parameterized types. Templates can be used to create functions (function templates), classes (class templates) and templates itself (template templates). The big advantage of templates in relation to macros is the type security, which is examined during compile time and thus helps avoids runtime errors. Here is an example of a simple function template to compute the maximum of two elements:


template <typename T>
T max(T x, T y)
{
  if (x < y)
    return y;
  else
    return x;
}

This template is called like a normal function: for example, max(3, 7). The compiler determines the type by looking at the parameters. So this would not only work for integers but for floating-point numbers, strings and even classes. The only necessary conditions is that compare operator and the copy constructor are defined for the used data type. The only other thing one needs to know to use TBB is how functors work. Functors in C++ are used to work on the elements of a given data structure. A functor is a class which is used as a function. The class is defined by overloading the function call operator therefore an operator() member function is defined. A simple functor would look like this:


#include <list>
#include <algorithm>
#include <iostream>
struct Sum {
  double sum;
  Sum():sum(0.0) {}
  void operator()(const double & d) { sum += d; }
};
std::list<double> myList;
 ...
Sum mySum = std::for_each(myList.begin(), myList.end(), Sum());
std::cout << "Summe " << mySum.sum << std::endl;

The class sum is doing the summation. The constructor is initialized with zero. The operator() is defined to add another value. The class is used together with the for_each statement to call the operator() for every element in MyList. The result is returned by for_each and the result is placed in mySum.sum. This methodology already used in C++ is extended by TBB to realize simple parallelization concepts.

Intel TBB provides a couple of functions templates like parallel_for, parallel_while, parallel_reduce, pipeline, parallel_sort, and parallel_scan, along with some concurrent containers. In our example we make use of the parallel_for template. This functor is used to divide the recursive distribution of the data for the parallel tasks up to an arbitrary selectable grain size. For the processing interval the TBB class tbb::block_range is used which can be instantiated by numbers, pointers and random-access iterators. The functor has to work on the whole range instead on a single element. For the solve loop, the sample implements the parallel_for template function. The first parameter of the call is a blocked_range object that describes the iteration space. The constructor takes three parameters: The lower and upper bound of the range and the <grainsize>. The parallel_for subdivides the range into sub-ranges that have approximately <grainsize> elements.

The second parameter to the parallel_for function is the function object to be applied to each subrange. So this would look like:


class SetQueens {
public:
void operator()( const blocked_range<size_t>& r ) const {
   for( size_t i=r.begin(); i!=r.end(); ++i ) {
   // try all positions in first row
   // create separate array for each recursion
   // started here
   setQueen(new int[size], 0, (int)i);
   }
}
void solve() {
   parallel_for(blocked_range<size_t>(0, size, 1), SetQueens() );
}

As in the OpenMP sample, this sample attempts to protect the solution counter by the mutual exclusion construct NrOfSolutionsMutexType:


if(row==size-1) {
{
   // increment is not atomic, so setting a lock is required here
   NrOfSolutionsMutexType::scoped_lock mylock(NrOfSolutionsMutex);
   nrOfSolutions++;
   }
   // closing block invokes scoped_lock destructor which releases lock
}

There are of course more and clever ways to implement this in TBB. For instance, a reduce operation can be used but this is left to the reader as an additional exercise. An interesting question is if the parallelization really is efficient and scalable.

One method is to look at the runtime and divide it by the number of threads to calculate the speedup in comparison to the serial version. The disadvantage of this approach is that one gets no insight into the parallel problem. There is a very sophisticated tool the Intel Thread Profiler which allows interesting insight into the real parallel execution of the concurrent flow graph. It works for all featured approaches. Here is the view of the Intel TBB runtime.

The two green bars show that our solution is well balanced but with a lot of synchronization overhead which is caused by synchronizing the access to the global variable which counts the number of solutions. It is crucial to all parallelization approaches to validate and verify that the used approach is providing the maximal performance by avoiding slowdowns caused by serialization. See Amdahl's law on how serial parts of a program can affect the parallel performance. In our example it is pretty straightforward to get rid of this unnecessary synchronization. Every time we add a number to the solution counter we have to synchronize this access, as it is a global variable for all threads. To avoid this synchronization we just have to implement a local solution variable for each thread. At the end of the program we simply sum up the local solution variables to the global one. The resulting code run should almost look like this:

2.2.6 Using lambda Functions with Intel TBB

The sample code can be found here in the \Nqueens\nq-tbb-lambda\directory.

Lambda functions have been included in the working draft of the next C++ standard C++0x. This addition makes STL/Intel TBB algorithms an order of magnitude more usable then in the past. The Intel compiler is the first C++ compiler which implements lambda function. A lambda abstraction in general defines an unnamed function, so it's a bunch of code without a name. Lambda functions together with closures are a powerful concept because they combine code with a scope. The characteristic of this construct is that it ties a scope to a code block which than can be passed around and the scope doesn't change for the particular block it is tied to. The term used in the standard is "capture" (of an auto), and an auto can be captured by value or reference. A closure is a function that can refer to and alter the values of bindings established by binding forms that textually include the function definition. A lambda construct is almost the same as a function object in C++ or a function pointer in C. A closure then can be seen as a pointer to a function and a pointer to a set of name/variable bindings. Finally a lambda function together with a closure can be seen as a lot of syntactic sugar around function objects and function pointers. This syntactic sugar is intended to be a convenient way to write function objects or lambdas right at the point of use.

This section explains in some detail about using of C++ Lambda expressions. In order to use Intel's implementation of lambda expressions, you need to request C++0X with the command-line option /Qstd that is: /Qstd=c++0X. The compiler creates an anonymous function object upon evaluating a lambda expression. This function object, created by a lambda expression, may live longer than the block in which it is created.

You must ensure that it does not use variables that were destroyed before the function object is used. The following example shows how a function object created by a lambda expression will look like in the context of the Nqueens example. Tighter C++ and TBB integration allows the simplification of the functor operator() concept by using lambda functions and closures to pass code as parameters:


void solve() {
  parallel_for(blocked_range<size_t>(0, size, 1),
   [](const blocked_range<int> &r){
    for (int i = r.begin(); i != r.end(); ++i) {
     setQueen(new int[size], 0, (int)i);
    };
 });
}

Previous 1 2 3 4 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Tools

Parallelizing N-Queens with the Intel Parallel Composer

2 Parallelizing the Solution

2.1 The Serial "Solver" Function

2.1.1 Standard Template Library (STL) Solution

2.1.2 Serial Solution

2.2 Parallelizing the "Solver"

2.2.1 Windows 32 Threading API

2.2.2 Parallelizing with task and taskcomplete

2.2.3 Starting with OpenMP 3.0

2.2.4 OpenMP 3.0 Task queuing

2.2.5 Parallelizing with Intel Threading Building Blocks (TBB)

2.2.6 Using lambda Functions with Intel TBB

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Tools Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Tools

Parallelizing N-Queens with the Intel Parallel Composer

2 Parallelizing the Solution

2.1 The Serial "Solver" Function

2.1.1 Standard Template Library (STL) Solution

2.1.2 Serial Solution

2.2 Parallelizing the "Solver"

2.2.1 Windows 32 Threading API

2.2.2 Parallelizing with __task and __taskcomplete

2.2.3 Starting with OpenMP 3.0

2.2.4 OpenMP 3.0 Task queuing

2.2.5 Parallelizing with Intel Threading Building Blocks (TBB)

2.2.6 Using lambda Functions with Intel TBB

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Tools Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

2.2.2 Parallelizing with task and taskcomplete