# You've Got Parallel Code in My Chocolate

So, here's the problem in a nutshell: My take on a parallel Depth-First Search (DFS) algorithm visits all nodes in a graph and the order of visitation is likely to change from one run to the next. I've recently been told that there are algorithms that require the visitation order of nodes be the same as it is for a serial execution. This is known as a "topological order" of graph nodes, which is defined by serial DFS on a directed acyclic graph (DAG). The context for this order requirement is in solving sparse linear equations where the matrix of a sparse system can be modeled as a DAG. The topological order on the DAG determines the order of equations to be solved. Thus, is there some way to process (visit) nodes of a DAG that touches those nodes in the same order each time the code is run and does not depend on the number of threads used?

I am assuming that the computation at each node in the graph is independent of all other nodes. However, knowing a little about linear algebra and solving systems of equations, I suspect that the computations are going to be "mostly independent." I'll address this potential dependence at the end of this post after I expound upon my initial thoughts around how to access nodes in a specific ordering.

While the original parallel solution I gave in *The Art of Concurrency* uses a single queue, the extra condition on visitation order has me thinking about using two containers: a stack to traverse/visit nodes and a queue to hold the tasks of processing the nodes. One thread handles the traversal while the other `N-1`

threads do the node processing.

The "boss" thread works alone to visit each node in the DAG in topological order. When an unvisited node is popped off the stack, the boss thread encapsulates that node into a task and puts that task into the queue. The "worker" threads wait on the queue. When something is available, a worker dequeues a task and performs the required computation.

The single thread processing all the unvisited nodes from the stack will preserve the precedence of visitation to be topological as required. All other threads pulling tasks out of a queue will at least dole out node processing in the proper order, but will also allow concurrent execution of that processing. Below is code to implement this strategy using Windows threads. First, the shared declarations and the boss thread’s function.

long visited[NUM_NODES]; long order[NUM_NODES]; int adj[NUM_NODES][NUM_NODES]; long gCount = 0; stack<int> S; // STL stack class concurrent_queue<int> Q; // TBB concurrent_queue container unsigned __stdcall bossThread(void *pArg) { int i,j,k; int nodeCount = 0; while (S.size() > 0) { k = S.top(); S.pop(); if (InterlockedCompareExchange(&visited[k], 1L, 0L) == 0) { Q.push(k); // enqueue on worker's queue order[k] = nodeCount++; } for (i = NUM_NODES-1; i >= 0; i--) if (adj[k][i] && !visited[i]) S.push(i); } // end while return 0; }

The `visited`

array marks whether or not the node has been processed (all initialized to zero), the `order`

array will hold the serial order of node processing by the worker threads, and the `adj`

matrix is the adjacency matrix representation of the graph. The `gCount`

counter is going to be used to keep track of the number of nodes that have been processed by the worker threads. When this counter value reaches `NUM_NODES`

, the computation has completed and the worker threads will terminate. Since only the boss thread will be using it, I have decided to use the STL `stack`

object (`S`

) as the stack for assuring nodes are queued in topological order.

One instance of the `bossThread()`

function is launched and assumes that one reference to each node in the graph has been pushed onto `S`

. (This ensures that all connected components of the graph will be processed.) As long as there is something still in the stack, the boss thread pops off the value and, if the boss thread has not visited this node, the node is placed at the tail end of the queue, `Q`

. At this point, just for the purposes of my example code, I note the `order`

in which the node was placed into the worker queue, which will yield the topological order of the graph nodes. Finally, any adjacent node that has not been visited by the boss thread is pushed onto the stack before the boss pops the next node off the stack.

This really is just the serial DFS algorithm with the "processing" of a node to simply be putting it into the queue for the worker threads to actually do the computation for each node. Simple enough, right? The worker thread function, `pDFSearch()`

, is also pretty simple.

unsigned __stdcall pDFSearch(void *pArg) { int k, i; while(1) { if (gCount == NUM_NODES) break; while (!Q.try_pop(k)) { if (gCount == NUM_NODES) break; } if (InterlockedCompareExchange(&visited[k], 2L, 1L) == 1) { InterlockedIncrement(&gCount); /* Do something to VISIT node k */ } } return 0; }

I set each worker thread into an infinite loop that will exit when the `gCount`

reaches `NUM_NODES`

. If the counter hasn't reached the final value, the thread looks into the queue, `Q`

, for a new node to process. If no node is ready, the exit condition is tested to exit this spin-wait loop if the computation is done. If there was a node on the queue, it is processed after the `gCount`

counter is incremented in an atomic way with `InterlockedIncrement()`

, and then the thread goes back to the queue for the next node (or to find the termination condition has been reached).

You may have noticed that I've glossed over the use of the `InterlockedCompareExchnge()`

function. Recall that the three parameters are the destination, the new value and the old value. In an atomic way, the function will compare the value stored in the destination with the old value and, if they are equal, the new value will be stored in the destination. The function returns the original value of the destination variable.