# You've Got Serial Code in my Peanut Butter

You may be familiar with the saying "When you have a hammer, everything looks like a nail." Parallelism is a fine "hammer," but sometimes you are confronted with nuts and bolts or pins and needles. One of the most often cited examples of where parallelism does not apply is the serial process of pregnancy. You can't create a child in just one month using nine women.

A less clear case of when parallelism isn't a viable option is for data dependencies or race conditions that result from parallelization of code. Yes, we all understand the need for serializing code that accesses shared resources, but identifying those parts of the code can be less obvious. Software tools can be an indispensable assistant for locating race conditions, especially when dealing with variable accesses via pointers or data structures that have been implemented by a third party.

Due to the fixed word sizes within processors, round off error will only approximate floating-point computations. Everyone has learned to live with this restriction in a serial execution. However, when you change up the order of the same computations, the effects of round off can drastically alter the answer. (Quick example: Add up 10 million very, very small numbers and then one relatively large number. If all the small numbers are done first, they might actually sum up to something that isn't dwarfed by the remaining large number. On the other hand, starting with the large number and adding all the small values to it might eliminate the effect of all the small values in the final total.) Doing floating-point computations in parallel will change up the relative order of those operations and potentially change the final results.

If a bit-for-bit equivalent result to the serial code outcome is required from the parallel code, your only option may be to execute the critical computations in serial. If all of the computations are deemed "critical," your parallel scalability will be hobbled and the application could very easily perform worse than the serial version with all the overhead of the serial implementation in the middle of the parallel execution.

So far, so obvious, right?

I began reviewing these issues when I got a question about my implementation of Depth-First Search (DFS) on a graph that I used in *The Art of Concurrency*. The observation was that the results of the program were different when run in parallel than they were when run in serial. Knowing that the most critical requirement for a correct parallel code (modulo round off error) is to get the same answer as an equivalent serial code, I reviewed my DFS implementation.

The parallel algorithm starts with some node in the graph and places a reference to all other nodes that are connected to this node by an edge into a shared stack. Once the stack has been primed, all threads participating in the search will pop a node from the stack. If that node has not been visited by some previous thread, the current node is noted as having been visited and the required computation is performed on the contents of the node. Finally, any unvisited nodes connected to the current node are pushed onto the stack before the thread pops a new node from the stack.

Since my code was just for illustration purposes, there is actually no work being done at each node as it is visited. The application simply notes the order of visitation taken from a shared counter. In serial, because of the stack structure, the visitation order of nodes will be fixed based on the starting node in the graph and the order in which connected nodes are pushed onto the stack. Since I was assuming that the processing done at each node must be independent of processing any other node in the graph, the order of node visitations being different between a serial execution and a two- or four- or eight-thread execution (or even between two different runs of multiple threads) was a feature, not a problem.

It turns out that the order of visitation is actually an important aspect of some computations. I was introduced to the notion of a "topological order" of graph node processing. In a directed acyclic graph (DAG), a depth-first search defines a topological order of nodes. This order is used in solving sparse linear equations where the matrix of a sparse system can be modeled as a DAG and the topological order on the DAG determines the order of equations to be solved. Assuming that the computation at each node in the graph is still independent (or mostly independent) of all other nodes, the question was posed: How could the parallel DFS visit nodes in the same order each time the code was run and did not depend on the number of threads used?

After some thought, I came up with a "back of the envelope" solution, which I will describe in my next post. Think about this on your own and we'll compare notes.