# Finding My Summer Vacation's Shortest Path

With such simple loops, the actual concurrent implementation can easily be done in OpenMP or Intel Threading Building Blocks. Below I've given a TBB solution. The nested loops have been placed into the `tryK`

class and that is called from the `parallel_for()`

algorithm using the intrinsic `blocked_range2d`

range class to be able to divide up the iterations of the two innermost loops.

class tryK { const int k; float **D; public: tryK (const int k_, float **D_) : k(k_), D(D_) {} tryK (const int k_, split) : k(k_), D(D_) {} void operator () ( const blocked_range2d<int>& r ) const { for (i = r.rows().begin(); i < r.rows().end(); i++) for (j = r.cols().begin(); j < r.cols().end(); j++) D[i][j] = min(D[i][j], D[i][k]+D[k][j]); } }; void cFloyds(float **D, int N) { int i, j, k; for (k = 0; k < N; k++) parallel_for(blocked_range2d<int> (0, N, rGrainSize, 0, N, cGrainSize), tryK(k, D)); } }

The first argument to the `parallel_for()`

call is the range of the loop that is to be subdivided and run in parallel. Since I'm using the `blocked_range2d`

class, once I have specified the type of the two ranges, I need to list two triples as the parameters to the `Range`

class. The initial triplet is for the first (`rows`

) loop and the other for the second (`cols`

) loop. The first two parts of a triplet are the start and end points of the corresponding loop. The third part is the granularity argument: the size of a subrange that should not be further subdivided.

In the above code, I've assumed that the two constants, `rGrainSize`

and `cGrainSize`

, are set by the programmer. These are used to fix the number of rows and columns, respectively, for the TBB scheduler to stop subdividing that dimension of the subrange. Finding the values to use would likely involve some form of trial and error testing to see how small the array blocks can get before performance suffers.

You may have noticed that the entire array is accessed within each iteration. I began this post talking about a lower triangular matrix needed to find the distance on a map between two cities. On a map, roads go both ways, at least between cities, so why do I need to compute over the whole array? Couldn't the whole operation be further optimized by using only half the matrix? If I have a directed graph, there can be a path from Peoria to Quebec, but there may be no path from Quebec to Peoria. (Since my scenario takes place in summer, there might be road construction that closes all roads *into* Peoria or just the one you would normally travel when driving down from Quebec.) To compute the shortest path for all pairs of nodes in the graph, it's best to work with the whole array. Besides, the extra coding to translate the order of the `i`

, `j`

, and `k`

index variables to access a valid element in the lower triangular section can add complexity to the code and overhead to the computation.

The details of the TBB code should be comprehensible if you understand the serial algorithm, the serial code, and the operation of the TBB `parallel_for()`

algorithm. However, if you've been concentrating as you examined the code rather than simply lounging under a tree, sipping a Mint Julep, you might be asking yourself, "What about a data race on the `k`

^{th} row?" If you are imbibing a refreshing drink in the shade of a spreading chestnut, clear your thoughts and go back to the last code block to see if you can locate a problem.

If you examine the serial algorithm, each iteration of the `k`

-loop accesses (read and potential update) all entries of the `D`

matrix. The structure of the loops does this one row at a time reading values from a fixed row (`k`

). Without loss of generality, I can assume that entire row updates are assigned to individual threads by TBB. Consider two threads, `T0`

and `T1`

, where `T0`

is assigned some row `i != k`

and `T1`

is assigned to update row `k`

. Are there interleavings of `T0`

and `T1`

such that it can be shown that `T1`

could update `D[k][j]`

both before (one interleaving) and after (another interleaving) `T0`

updates `D[i][j]`

? Yes, I can do that quite easily. But, does it matter or can you return to your lounging activities?

This last question is really the crux of the matter. Let me rewrite the body of the inner loop by substituting `k`

for `i`

to explicitly illustrate the computation that `T1`

would execute. This substitution gives me

D[k][j] = min(D[k][j], D[k][k]+D[k][j])

From the original weight matrix I know that `D[k][k]`

will be zero. The only way to get a smaller value in a main diagonal element of the distance matrix would be to have a negative cycle from the node to itself. However, it was stipulated that the graph could have no negative cycles. The result of the above operation will be to find the minimum value between `D[k][j]`

and itself, which will result in no updates to any element of the distance matrix within row `k`

. Thus, regardless of how TBB divides up the distance matrix for updates, there will be no data race accessing elements from (the fixed) row `k`

.

Sometimes interleaving analysis is not quite enough. Just because there appears to be a data race, you also have to show that there will be actual adverse consequences from the problematic interleaving schedule.

And that is my tip for your travel planning process. That and don't forget to pack plenty sunscreen.