Parallel Shellsort
As described, the Shellsort algorithm interleaves h independent Insertion Sorts. Starting out, before the i-loop starts, the h sorted partitions each contain one element. First, one unsorted item is inserted into the first partition, then an item is inserted into the second partition, then an item is inserted into the third partition, and so on until one item is inserted into the h partition. One more item, one after the other, is inserted to each of the partitions. This continues until the rightmost item is inserted to the sorted position with its partition.
When you think about it, the sorting within a partition is independent of the sorting within any other partition. There are no race conditions since there are no shared array locations used between partitions for a given value of h. As written, the serial code doesn't expose the data decomposition opportunities of each independent h-partition. For concurrent design and implementation purposes, it would be better for the serial algorithm to sort an entire h-partition of data before going on to sort the next h-partition. The following modified Shellsort code adds an outer loop to do just that.
void Shellsort(int N, int *A)
{
int i, j, k, h, v;
h = 1; while (h < N) h = 3*h + 1;
h /= 3;
while (h != 1) {
for (k = 0; k < h; k++) {
for (i = k; i < N; i+=h) {
v = A[i]; j = i;
while (A[j-h] > v) {
A[j] = A[j-h]; j -= h;
if (j <= h) break;
}
A[j] = v;
}
} // for k
h /= 3;
}
InsertionSort(N,A); // clean up
}
This modification of the serial algorithm is an example of changing the algorithm to gain a better chance for concurrency. With this formulation of Shellsort, we can easily parallelize the code. On the outer k-loop, a loop parallel solution such as OpenMP or Intel Threading Building Blocks will work. The code below shows an OpenMP version of the modified serial code. I have declared the i, j, and v variables within the scope of the k-loop to make them private by default. The firstprivate clause on the parallel pragma will create a private copy of the variable h and will also initialize it with the value of the shared copy prior to entering the parallel region.
int parallel_shellsort(int N, int *A)
{
int k, h;
h = 1; while (h < N) h = 3*h + 1;
h /= 3;
#pragma omp parallel firstprivate(h)
{
while (h != 1) {
#pragma omp for
for (k = 0; k < h; k++) {
int i, j, v;
for (i = k; i < N; i+=h) {
v = A[i]; j = i;
while (A[j-h] > v) {
A[j] = A[j-h]; j = j – h;
if (j <= h) break;
}
A[j] = v;
}
} // for k
h /= 3;
} // end while
} // end parallel region
InsertionSort(N,A); // clean up
}
Within the parallel region, iterations of the k-loop are divided among the threads and the implicit barrier of the parallel omp for pragma holds threads until all the partitions have been sorted. After release from the barrier, each thread updates its local copy of h. Since each thread will hold the same value of h, the number of iterations will be the same for each thread and threads will terminate execution of the parallel region at the same time.
It should be obvious that there is one flaw with the scalability of concurrent Shellsort. It all starts out respectable. At any time during the sorting, there will be h different independent partitions that will differ in size by at most one item, which speaks to good load balance between partitions. However, the number of partitions steadily decreases as the algorithm proceeds. Eventually there will be a single partition of data to be sorted. Regardless of the data size, the scalability of this concurrent algorithm gets worse as the computation proceeds.
Besides the declining parallelism involved in succeeding rounds of Shellsort, there is another mark against this threaded code that should also provide a lesson when looking to parallelize other algorithms. The strength of the Shellsort algorithm for fewer exchanges done over long distances to quickly place items closer to their final position is the concurrent algorithm's weakness. In the first serial code version of the algorithm, the long distance moves of data can take advantage of some cache reuse. With the serial code modifications to process an entire h-partition before moving on to the next partition, I have even better cache reuse for each iteration of the i-loop. In fact, by working through the entire h-partition all at once, subsequent processing of h-partitions will likely find many, if not all, the necessary cache lines already available.
Under the parallel version, assigning consecutive h-partitions to different threads will start to run into false-sharing problems. Any two successive h-partitions will share almost all of the same cache lines. The concurrent algorithm is constantly reading (for comparison of the item to be inserted) and writing (moving items down to make an insertion point) elements from the cache lines for each iteration of the while-loop within the i-loop.
Like Madame Defarge, be aware that your parallel algorithm may hide a terrible nature beneath a veneer of quietly "knitting" in the corner. Behaviors that may be positive in the serial code can be broken and detrimental when made to run in parallel. In some cases, those detriments can be moderated with some extra coding or a slightly different approach to the order of execution. Other cases will require a complete revision of your approach or an admission that a parallel algorithm will not be efficient.

