Channels ▼

Clay Breshears

Dr. Dobb's Bloggers

Parallel Shellsort

October 08, 2012

As described, the Shellsort algorithm interleaves h independent Insertion Sorts. Starting out, before the i-loop starts, the h sorted partitions each contain one element. First, one unsorted item is inserted into the first partition, then an item is inserted into the second partition, then an item is inserted into the third partition, and so on until one item is inserted into the h partition. One more item, one after the other, is inserted to each of the partitions. This continues until the rightmost item is inserted to the sorted position with its partition.

When you think about it, the sorting within a partition is independent of the sorting within any other partition. There are no race conditions since there are no shared array locations used between partitions for a given value of h. As written, the serial code doesn't expose the data decomposition opportunities of each independent h-partition. For concurrent design and implementation purposes, it would be better for the serial algorithm to sort an entire h-partition of data before going on to sort the next h-partition. The following modified Shellsort code adds an outer loop to do just that.

void Shellsort(int N, int *A)
{
  int i, j, k, h, v;
  h = 1; while (h < N) h = 3*h + 1;
  h /= 3;
  while (h != 1) {
    for (k = 0; k < h; k++) {
      for (i = k; i < N; i+=h) {
        v = A[i]; j = i;
        while (A[j-h] > v) {
          A[j] = A[j-h]; j -= h;
          if (j <= h) break;
        }
        A[j] = v;
      }
    }  // for k
    h /= 3;
  }
  InsertionSort(N,A);  // clean up
}

This modification of the serial algorithm is an example of changing the algorithm to gain a better chance for concurrency. With this formulation of Shellsort, we can easily parallelize the code. On the outer k-loop, a loop parallel solution such as OpenMP or Intel Threading Building Blocks will work. The code below shows an OpenMP version of the modified serial code. I have declared the i, j, and v variables within the scope of the k-loop to make them private by default. The firstprivate clause on the parallel pragma will create a private copy of the variable h and will also initialize it with the value of the shared copy prior to entering the parallel region.

int parallel_shellsort(int N, int *A)
{
  int k, h;
  h = 1; while (h < N) h = 3*h + 1;
  h /= 3;
#pragma omp parallel firstprivate(h)
 {
  while (h != 1) {
#pragma omp for
    for (k = 0; k < h; k++) {
      int i, j, v;
      for (i = k; i < N; i+=h) {
        v = A[i]; j = i;
        while (A[j-h] > v) {
          A[j] = A[j-h]; j = j – h;
          if (j <= h) break;
        }
        A[j] = v;
      }
    }  // for k
    h /= 3;
  } //  end while 
 }  // end parallel region
  InsertionSort(N,A);  // clean up
}

Within the parallel region, iterations of the k-loop are divided among the threads and the implicit barrier of the parallel omp for pragma holds threads until all the partitions have been sorted. After release from the barrier, each thread updates its local copy of h. Since each thread will hold the same value of h, the number of iterations will be the same for each thread and threads will terminate execution of the parallel region at the same time.

It should be obvious that there is one flaw with the scalability of concurrent Shellsort. It all starts out respectable. At any time during the sorting, there will be h different independent partitions that will differ in size by at most one item, which speaks to good load balance between partitions. However, the number of partitions steadily decreases as the algorithm proceeds. Eventually there will be a single partition of data to be sorted. Regardless of the data size, the scalability of this concurrent algorithm gets worse as the computation proceeds.

Besides the declining parallelism involved in succeeding rounds of Shellsort, there is another mark against this threaded code that should also provide a lesson when looking to parallelize other algorithms. The strength of the Shellsort algorithm for fewer exchanges done over long distances to quickly place items closer to their final position is the concurrent algorithm's weakness. In the first serial code version of the algorithm, the long distance moves of data can take advantage of some cache reuse. With the serial code modifications to process an entire h-partition before moving on to the next partition, I have even better cache reuse for each iteration of the i-loop. In fact, by working through the entire h-partition all at once, subsequent processing of h-partitions will likely find many, if not all, the necessary cache lines already available.

Under the parallel version, assigning consecutive h-partitions to different threads will start to run into false-sharing problems. Any two successive h-partitions will share almost all of the same cache lines. The concurrent algorithm is constantly reading (for comparison of the item to be inserted) and writing (moving items down to make an insertion point) elements from the cache lines for each iteration of the while-loop within the i-loop.

Like Madame Defarge, be aware that your parallel algorithm may hide a terrible nature beneath a veneer of quietly "knitting" in the corner. Behaviors that may be positive in the serial code can be broken and detrimental when made to run in parallel. In some cases, those detriments can be moderated with some extra coding or a slightly different approach to the order of execution. Other cases will require a complete revision of your approach or an admission that a parallel algorithm will not be efficient.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video