Channels ▼

Clay Breshears

Dr. Dobb's Bloggers

Implementing Partition with Prefix Scan

July 11, 2012

The number of elements in the subset of A to be sorted next is examined. If there is more than one, a temporary pointer (At) is initialized to the first subset element in A. This first element is chosen as the pivot element and the actual number of non-pivot elements in the current subset (n) is computed. At this point, I can populate the (local) mask vectors for elements less-than and greater-than the pivot value. I've chosen the TBB parallel_for() algorithm to do this since I'm expecting that there are many threads/cores available that are not being used to run the Quicksort() function. I've shown the lambda version of parallel_for() here.

Once the vectors have been filled, the [0] element in each (corresponding to the pivot value) is set to -1 since I know that the parallel_scan() is an inclusive scan and I need the result from an exclusive scan for my 0-based array indexing.

Before going further, I must point out that my simple test workload generated unique keys in my data set. The parallel_scan() code shown below will handle duplicate keys in the less-than and greater-than partitions. I've neatly side-stepped handling of duplicate pivot keys by having those elements placed into the greater-than partition by default (i.e., not being less than the pivot). If a separate pivot partition was used, I would need a third mask vector and parallel_scan() call to pull out those equal-to pivot elements. In cases where there are duplicate keys, handling the equal-to partition would reduce the sizes of the subarrays that need to be further sorted and potentially shorten the overall execution time, but increase the code complexity and add some time to processing each iteration.

Speaking of scans, here are the two parallel_scan() calls, again, assuming I have many available threads/cores that can perform this computation.

      parallel_scan(blocked_range<int>(0, n), pScan(B, LT), 
                    auto_partitioner());
      parallel_scan(blocked_range<int>(0, n), pScan(C, GT), 
                    auto_partitioner());

      int ll = B[n-1]+1;  // length of data movement vectors
      int gl = C[n-1]+1;

The less-than scan results are stored in the B array while the greater-than scan is placed in the C array. After the scans are done, the number of items in each partition is taken from the last element of the result arrays. The lengths of the two partitions (ll and gl) will be used to copy the values back to the A array and to set the index pairs that will be placed into the shared queue for some thread to pick up and sort.

The next step is to move the elements from their original spots in A to their new locations in A.

      parallel_for(blocked_range<int>(1, n),
        [&](const blocked_range<int>& ii) {
            for (int i = ii.begin(); i < ii.end(); ++i) {
              if (LT[i]) 
                 Ld[B[i]] = At[i];
              else
                 Gd[C[i]] = At[i];
	       }
         }
      );

      q = ll+p;

      // Copy back to A
      for (int i = 0; i < ll; ++i)
        At[i] = Ld[i];
      At[ll++] = pivot;
      for (int i = 0; i < gl; ++i)
        At[ll+i] = Gd[i];

In the TBB parallel_for() algorithm, I use the fact that each element will be in only one of the partitions to determine which element is moved in each iteration of the loop. I've also used "work" arrays, Ld and Gd, to temporarily hold elements. The index into these work arrays receiving a data item is the corresponding value from the scan result vectors, B and C.

Once all the data items have been partitioned into the work arrays, I move them back, in order, to the A array for further sorting. I've used a serial algorithm here since the sizes of the partitions are not necessarily equal. I could have done this in a parallel_for() call with the upper bound being the maximum of ll and gl. I would need a test to make sure there is (or isn't) an item to be moved back for each value of the loop counter. With all the overhead of the parallel_for() and the addition of the comparison test, I figured it was likely not worth the effort for parallelizing this section of my code. (Experimentation might be able to compute the size of the partitions that could make such parallelization worthwhile and run the serial code when the size is too small.)

Finally, the rest of the Quicksort() function does some housekeeping to prepare for sorting the two newly created partitions. This involves incrementing a shared count (tCount) of total sorted items (the pivot element was put into its sorted position as a result of the Partition operation). I need an atomic pragma to avoid a data race between the OpenMP threads running Quicksort(). Finally, I package up the indices for the partitions and push each of the pairs into the shared queue.

#pragma omp atomic
      ++tCount;
      d1.lo = p;
      d1.hi = q-1;
      Q.push(d1);
      d2.lo = q+1;
      d2.hi = r;
      Q.push(d2);
    }
    else if (p == r) {
#pragma omp atomic
      ++tCount;
      if (tCount == N) Done = TRUE;   // Signal main thread that sorting is done
    }
  }      // while(1)
}

The final bit of code handles the case when there is only one item to be partitioned. Since one thing by itself is in sorted order, I only need to increment the shared counter (tCount). Also, any thread that has only one item to be partitioned will check to see if the whole array has been sorted by comparing the shared counter value with the number of items to be sorted (N). Each thread could make this check after all the previous partitioning work. However, only those threads that encounter a single item to be partitioned really have a chance that this is the last item to be placed into its sorted position (which it already will be). This cuts down on the total number of superfluous tests that are done.

So, there you have it, a prefix scan-based Partition algorithm within Quicksort. If I have lots and lots of cores available, I can reduce the span of the execution and achieve some good speedup by harnessing the extra cores to execute the prefix scans and the parallel loops that I've coded using the Intel Threading Building Blocks algorithms.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video