Channels ▼

Clay Breshears

Dr. Dobb's Bloggers

Insert Your Own Clever Bubblesort Pun

March 30, 2012

Of course, for the last partition (myid == NUM_THREADS-1) there will be no succeeding partition, so there is no need to do a final comparison. All other threads now become "peeping Toms" by looking over the boundary at that first element of the next partition. If the next thread has gotten a slow start and has not yet been able to process the first comparison (from the second code segment), there is the chance that a data race can result. To avoid this, the peeking thread must hold the BLock[] mutex that is indexed by its own myid.

Let me illustrate this with a concrete example by using threads (Tx) whose myid values are 4 and 5. For my purposes here, I will have T5 unable to get past the initial assignment to the local exch flag. On the other hand, T4 races through the partition of data assigned to it and arrives at the code segment above. At this point both threads are given free rein to execute, but both threads want read (and potentially write) access to the first element in T5's partition. However, before any comparison can take place, both threads request a mutex from the Block array. T4 will request the mutex associated with its own myid value (BLock[4]) since it is the peeking thread; and T5 will request the mutex indexed by the ID number of the thread that could be peeking into the T5 partition, namely BLock[4]. Only one of the threads will be granted the mutex and make the comparison (and possible swap), while the other thread waits for the release of the mutex before performing its own compare (and possible swap).

Before we go on to the final code segment, you may be wondering whether or not the above will still give us a correctly sorted output. (Spoiler Alert: It does.) Can we prove that it will or show a case where it leads to an erroneous situation? In my post describing the task decomposition version of parallel Bubblesort I needed three cases (only T4 will swap, only T5 will swap, and both T4 and T5 will swap) to demonstrate the correctness of the algorithm. I can use those three cases again in showing that my data decomposition algorithm will result in a sorted output. On top of those three cases, though, I need to add in the potential for ordering by which thread gains access to the mutex first and thus does the comparison before the other thread is allowed. That's six cases to be examined. In my prior post, the largest element from the T4 partition had been swapped into the first element of the T5 partition, which is not the case for this algorithm. Thus, I should need to consider the possibility of those cases where the outcome of the first compare and swap does or doesn't alter the swap decision of the second thread. This doubles my total again to 12 cases.

Rather than enumerate all 12 cases, if all 12 are indeed possible or needed (since three items can be arranged in only 6 ways, I'm not completely convinced that we would need to consider 12), let me settle this with some simple logic. In the worst case scenario, a value may travel one position in the wrong direction when found close to the boundary and threads are synchronized as described above. However, since at least one of the threads performed an exchange, another pass through the data will be done by each thread and the item that had inadvertently moved in the wrong way will be recognized as being out of place and set back on the path of "forward" motion. Hence, the worst we can expect is that a few extra passes than absolutely necessary will be required. Since the code is in parallel, I'm hoping that won't be a huge or even common detriment.

This brings us to the point where the threads must determine if any of their sister threads executed an exchange or that none of them did and the sort is complete.

    pthread_mutex_lock(&m_Done);
    if (!exch) ++Done[k%2];
    pthread_mutex_unlock(&m_Done);

    pth_barrier(&bBarrier);

    if (Done[k%2] == NUM_THREADS) break;  // All threads report no exchanges 

    pthread_mutex_lock(&m_Done);
    if (!exch) --Done[k%2];
    pthread_mutex_unlock(&m_Done);
    ++k;		
  } // end WHILE
  return 0;
}

The first thing a thread does is acquire the m_Done mutex in order to update one of the two elements of the Done array. The choice of which element to increment is based on the local value of k. If the thread did not perform an exchange of any items in the partition, the chosen Done element is incremented. Right after the m_Done mutex is released, a thread waits at a barrier.

I wrote the barrier function, pth_barrier(), based on the specification and code from Programming with POSIX Threads (Addison-Wesley, 1997) by David R. Butenhof. My barrier object, bBarrier, holds a mutex, a condition variable, a running count of remaining threads left to reach the barrier, the total number of threads that are expected, and a variable to note the current "color" of the barrier. The initialization function (not shown) for the barrier object sets up the synchronization objects and sets the count and number of threads to the value provided as the parameter of the function call. The color is initialized to zero (which I like to think of as RED).

Upon reaching the barrier, a thread acquires the mutex within the object, saves the color value into a local variable declared in the function, and decrements the running count. If not all threads have reached the barrier, while the local color value matches the current barrier color, the thread waits on the condition variable (yielding the mutex). When the last thread decrements the barrier count down to zero, the wakeup signal is broadcast to all other threads waiting on the condition variable, the color of the barrier is changed to the opposite of the current color (think BLUE), and the count of the number of threads needing to reach the barrier is reset to the number of threads. Finally, the mutex is released.

Waking up the threads before fixing up the barrier object for the next use seems dangerous. Does this actually work? Yes, since any thread that is waiting on the condition variable must first acquire the associated mutex before it can return from the wait call. However, even though the last thread to the barrier first issues the wakeup signal to all waiting threads, it also holds the mutex and doesn't release it until the barrier is reset. Read the section in Butenhof's book on implementing a barrier construct with Pthreads for all the implementation details and how the use of the color prevents incorrect barrier synchronizations.

When threads are released from the barrier, each checks on the count of threads that have not had exchanges in the just-completed pass through the data. If it was all of them (NUM_THREADS), the sort is complete and threads break out of the infinite loop. Otherwise, all threads that didn't have an exchange decrement the relevant element of Done (protected by the m_Done mutex, of course) and all threads increment their local value of k before going back to execute another iteration through the while loop. The barrier ensures that each thread has the same value of k (without making the counter shared and protected by another layer of synchronization).

The one last detail to explain is why two different Done counters are needed. The reason is related to the different colors of barriers. If only one Done counter were used, I could create a deadlock. As an example, imagine that only one thread, T7, had an exchange during a pass. As all non-exchange threads reach the barrier, they increment the single Done counter. If T7 is last to the barrier — because it did the extra work of an exchange not due to it being lazy — it doesn't wait on the condition variable, but resets the barrier and goes on to the next iteration of the while loop. Now, if all the other threads that had been waiting are slow to wake up — because not exchanging data items can be tiring for a thread — and T7 zips through the assigned partition without an exchange, then T7 could reach the increment of Done before any of the other threads have gotten out of the barrier. When threads do exit the barrier, they will find the Done counter to be equal to NUM_THREADS and assume that the sorting is done. This might be true or the last exchange that T7 performed might have moved an out-of-place item into the partition of T8, which needs to confirm that the item has reached the final position or should be bubbled down the array. Still, even if the data is sorted, T7 will be left hanging at the barrier since all other threads have broken out of the while loop. Deadlock. The pair of Done counters, and the switching between them on each pass, prevents this problem.

I hope that the above code and descriptions make sense to you. This data decomposition of Bubblesort got a bit tricky, which is unusual. I don't think I've seen a data decomposition parallel algorithm that was much more than parallelizing some loops. Perhaps sorting is one of those problems that aren't amenable to a data decomposition approach. (Wink. Wink.)

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video