Insert Your Own Clever Bubblesort Pun
Of course, for the last partition
(myid == NUM_THREADS-1) there will be no succeeding partition, so there is no need to do a final comparison. All other threads now become "peeping Toms" by looking over the boundary at that first element of the next partition. If the next thread has gotten a slow start and has not yet been able to process the first comparison (from the second code segment), there is the chance that a data race can result. To avoid this, the peeking thread must hold the
BLock mutex that is indexed by its own
Let me illustrate this with a concrete example by using threads (
myid values are 4 and 5. For my purposes here, I will have
T5 unable to get past the initial assignment to the local
exch flag. On the other hand,
T4 races through the partition of data assigned to it and arrives at the code segment above. At this point both threads are given free rein to execute, but both threads want read (and potentially write) access to the first element in
T5's partition. However, before any comparison can take place, both threads request a mutex from the
T4 will request the mutex associated with its own
myid value (
BLock) since it is the peeking thread; and
T5 will request the mutex indexed by the ID number of the thread that could be peeking into the
T5 partition, namely
BLock. Only one of the threads will be granted the mutex and make the comparison (and possible swap), while the other thread waits for the release of the mutex before performing its own compare (and possible swap).
Before we go on to the final code segment, you may be wondering whether or not the above will still give us a correctly sorted output. (Spoiler Alert: It does.) Can we prove that it will or show a case where it leads to an erroneous situation? In my post describing the task decomposition version of parallel Bubblesort I needed three cases (only
T4 will swap, only
T5 will swap, and both
T5 will swap) to demonstrate the correctness of the algorithm. I can use those three cases again in showing that my data decomposition algorithm will result in a sorted output. On top of those three cases, though, I need to add in the potential for ordering by which thread gains access to the mutex first and thus does the comparison before the other thread is allowed. That's six cases to be examined. In my prior post, the largest element from the
T4 partition had been swapped into the first element of the
T5 partition, which is not the case for this algorithm. Thus, I should need to consider the possibility of those cases where the outcome of the first compare and swap does or doesn't alter the swap decision of the second thread. This doubles my total again to 12 cases.
Rather than enumerate all 12 cases, if all 12 are indeed possible or needed (since three items can be arranged in only 6 ways, I'm not completely convinced that we would need to consider 12), let me settle this with some simple logic. In the worst case scenario, a value may travel one position in the wrong direction when found close to the boundary and threads are synchronized as described above. However, since at least one of the threads performed an exchange, another pass through the data will be done by each thread and the item that had inadvertently moved in the wrong way will be recognized as being out of place and set back on the path of "forward" motion. Hence, the worst we can expect is that a few extra passes than absolutely necessary will be required. Since the code is in parallel, I'm hoping that won't be a huge or even common detriment.
This brings us to the point where the threads must determine if any of their sister threads executed an exchange or that none of them did and the sort is complete.
pthread_mutex_lock(&m_Done); if (!exch) ++Done[k%2]; pthread_mutex_unlock(&m_Done); pth_barrier(&bBarrier); if (Done[k%2] == NUM_THREADS) break; // All threads report no exchanges pthread_mutex_lock(&m_Done); if (!exch) --Done[k%2]; pthread_mutex_unlock(&m_Done); ++k; } // end WHILE return 0; }
The first thing a thread does is acquire the
m_Done mutex in order to update one of the two elements of the
Done array. The choice of which element to increment is based on the local value of
k. If the thread did not perform an exchange of any items in the partition, the chosen
Done element is incremented. Right after the
m_Done mutex is released, a thread waits at a barrier.
I wrote the barrier function,
pth_barrier(), based on the specification and code from Programming with POSIX Threads (Addison-Wesley, 1997) by David R. Butenhof. My barrier object,
bBarrier, holds a mutex, a condition variable, a running count of remaining threads left to reach the barrier, the total number of threads that are expected, and a variable to note the current "color" of the barrier. The initialization function (not shown) for the barrier object sets up the synchronization objects and sets the count and number of threads to the value provided as the parameter of the function call. The color is initialized to zero (which I like to think of as RED).
Upon reaching the barrier, a thread acquires the mutex within the object, saves the color value into a local variable declared in the function, and decrements the running count. If not all threads have reached the barrier, while the local color value matches the current barrier color, the thread waits on the condition variable (yielding the mutex). When the last thread decrements the barrier count down to zero, the wakeup signal is broadcast to all other threads waiting on the condition variable, the color of the barrier is changed to the opposite of the current color (think BLUE), and the count of the number of threads needing to reach the barrier is reset to the number of threads. Finally, the mutex is released.
Waking up the threads before fixing up the barrier object for the next use seems dangerous. Does this actually work? Yes, since any thread that is waiting on the condition variable must first acquire the associated mutex before it can return from the wait call. However, even though the last thread to the barrier first issues the wakeup signal to all waiting threads, it also holds the mutex and doesn't release it until the barrier is reset. Read the section in Butenhof's book on implementing a barrier construct with Pthreads for all the implementation details and how the use of the color prevents incorrect barrier synchronizations.
When threads are released from the barrier, each checks on the count of threads that have not had exchanges in the just-completed pass through the data. If it was all of them (
NUM_THREADS), the sort is complete and threads
break out of the infinite loop. Otherwise, all threads that didn't have an exchange decrement the relevant element of
Done (protected by the
m_Done mutex, of course) and all threads increment their local value of
k before going back to execute another iteration through the
while loop. The barrier ensures that each thread has the same value of
k (without making the counter shared and protected by another layer of synchronization).
The one last detail to explain is why two different
Done counters are needed. The reason is related to the different colors of barriers. If only one
Done counter were used, I could create a deadlock. As an example, imagine that only one thread,
T7, had an exchange during a pass. As all non-exchange threads reach the barrier, they increment the single
Done counter. If
T7 is last to the barrier — because it did the extra work of an exchange not due to it being lazy — it doesn't wait on the condition variable, but resets the barrier and goes on to the next iteration of the
while loop. Now, if all the other threads that had been waiting are slow to wake up — because not exchanging data items can be tiring for a thread — and
T7 zips through the assigned partition without an exchange, then
T7 could reach the increment of
Done before any of the other threads have gotten out of the barrier. When threads do exit the barrier, they will find the
Done counter to be equal to
NUM_THREADS and assume that the sorting is done. This might be true or the last exchange that
T7 performed might have moved an out-of-place item into the partition of
T8, which needs to confirm that the item has reached the final position or should be bubbled down the array. Still, even if the data is sorted,
T7 will be left hanging at the barrier since all other threads have broken out of the
while loop. Deadlock. The pair of
Done counters, and the switching between them on each pass, prevents this problem.
I hope that the above code and descriptions make sense to you. This data decomposition of Bubblesort got a bit tricky, which is unusual. I don't think I've seen a data decomposition parallel algorithm that was much more than parallelizing some loops. Perhaps sorting is one of those problems that aren't amenable to a data decomposition approach. (Wink. Wink.)