C/C++

Measuring Parallel Performance: Optimizing a Concurrent Queue

By Herb Sutter, December 01, 2008

When it comes to scalability and concurrency, more is always better.

Example 3: Reducing Head Contention

In Examples 1 and 2, the producer was responsible for lazily removing the nodes consumed since the last call to Produce. But that's bad for performance for several reasons, notably because it forces a producer to touch both ends of the queue—and every thread that uses the queue, whether producer or consumer, has to touch the queue's head end. Even though a producer and a consumer don't use the same spinlocks and so can run fully concurrently with respect to each other, the fact that they touch the same memory inherently adds invisible contention, as updates to the memory containing the head nodes have to be propagated to all threads on other cores, not just to consumer threads that naturally have to touch the head end to do their work.

In Example 3, we'll let each consumer be responsible for trimming the node it consumed (which it was touching anyway) and this gives better locality. The first thing we notice is that we can get rid of divider—itself a source of contention because it was used by both consumers and producers:


// Example 3 (diffs from Example 2):
// Moving cleanup to the consumer
//
  LowLockQueue() {
    first = last = new Node( nullptr ); // no more divider
    producerLock = consumerLock =        false;
  }

Consume now doesn't need to deal with divider, but must add the work to clean up the previous now-unneeded first dummy node when it consumes an item:


bool Consume( T& result ) {
  while( consumerLock.exchange(true) ) 
    { }	// acquire exclusivity

  if( first->next != nullptr ) { 	// if queue is nonempty
    Node* oldFirst = first;
    first = first->next;
    T* value = first->value;	 	// take it out
    first->value = nullptr;	 	// of the Node
    consumerLock = false;	 	// release exclusivity

    result = *value;		 	// now copy it back
    delete value;		 	// and clean up
    delete oldFirst;		 	// both allocations
    return true;		 	// and report success
  }

  consumerLock = false;	 	// release exclusivity
  return false;		 	// queue was empty
}

Next, Produce becomes simpler because we can eliminate the lazy cleanup code. However, just eliminating that code leads to a very subtle pitfall because one existing line also has to change. Can you see why?

  bool Produce( const T& t ) {
    Node* tmp = new Node( t ); 	// do work off to the side

    while( producerLock.exchange(true) ) 
      { }			 	// acquire exclusivity

    last->next = tmp;		 	// A: publish the new item
    last = tmp;		 	// B: not "last->next"

    producerLock = false;	 	// release exclusivity
    return true;
  }

Changing Responsibilities Can Introduce Bugs

Note that line B used to be last = last->next;. That was always slightly inefficient because it needlessly reread last (a holdover from the original code written by someone else). Now, if left unchanged, it becomes something much worse: a small race window. Now that there's no divider and consumers clean up consumed nodes, the way consumers know there's an item available to be consumed is to check first->next; if it's not null, it's okay to go ahead and consume a node—and delete what used to be the first one because that node is no longer needed. The trouble arises when a sequence like the following occurs:

Initially: queue is empty, first == last
The producer (from Example 2 code, without the Example 3 correction): last->next = tmp; // A: publish
The consumer performs an entire call to Consume the just-published node, including deleting the now unnecessary previous first node before it
Then the producer dereferences last

last = last->next; // B: update last

     // oops: accesses freed memory.

The key is that the act of publishing the new node (line A) not only advertises that the new node is ready to be consumed, but also implicitly transfers ownership of the preceding node to the consumer. Hence, line B must not dereference last again, but should just assign from tmp directly.

"But," someone might object, "will this interleaving really happen? After all, A-B is a very small window for a call to Consume to fit into." True, it won't happen often. Based on experience, however, I can report that under heavy stress on a multicore system, this tends to fail once for every few tens of millions of items moving through the queue. This was the only race I wrote (that I know of) when putting these examples together, and it was a real pain to reproduce and diagnose.

Moral: When you change responsibilities for cleanup, code that used to be innocuous can suddenly turn into a subtle race window.

Measuring Example 3

But back to the main event: How well does moving the cleanup responsibility and reducing contention on the head of the queue really help? Again, before seeing my results, consider how much, and why, you think this is likely to affect throughput, scalability, contention, and the oversubscription penalty.

Figure 3 shows the Example 3 performance results. The effects are mainly on the left-hand small object graph, with only incremental improvements for large objects. For small objects, peak throughput has improved by nearly another factor of two, and we've again improved scalability and actually get close to reaching the dashed line, which represents our capacity for getting more work done using more cores. There is some dropoff due to contention as we exceed about 20 active threads (e.g., 12 producers and 8 consumers), and for the first time we can actually see the oversubscription wall on the left-hand graph beyond 24 threads. Although we'd like to scale that wall, right now we're happy to just be able to approach it in the first place!

[Click image to view at full size]

Figure 3: Example 3 throughput (total items through queue per second).

Previous 1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

C/C++

Measuring Parallel Performance: Optimizing a Concurrent Queue

Example 3: Reducing Head Contention

Changing Responsibilities Can Introduce Bugs

Measuring Example 3

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

C/C++

Measuring Parallel Performance: Optimizing a Concurrent Queue

Example 3: Reducing Head Contention

Changing Responsibilities Can Introduce Bugs

Measuring Example 3

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content