Parallel

Writing a Generalized Concurrent Queue

By Herb Sutter, October 29, 2008

Herb tackles the general problem of supporting multiple producers and multiple consumers with as much concurrency as possible.

Last month [1], I showed code for a lock-free queue that supported the limited case of exactly two threads—one producer, and one consumer. That's useful, but maybe not as exciting now that our first rush of lock-free coding glee has worn off. This month, let's tackle the general problem of supporting multiple producers and multiple consumers with as much concurrency as possible. The code in this article uses four main design techniques:

First, we'll use (the equivalent of) two locks: One for the head end of the queue to regulate concurrent consumers, and one for the tail to regulate concurrent producers. We'll use ordered atomic variables (C++0x atomic<>, Java/.NET volatile) directly instead of prefabricated mutexes, but functionally we're still writing spinlocks; we're just writing them by hand. Although this means it's not a purely "lock-free" or nonblocking algorithm, it's still quite concurrent because we'll arrange the code to still let multiple consumers and multiple producers make progress at the same time by arranging to do as much work as possible outside the small critical code region that updates the head and tail, respectively.

Second, we'll have the nodes allocate the contained T object on the heap and hold it by pointer instead of by value. [2] To experienced parallel programmers this might seem like a bad idea at first, because it means that when we allocate each node we'll also need to perform an extra heap allocation, and heap allocations are notorious scalability busters on many of today's memory allocators. It turns out that, even on a system with a nonscalable allocator, the benefits typically outweigh the advantages: Holding the T object by pointer let us get greater concurrency and scalability among the consumer threads, because we can take the work of actually copying the T value out of the critical section of code that updates the shared data structure.

Third, we don't want to have the producer be responsible for lazily removing the nodes consumed since the last call to Produce, because this is bad for performance: It adds contention on the queue's head end, and it needlessly delays reclaiming consumed nodes. Instead, we'll let each consumer be responsible for trimming the node it consumed, which it was touching anyway and so gives better locality.

Fourth, we want to follow the advice that "if variables A and B are not protected by the same mutex and are liable to be used by two different threads, keep them on separate cache lines" to avoid false sharing or "ping-ponging" which limits scalability. [3] In this case, we want to add padding to ensure that different nodes (notably the first and last nodes), the first and last pointers into the list, and the producerLock and consumerLock variables are all on different cache lines.

1 2 3 4 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Writing a Generalized Concurrent Queue

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Writing a Generalized Concurrent Queue

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content