Last month , I showed code for a lock-free queue that supported the limited case of exactly two threadsone producer, and one consumer. That's useful, but maybe not as exciting now that our first rush of lock-free coding glee has worn off. This month, let's tackle the general problem of supporting multiple producers and multiple consumers with as much concurrency as possible. The code in this article uses four main design techniques:
First, we'll use (the equivalent of) two locks: One for the head end of the queue to regulate concurrent consumers, and one for the tail to regulate concurrent producers. We'll use ordered atomic variables (C++0x
volatile) directly instead of prefabricated mutexes, but functionally we're still writing spinlocks; we're just writing them by hand. Although this means it's not a purely "lock-free" or nonblocking algorithm, it's still quite concurrent because we'll arrange the code to still let multiple consumers and multiple producers make progress at the same time by arranging to do as much work as possible outside the small critical code region that updates the head and tail, respectively.
Second, we'll have the nodes allocate the contained
T object on the heap and hold it by pointer instead of by value.  To experienced parallel programmers this might seem like a bad idea at first, because it means that when we allocate each node we'll also need to perform an extra heap allocation, and heap allocations are notorious scalability busters on many of today's memory allocators. It turns out that, even on a system with a nonscalable allocator, the benefits typically outweigh the advantages: Holding the
T object by pointer let us get greater concurrency and scalability among the consumer threads, because we can take the work of actually copying the
T value out of the critical section of code that updates the shared data structure.
Third, we don't want to have the producer be responsible for lazily removing the nodes consumed since the last call to
Produce, because this is bad for performance: It adds contention on the queue's head end, and it needlessly delays reclaiming consumed nodes. Instead, we'll let each consumer be responsible for trimming the node it consumed, which it was touching anyway and so gives better locality.
Fourth, we want to follow the advice that "if variables
B are not protected by the same mutex and are liable to be used by two different threads, keep them on separate cache lines" to avoid false sharing or "ping-ponging" which limits scalability.  In this case, we want to add padding to ensure that different nodes (notably the first and last nodes), the
last pointers into the list, and the
consumerLock variables are all on different cache lines.