A Two-Lock Multiproducer/Consumer Queue
The queue data structure itself is a singly linked list. To make the code simpler for the empty-queue boundary case, the list always contains a dummy node at the head, and so the first element logically in the queue is the one contained in the node after first. Figure 1 shows the layout of an empty queue, and Figure 2 shows a queue that holds some objects.
T objectsempty queue.
Each node holds the T object by pointer and adds padding:
template <typename T>
struct LowLockQueue {
private:
struct Node {
Node( T* val ) : value(val), next(nullptr) { }
T* value;
atomic<Node*> next;
char pad[CACHE_LINE_SIZE - sizeof(T*)- sizeof(atomic<Node*>)];
};
Like any shared variable, the next pointer needs to be protected by a mutex or, as here, declared as an ordered atomic type (C++0x atomic<> or Java/.NET volatile). The padding here is to ensure that two Node objects won't occupy the same cache line; in particular, in a nonempty queue, having the first and last nodes in the same cache line would penalize performance by causing invisible contention between the producer and consumer. Note that the amount of padding shown here and later on errs on the side of being too conservative: Each Node will be allocated on the heap, and a heap allocator may add its own extra overhead to each allocated block for alignment and housekeeping information, which effectively adds extra padding. If so, and if we know how much that will be, we could reduce our internal padding proportionately to make a heap-allocated Node exactly fill one cache line.
Next, here are our queue control variables:
char pad0[CACHE_LINE_SIZE];
// for one consumer at a time
Node* first;
char pad1[CACHE_LINE_SIZE
- sizeof(Node*)];
// shared among consumers
atomic<bool> consumerLock;
char pad2[CACHE_LINE_SIZE
- sizeof(atomic<bool>)];
// for one producer at a time
Node* last;
char pad3[CACHE_LINE_SIZE
- sizeof(Node*)];
// shared among producers
atomic<bool> producerLock;
char pad4[CACHE_LINE_SIZE
- sizeof(atomic<bool>)];
Again, we add padding to make sure hat data used by different threads stay physically separate in memory and cache. Clearly, we want the consumer-end data and the producer-end data to be on separate cache lines; but even though only one producer and one consumer will be active at a time, we want to keep the locking variable separate so that waiting consumers spinning on consumerLock won't contend on the cache line that contains first which the active consumer is updating, and that waiting producers spinning on producerLock won't slow down the active producer who is updating last.
The constructor just sets up the initial empty state, and the destructor (in .NET or Java, this would be the disposer) just walks the internal list and tears it down:
public:
LowLockQueue() {
first = last = new Node( nullptr );
producerLock = consumerLock = false;
}
~LowLockQueue() {
while( first != nullptr ) { // release the list
Node* tmp = first;
first = tmp->next;
delete tmp->value; // no-op if null
delete tmp;
}
}


