Dr. Dobb's | Atomic Reference Counting Pointers

Atomic Reference Counting Pointers

Reference counting pointers provide a powerful and convenient way of managing heap memory in C++.

December 01, 2004
URL:http://www.drdobbs.com/atomic-reference-counting-pointers/184401888

December, 2004: Atomic Reference Counting Pointers

Kirk Reinholtz is the Chief Programmer of the Mission Data System project at the Jet Propulsion Laboratory. He is also the Principle Investigator of the Verifiable C++ research task. He can be contacted at [email protected].

Reference counting pointers (RCP) provide a powerful and convenient way of managing heap memory in C++. RCPs keep track of your heap allocations and release the memory when it's no longer needed, thereby eliminating the possibility of several dangerous and common bugs that come with using raw pointers. For example, thanks to RCP:

You can't use an object that's already been deleted (causing memory corruption).
You can't delete something twice (causing memory corruption).
You can't fail to delete an object (causing memory leaks).

However, you do gain the power of managed memory, and aren't restricted in your use of the many powerful and productive features of C++.

In this article, I present a C++ RCP implementation that is async-safe, lock-free, thread-safe, and multiprocessor-safe. The implementation is especially interesting because it's async-safe and lock-free, which means that you can use it in signal handlers and interrupt code, unlike most lock-based implementations. This makes the implementation especially suitable for use in operating-system kernels and multithreaded real-time systems. And as if that weren't enough, it's also fast!

Thread-safe means the RCP implementation operates correctly when used in a multithreaded program. Multiprocessor safe (MP-safe) means the RCPs work even when the program uses multiple threads that are executed on multiple processors. Thread and MP safety are pretty typical for an RCP implementation. What's more interesting about this implementation is it is also async-safe, meaning that it even works when used in a signal or interrupt handler, which is not the case for most RCP implementations currently available. Finally, the algorithms are what's called "lock-free," which essentially means that blocking one thread can never cause another thread to block within the algorithm. This property is the reason the algorithm is async-safe, for if an algorithm didn't have this property, then a signal (which in effect blocks a thread) may attempt to use the algorithm and become blocked itself. Lock-free also implies that the bugaboo of multithreaded programs—priority inversion—is not an issue.

For example, consider the following function that atomically increments a counter. It is thread-safe because the locks ensure that only one thread increments the counter at a time. It is probably even MP-safe, assuming the lock itself is a type that works in an MP system. However, it is not async-safe. Imagine some thread has acquired the lock, but not released it (between lines 2 and 3, for example). Now imagine a signal handler is executed, which interrupts the thread at that point (between lines 2 and 3). Now, if the signal handler were to try to call increment, it would block on line 2 when it tries to lock(). Your program is now broken because the thread is blocked by the signal handler so it can't release the lock, and the signal handler can't continue because it's blocked on the lock. The thread never continues.

1  void increment(int* p){
2    lock();
3    *p += 1;
4    unlock();
5  }

We developed the RCP implementation at the Jet Propulsion Laboratory as part of a project called "Verifiable C++" that explores ways in which C++ can be used to develop safety-critical and mission-critical real-time software. Our goal is to mitigate the risks of using C++ to develop critical systems, while developing advanced coding techniques that let you bring the full power of C++ to bear upon the ever larger and complex programs we develop for future spacecraft.

Implementation

My RCP implementation uses powerful nonlocking synchronization instructions that are available in many modern processors. The PowerPC, for example, has a pair of instructions called LWARX and STWCX, which use a reservation mechanism to ensure that a variable that was read and then written by one thread was not modified in between by some other thread. Most other processors have similar hardware support for synchronizing threads. I focus on the PowerPC in this article. A future article will show how to do this on modern Intel processors.

The examples shown here are written in pidgin-C++. For clarity, they're missing various declarations, type casts, and whatnot. The complete code and tests are available at http://www.cuj.com/code/.

The RCP algorithms are implemented in terms of a few primitives. I've presented each primitive in pidgin-C++, but in real life they're implemented in assembly language to avoid various issues with the compiler reordering memory reads and writes. CUJ has published two previous articles on this subject. Table 1 summarizes the primitives. The source code for the primitives is also available at http://www.cuj.com/code/.

The PowerPC instructions lwarx (read-reserved) and stwcx (write-reserved) are used to construct the RCP algorithms. The instructions work basically as presented here. In real life they execute in the CPU hardware and are atomic. I've left out some details; for example, if the address used in the lwarx does not match the address used in the subsequent stwcx, then bad things can happen.

int lwarx(address a){
  RESERVED = 1;
  return mem[a];
}
int stwcx(address a,int v){
  int retval = RESERVED;
  if(RESERVED){
    mem[a] = v;
  };
  RESERVED = 0;
  return retval;
}

The operating system must do a few things in order for this to work. Fortunately, OS code exploits these operations, and so does the right thing. For example, it must clear the RESERVED flag on each context switch, and in a multiprocessor system it must clear the RESERVED flag if any other processor writes to the cache line containing the reserved address. It must do this to ensure that a reservation owned by one thread or processor isn't used by another thread or processor. This is just a simplified description; for complete details, see [1].

One of the key functions you need for RCP is a way to atomically increment a variable and return the new value. Here's how to do that using lwarx and stwcx:

int aAdd(addr mem, int delta){
  int tmp;
  do{
    tmp=lwarx(mem);
    tmp += delta;
  }while(!stwcx(mem,tmp);
  return tmp;
}

This works by reading the counter with a reservation, so that if any other thread interrupts (which clears the reservation), then the write fails, causing it to loop and try again until it succeeds.

Another key function atomically swaps the contents of two memory locations. This algorithm does so, but the second memory location (arg r2) must not be subject to contention. (The RCP algorithm respects this restriction, but I'm warning you in case you try to use this for other purposes.)

void aSwap(addr r1,addr r2){
  void* tmp(*r2);
  void* tmp1;
  do{
    tmp1 = lwarx(r1);
  }while(!stwcx(r1,tmp));
  *r2 = tmp1;
};

The atomic increment and fetch function is the most complex, and is the key to this implementation of thread-safe reference counting pointers. I finally figured out how to do this (after several days) when the kids were out of the house and I was relaxing on the couch enjoying the silence. But anyway...this algorithm atomically creates a copy of a location in memory and increments the counter at the address stored in that location. In RCP terms, it atomically copies the reference out of memory, then increments the counter of the referent. This has to be done atomically to avoid various race conditions. For example, one thread may read the reference, another overwrites it and deletes the old value (the one the first thread just read), and then the first thread increments the counter of the now-defunct object, which is bad.

addr aIandF(addr r1){
  addr tmp;int c;
  do{
    do{
      tmp = *r1;
      if(!tmp)break;
      c = lwarx(tmp);
    }while(tmp != *r1);
  }while(tmp && !stwcx(tmp,c+1));
  return tmp;
};

To see how important this function is, look at the copy constructor and list the ways in which it can go wrong in an multithreaded environment.

RCP::RCP(const RCP& rhs){ 
1:
2:  this->ref_ = rhs.ref_;
3:
4:  ref_->cntr_ += 1;
5:
};

In this code, rhs.ref_ might get deleted at line 3, causing you to increment a counter of a deleted object at line 3. The heap is corrupted. Okay, that's easy, so just increment the counter first:

1:
2: rhs.ref_->cntr += 1;
3:
4: this->ref_ = rhs.ref_

This seems better, but is still broken. The problem is that line 2 is really several CPU instructions—one that fetches the rhs.ref_ and another that increments the counter.

So you really have the same problem: The rhs might get deleted before you do the increment. Okay, use the atomic increment:

1: Atomics::incr(&(rhs.ref_->cntr_,1)
2: this->ref_ = rhs.ref_

But even that doesn't work because you still need to fetch the address of the thing to increment, and then atomically increment it. It can still get deleted between those two steps. Not to mention that rhs.ref_ could get assigned to between the incrementing and when you assign it to this->ref_.

If that happened, then this->ref_ would end up pointing at a referent that has the wrong count, which eventually causes a heap problem when the object is deleted while this reference is still in existence.

This is why I wrote aIandF(), which does all three operations as one atomic transaction. It reads the pointer to the counter, increments the counter, and returns the pointer—all in such a manner that no other threads can cause an incorrect result.

RCP Class

I describe the algorithms here in terms of a simple RCP class and typical referent:

class RCP{
public:
  RCP();
  RCP(Node*);
  ~RCP();
  RCP(const RCP&);
  RCP& operator=(const RCP&);
  Node* operator->();
  Node& operator*();
  int ref_;
};
class Node{
public:
  ...        // usual stuff
  int cntr_; // Intrusive counter
}

A full implementation, of course, wraps a bunch of template stuff around this, but doing so here would obscure the basic point of this article.

The RCP constructor takes a reference to an ordinary object and constructs an RCP to that object. It's bad form to use the raw pointer in any way once this has been done. It's also bad form to pass the same raw pointer to more than one RCP constructor. The reason it's bad form is that it enables race conditions that can't be fully mitigated by the RCP algorithm. For example, the first RCP may go out of scope before you passed the object to the second RCP, causing the second RCP to reference a deleted object, which is obviously bad. The raw object must initialize its counter to zero.

RCP::RCP(Node* p){
  Node* tmpp(p);
  int tmpi(aAdd(&(tmpp->cntr_),1));
  aSwap(&(this->ref_),&tmpp);
};

The key to the destructor algorithm is to first atomically swap the RCP being deleted with a null pointer on the stack. Once it's on the stack, you know the pointer itself can't be modified by another thread, though the referent is still vulnerable to race conditions. You then use the atomic increment to decrement the counter and return the value.

If the counter goes to zero, then this must be the last RCP at that referent. Since you have the RCP on the stack, it can't be copied or otherwise accessed from any other thread, so you can delete it without fear of race conditions. Quite a mouthful, but straightforward, as you can see.

RCP::~RCP(){
  Node* tmp(0);
  aSwap(&(this->ref_),&tmp);
  if(tmp && !aAdd((&(tmp->cntr_)),-1)){
    delete tmp;
  };
};

The RCP copy constructor creates a copy of an existing RCP, which means the counter of the referent must be incremented in order to reflect the additional reference. There are several race conditions to be concerned about: The RCP being copied or the one being constructed might be deleted or copied or assigned to, while the copy constructor is executing (picture two RCP global variables, and various threads doing various operations on those global variables).

The key is to use aIandF() to atomically increment the counter and get a new pointer at the referent onto the stack where it isn't subject to contention, which guarantees that the counter remains at least one (so it won't get deleted while the algorithm is operating on it, no matter what). I then swap the pointer on the stack with the RCP member data, after which it again becomes subject to contention. This leaves the value before the copy constructor executed on the stack, which I ignore because by the rules of C++, you can't apply a copy constructor with a well-formed object on the lhs, so the value on the stack can't be an RCP, and so it wouldn't be safe to treat it as an RCP.

RCP::RCP(const RCP& rhs){
  void* tmp(aIandF(&(rhs.ref_)));
  aSwap(&(this->ref_),tmp);
};

The RCP assignment operator works by constructing a new RCP to the rhs on the stack, and then swapping that new RCP with the lhs. The lhs RCP is then on the stack, and is processed using the normal RCP destructor when the function returns.

RCP& RCP::operator=(const RCP& rhs){
  RCP tmp(rhs);
  aSwap(tmp,*this);
};

The dereferencing operators, operator-> and operator*, are done in the usual manner. Beware that these, in effect, create raw pointers at the managed object, and so some care must be taken to ensure that the object isn't deleted while the raw pointers are being used.

The basic idea is to ensure that the RCP being dereferenced is in scope as long as or longer than the raw pointer so created is in scope. Beware also that there will be references in CPU registers and whatnot that you didn't create, so you want to understand the C++ lifetime rules to make sure you don't accidentally reference a bad pointer.

Node* RCP::operator->(){
  return ref_;
};
Node& RCP::operator*(){
  return *ref_;
};

Usage Rules

The purpose of the usage rules is to ensure that you don't accidentally create an RCP at an illegal object, and that you don't accidentally delete an object while you're manipulating it. For the most part, these are only an issue in a multithreaded program.

An RCP may be executed concurrently and/or asynchronously on any number of instances of any combination of: either side of operator=, in operator->, in operator*, and as the argument to an RCP copy constructor.
You must not use any RCP before construction. An RCP, as any other object in C++, must be constructed before it is used. I mention this because there are various ways, especially in a multithreaded program, to use an object before construction.
You must apply the CTOR only once. The RCP algorithm does not manage the memory to which the CTOR is applied: It only replaces that memory with a value that is managed. So, if you manage to apply a CTOR to a location that is a legit RCP, you break the RCP algorithm. This could happen if you were using placement new, for example, and did two placement news without a placement delete between them.
You must not use any RCP after destruction. Again, this applies mostly to multithreaded and multiprocessing programs and those that use placement operators.
You must apply the destructor only once. Applying an RCP destructor to the same memory location twice without an intervening CTOR is illegal.
If you construct an RCP using a raw pointer, then you must do so only once for that pointer. We've even had discussions on disallowing this form of RCP construction, and instead combining the object creation and the RCP creation. The idea is to never create a raw pointer in an ad hoc manner, and so eliminate this particular risk.
You must not reference a raw pointer used to construct an RCP in any way once the RCP has been constructed.
operator* and operator-> must be used in such a manner that you know that the RCP remains in scope (for example, its DTOR is not called) for longer than the resulting raw reference. If you do otherwise, then the RCP may delete the object when it goes out of scope, or allow another RCP to do so, resulting in an invalid pointer.

Performance

Table 2 shows the approximate performance of various operations on my 1-GHz PowerBook, compiling -O2 using GCC Version 3.3. You can see that the performance is pretty good. Your performance depends on how often the stwcx fails, causing the loop to execute again. This doesn't appear to be much of a practical concern because there are only a few CPU instruction times between each lwarx and stwcx, amounting to, say, a 10 ns window. You can calculate the approximate probability of an interrupt occurring during that 10 ns window by computing the percentage of time the program is inside of the 10 ns window, which is also the probability that any single interrupt hits the window.

Here's the equation:

   probability of looping  = (rcp ops/sec) * rcp_window

For example, at 100,000 RCP operations per second, the probability of going around the loop is about one in a thousand, which has little measurable effect on program performance. A detailed analysis is more complicated because in aIandF() looping can also occur when variables are changed, even when a reservation wasn't lost, but this effect is very small.

Testing

Since it's hard to write a sound multithreaded algorithm, I paid particular attention to testing this implementation. I was concerned with not only the theoretical soundness, but also the interactions with the compiler: The C++ Standard gives the compiler lots of room to reorder (or even remove) operations, so that it can do powerful optimizations. Unfortunately, that makes implementing this type of algorithm challenging because the whole basis of the algorithm is the order in which things are done. The compiler wants to reorder or remove operations, you want to force them to be exactly what you want, and a tug-of-war results.

For the first test I ran several threads, each doing various operations on various RCPs. The operations are sequenced so that each thread executes all pairs of operations. The test code is heavily instrumented with assertions to detect any failures. This test passed many hours of execution.

At this point, I wondered if I'd actually tested much: After all, how do I know all interactions were exercised? What if the threads never actually interacted in interesting ways? To address these concerns, I did fault seeding such that various very small windows of vulnerability opened up, then ran the tests again. Every seeded fault caused failures, giving me some confidence that the lack of failures means the algorithm is sound.

All testing was done on a Macintosh PowerBook using GCC Version 3.3 20030304 (Apple Computer Inc. build 1640). We also ran the SPIN model checker [2] on the algorithm, which did not detect any bugs. Finally, there are efforts underway to develop formal mathematical proofs that the algorithms are correct. That work isn't complete either.

Full Disclosure

If you Google around on "lwarx" for a bit, you'll find there is some controversy about using the synchronization instructions in "user" code. There seem to be four general reasons:

lwarx/stwcx were broken in some CPU versions. Okay, so were arithmetic instructions of various sorts, but we still use arithmetic. Thankfully, it appears that these opcodes work in current processors.
They smell like something only the OS should use. I admit I felt the same way. The next point covers the facts of the matter.
They can cause processes and the OS to interfere with each other. On a single-CPU machine with an OS that clears the reservation upon entry, as specified in the PPC programming manual [1], user programs cannot interfere with the OS because the OS clears the reservation flag whenever it's entered. But that does mean that the OS interferes with user programs, because if the OS gets control between an lwarx and stwcx, the stwcx fails and causes a retry. This seems reasonable, given the interval between the lwarx and stwcx in my algorithms is only a few nanoseconds and the OS doesn't get control at anywhere near that rate, so the probability of interference is rather low. However, depending on how the opcodes are implemented, it is perhaps possible for user programs running on multiprocessor computers to interfere with the OS on the other processor. I don't know if this is a real issue.
The reservations typically have some "granularity." A normal write to an address "near" the one that was read with reservation can clear the reservation. The definition of "near" is different for different processors. If your algorithm happened to perform such a write between the lwarx and stwcx, then various starvation/livelock problems could occur. My RCP algorithms only write local variables between lwarx and stwcx, which are put into registers in the assembly language implementations and so do not raise the specter of this failure.

There is some nonportable code in the implementation. In particular, the code that calls copyAndInc has a cast from a member data pointer to an int, which happens to return what I want (an offset into the class to the counter). I only know it does what I want because I checked, but it might not do what you want with your compiler, and it might not do what either of us want for some complicated class with multiple inheritance or whatever, so you may have to compute the offset in some other manner. For example, it might be better to, at initialization time, construct an instance of the class, get a pointer at cntr_ and another at the instance, and store the difference as the offset for that class.

Future Work

This RCP implementation was developed to convey the basic algorithms in a clear manner. An industrial-strength RCP implementation needs the usual template adornments that make the implementation complete and convenient to use.

This implementation uses what's called "intrusive counters," which means that the counter is part of the referent itself, as opposed to a separately allocated object. The distinction isn't critical to the integrity of my RCP algorithms, so an industrial-strength implementation could use either technique. But nonintrusive pointers might get rid of the massive member data pointer/offset hack, which would be good.

There are a number of powerful container algorithms (and other algorithms) that can be implemented using the same synchronization mechanisms, but have not yet been implemented so far as I can Google. I will present async-safe and lock-free implementations of some of the algorithms in future articles.

References

[1] http://www.freescale.com/files/if/cnb/MPCFPE32B.pdf.

[2] http://www.spinroot.com/.

December, 2004: Atomic Reference Counting Pointers

Table 1: Primitives that implement RCP.

Primitive	Description
lwarx	CPU-provided operation to atomically read a memory location and set a reservation flag.
stwcx	CPU-provided operation to atomically write to a memory location. If the reservation flag is set, and in any case return the value of the reservation flag.
aAdd	Atomically mem[addr] += delta; return mem[addr].
aSwap	Atomically tmp = mem[addr1]; mem[addr1] = mem[addr2]; mem[addr2] = tmp.
aIandF	Atomically tmp = mem[addr1]; mem[tmp] += delta;return tmp.

December, 2004: Atomic Reference Counting Pointers

Table 2: Approximate performance of various operations.

Function	Speed
Atomics::incr	16ns
Atomics::swap	23ns
Atomics::copyAndInc	22ns
RPC()	12ns
RCP(Node*)	70ns
RCP(RCP&)	68ns
~RCP on empty RCP	42ns
~RCP on non-empty RCP	62ns
rcp1 = rcp2	91ns