Dr. Dobb's | Improving Performance with Custom Pool Allocators for STL

Improving Performance with Custom Pool Allocators for STL

Anthony presents a highly flexible and configurable replacement for C++'s std::allocator for use with node-based standard containers.

September 01, 2005
URL:http://www.drdobbs.com/cpp/improving-performance-with-custom-pool-a/184406243

September, 2005: Improving Performance With Custom Pool Allocators for STL

Anthony is a software developer in the Natural Language Processing group at Microsoft Research. He can be contacted at [email protected].

When preaching the benefits of C++, evangelists invariably mention things such as type safety, generic programming, and multiple inheritance when trying to convince devotees of other languages that C++ is better. For me, one of the most attractive features of C++ is the Standard Template Library (STL). Having access to a standardized, well-designed, and well-tested set of containers with algorithms to operate on them takes the load off programmers and lets us concentrate on the interesting parts of the code.

At the root of the STL is an incredibly rich set of containers. The STL provides default implementations for contiguous storage (vectors), lists, sets, and maps. Most STL implementations have hash sets and hash maps, which will become canonical in the next version of the Standard. This rich set of containers lets you store data in the most efficient manner. If frequent insertions and deletions are likely but the data does not need to be sorted by a key, the data can be put in a list. If constant-time lookup is a requirement, they can be put in a hash map instead. Best of all, if your requirements change, you can usually change container types without having to touch too much of your code. While the ideal of container-neutral code is unfortunately unattainable—see Item 2 of Effective STL: 50 Specific Ways to Improve Your Use of the Standard Template Library, by Scott Meyers (Addison-Wesley, 2001)—much of the code that inserts, iterates through, and retrieves elements from the various container types is, thanks to the magic of iterators, similar if not identical.

The Problem

As is always the case, however, general solutions do not come without a price. With STL containers, one of the ways we pay that price is in the final template argument to all of the containers—std::alloc<T>. Every STL container lets you specify the type of the allocator used to acquire storage for whatever it is you're storing, as well as the container's bookkeeping information. You don't have to specify it, because the Standard Library provides a default argument for you in the form of std::allocator.

The default implementations of std::allocator provided by STL implementations vary in complexity by the STL implementation from thin wrappers around new and delete to carefully built allocators running to thousands of lines of code. Different allocator implementations shine in different scenarios. If you're allocating a big block of memory, holding on to it for a long time, and then releasing it all at once (as is the case with vectors and deques), it may be hard to beat the minimalist wrapper around new and delete. If you're using node-based containers (such as maps, sets, or lists), allocators optimized for smaller objects are probably a better way to go.

In some situations, however, the standard allocator can penalize you heavily in terms of both time and space, and unless you're paying attention, you won't even realize it. There won't be any memory leaks, and your program won't crash (at least, not until it runs out of memory). You'll just be using more memory than you should, and possibly spending a lot of unnecessary time in calls to allocate and delete. The problem is that std::allocator is, by definition, a general-purpose allocator, and yet there's really no such thing as general-purpose memory usage. Different programs use memory differently. Even within the same program, different areas have different memory usage patterns.

What To Do?

I'm certainly not the first person to realize that this is a problem. Scott Meyers, for instance, talks about using alternatives to std::allocator (see Effective STL). Andrei Alexandrescu devotes an entire chapter of Modern C++ Design: Generic Programming and Design Patterns Applied (Addison-Wesley, 2001) to an allocator for small objects. The Boost library provides two wrappers around its pool_allocator that are std::allocators. It would be nice to have a whole family of allocators, each optimized for a different memory usage pattern, ideally sharing as much code as possible. It should be possible to switch from one allocation strategy to another as easily as possible.

Again, each program has different memory usage patterns, and this makes it difficult for memory allocator writers to write a general-purpose allocator that's good for all cases. No matter how your memory allocator is designed, it's possible to write pathological programs that result in fragmented memory and poor allocation times (see "An Estimate of the Store Size Necessary for Dynamic Storage Allocation," by J.M. Robson, Journal of the ACM, 18(3):416-423, 1971). However, there are common usage patterns that are seen again and again, especially with standard containers. Once you know how your container is going to be used, you ought to be able to specify the allocation strategy that performs best with your particular usage pattern. In addition, the environment in which the code is being executed plays a part in design decisions. For instance, if an allocator is only going to be used in single-threaded environments, there's no need to worry about protecting its data against access from multiple threads.

The family of allocators just described is a perfect example of a case where policy-based design can really shine. Knowledge of how the container will be used, performance requirements on the container, the size of the objects that will be allocated, and the environment in which the code will be executed are all generally known at compile time. Ideally, users of the allocator could specify the right combination of policies, and an allocator configured to perform well for the given situation would be generated at compile time.

Design

The allocator library I present here (available electronically; see "Resource Center," page 3) is a pool allocator. Pool allocators are an extremely efficient method for providing allocation services for objects of fixed size s. The basic strategy is to reserve a chunk of memory sufficient for storing N objects of size s all at once. When the allocator is asked to provide a chunk of memory, it simply returns an offset mod s into the allocated chunk. This is far more efficient than calling operator new separately for each request because it avoids the bookkeeping overhead required of a general-purpose memory allocator that has to service requests for different-sized allocations. It's an economy of scale.

The decision to use a pool allocation scheme limits its usability in some cases. Because pool allocators are based on the idea that you'll be allocating objects of a single size, using them with containers that are likely to request different amounts of memory with each allocation doesn't pay off. This means that you won't see the same performance boosts when using the allocator with large containers of type std::vector, std::deque, or std::string as you would in node-based containers—such as std::map, std::set, and std::list To guarantee O(1) amortized insertion time, std::vector and friends have to request memory proportional to the amount of memory currently allocated at each request. In fact, requests for allocations greater than 64 bytes will be shunted off to malloc and free. This isn't such a serious limitation because the default memory allocator is generally optimized for dealing with large, variable-sized memory blocks anyway.

A potentially more serious caveat is that, since the allocator uses nonstatic data, it's not technically Standard compliant because the Standard requires that allocators of the same type be equivalent. See Effective STL (Item 10) for a thorough explanation of the issue. This amounts to requiring that an allocator for a given type be able to deallocate memory allocated by any other instance of an allocator for that type. For many uses of standard containers, this requirement is unnecessary (some might say Draconian). However, there are two cases where this requirement is absolutely necessary: list::splice and swap(). The case of swap() is especially serious because it is needed in order to implement certain operations on containers in an exception-safe manner (see Exceptional C++, Item 12). Technically, swap could be (and in some cases, is) implemented in the face of allocators that don't compare equally—items could be copied or the allocators could be swapped along with the data—but this is not always the case. For this reason, if you're using swap() or list::splice, you should make sure to use HoldingPolicySingleton; otherwise, you're bound to run into some really nasty behavior.

The lowest level of the allocator is the PoolChunk class, which represents a contiguous chunk of memory living at m_pMem, big enough for NUMBLOCKS objects of size ALLOCSIZE. It uses a neat space-saving trick (used by many implementations of pool allocators) to store a linked list of memory blocks in the same space that it uses for servicing allocation requests. Upon construction, the beginning of each block is an integer pointing to the next free block. The object keeps track of the head of this list in m_pNext. When a block is requested, the head of the list is updated to point to the second chunk in the list, and the chunk of memory that used to be the head is returned. When a block is freed, the beginning of the block is updated to point to the old head, and this block becomes the new head.

PoolChunk exposes a very basic interface:

PoolChunk(size_t allocsize_set, IndexType num_blocks_set) constructs a new PoolChunk for num_blocks_set objects of size allocsize_set.
bool In(void* p) simply tells you whether the memory at p comes from this PoolChunk.
bool Empty() tells you whether the chunk is empty.
bool Full() tells you whether the chunk is full.
void* allocate() returns the next item from the free list, or null if the chunk is full.
void deallocate(void* p) puts p at the head of the free list.

The second layer of the allocator is the BunchOfChunks object, so named because it aggregates multiple PoolChunks. Because each PoolChunk has a static size, you need more than one of them in order to be able to satisfy an arbitrary number of allocation requests. Like the PoolChunk object, the BunchOfChunks object only handles allocation for objects of a fixed size. It does so by keeping a collection of PoolChunk objects and farming the requests out to them. The BunchOfChunks interface is even simpler than the PoolChunk interface. Aside from the constructor and destructor, only two public functions are exposed:

• void* allocate() finds a nonempty PoolChunk and allocates a block of memory from it.
void deallocate(void* p) finds the PoolChunk that allocated p, and deallocates it from that PoolChunk.

I've deliberately left the exact mechanisms for storing and accessing individual PoolChunks vague in this description, since these are details handled by BunchOfChunks's STORAGEPOLICY (one of the required template arguments).

The previous two objects provide allocation services for objects of fixed size. In order to be able to handle requests for allocations of arbitrary size, which is required of an std::allocator, you need to aggregate BunchOfChunks objects for various sizes. PoolAllocator, the third level in the design hierarchy, does just this. It keeps a vector of BunchOfChunks objects at m_vboc, sorted by size. Once again, the interface is straightforward:

void* allocate(size_t stNumBytes) does a binary search in m_vboc to find a BunchOfChunks for size stNumBytes. If it finds one, it just calls the allocate() method on it. If it doesn't find one, it creates a new one and inserts it into the vector.
void deallocate(void* pDeallocateMe, size_t stNumBytes) finds the BunchOfChunks for stNumBytes and calls the deallocate() method on it.

The final layer of the allocator is pool_allocator_stl, which contains the implementation of the std::allocator interface. It holds an instance of PoolAllocator and delegates allocation and deallocation requests to it.

The multitiered design (individual chunk object -> collection of chunks -> collection of collections of chunks for various sizes) is a typical design for pool allocators—both boost::pool and Alexandrescu's SmallObject follow similar designs. Readers will notice a more-than-passing resemblance to Alexandrescu's SmallObject implementation, to which I am much indebted, as it was my first exposure to the issue of small object allocation. What makes this implementation more flexible is the rich set of policies for specifying the behavior of the allocator.

Evaluation

Evaluating memory allocators is notoriously difficult because every program uses memory somewhat differently. To measure allocation and deallocation directly, I wrote a drop-in std::allocator (implemented in trace_allocator.hpp, available electronically) that records each allocation and deallocation to a log file. I specified this allocator in containers that use memory in different ways, and then wrote a testing application that "plays back" the memory allocations and deallocations using different allocators, monitoring memory usage, page faults, and execution times.

The four containers I attached the allocator to are:

simple_map: creates an int->int map, fills it with some integer pairs, then lets it go out of scope. Based on a quick survey of my colleagues, as well as a look through code I've written and/or maintain, this is a very common scenario.
fixed_priority_queue: is based on a std::set of fixed size. The set is allowed to grow to a certain size. Thereafter, new inserts are only allowed if the item to be inserted is greater than the smallest item in the set. The smallest item is deleted and the new item is inserted. A real-world example of this would be an MRU cache.
remove_if: creates an int->int map and adds a number of integer pairs to it. Generate a random integer, call remove_if on the map with that integer. Repeat three times.
random: creates an int->int map. Insert a number of integer pairs to it. Then, for some number of iterations, randomly either insert a new pair or delete an old pair.

Figures 1 through 4 (1, 2, 3, 4)depict the memory profiles for the cases I instrumented. I played each allocator trace back using allocators built with these compilers, in each case running once with the default (compiler included) STL, and once with STLport 4.6.2:

Microsoft Visual C++ 7.1
GCC 3.3.3 under cygwin
GCC 3.3.3 under Linux

Performance data for each combination of policies, on all configurations, is in the spreadsheet entitled "results.xls" (available electronically).

Policies

The actual definition of pool_allocator_stl is as follows:

template<
class T,
class STORAGEPOLICY,
template<class> class ThreadingPolicy,
template<class> class HoldingPolicy,
class GrowthPolicy,
bool DeletionPolicy=true>
class pool_allocator_stl.

Aside from the first template argument T, which is the type of object to be allocated, the rest of the arguments are all configurable policies that govern various aspects of the allocator's behavior.

I'll now give a quick description of each of the policies, along with some charts showing the effects of each policy on each of the data sets. The performance numbers I present for each policy are based on code compiled with the Microsoft Visual C++ 7.1 compiler using the default STL implementation, with aggressive speed optimizations turned on (/Ox /Os).

StoragePolicy. The StoragePolicy is the most interesting and challenging part of the allocator. It dictates how individual PoolChunks are stored in a BunchOfChunks. It is also responsible for actually releasing the memory allocated by BunchOfChunks. The interface is defined by the following functions:

void Store(ChunkType* pStoreMe) stores the chunk in question for later retrieval.
• void Purge() releases all memory from the container.
ChunkType* GetChunkForAlloc() returns a pointer to a nonempty PoolChunk in the collection, or NULL if there is not one.
ChunkType* FindChunkDealloc(void* p, iterator& itrRet) finds and returns the PoolChunk containing p. Also returns a StoragePolicy::iterator (necessary typedef) via itrRet that points to the deallocated chunk. This enables us to efficiently remove the chunk from our collection if necessary
void DealWithEmptyChunk(iterator itr) is called by BunchOfChunks whenever a chunk becomes empty as the result of a deallocation.

I've implemented three different StoragePolicies:

StoragePolicySimple keeps its PoolChunks in an std::list. When a chunk is requested for allocation, it looks in the chunk most recently allocated from. If there's still room, it just returns that chunk. If the current chunk is full, it does a linear search in the list for a chunk with free space. Allocation and deallocation are O(c), where c is the number of chunks in the list. Deallocation is handled similarly.
StoragePolicyBoundary is similar to StoragePolicySimple, but it does a bit of extra housekeeping to enforce a boundary between full and nonfull chunks. This improves worst-case allocation speed to O(1), though deallocation is still O(c) in the number of chunks in the list.
StoragePolicyOrdered keeps its PoolChunks in an std::map. Allocation is O(c), but deallocation is O(logc).

Figure 5 shows the effect of StoragePolicy on speed in each of the contains I profiled.

HoldingPolicy. The HoldingPolicy is a simple policy that lets you control the lifetime of the pool_allocator object held by pool_allocator_stl. It has only a single function, T& Get(), which returns a reference to the encapsulated object.

Currently, there are two HoldingPolicies:

HoldingPolicyStack, which puts the encapsulated object on the stack, to be cleaned up whenever it goes out of scope.
HoldingPolicySingleton, which provides access to a shared pool_allocator for all objects.

ThreadingPolicy. ThreadingPolicy consists of a few simple typedefs:

VolatileType, inspired by Alexandrescu's ThreadingModel, is a way to make sure that data members shared across multiple threads are properly qualified as volatile.
MutexType indicates a type that must in turn have a scoped_lock typedef. Thread safety for multithreaded policies is guaranteed by wrapping all state-changing code in a scoped_lock.

I've provided two separate threading policies:

ThreadingSingleThreaded is a policy in which the lock is not operational. As you might expect, it's always preferred for single-threaded programs. You can use it in multithreaded programs as well, so long as you are using HoldingPolicyStack. Each container has its own copy of the allocator, and any container operations causing allocation or deallocation that might be called from multiple threads have to be locked by the client for thread safety anyway, so there's no need to lock the allocator separately.
ThreadingMultiThreaded is a policy suitable for use in multithreaded situations. It's a thin wrapper around boost::mutex for maximum portability.

GrowthPolicy. Each BunchOfChunks object has a GrowthPolicy member that's responsible for calculating the size of the next chunk to allocate. The policy implements a single function, unsigned int NextChunkSize().

Two GrowthPolicies are defined:

GrowthPolicyLinear allocates fixed-size chunks of size 255 each time.
GrowthPolicyExponential doubles the size of the last chunk.

The GrowthPolicy basically controls the relationship between N, the number of objects allocated; and C, the number of PoolChunks encapsulated by the BunchOfChunks. This has a predictable effect on the performance guarantees of the container. When you use GrowthPolicyLinear, C is O(N). When you use GrowthPolicyExponential, C is O(log(N)). This lets you optimize for time or space. For instance, recall that allocation and deallocation time using StoragePolicySimple were both O(C). If you use GrowthPolicyLinear, then allocation and deallocation time will be O(N), but the amount of wasted space will be O(1) because you'll never be wasting space for more than 255 objects. If you use GrowthPolicyExponential, allocation and deallocation will be O(log(N)), but the amount of wasted space will be O(N). Figures 6 and 7 show the effect of GrowthPolicy on speed and working set size.

DeletePolicy. DeletePolicy is just a Boolean flag that's passed as a template argument to pool_allocator_stl. It controls whether objects are actually deallocated when deallocate() is called. When DeletePolicy is set to True, deallocation is performed as per normal. When it's set to False, deallocation becomes nonoperational. There are plenty of times when this can come in handy. For instance, a very common memory usage pattern is to fill a container, perform some processing with the elements in it, save the results as temporary, and then destroy the object (or let it go out of scope). In this case, you know you're going to be getting rid of the entire container at once, so there's no need to call deallocate() separately for each object. All of the memory will be released at once as soon as the allocator goes out of scope. Figure 8 shows the effect of DeletePolicy on speed.

Comparative Performance

As you can see in Figures 9 through 12, the graphs comparing different policies, the best set of policies to use can vary depending on the allocation pattern. To determine how much performance gain I could achieve over less-configurable solutions, I also ran each scenario against the default std::allocator for the STL I was using, as well as against boost::pool_allocator and boost::fast_pool_allocator.

In most platform/STL combinations, the allocator presented in this article performs better than the others. This is not surprising because, in each case, we were able to tune the instantiated allocator type according to the specific memory use at hand. One thing that puzzles me, however, is that the boost::fast_allocator is invariably faster on Linux when using the STL provided with GCC, sometimes by a large margin. The most likely explanation seems to me that the allocator was probably optimized in this configuration.

Conclusion

I've presented here a highly flexible and configurable replacement for std::allocator for use with node-based standard containers. An efficient and configurable allocator is useful because of the widespread use of standard containers. On several occasions, I've been able to make substantial improvements to the performance and/or memory usage of programs by a simple one-line modification that specifies a different allocator for a given container. Hopefully, the allocator described here can be as useful to you as it has been to me—give it a try and see!

Acknowledgments

Thanks to Eric Niebler for pointing out the danger of swap()ing containers whose allocators have state, and to Andrei Alexandrescu for pointing out the necessity of special allocators for small objects in the first place.

DDJ

September, 2005: Improving Performance With Custom Pool Allocators for STL