Efficient Variable Automatic Buffers
By Matthew Wilson
Automatic variables, including arrays, are substantially quicker than dynamically allocating storage from the heap. However, since automatic variables actually occupy space in the program stack reserved by the compiler, they are of a fixed size. Though some compilers and platforms provide facilities for dynamically allocating stack space by adjusting the stack pointer, this support is neither standard nor universal, and is quite limited.
In this article, I present auto_buffer, a simple, platform-independent STLSoft template class (http://stlsoft.org/) that attempts to fulfill dynamically sized allocations from an encapsulated array; otherwise, it allocates space from its parameterizing allocator. Because auto_buffer lets programmers specify the size of the encapsulated memory as a parameter to the template, this scheme gives you more control over memory allocation. Such control is especially significant in circumstances of limited stack or when stack allocations over system page size are undesirable. You can also use instances of auto_buffer in object composition as well as with variable automatic variables.
Stack variables are allocated from the executing thread's stack, in the scope of the local block within which they are defined, by adjustment of the stack pointer. Global variables are fixed in program global memory and allocated by reservation of space in the global memory area. In Listing 1, m1 is a global variable, m2 a stack variable, and m3 (a stack variable that points to) heap memory. Here, I focus on the implications of the use of stack memory, although some of the issues I discuss also apply to global memory.
Since the allocation of memory for global and stack variables is carried out at compile time, there is no actual "allocation" simply a manipulation of pointers or addresses. Disregarding virtual-memory issues, the memory already exists. Consequently, this form of memory allocation is extremely efficient; in fact, it is the most efficient form of memory allocation. An additional advantage is that you can determine, at compile time, the size of the available memory via the sizeof operator.
However, since the allocation is carried out at compile time, it can only be of fixed, predetermined size. This is often perfectly acceptable such as when, for example, dealing with filesystem entity names, which have a fixed maximum on most platforms. When writing such code, you can simply declare a buffer of the maximum potential size, confident that passing it to any function will not result in buffer overwrites. However, when dealing with APIs whose functions may use buffers of any length (the Windows Registry API, for instance), you can never guarantee to have a fixed buffer of sufficient size (though most of us have written RegXxx() code passing buffers of _MAX_PATH size!).
Another disadvantage of stack variables is that variables within a single function having a combined size above a certain (operating system) threshold can cause compilers to insert stack-checking code, thereby binding the executable to the C-Runtime Library.
You have a number of alternatives to fixed-size stack memory. Heap memory, the most obvious option, is the opposite of frame memory. It is obtained from the heap, or from one of a set of heaps, at runtime. Every heap API requires a function to allocate memory (malloc()) and, except where using garbage collection, a corresponding function to return it (free()).The advantage of heap memory is that the size of the buffer can be any practical size within the limitations of the runtime system (although some older memory APIs restrict the maximum size of individual buffers).
The main disadvantage of heap memory is that heap allocations are considerably slower than stack/global allocations (due to the complexity of implementing the memory-allocation schemes to reclaim and defragment freed memory). Additionally, it is possible that the request may not actually be satisfiable at runtime, requiring client code to manage that eventuality (whether through exception handling or through testing the returned value against NULL). Finally, if you forget to free allocated chunks, your process will likely die a slow death through memory exhaustion.
alloca() (and its vendor-specific variants) is an attempt to merge the speed of stack memory with the flexibility of heap memory. The alloca() function allocates memory, whose size is determined at runtime, from the stack rather than from the heap by simply adjusting the stack pointer. The memory is automatically freed when the enclosing scope exits. This is a useful facility and, where available and applicable, provides the best solution for most cases of automatic buffers. Unfortunately, alloca() has a number of disadvantages.
- It cannot give compile-time size and requires a local variable to keep track of memory size.
- It cannot reallocate memory in the same manner that realloc() (and its platform-specific analogs) can, so in that way, it cannot be a direct replacement for heap memory APIs.
- It is nonstandard, so it is not guaranteed to be available with all platforms and compilers.
- It has varied failure behavior: On Linux, it returns NULL if it fails to allocate the requested memory, on Win32 a machine-dependent exception is generated.
alloca() also requires stack-checking code and the context for its use is restricted. For instance, the Microsoft Win32 C-runtime _alloca() cannot be used from within certain exception-handling contexts since it can cause the system to crash unpredictably.
Two other disadvantages of alloca() are even more significant:
- By virtue of its mechanism, alloca() cannot be used to hold memory outside the current execution context, so it cannot be used for allocating variable-sized blocks inside object instances, templates, or other functions (i.e., it is not possible to write a local_strdup() unless you use macros).
- Finally, and most seriously, because of its adjusting of stack pointers, it can easily cause stack exhaustion when used in a function that performs a number of allocation/deallocation cycles. Indeed, the functions I used to test performance for this article quickly failed on both Linux and Win32 because of stack exhaustion. If it is not possible or practicable to move the alloca() call and processing of the memory it returns into a separate function called within a loop which may hurt performance too much to be useful anyway you are better off avoiding alloca() for looping code.
The C-language C99 enhancements offer another memory allocation alternative in the form of Variable Length Arrays (VLAs). VLAs provide a great deal of power and (syntactically at least) address the issue of dynamically sized stack array variables . Though a few compilers support them in C++, VLAs are not a part of Standard C++. (Randy Meyers tells me that, although there is no factor hindering VLA adoption into C++, the valarray class is the generally recommended alternative.) Hence, the main disadvantage of VLAs in C++ is lack of portability.
Where implemented, VLAs will likely be layered on top of either alloca() (or a similar technique) or heap memory. Whether they support all types, or just Plain Old Data (POD) types fundamental and simple aggregate types is implementation dependent. So, although the syntax of the language will be clearer than current solutions based on either alloca() or heap memory, the implications for performance, robustness, and availability are largely the same. Digital Mars C/C++ uses alloca(), and thus does not work with nonPOD types. Comeau C++ uses whatever VLA support is provided by the backend compiler. (Comeau Computing is considering directly implementing VLA in terms of alloca(), which would widen the support considerably.) Impressively, GCC supports nonPOD types in its VLAs, and does not experience the stack exhaustion that it does with alloca().
The purpose of auto_buffer is to emulate the syntax and semantics of a standard array variable. The requirements are:
- To attempt to allocate requested memory from its internal buffer, otherwise allocating from the heap
- To be as compatible as is possible with normal raw array syntax in the access to and manipulation of its allocated storage, in particular facilitating the use of array arithmetic and index operators .
- To provide basic STL container methods empty(), size(), begin(), and end(). Note that begin() and end() are added purely as a programmatic convenience and should not be taken to mean that auto_buffer is a full STL container. It is not, in part because it is not (currently) able to allocate and manage instances of nonPOD types.
auto_buffer's constructor performs only the allocation of memory, not in-place construction of elements. Storing arrays of nonPOD type objects would cause problems, since the construction and/or destruction of these objects would not be carried out. While the presence of the m_internal array member prevents compilation of parameterizations of auto_buffer with classes that do not provide publicly accessible default constructors, default-constructible types could still cause problems. Hence, the presence of the constraint in the constructor, which uses the type_is_non_class_or_trivial_class constraint class (see Listing 2) to prevent compilation for nonPOD types. The constraint is simply a union containing the type, thereby preventing an instance being created when parameterized with a nonPOD type. (This check can be omitted by specification of a preprocessor symbol, but you're responsible for ensuring correctness if you use it.)
The class itself is comprised of a single constructor, destructor, data(), size(), and empty() methods, const and non-const sequence methods (begin() and end()), and const and non-const conversion operators ; see Listing 3 .
The implementation of the class is straightforward. Almost all the action is in the constructor. It takes a single argument: the requested number of elements in the array. This is tested against the size of the internal buffer and, if not larger, the m_buffer member is set to point to m_internal, the internal array. (This is fairly similar to the small string optimization; see Effective STL, by Scott Meyers, Addison-Wesley, 2002). If the requested size is larger, a request is made to the allocator, setting m_buffer to the allocated block. (All accessor methods refer to m_buffer, in either case.)
In the latter case, it is possible for the allocation to fail. Because an important requirement for the class (and the STLSoft libraries as a whole) is to be as widely compatible as possible, the constructor is written to work correctly both in situations where allocation failures result in an exception, and in those cases where the allocate() method returns NULL. When exceptions are thrown, they are propagated to the caller of the auto_buffer constructor and the instance of the auto_buffer is not constructed.
Some allocators do not throw exceptions when they fail to secure enough memory for the requested allocation, instead returning NULL (for example, the STL that ships with Visual C++ 5 and 6; see my article "Generating Out-Of-Memory Exceptions," Windows Developer's Journal, May 2001). Also, if you are creating a small program, you may not want to compile/link in exception-handling mechanisms and may deliberately plug in a NULL-on-failure allocator. In such circumstances, it is a good idea to leave the auto_buffer in a coherent state; therefore, the initializing condition for m_cItems discriminates on whether m_buffer is nonNULL. In the case where NULL is returned, the remaining construction of the auto_buffer instance results in initialization of the m_cItems member to 0, thereby providing correct behavior for the use of this empty instance; namely, that begin() == end(), empty() returns true and size() returns 0.
This scheme uses the fragile technique of relying on the relative member (initialization) order of m_buffer and m_cItems. The constructor includes assertions (static/compile-time and dynamic/runtime) to guard against any maintenance edits that are not mindful of this requirement and might change this ordering, resulting in the discrimination against the value of m_buffer testing garbage and the classic undefined results: a crash. Such member ordering dependencies are not generally a good idea, but I chose to use the technique here because it allows me to declare m_cItems as const, and the assertions ensure that all is well .
In keeping with good practice, the constructor is explicit (though it's hard to conceive of an implicit conversion scenario against which to guard). The destructor has a straightforward implementation. By testing m_cItems against the size of the internal buffer, it determines whether m_buffer points to m_internal; if not, it frees the heap memory by calling the allocator's deallocate() method.
The remainder of the method implementations are trivial. Pointer conversions (operator value_type *() and const version) are used rather than index operators (value_type &operator () and const version), since the pointer conversion approach supports both conversion to pointer and (implicit) array indexing, whereas the index operator approach on its own allows only (explicit) array indexing .
Finally, the auto_buffer class can be used in composition within other classes.
Parameterizing the Template
Using the template requires parameterization of the element type, allocator type, and size of the internal array (m_internal). Again, the class does not require a particular memory-allocation scheme to support its semantics. As long as the allocator supports the STL Allocator (http://sgi.com/tech/stl/Allocators.html) concept, you may use any compliant allocator. This approach offers several advantages in addition to the benefit of conforming to STL practices, particularly when the code is to work with Visual C++ and you want to avoid linkage to the C-Runtime Library. The size of m_internal, measured in number of elements rather than bytes, is given by the third parameter, which defaults to 512. Client code can specify any size here, to best match the most common required array size, for maximum performance benefit.
For all compilers other than Borland's, the auto_buffer derives from the allocator and takes advantage of the Empty Base Optimization (EBO) (http://www.gotw.ca/publications/mill07.htm) when appropriate. However, with Borland 5.5 and 5.6, this causes such significant performance degradation (actually worse than the malloc()/free() scenario in all cases) that an allocator is, instead, shared between all instances .
I measured auto_buffer's performance via a test program on both Linux (single-processor desktop 800 MHz, 128 MB) and Win32 (single-processor desktop 2 GHz, 512 MB). It was tested against memory-allocation types such as: stack variables; heap using against malloc()/free(); heap using operator new/delete; dynamic stack allocation using alloca()/alloca(); VLAS; and std::vector.
For each allocation type, the program allocates a block, the size determined by the first program parameter, accesses a byte within it (to prevent the compiler optimizing away the loop), and then deallocates it. The operation is repeated for a given number of times determined by the second program parameter. Since the program's auto_buffer class is parameterized with the default 512 for the internal buffer size, the two sizes tested are above and below this; specifically, 100 and 1000 bytes repeated 10 million times. (Tests of 10 and 100 bytes with a 64-byte buffer size yielded virtually identical relative results for all compilers.)
auto_buffer does a test (comparing the size of its internal buffer against the requested buffer size) before making any heap allocation. In circumstances where it must allocate from the heap, performance is less than going straight to the heap. Therefore, the purpose of the performance test is to quantify the presumed superiority where it uses its internal buffer, and the presumed inferiority where it allocates from the heap.
The resultant times were normalized with respect to the time taken for malloc()/free(), and are expressed in percentage terms; see Table 1. Except where marked, they are on the Win32 platform. For figures marked by an asterisk (*), it proved impossible (on both Linux and Win32) to obtain meaningful performance figures due to stack exhaustion crashes on any useful level of loop repeats. This was the case for all alloca() implementations as well as VLAs with Digital Mars (which is expected given its implementation over alloca()). The surprising thing is that GCC's VLA did not suffer stack exhaustion (as its alloca() did), which implies it uses a different technique. Though about five times slower than stack memory, it was twice as fast as auto_buffer. (My suspicion, therefore, is that it adjusts the stack pointer back when the scope, rather than the function, is exited thus avoiding the exhaustion.)
When the requested size of the buffer causes heap allocation, the cost is less than 113 percent for all compilers, except Borland, where it is 213 percent. This cost is not in any way detrimental, since the wise programmer will specify the template size parameter such that the vast majority of allocations will fit within that size. The overall effect in such circumstances will be a significant net performance gain.
The vector performance is 100-370 percent and 129-2600 percent for the two scenarios, so there's no doubt that auto_buffer is vastly more efficient than vector. But remember that auto_buffer is not resizable, nor does it (currently) support nonPOD types. Furthermore, it does not initialize the memory it acquires (for POD types). All of this is by design I wanted one-time allocation and wanted it fast and so it's fair to draw a comparison with vector, but only in these circumstances. auto_buffer is not a replacement for vector, and it is not intended to be.
auto_buffer has two disadvantages. The first is that the current implementation designed for maximum efficiency cannot be used with nonPOD types. However, the reason for the existence of the class, and where it has found its main use thus far, is as dynamic local character (ANSI or Unicode) arrays, so it fully satisfies its requirements. Hence, though there are plans to expand its capabilities to work with nonPOD types using traits/policies to efficiently discriminate any needed (un)initialization they are not yet a high priority. Indeed, before releasing such an updated version, I would want to ensure that reimplementation to policy-based initialization (or lack thereof) would not have a deleterious effect on performance for currently supported types. auto_buffer would also need to be significantly faster than std::vector for nonPOD class types; if it weren't, then there'd not be much point!
The second disadvantage is more significant, subtle, and potentially expensive. Although it can be advantageous to use auto_buffer as a member of other classes, you should only do so in cases where the most common allocation size is exactly, or close to, the internal buffer size. If the typical size is larger than the internal buffer, most instances of the class will not use their internal buffer at all. If the typical size is much less than the internal buffer, most instances of the class will not use most of their buffer. In either case, when the containing class is on the heap, the auto_buffer and its internal buffer will also exist on the heap, potentially wasting a large amount of memory.
Table 2 lists the relative merits of the various memory-allocation schemes. The advantages over the other schemes are efficiency, dynamic size, platform independence, compiler independence, flexibility of allocation scheme, and the avoidance of stack-checking code. Clearly, when used wisely, auto_buffer has the best mix of features of all the schemes for automatic arrays of fundamental and POD types. In addition, auto_buffer can be used in composition, although care must be exercised.
Thanks to Randy Meyers for enlightening me about VLAs, in particular, the cross-language issues. Thanks also to Greg Comeau of Comeau Computing, and Walter Bright of Digital Mars, for lending their expertise here and there.
 Randy Meyers, "The New C: Why Variable Length Arrays?" C/C++ Users Journal, October 2001.
 I don't like implicit conversion operators as a rule, but in this case, they're required to achieve array-like behavior, emulating built-in arrays' names being synonyms for the address of their (first) elements.
 Please forgive the nonstandard preprocessor symbols beginning with underscores. This is a bad habit I'm gradually shedding. From STLSoft v1.7.1 onwards, they will not appear.
 Practice wins out over theory here. In principle, offsetof() is undefined for nonPOD types, so applying it here is not strictly valid. However, for all compilers supported by STLSoft, it does what's required, so it is used. If STLSoft is ported to a compiler that lays out classes such that this would not apply, then auto_buffer, or the macro, will be rewritten accordingly.
 Though index operators would afford the possibility of some simple index validation (for example, assert or even exception), I felt that this still does not make up for the concomitant restriction of client code expressions.
 Thankfully, allocators are not allowed to act as if they have per-instance data, and most actually do not; but if they did, there's a slight possibility of a multithreading race condition here, which is not pleasant.
Matthew is a software development consultant for Synesis Software, author of the STLSoft libraries, and author of Imperfect C++ (to be published by Addison-Wesley, 2004). He can be contacted via email@example.com or http://stlsoft.org/.