Space Efficient Sets and Maps

By Michael Carrato, July 01, 2003

Combining the best of the vector and associative container worlds.

Introduction

Standard C++ provides a rich set of collection templates, including the tree-based associative collections std::set and std::map. std::set and std::map are flexible, scalable, and highly performing, but there are cases where the size overhead of the internal trees is impractical. This impracticality is especially evident for small data types, where the size overhead of tree-based collections can be as high as 300%.

Sorted vectors, on the other hand, have very little overhead and have gained in popularity among developers in search of a leaner, meaner associative collection. However, the rigid structure of sorted vectors limits their flexibility, relegating them to special-purpose use.

This article describes a new type of sorted-vector-based associative collection, which is almost as flexible as the tree-based collections and almost as space efficient as pure sorted vectors. The result is a general purpose, space efficient, drop-in replacement for the standard associative containers, std::set and std::map.

Background

The Amazing std::vector

Structurally, the Standard C++ vector could not be any simpler. Implemented internally as a contiguous array of elements, the Standard C++ vector achieves random access to elements by simple pointer arithmetic, and space overhead is minimal.

Yet, despite its structural simplicity, std::vector is a powerhouse: small, fast, flexible, scalable. I have always been fascinated by one particular feature of std::vector: its ability to expand its capacity automatically in amortized constant time [1], despite having to recreate its entire structure to expand.

The reason for this amazing property is that std::vector always allocates some extra space when expanding, just in case you decide to add more to the end. And, most importantly, the amount of extra space is a multiplicative factor of the current capacity.

To summarize the complexity analysis of std::vector::push_back, by allocating extra space in multiplicative increments, "the N's cancel" and the resulting expansion complexity is independent of the size of the container. In layman's terms: as the collection grows, the extra space allocated grows too, meaning that reallocation happens less often as the collection gets bigger. It's a brilliant, mathematically sound design, the compounding interest of the collection world.

The raw power and flexibility of std::vector has made it the ordered collection of choice for many applications where a linked list might have been used in the past. As long as inserts and erases occur only at the end of the collection, std::vector is generally superior in speed and space to std::list, for most applications.

Sorted Vectors

Sorted vectors do not, in general, benefit from this efficient expansion property. The reason is that each insert or erase in a sorted vector forces a reorganization of the entire structure, copying an average of N/2 elements per insert/erase. This is unacceptable for all but the most limited applications: where the collection size is small, or where modifications are rare.

The ideal solution would be to find a way to allocate extra space in a sorted vector, just like regular vectors do, but do it in such a way that does not require objects to be moved en masse for each modification. The key is to generalize the concept of extra space so that the same principle can be applied for sorted collections.

Design

The "Extra Space"

In std::vector, the extra space takes the form of uninitialized memory at the end of the buffer. During a reallocation, std::vector allocates space for N + kN elements, where k is the expansion factor. As the vector grows, the extra kN elements are filled in.

But, in order to satisfy the amortized constant complexity constraint, it doesn't matter where the extra elements go, or even how they're stored. The only thing that is important is that kN elements are efficiently stored somewhere, somehow, without incurring a reallocation (and recopying) of the vector. The other details are incidental, from a complexity standpoint.

It happens that the extra space for std::vector is located past the end of the valid elements. But std::vector push_back complexity would be no different if the extra space were a separate data structure entirely. In fact, if such an implementation could be made to look contiguous to the user, this might even be a valid design for std::vector.

This reveals an important constraint in designing the "extra space." Whatever the implementation, it must appear seamless to the user. For sorted vectors, the elements in the extra space must always appear to be in sorted order with respect to the rest of the collection. And, to be compatible with tree-based collections, inserts into the extra space must happen in O(log(N)) time, or better.

The following sections describe the design of a hybrid collection, CompactSet, with the core elements stored in a sorted vector, and the new elements inserted into a std::set. This set serves as the extra space, storing elements efficiently, in sorted order, until enough have accumulated to rebuild the sorted vector.

CompactMap is also provided in the source code (available for download at <www.cuj.com/code>), as a simple manipulation of CompactSet to allow both keys and values.

CompactSet: A Hybrid Collection

CompactSet consists of the following members:

archiveSet: a sorted vector-based collection that contains the base N elements of the collection.

dynamicSet: a std::set that contains up to kN newly inserted elements.

compactionThreshold: an integer containing the maximum size of the extra space.

When first constructed, archiveSet is empty. Elements are inserted into dynamicSet until its size equals compactionThreshold, at which point dynamicSet is merged into archiveSet. This step is called compaction. After compaction, compactionThreshold is set (as a multiplicative factor of N, where N has increased), and the process starts over again.

Figure 1 shows an example CompactSet of integers just before compaction. The archive set contains values from the previous compaction, and the dynamic set contains values inserted since then.

Compaction merges the two collections into a new, larger archiveSet, shown in Figure 2. dynamicSet is cleared, in preparation for the next batch of inserts.

Compaction

Compaction must occur in linear time (or better), and the rebuilt archive set must be in sorted order.

Fortunately, there is a standard algorithm that satisfies these constraints: set_union. This function takes two sorted collections as input, merging the values into a third sorted collection, in linear time.

By analogy to the std::vector expansion algorithm, CompactSet compaction occurs in amortized constant time, as long as compactionThreshold is always initialized to a multiplicative factor of the collection size.

A United Front

In order for CompactSet to act like a standard set, it must present a united front to the user at all times. The implementation must hide the complexities of the hybrid implementation, essentially merging the two internal collections dynamically for every member function. In reality, no actual merging occurs (except during compaction), but each function must act as if it is working on a merged std::set. And each function must run in a complexity that is no worse than the corresponding std::set version.

Implementation and Complexity Analysis

CompactSet mimics the std::set interface, which can be broken down into four basic operations:

Find
Insert
Erase
Iterate

The following sections describe the algorithms for implementing each of these operations in CompactSet.

Find

find must locate a value and return its iterator. Since a value may occur in either of the two collections (or neither, but not both), both must be checked. The algorithm for finding a value val is as follows:

If(dynamicSet.contains(val))
  return dynamicSet.find(val)
Else if (archiveSet. contains(val))
  return archiveSet.find(val)
Else return end()

Neglecting the iterator construction for now, this algorithm does no more than 2log(N) comparisons. This is complexity O(log(N)), like std::set::find.

Insert

The std::set specification does not allow duplicates. So CompactSet::insert must ensure no more than one copy of a given value exists in both archiveSet and dynamicSet.

insert first searches for the value in dynamicSet. If the value exists there, it returns immediately, since std::set does not reinsert existing values. Otherwise, insert attempts to insert the value into archiveSet, by calling archiveSet.tryInsert.

This may seem to be a contradiction, since archiveSet does not generally allow efficient inserts. However, there is one case where archiveSet can efficiently insert a value: when the value fits into a slot where another element was previously erased. And, of course, the value may already exist in archiveSet, in which case tryInsert succeeds without inserting the value.

Therefore, tryInsert returns one of the following three state codes:

existed: the value already existed and was not inserted.
inserted: the value was inserted as a new element.
insertionFailed: the value could not be inserted efficiently.

If archiveSet.tryInsert returns insertionFailed, the value is inserted into the "extra space," dynamicSet. (The other return codes are necessary so that CompactSet::insert can construct the return value.)

A complexity analysis reveals that there are no worse than 2log(N) comparisons. In addition, compaction may occur as a result of insert. So the total complexity for insert is (O(log(N)) + amortized O(1)). This is equivalent to amortized O(log(N)), which is, on average, the same as std::set insert.

Erase

The erase algorithm was not described in the design section. Most of the design was in terms of insert. However, erase is analogous to insert in many ways and shares many of the same design concepts.

Like insert, erase triggers a compaction based on the compactionThreshold variable and does so in such a way that the performance guarantees are preserved: erase runs in amortized O(log(N)).

However, erase presents an additional problem: if the value to be erased happens to be in archiveSet, how can it be erased without incurring a linear performance hit?

The solution is to hold a parallel collection of Booleans in a vector<bool> member named deleted. The deleted vector is resized to the number of elements in archiveSet, and each Boolean is initially false. A value of true in the deleted vector indicates that the corresponding value in archiveSet has been erased. The deleted vector adds only one bit per element of overhead.

The archiveSet.erase function locates the index of the value to be erased in archiveSet and sets the corresponding Boolean to true in the deleted vector. During the next compaction, the erased values are filtered out, and the deleted vector is reset.

A side effect of this design is deferred value destruction (see sidebar).

The basic algorithm for CompactSet::erase(val) is as follows:

archiveEraseCount = archiveSet.erase(val)
//might still be in the dynamicSet
if(archiveEraseCount ==  0)
  dynamicSet.erase(val)

Iterate

Iterators were the most difficult piece to design and implement. The iterator must be able to navigate forward and backward through the two heterogeneous collections, merging dynamically. And it must support all iterator operations in constant time.

CompactSet iterators contain the following members:

archiveIter: current archiveSet iterator.
dynamicIter: current dynamicSet iterator.
archiveIsCurrent: Boolean that indicates the current active element (archive or dynamic).
collectionPtr: points to the owner collection.

The two iterator members, archiveIter and dynamicIter, indicate the current iteration position in each of the internal sets. One points to the current value, and the other points to the upper bound of the current value in the non-current collection. In other words, the non-current iterator points to the value that would follow the current value in that collection. If the non-current collection has no value greater than the current value, the non-current iterator contains end.

The archiveIsCurrent Boolean member indicates which iterator is current.

The CompactSet end iterator is represented by end iterators in both archiveIter and dynamicIter.

The collectionPtr member is a pointer to the iterator's parent CompactSet. This is necessary so that the iterator manipulation functions can access the begin and end iterators for the internal sets, which are needed for traversal.

The above constraints define a valid CompactSet iterator. Iterator traversal functions (operators ++ and --) always return valid iterators. However, any insert or erase operation invalidates all iterators. This is the most significant deviation from the std::set specification, whose iterators remain valid for the life of the collection.

Figure 3 shoes an example CompactSet iterator. This diagram shows the begin iterator for a sample CompactSet with 14 total members (7 in each internal set). Calling begin should return an iterator pointing to the value 1, the lowest value in the entire collection. In this case, 1 happens to be in the dynamic set. Therefore, dynamicIter points to 1 and is defined to be current. (archiveIsCurrent is set to false.)

The non-current iterator, archiveIter, points to value 2, which is the upper bound of the current value (1) in the archive set.

Manipulating the Iterator

The algorithm for incrementing the iterator is as follows:

Increment the current iterator.
Set archiveIsCurrent to (*archiveIter < *dynamicIter)

Figure 4 shows the iterator from Figure 3, incremented once. The current iterator (which was dynamicIter) has been incremented, so that it points to value 3. Since 2 < 3, archiveIter is now current, and archiveIsCurrent gets the value true.

Decrementing is similar:

Decrement both iterators.
Set archiveIsCurrent to (*dynamicIter < *archiveIter)
Increment the non-current iterator.

Decrementing is slightly more complicated than incrementing. This is by design: increments are generally more common in user code, and the iterators are optimized for incrementing.

(The boundary conditions are not included in these descriptions. In the actual code, there are checks to avoid running off the ends of the archive and dynamic sets).

Dereferencing is simply a matter of dereferencing the current iterator.

All iterator functions run in constant time.

Figure 5 compares the functionality of the standard tree-based collections (set, map) with Loki::AssocVector (a pure associative vector implementation, designed primarily for small, static collections [2]) and CompactSet/Map.

Making It Configurable with Policies

In Modern C++ Design [2], Andrei Alexandrescu introduces the concept of policies, which are used to customize the behavior of a class. The policy class is passed as a template argument in the class declaration.

The MaximizeSpeed policy uses std::vector as the base collection for archiveSet. During compaction, a new std::vector is created, and the elements are merged into that new std::vector. Memory usage effectively doubles during this process. At the end of compaction, the original std::vector is removed, and memory usage drops back down. So, even though the steady-state memory usage is low, the transient memory usage is not.

The MinimizeSpace policy uses std::deque for the archiveSet collection. Compaction for the MinimizeSpace policy destroys the original sets as it builds the new one, resulting in much lower transient space usage during compaction. (This feature relies on deallocating erase in std::deque, which is not the case for some versions of the standard library.) The cost of this space efficiency is performance: MinimizeSpace is significantly slower than MaximizeSpeed. But the MinimizeSpace version allows for efficient storage and manipulation of CompactSets even in environments with extremely limited memory.

Source Code and Further Documentation

Full source code and documentation for CompactSet and CompactMap can be downloaded at <www.cuj.com/code>. See the documentation for more details on the implementation. The code has been tested with GCC 3.1, GCC 2.95.2, Borland CC 5.5, and Microsoft Visual Studio 6.0. Due to compiler limitations, policies are not available in the GCC 2.95.2 and Microsoft Visual Studio 6.0 versions.

Notes

[1] Bjarne Stroustrup. The C++ Programming Language (Addison-Wesley, 2000).

[2] Andrei Alexandrescu. Modern C++ Design (Addison-Wesley, 2001).

About the Author

Michael Carrato received his MS degree in computer engineering from the State University of New York at Buffalo. He has nine years of professional experience in design and development in C/C++, Java, and Smalltalk. He currently works as a software engineer at Cymfony, Inc., in Williamsville, NY, and can be reached at [email protected].

1 2 3 4 5 6 7 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Space Efficient Sets and Maps

Introduction

Background

The Amazing std::vector

Sorted Vectors

Design

The "Extra Space"

CompactSet: A Hybrid Collection

Compaction

A United Front

Implementation and Complexity Analysis

Find

Insert

Erase

Iterate

Manipulating the Iterator

Making It Configurable with Policies

Source Code and Further Documentation

Notes

About the Author

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Space Efficient Sets and Maps

Introduction

Background

The Amazing std::vector

Sorted Vectors

Design

The "Extra Space"

CompactSet: A Hybrid Collection

Compaction

A United Front

Implementation and Complexity Analysis

Find

Insert

Erase

Iterate

Manipulating the Iterator

Making It Configurable with Policies

Source Code and Further Documentation

Notes

About the Author

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content