Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Space Efficient Sets and Maps


Space Efficient Sets and Maps

Introduction

Standard C++ provides a rich set of collection templates, including the tree-based associative collections std::set and std::map. std::set and std::map are flexible, scalable, and highly performing, but there are cases where the size overhead of the internal trees is impractical. This impracticality is especially evident for small data types, where the size overhead of tree-based collections can be as high as 300%.

Sorted vectors, on the other hand, have very little overhead and have gained in popularity among developers in search of a leaner, meaner associative collection. However, the rigid structure of sorted vectors limits their flexibility, relegating them to special-purpose use.

This article describes a new type of sorted-vector-based associative collection, which is almost as flexible as the tree-based collections and almost as space efficient as pure sorted vectors. The result is a general purpose, space efficient, drop-in replacement for the standard associative containers, std::set and std::map.

Background

The Amazing std::vector

Structurally, the Standard C++ vector could not be any simpler. Implemented internally as a contiguous array of elements, the Standard C++ vector achieves random access to elements by simple pointer arithmetic, and space overhead is minimal.

Yet, despite its structural simplicity, std::vector is a powerhouse: small, fast, flexible, scalable. I have always been fascinated by one particular feature of std::vector: its ability to expand its capacity automatically in amortized constant time [1], despite having to recreate its entire structure to expand.

The reason for this amazing property is that std::vector always allocates some extra space when expanding, just in case you decide to add more to the end. And, most importantly, the amount of extra space is a multiplicative factor of the current capacity.

To summarize the complexity analysis of std::vector::push_back, by allocating extra space in multiplicative increments, "the N's cancel" and the resulting expansion complexity is independent of the size of the container. In layman's terms: as the collection grows, the extra space allocated grows too, meaning that reallocation happens less often as the collection gets bigger. It's a brilliant, mathematically sound design, the compounding interest of the collection world.

The raw power and flexibility of std::vector has made it the ordered collection of choice for many applications where a linked list might have been used in the past. As long as inserts and erases occur only at the end of the collection, std::vector is generally superior in speed and space to std::list, for most applications.

Sorted Vectors

Sorted vectors do not, in general, benefit from this efficient expansion property. The reason is that each insert or erase in a sorted vector forces a reorganization of the entire structure, copying an average of N/2 elements per insert/erase. This is unacceptable for all but the most limited applications: where the collection size is small, or where modifications are rare.

The ideal solution would be to find a way to allocate extra space in a sorted vector, just like regular vectors do, but do it in such a way that does not require objects to be moved en masse for each modification. The key is to generalize the concept of extra space so that the same principle can be applied for sorted collections.

Design

The "Extra Space"

In std::vector, the extra space takes the form of uninitialized memory at the end of the buffer. During a reallocation, std::vector allocates space for N + kN elements, where k is the expansion factor. As the vector grows, the extra kN elements are filled in.

But, in order to satisfy the amortized constant complexity constraint, it doesn't matter where the extra elements go, or even how they're stored. The only thing that is important is that kN elements are efficiently stored somewhere, somehow, without incurring a reallocation (and recopying) of the vector. The other details are incidental, from a complexity standpoint.

It happens that the extra space for std::vector is located past the end of the valid elements. But std::vector push_back complexity would be no different if the extra space were a separate data structure entirely. In fact, if such an implementation could be made to look contiguous to the user, this might even be a valid design for std::vector.

This reveals an important constraint in designing the "extra space." Whatever the implementation, it must appear seamless to the user. For sorted vectors, the elements in the extra space must always appear to be in sorted order with respect to the rest of the collection. And, to be compatible with tree-based collections, inserts into the extra space must happen in O(log(N)) time, or better.

The following sections describe the design of a hybrid collection, CompactSet, with the core elements stored in a sorted vector, and the new elements inserted into a std::set. This set serves as the extra space, storing elements efficiently, in sorted order, until enough have accumulated to rebuild the sorted vector.

CompactMap is also provided in the source code (available for download at <www.cuj.com/code>), as a simple manipulation of CompactSet to allow both keys and values.

CompactSet: A Hybrid Collection

CompactSet consists of the following members:

  • archiveSet: a sorted vector-based collection that contains the base N elements of the collection.

  • dynamicSet: a std::set that contains up to kN newly inserted elements.

  • compactionThreshold: an integer containing the maximum size of the extra space.

    When first constructed, archiveSet is empty. Elements are inserted into dynamicSet until its size equals compactionThreshold, at which point dynamicSet is merged into archiveSet. This step is called compaction. After compaction, compactionThreshold is set (as a multiplicative factor of N, where N has increased), and the process starts over again.

    Figure 1 shows an example CompactSet of integers just before compaction. The archive set contains values from the previous compaction, and the dynamic set contains values inserted since then.

    Compaction merges the two collections into a new, larger archiveSet, shown in Figure 2. dynamicSet is cleared, in preparation for the next batch of inserts.

    Compaction

    Compaction must occur in linear time (or better), and the rebuilt archive set must be in sorted order.

    Fortunately, there is a standard algorithm that satisfies these constraints: set_union. This function takes two sorted collections as input, merging the values into a third sorted collection, in linear time.

    By analogy to the std::vector expansion algorithm, CompactSet compaction occurs in amortized constant time, as long as compactionThreshold is always initialized to a multiplicative factor of the collection size.

    A United Front

    In order for CompactSet to act like a standard set, it must present a united front to the user at all times. The implementation must hide the complexities of the hybrid implementation, essentially merging the two internal collections dynamically for every member function. In reality, no actual merging occurs (except during compaction), but each function must act as if it is working on a merged std::set. And each function must run in a complexity that is no worse than the corresponding std::set version.

    Implementation and Complexity Analysis

    CompactSet mimics the std::set interface, which can be broken down into four basic operations:

    • Find
    • Insert
    • Erase
    • Iterate
    The following sections describe the algorithms for implementing each of these operations in CompactSet.

    Find

    find must locate a value and return its iterator. Since a value may occur in either of the two collections (or neither, but not both), both must be checked. The algorithm for finding a value val is as follows:

    If(dynamicSet.contains(val))
      return dynamicSet.find(val)
    Else if (archiveSet. contains(val))
      return archiveSet.find(val)
    Else return end()
    
    Neglecting the iterator construction for now, this algorithm does no more than 2log(N) comparisons. This is complexity O(log(N)), like std::set::find.

    Insert

    The std::set specification does not allow duplicates. So CompactSet::insert must ensure no more than one copy of a given value exists in both archiveSet and dynamicSet.

    insert first searches for the value in dynamicSet. If the value exists there, it returns immediately, since std::set does not reinsert existing values. Otherwise, insert attempts to insert the value into archiveSet, by calling archiveSet.tryInsert.

    This may seem to be a contradiction, since archiveSet does not generally allow efficient inserts. However, there is one case where archiveSet can efficiently insert a value: when the value fits into a slot where another element was previously erased. And, of course, the value may already exist in archiveSet, in which case tryInsert succeeds without inserting the value.

    Therefore, tryInsert returns one of the following three state codes:

    • existed: the value already existed and was not inserted.
    • inserted: the value was inserted as a new element.
    • insertionFailed: the value could not be inserted efficiently.
    If archiveSet.tryInsert returns insertionFailed, the value is inserted into the "extra space," dynamicSet. (The other return codes are necessary so that CompactSet::insert can construct the return value.)

    A complexity analysis reveals that there are no worse than 2log(N) comparisons. In addition, compaction may occur as a result of insert. So the total complexity for insert is (O(log(N)) + amortized O(1)). This is equivalent to amortized O(log(N)), which is, on average, the same as std::set insert.

    Erase

    The erase algorithm was not described in the design section. Most of the design was in terms of insert. However, erase is analogous to insert in many ways and shares many of the same design concepts.

    Like insert, erase triggers a compaction based on the compactionThreshold variable and does so in such a way that the performance guarantees are preserved: erase runs in amortized O(log(N)).

    However, erase presents an additional problem: if the value to be erased happens to be in archiveSet, how can it be erased without incurring a linear performance hit?

    The solution is to hold a parallel collection of Booleans in a vector<bool> member named deleted. The deleted vector is resized to the number of elements in archiveSet, and each Boolean is initially false. A value of true in the deleted vector indicates that the corresponding value in archiveSet has been erased. The deleted vector adds only one bit per element of overhead.

    The archiveSet.erase function locates the index of the value to be erased in archiveSet and sets the corresponding Boolean to true in the deleted vector. During the next compaction, the erased values are filtered out, and the deleted vector is reset.

    A side effect of this design is deferred value destruction (see sidebar).

    The basic algorithm for CompactSet::erase(val) is as follows:

    archiveEraseCount = archiveSet.erase(val)
    //might still be in the dynamicSet
    if(archiveEraseCount ==  0)
      dynamicSet.erase(val)
    

    Iterate

    Iterators were the most difficult piece to design and implement. The iterator must be able to navigate forward and backward through the two heterogeneous collections, merging dynamically. And it must support all iterator operations in constant time.

    CompactSet iterators contain the following members:

    • archiveIter: current archiveSet iterator.
    • dynamicIter: current dynamicSet iterator.
    • archiveIsCurrent: Boolean that indicates the current active element (archive or dynamic).
    • collectionPtr: points to the owner collection.
    The two iterator members, archiveIter and dynamicIter, indicate the current iteration position in each of the internal sets. One points to the current value, and the other points to the upper bound of the current value in the non-current collection. In other words, the non-current iterator points to the value that would follow the current value in that collection. If the non-current collection has no value greater than the current value, the non-current iterator contains end.

    The archiveIsCurrent Boolean member indicates which iterator is current.

    The CompactSet end iterator is represented by end iterators in both archiveIter and dynamicIter.

    The collectionPtr member is a pointer to the iterator's parent CompactSet. This is necessary so that the iterator manipulation functions can access the begin and end iterators for the internal sets, which are needed for traversal.

    The above constraints define a valid CompactSet iterator. Iterator traversal functions (operators ++ and --) always return valid iterators. However, any insert or erase operation invalidates all iterators. This is the most significant deviation from the std::set specification, whose iterators remain valid for the life of the collection.

    Figure 3 shoes an example CompactSet iterator. This diagram shows the begin iterator for a sample CompactSet with 14 total members (7 in each internal set). Calling begin should return an iterator pointing to the value 1, the lowest value in the entire collection. In this case, 1 happens to be in the dynamic set. Therefore, dynamicIter points to 1 and is defined to be current. (archiveIsCurrent is set to false.)

    The non-current iterator, archiveIter, points to value 2, which is the upper bound of the current value (1) in the archive set.

    Manipulating the Iterator

    The algorithm for incrementing the iterator is as follows:

    Increment the current iterator.
    Set archiveIsCurrent to (*archiveIter < *dynamicIter)
    
    Figure 4 shows the iterator from Figure 3, incremented once. The current iterator (which was dynamicIter) has been incremented, so that it points to value 3. Since 2 < 3, archiveIter is now current, and archiveIsCurrent gets the value true.

    Decrementing is similar:

    Decrement both iterators.
    Set archiveIsCurrent to (*dynamicIter < *archiveIter)
    Increment the non-current iterator.
    
    Decrementing is slightly more complicated than incrementing. This is by design: increments are generally more common in user code, and the iterators are optimized for incrementing.

    (The boundary conditions are not included in these descriptions. In the actual code, there are checks to avoid running off the ends of the archive and dynamic sets).

    Dereferencing is simply a matter of dereferencing the current iterator.

    All iterator functions run in constant time.

    Figure 5 compares the functionality of the standard tree-based collections (set, map) with Loki::AssocVector (a pure associative vector implementation, designed primarily for small, static collections [2]) and CompactSet/Map.

    Making It Configurable with Policies

    In Modern C++ Design [2], Andrei Alexandrescu introduces the concept of policies, which are used to customize the behavior of a class. The policy class is passed as a template argument in the class declaration.

    The MaximizeSpeed policy uses std::vector as the base collection for archiveSet. During compaction, a new std::vector is created, and the elements are merged into that new std::vector. Memory usage effectively doubles during this process. At the end of compaction, the original std::vector is removed, and memory usage drops back down. So, even though the steady-state memory usage is low, the transient memory usage is not.

    The MinimizeSpace policy uses std::deque for the archiveSet collection. Compaction for the MinimizeSpace policy destroys the original sets as it builds the new one, resulting in much lower transient space usage during compaction. (This feature relies on deallocating erase in std::deque, which is not the case for some versions of the standard library.) The cost of this space efficiency is performance: MinimizeSpace is significantly slower than MaximizeSpeed. But the MinimizeSpace version allows for efficient storage and manipulation of CompactSets even in environments with extremely limited memory.

    Source Code and Further Documentation

    Full source code and documentation for CompactSet and CompactMap can be downloaded at <www.cuj.com/code>. See the documentation for more details on the implementation. The code has been tested with GCC 3.1, GCC 2.95.2, Borland CC 5.5, and Microsoft Visual Studio 6.0. Due to compiler limitations, policies are not available in the GCC 2.95.2 and Microsoft Visual Studio 6.0 versions.

    Notes

    [1] Bjarne Stroustrup. The C++ Programming Language (Addison-Wesley, 2000).

    [2] Andrei Alexandrescu. Modern C++ Design (Addison-Wesley, 2001).

    About the Author

    Michael Carrato received his MS degree in computer engineering from the State University of New York at Buffalo. He has nine years of professional experience in design and development in C/C++, Java, and Smalltalk. He currently works as a software engineer at Cymfony, Inc., in Williamsville, NY, and can be reached at [email protected].


  • Related Reading


    More Insights






    Currently we allow the following HTML tags in comments:

    Single tags

    These tags can be used alone and don't need an ending tag.

    <br> Defines a single line break

    <hr> Defines a horizontal line

    Matching tags

    These require an ending tag - e.g. <i>italic text</i>

    <a> Defines an anchor

    <b> Defines bold text

    <big> Defines big text

    <blockquote> Defines a long quotation

    <caption> Defines a table caption

    <cite> Defines a citation

    <code> Defines computer code text

    <em> Defines emphasized text

    <fieldset> Defines a border around elements in a form

    <h1> This is heading 1

    <h2> This is heading 2

    <h3> This is heading 3

    <h4> This is heading 4

    <h5> This is heading 5

    <h6> This is heading 6

    <i> Defines italic text

    <p> Defines a paragraph

    <pre> Defines preformatted text

    <q> Defines a short quotation

    <samp> Defines sample computer code text

    <small> Defines small text

    <span> Defines a section in a document

    <s> Defines strikethrough text

    <strike> Defines strikethrough text

    <strong> Defines strong text

    <sub> Defines subscripted text

    <sup> Defines superscripted text

    <u> Defines underlined text

    Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

     
    Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.