Now that work on the next revision of C++ has begun, the Standardization Committee is beginning to consider extensions to the Standard C++ library. One of the first proposals the committee will consider is a proposal to add hash tables to the standard library.
Why hash tables? One reason is that hash tables are one of the most frequently requested extensions. C++ has associative containers (set, map, multiset, and multimap) that allow fast O(log N) lookup of an element by key and that store elements in ascending order. Unlike some other languages, C++ doesn't have hash tables. Many users find this an obvious hole in the language. Hash tables and sorted associative containers have different performance tradeoffs: hash tables have poorer worst-case performance, but, in many real-world applications, better performance on average.
A second reason for this proposal is that there is extensive experience with hash tables implemented in C++ in the style of standard containers. Hash tables were proposed for the C++ Standard in 1995 ; the proposal was rejected, mainly because it was submitted at the wrong time. Three independently written libraries (SGI , Dinkumware , and Metrowerks ) now provide hashed associative containers as an extension. The GNU C++ library includes hash tables derived from SGI's hashed associative containers. Standardization committees do lots of different things, but one of the things they're supposed to do is standardize existing practice. We now have multiple, widely used, incompatible definitions of hash tables, and that's exactly what standardization is supposed to prevent.
I recently submitted a new hash table proposal , and the standardization committee has begun to consider it. This column is based on that proposal. There will undoubtedly be changes -- there always are -- so, while it's safe to assume that hash tables will eventually make it into the standard library, you should not assume they will look exactly like the hash tables described in this proposal.
What Does a Proposal Look Like?
A formal proposal should include these sections:
- Motivation. You should explain why this proposal should be considered. Why is it important, and what problem area does it address? What kinds of programmers will use it? To what extent is it based on existing practice?
- Impact on the Standard and on users. What does it depend on, and what depends on it? Is it a pure extension, or do you propose changing existing library components? Does your proposal involve core language changes? Can it be implemented portably, or does it require nonstandard facilities or compiler magic?
- Design decisions. Having a record of your intentions will be important to programmers who are trying to understand the Standard, and to the committee as it makes changes and bug fixes in the future. (In the past, we've occasionally had problems when the committee made "obvious" changes without a full understanding of the original design.) This section should address design issues that you think are important, as well design issues that other people raise. If you've made any decisions that you don't think are important, but where you had to make some definite choice, that too is important to know.
Some specific questions to address in this section: Why did you choose the specific design that you did? What alternatives did you consider, and what are the tradeoffs? What are the performance implications of your choices? What decisions are left up to implementers? Were there any "obvious" ideas that you tried and rejected?
- Proposed text. Wording that's suitable for the Standard. There will probably be some editorial tinkering, but you should assume that it will be minor. The style and order of presentation should be consistent with that of the C++ Standard's library clauses .
- References. This section can be important in areas that a lot of people have worked on: if the committee is choosing one design in an area where there are lots of alternatives, we can be more confident if we know you've looked at the other possibilities.
I already discussed the motivation for standardizing hash tables. As for the impact on the Standard, this proposal is almost a pure extension. (Only "almost," because I'm proposing to add a new declaration to the header <functional>.) It doesn't require changes to any of the standard requirement tables, and it can be and has been implemented in Standard C++.
The rest of this column will discuss design decisions. As you'll see, designing a set of requirements for the Standard isn't quite the same as designing a library. The important point is that you're not just documenting what your library does: you're imposing requirements that are supposed to be satisfied by any conforming implementation. There are tradeoffs between specificity and implementer latitude, and you have to think carefully about those tradeoffs.
The three implementations in current use, as well as the Barreiro/Fraley/Musser proposal, all used the names hash_set, hash_map, hash_multiset, and hash_multimap. Existing practice suggests that these names should be retained.
The distinction between the four hashed containers is the same as the distinction between the four standard associative containers. All four hashed containers allow lookup of elements by key. In hash_set and hash_multiset, the elements are the keys; modification of elements is not allowed. In hash_map and hash_multimap, the elements are of type std::pair<const Key, Value>. The key part can't be modified, but the value part can. In hash_set and hash_map, no two elements may have the same key; in hash_multiset and hash_multimap, there may be any number of elements with the same key.
Similarly, because of existing practice, this proposal defines the hash_set and hash_multiset classes within a new <hash_set> header, and hash_map and hash_multimap within <hash_map>.
This proposal defines a set of Hashed Associative Container requirements and then separately describes four classes that conform to those requirements. In that respect, this proposal follows the lead of the Standard and the Barreiro/Fraley/Musser proposal. However, Barreiro, Fraley, and Musser proposed more extensive changes to the container requirements. They proposed two new requirements tables, not one: Sorted Associative Container, satisfied by the existing standard associative containers, and Hashed Associative Container. They then modified the existing Associative Container requirements (Table 69 in the C++ Standard) so that both sorted associative containers and hashed associative containers would satisfy the new Associative Container requirements. The difference is shown in Figure 1.
I believe that the Barreiro/Fraley/Musser taxonomy is better: the generality of the name "Associative Container," and the specificity of Table 69, aren't a good match. However, that proposal was made before the C++ Standard was finalized. The superiority of the Barreiro/Fraley/Musser taxonomy isn't so great as to justify changing an existing requirements table that users may be relying on.
The three hash table implementations in current use are not identical, but they are similar enough that, for simple uses, they are interchangeable. This proposal attempts to maintain a similar level of compatibility.
Chaining vs. Open Addressing
Knuth  distinguishes between two kinds of hashing: chaining, where a hash code is associated with the head of a linked list, and open addressing, where a hash code is associated with an index into an array.
I'm not aware of any satisfactory implementation of open addressing in a generic framework. Open addressing presents a number of problems:
- It's necessary to distinguish between a vacant position and an occupied one.
- It's necessary either to restrict the hash table to types with a default constructor and construct every array element ahead of time, or else to maintain an array some of whose elements are objects and others of which are raw memory.
- Open addressing makes collision management difficult: if you're inserting an element whose hash code maps to an already-occupied location, you need a policy that tells you where to try next. This is a solved problem, but the best solutions are complicated.
- Collision management is especially complicated when erasing elements is allowed. (See  for a discussion.) A container class for the standard library ought to allow erasure.
- Collision management schemes for open addressing tend to assume a fixed-size array that can hold up to N elements. A container class for the standard library ought to be able to grow as necessary when new elements are inserted, up to the limit of available memory.
Solving these problems could be an interesting research project, but, in the absence of implementation experience in the context of C++, it would be inappropriate to standardize an open-addressing container class.
All further discussion will assume chaining. Each linked list within a hash table is called a bucket. The average number of elements per bucket is called the load factor, or a.
Lookup within a Bucket
When looking up an item in a hash table by key k, the general strategy is to find the bucket that corresponds to k and then to perform a linear search within that bucket. The first step uses the hash function; the second step must use something else.
The most obvious technique is to use std::find or the equivalent: look for an item whose key is equal to k. Naturally, it would be wrong for operator== to be hard-wired in; it should be possible for the user to provide a function object with semantics of equality. As an example where some predicate other than operator== is useful, suppose the user is storing C-style strings (i.e., pointers to null-terminated arrays of characters). In this case, equality of keys k1 and k2 shouldn't mean pointer comparison; it should mean testing that strcmp(k1, k2) == 0.
This proposal takes such an approach. Hashed associative containers are parameterized by two function objects: a hash function and an equality function. Both have defaults.
An alternative technique is possible: instead of testing for equality, sort each bucket in ascending order by key. Linear search for a key k would mean searching for a key k' such that k < k' and k' < k are both false. Again, this shouldn't be taken to mean that operator< would literally be hard-wired in; users could provide their own comparison function, so long as that function has less-than semantics.
The performance characteristics of the two techniques are slightly different. The following table shows the average number of comparisons required for a search through a bucket of n elements:
Using Equality Using Less-Than Failed Search n n/2 Successful Search n/2 n/2 + 1
The difference for a failed search is that with less-than you can tell that a search has failed as soon as you see a key that's larger than k; with equal-to you have to get to the end of the bucket. The difference for a successful search is with equal-to you can tell that a search has succeeded as soon as you see a key that's equal to k; with less-than all you know when you find a key that's not less than k is that the search has terminated, and you need an extra comparison to tell whether it has terminated in success or failure.
I do not see a clear-cut performance advantage from either technique. Which technique is faster depends on usage pattern: the load factor and the relative frequency of failed and successful searches. (Remember, a bucket may only contain a single element. That extra "+ 1" can be important!) There are also performance implications for insertion, but I expect those differences to be smaller because in most cases I expect insertion to be dominated by the cost of memory allocation or element construction and not by the cost of lookup.
For users, it's sometimes important for a container to present its elements as a sorted range; sorting elements by inserting them into an std::set, for example, is a common idiom. However, I see no value (other than the performance issues discussed above) in guaranteeing the order of elements within a single bucket. If the hash function is well chosen, after all, elements will be distributed between buckets in a seemingly random way. It is useful to guarantee that equivalent elements are adjacent, but we can make that guarantee with either equal-to or less-than.
From the point of view of user convenience, there isn't a huge difference between the two alternatives of equal-to and less-than. I view equality as slightly more convenient, since it's common to define data types that have equality operations but not less-than operations, and rather less common to do the reverse. There are some types where less-than is not a natural operation, and users would have to define a somewhat arbitrary less-than operation for no reason other than to put the objects in a hash table. One obvious example is std::complex<>.
Existing implementations differ. The SGI and Metrowerks implementations use equal-to, and the Dinkumware implementation uses less-than.
An aside: in principle, linear search isn't strictly necessary. A bucket doesn't have to be structured as a linked list; it could be structured as a binary tree, or as some other data structure. This proposal assumes linked lists, partly for reasons of existing practice (all of the C++ hash table implementations in widespread use are implemented in terms of linked lists) and partly because I believe that in practice a tree structure would hurt performance more often than it would help. Balanced trees have large per-node space overhead, and binary tree lookup is faster than linear search only when the number of elements is large. If the hash table's load factor is small and the hash function well chosen, trees have no advantage over linear lists.
The Hash Function Interface
Abstractly, a hash function is a function f(k) that takes an argument of type Key and returns an integer in the range [0, B), where B is the number of buckets in the hash table. A hash function must have the property that f(k1) and f(k2) are the same when k1 and k2 are the same. A good hash function should also have the property that f(k1) and f(k2) are unlikely to be the same when k1 and k2 are different.
It is impossible to write a fully general hash function that's valid for all types. (You can't just convert an object to raw memory and hash the bytes; among other reasons, that idea fails because of padding.) Because of that, and also because a good hash function is only good in the context of a specific usage pattern, it's essential to allow users to provide their own hash functions.
There can be a default hash function for a selected set of types; ideally, it should include the most commonly used types (e.g., strings).
There are two design decisions involving non-default hash functions:
- How can a user-written function return an integer in the range [0, B) for arbitrary B, especially since B may vary at run time?
- Should the hash function be a stand-alone function object or part of a larger package that controls other aspects of the hash policy?
In principle there are two possible answers to the first question. First, the hash function could take two arguments instead of one, where the second argument is B. The hash function would have the responsibility of returning some number in the range [0, B). Second, the hash function could return a number in a very large range, say [0, std::numeric_limits<std::size_t>::max()). The hash table class would be responsible for converting the hash code (the value returned by the hash function) into a bucket index in the range [0, B).
This proposal uses a single-argument hash function. The reasons are:
- Existing practice. The three shipping hash table implementations all use single-argument hash functions.
- The main argument in favor of a two-argument hash function is that a user-written hash function might make good use of the bucket count; it might, for example, involve multiplication modulo B. However, this argument is weaker than it might seem, since a user-written hash function that took B as an argument would have to cope with arbitrary B. (B is chosen by the hash table class, not the user.) This is a significant restriction, because it means a user-written hash function couldn't rely on any special numerical properties.
- As discussed later, the bucket count won't always remain the same during the lifetime of a hash table. This means that, for a particular key k, the hash table class may need the bucket index for two different bucket counts B1 and B2. With a double-argument hash function, the hash table class would have to call the hash function twice. With a single-argument hash function, the hash table class only has to invoke the hash function once.
If the hash function is to be packaged along with other aspects of hash policy, what should those aspects be? There are two obvious candidates. First, it could be packaged along with the function object that tests for key equality, or perhaps, even more generally, with a function object that specifies a policy for linear search within a bucket. Second, it could be packaged along with the parameters that govern resizing the bucket count.
This proposal uses a stand-alone hash function, rather than a hash function that's part of a policy package. This is mostly a consequence of other design decisions. First, as discussed later, this proposal controls bucket resizing by floating-point parameters that can be changed at run time, so there is no advantage in putting them in a policy class. Second, linear search within a bucket uses equality, and equality is such a common operation that in most cases I expect that a user-supplied equality predicate will have been written for some other purpose and will be reused as a hash table template argument. Making equality part of a larger policy class would make such reuse harder.
This proposal includes a function object hash<>, with an operator() that takes a single argument and returns an std::size_t. The hash<> template is an incomplete type; it is specialized, and declared as a complete type, for a few common types. I've chosen all of the built-in integer types, void*, char* and const char*, and std::basic_string<charT, traits, Allocator>. I believe that char*, const char*, and std::basic_string are important, because hash tables are often used for strings. Beyond that, the list is fairly arbitrary.
The hash<> function object is defined in the <functional> header. Another sensible alternative would have been to declare it in both <hash_set> and <hash_map>. Implementers would have to arrange for there to be only a single definition when both headers are included, but that's straightforward. The main reason I chose to put it in <functional> is that authors of user-defined types may need to specialize hash<> (which means they need the declaration) even if they have no need to use any of the hashed containers, but this choice too is basically arbitrary.
Trivial as it may seem, hash function packaging may be the most contentious part of this proposal. Existing implementations differ. The SGI and Metrowerks implementations use hash functions that aren't bundled with anything else, but the Dinkumware implementation uses a more general hash policy class. Since this is an interface issue, a decision is necessary.
Legacy implementations are a tricky enough problem that some people have suggested a drastic solution: abandon the names hash_set, hash_map, hash_multiset, and hash_multimap, because of the multiple pre-standard designs using those names, and instead standardize some alternate set of names like hashset, hashmap, hashmultiset, and hashmultimap. I hope nothing of the sort turns out to be necessary.
Control of Hash Resizing
The time required for looking up an element by key k is c1 + c2 n, where c1 and c2 are constants, and where n is the number of elements in the bucket indexed by k's hash code. If the hash function is well chosen, and elements are evenly distributed between buckets, this is approximately c1 + c2 N/B, where N is the number of elements in the container and B is the bucket count. If the bucket count is taken as a constant, then the asymptotic complexity for element lookup is O(N).
To maintain average case complexity O(1) for lookup, the bucket count must grow as elements are added to the hash table; on average the bucket count must be proportional to N. Another way of putting this is that the load factor, N/B, must be approximately constant.
Two methods of maintaining a roughly constant load factor are in current use. First is traditional rehashing. When the load factor becomes too large, choose a new and larger bucket count, B'. Then go through every element in the hash table, computing a new bucket index based on B'. This operation is called rehashing, and it is expensive. Since we want the amortized complexity of element insertion to be constant, we must use exponential growth; that is, B' = g B, where the growth factor, g, is larger than 1. In general this proportionality can only be approximate, since many hashing schemes require B to have special numerical properties (e.g., primality).
Second is a newer technique, incremental hashing . Incremental hashing structures the hash table in such a way that it is possible to add a single bucket at a time. When adding a bucket it is only necessary to examine the elements of a single old bucket, distributing some of them to the new one.
The advantage of incremental hashing is that insertion time becomes more predictable: there's no longer a large time difference between insertions that do trigger rehashing and insertions that don't. The disadvantage is that incremental hashing makes lookup slightly slower. The slowdown is for two reasons. First, the logic to determine a bucket index from a hash code is slightly more complicated: it requires one extra test. Second, incremental hashing results in a somewhat less uniform distribution of elements within buckets. It relies on a construction where there are conceptually B buckets, of which U are in current use; B is a power of two, and U > B/2. We first find a bucket index i in the range [0, B) and then find a bucket index j in the range [0, U) by subtracting U from i if necessary. If the original hash codes are evenly distributed, a bucket in the range [0, B-U) will on average have twice the number of elements as a bucket in the range [B-U, U).
Because of this tradeoff, there is not a clear choice between incremental hashing and traditional rehashing; both are legitimate implementation techniques. A standard, of course, need not and should not dictate implementation technique. The goal of this proposal is to allow both.
From a user's perspective, all of this is invisible in normal use. It's visible when users want to do one of these things:
- Examine the number of buckets, or the distribution of elements between buckets.
- Request a rehash even when the load factor isn't yet large enough so that one would be triggered automatically. Suppose, for example, that a hash table currently contains 100 elements, and a user is about to add 1,000,000 elements. A user would save a lot of time by requesting a rehash before performing those insertions.
- Control the parameters that govern automatic rehashing.
What are those parameters? The most obvious is the maximum load factor, since that's what triggers an automatic rehash. There's also a second parameter, which can be thought of in two different ways: as a growth factor (the constant of proportionality by which the bucket count grows in a rehash) or as a minimum load factor.
Letting users control that second parameter is less straightforward than it seems at first.
- For incremental hashing, it makes no sense to talk about a growth factor; incremental hashing doesn't work by exponential growth.
- Even if we regard the second parameter as a minimum load factor, it's not clear how it would be used in an incremental-rehashing implementation. The natural strategy for incremental rehashing is to add one bucket whenever the load factor exceeds the maximum; this automatically keeps the load factor within a tight range.
- The interactions between a minimum load factor and manual rehash are tricky. Again, consider the example where a hash table has 100 elements, and the user, anticipating a large insertion, requests that the table rehash itself for 1,000,000 elements. In between the rehash and the insertion, the load factor will be very small -- probably below reasonable values of the minimum.
- A minimum load factor presents difficulties for the hash table's initial state. An empty hash table -- one that has a nonzero bucket count, but zero elements -- has a load factor of zero. If we impose an invariant that the load factor must always lie between amin and amax, an empty hash table would fail to satisfy that invariant.
- The second parameter, however it's expressed, would probably have to be interpreted in some approximate or asymptotic sense. The number of buckets must always be an integer, and in some implementations it has to obey other constraints (e.g., primality). If we specify an exact growth factor, or a tight minimum and maximum, then there might simply be no suitable number available.
I don't know how to specify invariants that are precise enough to be meaningful and normative, but loose enough to accommodate traditional and incremental hashing, hash tables with bucket counts that are restricted to prime numbers, manual rehashing, and empty hash tables. We could include a member function for setting the growth factor (or minimum load factor), but not say exactly how that number is used. However, I see very little value in such a vacuous requirement. This proposal provides user control of the maximum load factor, but not of growth factor or minimum load factor.
One implication is that this proposal says what happens as the number of elements in a hash table increase, but doesn't say what happens as the number decreases. This is unfortunate. An unnecessarily small load factor wastes space (the magnitude of the load factor is a time/space tradeoff) and can lead to unnecessarily slow iteration.
There's still one last question about the maximum load factor: How should the user specify it? Setting amax to a small value optimizes for speed, and setting amax to a large value optimizes for space; surely this should be under user control. Letting users specify amax as an integer is an unnecessarily restrictive choice, since fractional values (especially in the range 0 < amax < 1) are sensible. There are three reasonable options: as a rational number (perhaps an ad hoc rational number, where the user provides the numerator and denominator separately), as an enum (the user may select one of a small number of predetermined values, such as 1/4, 1/2, 1, 3/2, 2), or as a floating-point number.
This proposal provides a member function that allows the user to set the maximum load factor at run time, using a floating-point number. The reasons are:
- In my opinion, there is no compelling performance advantage gained from setting the maximum load factor at compile time rather than at run time. The cost of a run-time floating-point parameter is one floating-point multiplication at every rehash (not every insertion). Even with incremental hashing, this is almost certain to be dwarfed by the cost of a rehash. And, except when using a compile-time constant of one, the costs of an integer or rational limit aren't much less in any case.
- An enum is less flexible from the user's point of view, because a user might want a load factor that's not in the predetermined list. Ad hoc rational numbers are just clumsy, both for the user and for the implementer.
I propose that the maximum load factor initially be set to one. This is an arbitrary choice.
The Dinkumware hash table implementation uses a compile-time integer constant (part of a hash traits class) to control the maximum load factor, and the Metrowerks implementation uses a run-time floating-point parameter. The SGI implementation does not provide any mechanism for controlling the maximum load factor.
There is one basic decision to be made about hash table iterators, which can be expressed either from the implementer's or the user's point of view. From the implementer's point of view: are the buckets singly linked lists, or doubly linked lists? From the user's point of view: are the iterators forward iterators, or bidirectional iterators?
From the implementer's point of view, there's no question that doubly linked lists are much easier to work with. One advantage is that you don't have to maintain a separate list for each bucket. You can keep a single long list, taking care that elements within a bucket remain adjacent; a bucket is then just a pair of pointers into the list. This is nice for the implementer, because a hash table iterator can just be a recycled std::list<>::iterator. It's nice for the user, because iteration is fast. This implementation technique may not appear to depend on the choice between singly and doubly linked lists, but it seems that it does. I don't know of a way to make the single-long-list technique work for singly linked lists: some operations that ought to be linear time would require a linear search through buckets. (The sticking point turns out to be that erasing a node in a singly linked list requires access to the node before it. The obvious workarounds just push the problem to a different place.)
From the user's point of view, the choice is a tradeoff. Hash tables that use singly linked lists have slower iterators, because an iterator first steps within a bucket and then, upon reaching the end of a bucket, steps to the next. Additionally, users may sometimes want to apply algorithms that require bidirectional iterators. If a hash table supplies bidirectional iterators, it's easier for users to switch between (say) hash_set<> and std::set<>. (But "easier" still doesn't mean easy. Some applications use the standard associative containers because of those containers' ordering guarantees, which can't possibly be preserved by hash tables.)
For the user, the disadvantage of bidirectional iterators is greater space overhead. The space overhead for singly linked lists is N + B words, where N is the number of elements and B is the bucket count, and the space overhead for doubly linked lists is 2N + 2B. This is an important consideration, because the main reason for using hashed associative containers is performance.
The SGI and Metrowerks implementations provide forward iterators. The Dinkumware implementation provides bidirectional iterators.
This proposal allows both choices. It requires hashed associative containers to provide forward iterators. An implementation that provides bidirectional iterators is conforming, because bidirectional iterators are forward iterators.
Like all STL containers, each of the hashed containers has member functions begin and end. The range [c.begin(), c.end()) contains all of the elements in the container, presented as a flat range. Elements within a bucket are adjacent, but the iterator interface presents no information about where one bucket ends and the next begins.
It's also useful to expose the bucket structure, for two reasons. First, it lets users investigate how well their hash function performs: it lets them test how evenly elements are distributed within buckets and look at the elements within a bucket to see if they have any common properties. Second, if the iterators have an underlying segmented structure (as they do in existing hash table implementations that use singly linked lists), algorithms that exploit that structure, with an explicit nested loop, can be more efficient than algorithms that view the elements as a flat range.
The most important part of the bucket interface is an overloading of begin and end. If n is an integer, [begin(n), end(n)) is a range of iterators pointing to the elements in the nth bucket. These member functions return iterators, of course, but not of type X::iterator or X::const_iterator. Instead they return iterators of type X::local_iterator or X::const_local_iterator. A local iterator is able to iterate within a bucket, but not necessarily between buckets; in some implementations it's possible for X::local_iterator to be a simpler data structure than X::iterator. X::iterator and X::local_iterator are permitted to be the same type; implementations that use doubly linked lists will probably take advantage of that freedom.
This bucket interface is not provided by the SGI, Dinkumware, or Metrowerks implementations. It is inspired partly by the Metrowerks collision-detection interface, and partly by earlier work  on algorithms for segmented containers.
The full text of the proposal, in a form and style that's similar to the existing C++ Standard, is shown in the sidebar, "Standardese for the Hash Table Proposal." All of the important design decisions have already been discussed; the text is merely a more formal way of saying the same thing.
Will the Standardization Committee adopt this proposal unchanged? Almost certainly not. It's a safe bet that hash tables will make it into the next revision of the C++ Standard, but it's also a safe bet that there will be changes along the way. And, of course, hash tables won't be the only way in which the standard library gets extended! If you want to write a proposal for a library extension, this can serve as a model.
 M. H. Austern. "Segmented Iterators and Hierarchical Algorithms," 1998, in M. Jazayeri, R. G. K. Loos, and D. R. Musser, ed., Generic Programming: International Seminar on Generic Programming, Castle Dagstuhl, Springer, 2001.
Matt Austern is the author of Generic Programming and the STL and the chair of the C++ standardization committees library working group. He works at AT&T Labs Research and can be contacted at [email protected].