High-performance Internet routers require very efficient lookups of IP addresses to quickly decide where to forward packets based on their destination IP address. Techniques used to address this problem include those based on hashing. In this article, I present a method based on multiple hash functions that produces good hash tables for applications such as this.
Open address hashing has long been used as a simple mechanism for doing fast lookups. A hash function takes a data element (or key) as input. It provides an index into an array of buckets, usually implemented as linked lists, as output. The key is stored in the bucket associated with its hash index. The entire structure is called a "hash table."
For example, a small database could keep track of employee information by using Social Security numbers as keys. The last four digits of the Social Security number provides a trivial hash function; this would require an array of 10,000 linked lists. Each entry in the linked lists would record the remaining five digits of an employee Social Security number and a pointer to all relevant employee information. If each list remains short, this provides an efficient way to access employee records. In routers, keys often consist of some of the bits of the destination address.
Linked lists give you an advantage in that you do not need to set aside memory ahead of time for the buckets. Instead, you assign memory to buckets as keys are placed. The corresponding disadvantage is that you may require several slow accesses to memory when searching for a key, since the linked list may not exhibit good locality for a cache. Since the time for memory accesses generally dominates the computation time for large hash tables, this lack of locality can create a bottleneck. Another disadvantage of this approach is you require additional memory for the pointers for the linked list.
Routers provide an example of a situation where these disadvantages prove problematic. A router that traverses a linked list to look up relevant information in a hash table would be too slow.
As an alternative, you can set aside a fixed amount of memory for the buckets. For example, suppose a cache line holds 32 bytes, and each key requires 8 bytes to store. You can assign each bucket 32 bytes so that searching an entire bucket requires bringing only a single cache line in from memory. There are two problems with this approach:
- It only works if at most four items end up in every bucket; otherwise, you need to worry about cases where buckets overflow. In practice, it is generally sufficient to make sure no bucket overflows with high probability; in the rare case of overflow, you can rehash the keys.
- This approach has the potential to waste a lot of memory. If most of the buckets lie empty or contain only one key, the hash table requires a lot of memory that it does not use. Routers in particular generally do not have memory to spare, so memory usage as well as lookup speed are important for this application.
Both problems relate to the distribution of keys among the buckets. To avoid wasting memory, you need an even spread of keys among the buckets. To avoid overflow, you need strong guarantees about how the distribution of keys behaves. Suppose that you have good hash functions available; by "good" I mean that the bucket each key ends up in appears random. Using bits from a secure hash function such as MD5 or SHA works well. If each key ends up in a random bucket, as in Figure 1(a), what does the distribution of keys into buckets look like?
The equation in Figure 1(a) shows that this case is easy to calculate mathematically: For n keys and m buckets, the probability of a bucket ending up with k keys is given. The first term comes from the fact that you must choose k out of the n keys to fall in the bucket. The second term comes from the n-k keys that do not fall in the bucket. The third term comes from the k keys that do fall in the bucket. For reasonably large n and m, you can approximate this expression by using the equation in Figure 1(b), which depends only on k and the ratio n/m. Using this approximation, I derived Table 1, which shows the probability of a bucket having k keys for various ratios n/m. This table lets you roughly estimate how often buckets overflow.
Unfortunately, Table 1 suggests that you need a lot of wasted memory to avoid overflow, stemming from the fact that the number of keys per bucket will vary dramatically. If you have the same number of keys as buckets (n=m), then on average there is one key per bucket, but Table 1 reflects that over 0.3 percent of all the buckets will have five or more keys. Even if you have twice as many buckets as keys (n=m/2) so that more than half of the buckets lie empty, about 1.6 out of every 10,000 buckets will have five or more keys. If my hash table has 10,000 buckets, I am still likely to find one or more buckets with too many items to fit in a cache line. For hash tables with thousands of buckets, I really need something like four times as many buckets as keys (n=m/4), which wastes an awful lot of space. In some applications, hash tables have tens of thousands or hundreds of thousands of keys, which would require even more empty space with this approach.
As another case for comparison, you might consider when cache lines are 32 bytes and keys are 4 bytes, so eight keys can fit in a bucket. Even if the keys and the number of buckets are equal (n=m), Table 1 suggests that roughly one out of every hundred thousand buckets will exceed the cap.
Two Hash Functions
Is it back to linked lists or is there a way to ensure a more even spread of keys among buckets? I suggest applying a simple but effective trick use two hash functions instead of one, as in Figure 2(b) as opposed to Figure 2(a). When placing a key in a bucket, you use two different hash functions to obtain two indices. You examine both buckets and see how many keys are already stored in each bucket; this requires an extra counter field for each bucket, or a special value (such as 0) to represent an empty place in the bucket. You then place the key in the bucket with fewer keys already there, breaking ties in any arbitrary way, such as selecting the first bucket chosen. Listing One is basic pseudocode for this approach. To search for a key now requires looking in two buckets, since the key could end up in either of the two choices.
Using this information, you can compute Table 2, which presents the same type of information about the distribution of keys as Table 1. Blank entries contain quantities with a value less than 1e-100, which seemed small enough to discount.
Just glancing at Table 2, you should see that the tails of the distribution fall off much more quickly than in Table 1. Indeed, the behavior appears completely different. In Table 1, as k increases, the fraction of buckets with k keys falls by a factor of roughly n/km. In Table 2, past some point, the fraction of buckets with k keys falls by becoming roughly the square of the previous fraction. This is the power of using two hash functions.
Suppose you want at most four keys in each bucket. If you have the same number of keys per bucket (n=m), then you are quite comfortable for a large range of bucket sizes fewer than two in every trillion buckets will have five keys. If you want at most eight keys per bucket, having four keys per bucket on average (n=4m) lets you feel quite secure that you will not overflow a bucket.
You might ask what can be done about buckets overflowing with too many keys. If you do not mind a miniscule probability of failure, you might design the system so that you can ignore the problem. For instance, if you design the system so that each bucket contains only two keys on average (n=2m), and you allow eight keys per bucket, then the probability a bucket ever overflows is so small (about 3.2e-97 per bucket) that you can probably just ignore it. This solution requires trading off some additional wasted space for more security. Of course, you better be sure to use a good hash function in this situation, since I created Table 2 with the assumption that each hash returns a random value. If you use a bad hash function, such as the last four digits of a Social Security number, these guarantees may not hold.
As another alternative, you can rehash values if necessary. For example, suppose you have a fixed number of keys and you choose a hash table size so that you have roughly a 10 percent chance of overflowing some bucket. If you represent the key as a number to obtain two hash functions, you might use bits from the MD5 hash of (key+x) and (key+y) for two numbers x and y. If you find a bucket overflows, you can change the values of x and y and try again. Each time, you have a 90 percent chance of succeeding without overflowing any buckets. With this technique, you can avoid wasted space, but you may have to pay the price of rehashing a few times.
If two hash functions work so well, what about three? Of course using more hash functions helps spread the keys more evenly among buckets, but the most substantial gains come from switching to two hash functions. Also, keep in mind that if you use three (or more) hash functions, you may have to examine three (or more) buckets every time you do a search. For the application to routers, each lookup may slow down the time required to forward a packet, so you would need to consider this tradeoff carefully. Table 3 provides the relevant numbers for comparison.
A less obvious improvement that I strongly recommend involves using a slightly different tie-breaking scheme due to Berthold Vöcking. Break the buckets into two equally sized groups. For example, call the buckets numbered 0 to m/2-1 the "left buckets" and those numbered m/2 to m-1 the "right buckets." Use the first hash function to choose a left bucket and the second hash function to choose a right bucket. Place the key in the bucket with fewer keys already, but if both buckets have the same number of keys, always place the key in the bucket on the left, as in Figure 2(c). The appropriate pseudocode in Listing Two almost exactly mirrors that of Listing One. Surprisingly, the asymmetry from splitting and breaking ties toward the left improves the resulting distribution. This may surprise you (it surprised me the first time I heard about it), but the numbers that appear in Table 4 do not lie.
This scheme still only uses two hash functions per object, although you could use the same technique with three or more hash functions. It also has a key advantage for some applications, including the application to routers: Because each index lies in a distinct part of the hash table, the lookups are easy to parallelize. If you can block your memory appropriately, you could parallelize the two memory accesses to take approximately the time of only one access. In the case of a router, where the hardware design can be controlled, adding the parallelism could yield large gains in speed at little additional cost.
Why does the asymmetry help? The following intuition may explain. When you insert the first keys, a bucket with two keys will arise if a new key already hashes to two locations with one key. In the original setup, if 1/10 of the buckets hold one key, the probability of a key hashing to two locations already holding a key would be 1/100. In the asymmetric setup, more buckets on the left will have one key than the buckets on the right. If 1/10 of the buckets hold one key, then perhaps 1/8 on the left do and 3/40 on the right do. (These numbers illustrate the point, even if they are not accurate.) Now the probability of a key hashing to two locations already holding a key would be (1/8)(3/40)<1/100. So with the asymmetry, creating buckets with larger numbers of keys is actually less likely.
The technique of using multiple hash functions applies to other similar problems. Certainly whenever you build a hash table and you would like to map buckets to cache lines or other fixed-sized blocks of memory, using two or more hash functions will lead to much greater control over the distribution of keys into buckets; see Figure 2.
But as another example, consider a distributed computing system consisting of many machines meant to handle computing jobs for a large population of users. Users send each job to a single machine and want fast responses. If jobs passed through a central controller, it could send the job to the currently least-loaded machine. The controller would have to maintain load information for all of the machines and direct each job. So much centralization could become a system bottleneck. Alternatively, each job could choose a random machine for processing using a good hash function, with some job identifier as the key. While this performs well, it is likely that some machines will become overloaded while others sit idle. A better distribution of load occurs if each job chooses two random machines using two hash functions, checks their load, and obtains service from the less loaded machine. Using more hash functions improves performance, as does splitting the machines into groups and breaking ties toward the left, but just using two choices yields the most significant improvement.
Michael's work is supported by NSF grant numbers CCR-9983832, CCR-0118701, and
index1 = hash1(key); // hash1 returns a number from [0,m - 1] index2 = hash2(key); // hash2 returns a number from [0,m - 1] if (count_keys(index1) < count_keys(index2)) place key in bucket for index1 increment count_keys(index1) else place key in bucket for index2 increment count_keys(index2)
index1 = hash1(key); // hash1 returns a number from [0,m/2 - 1] index2 = m/2 + hash2(key); // hash2 returns a number from [0,m/2 - 1] if (count_keys(index1) < count_keys(index2)) place key in bucket for index1 increment count_keys(index1) else place key in bucket for index2 increment count_keys(index2)