Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Database

Algorithm Alley


Dr. Dobb's Journal September 1997: Hash Functions

Hash Functions

Bob works at Oracle, maintaining referential integrity. He can be contacted at [email protected].


Over the past two years, I've experimented with many different hash functions, and eventually designed my own general-purpose function to address the weaknesses I found. To serve as a plug-in replacement for existing hash functions, it had to meet several requirements:

  • The keys must be unaligned variable-length byte arrays.
  • Sometimes keys are several such arrays.
  • Sometimes a set of independent hash functions are required.
  • Average key lengths range from 8 bytes to 200 bytes.
  • Keys might be character strings, numbers, bit-arrays, or weirder things.
  • Table sizes can be anything, including powers of 2.
  • The hash must be faster than its predecessor.
  • The hash must do a good job.

I developed this hash function with an elaborate search mechanism that tested for a variety of weaknesses. The result of this search is Listing One which is the fastest hash I've been able to design that meets all of my requirements.

Like most hashes, Listing One can be modeled like Example 1. In my hash, mix() takes 3n of the 6n+35 instructions needed to hash n bytes. Blocks of text are combined with the internal state (a,b,c) by addition. This combining step is the rest of the hash function, consuming the remaining 3n instructions. The only postprocessing is to choose c out of (a,b,c) to be the result.

Three tricks promote speed:

  • Mixing is done on three 4-byte registers rather than on a 1-byte quantity.
  • Combining is done on 12-byte blocks, reducing the loop overhead.
  • The final switch statement combines a variable-length final block with the registers a,b,c without a loop.

The golden ratio in Listing One is just an arbitrary value. It keeps all zeros from being hashed to all zeros.

In this article, I'll explain some things I've learned about hashes, and show how some competing hashes stack up.

The Hash Must Do a Good Job

The most interesting requirement was that the hash must be better than its competition. What does it mean for a hash to be good for hash table lookup?

A good hash function distributes hash values uniformly. If you don't know the keys before choosing the function, the best you can do is map an equal number of possible keys to each hash value. If keys are distributed uniformly, an excellent hash would be to choose the first few bytes of the key and use that as the hash value. Unfortunately, real keys aren't uniformly distributed. Choosing the first few bytes works quite poorly in practice.

The real requirement then is that a good hash function should distribute hash values uniformly for the keys that users actually use. To understand some of the problems, consider Table 1, the EMP table. If you consider each horizontal row to be a key, you'll notice some patterns:

  • Length matters. The only difference between zero and no value at all may be the length of the value. Also, "aa aaa" and "aaa aa" should hash to different values.
  • Keys often consist of substrings arranged in different orders. For example, the MGR of some keys is the EMPNO of others.
  • Some keys are mostly zero, with only a few bits set. (That pattern doesn't appear in this example, but it's a common pattern.)
  • Keys often differ in only a few bits. For example, all the keys are ASCII, so the high bit of every byte is 0.

It's easy to design hashes that handle the first three patterns. If the length is included in the data being hashed, then lengths are not a problem. If the hash does not treat text blocks commutatively, then substrings are not a problem. You can see how a hash behaves with strings that are mostly zeros by listing all strings with only one bit set and checking if that set of strings produces too many collisions.

The remaining pattern is that keys often differ in only a few bits. If a hash allows small sets of input bits to cancel each other out, and some keys differ in only those bits, then those keys will map to the same handful of hash values.

A Common Weakness

Usually, when a small set of input bits cancel each other out, it is because those input bits only affect a smaller set of bits in the internal state.

Consider the hash function in Example 2, which maps the strings "EXXXXXB" and "AXXXXXC" to the same value. These keys differ in bit 3 of the first byte and bit 1 of the seventh byte. After the seventh byte is combined, no additional processing will help because the internal states are already the same.

Any time n input bits can only affect m output bits, and n>m, then the 2n keys that differ in those input bits can only produce 2m distinct hash values. The same is true if n input bits can only affect m bits of the internal state -- later mixing may make the 2m results look uniformly distributed, but there will still be only 2m results. I call this problem "funneling."

The aforementioned function has many sets of 2 bits that affect only 1 bit of the internal state. If there are n input bits, there are (n choose 2)=(n*n/2-n/2) pairs of input bits, only a few of which match weaknesses in our sample function. It is a common pattern for keys to differ in only a few bits. If those bits match one of a hash's weaknesses, which is a rare but not negligible event, the hash will do extremely poorly. In most cases, though, it will do just fine. (This allows a function to slip through sanity checks, like hashing an English dictionary uniformly, while still frequently bombing on user data.)

In hashes built of repeated combine-mix steps, this weakness usually arises because of the following sequence:

1. A small number of bits y of one input block are combined, affecting only y bits of the internal state. So far so good.

2. The mixing step causes those y bits of the internal state to affect only z bits of the internal state.

3. The next combining step overwrites those bits with z more input bits, canceling out the first y input bits.

When z is smaller than the number of bits in the output, then y+z input bits have affected only z bits of the internal state, causing 2y+z possible keys to produce, at most, 2z hash values.

The same thing can happen in reverse:

1. Uncombine this block, causing y block bits to reverse their effect on y bits of the internal state.
2. Unmix the internal state, leaving x bits unaffected by the y bits from this block.
3. Unmixing the previous block reverses the effect on those x bits, canceling out this block's y bits.

If x is less than the number of bits in the output, then the 2x+y keys differing in only those x+y input bits can produce at most 2x hash values.

If the mixing function is not a permutation of the internal state, it is not reversible. Instead, it loses information about the earlier blocks every time it is applied, so keys differing only in the first few input blocks are more likely to collide. The mixing function ought to be a permutation.

It is easy to test whether this weakness exists: If the mixing step causes any bit of the internal state to affect fewer bits of the internal state than there are output bits, the weakness exists. This test should be run on the reverse of the mixing function as well. It can also be run with all sets of two internal state bits, or all sets of three.

The final mix and postprocessing needs to cause every bit of the last input block to affect every bit of the output. If some bit of the last input block does not affect some bits of the output, the user may choose to use only those output bits. Then the user will get the same result for any keys differing in only that input bit.

A Survey of Hash Functions

We now have a new hash function and a theory for evaluating hash functions. Listing Two presents source code implementations for a number of popular hash functions (for all except MD4). Table 2 shows how they compare.

  • Additive hash takes 5n+3 instructions. There is no mixing step. The combining step handles 1 byte at a time. Input bytes commute. The table length must be prime, and can't be much bigger than 1 byte because the value of variable hash is never much bigger than 1 byte.
  • Rotating hash takes 8n+3 instructions. This is the same as the additive hash, except it has a mixing step (a circular shift by 5) and the combining step is EXCLUSIVE-OR instead of addition. The table size is a prime, but the prime can be any size.
  • Pearson's hash preinitializes tab[] to an arbitrary permutation of 0-255. It takes 6n+2 instructions, but produces only a 1-byte result. Larger results can be made by running it several times with different initial hash values.
  • CRC hashing takes 9n+3 instructions. tab is initialized to simulate a maximal-length Linear Feedback Shift Register (LFSR). I used a 32-bit state with a polynomial of 0x04c11db7 for the tests.
  • Generalized CRC hashing takes 9n+3 instructions. It is the same as CRC hashing except it fills tab[] with arbitrary values. The values in the low bytes must form a permutation of 0..255.
  • Universal hashing takes 52n+3 instructions. The size of tab[] is the maximum number of input bits. Values in tab[] are chosen at random.
  • Zobrist hashing takes 10n+3 instructions. The size of tab[][256] is the maximum number of input bytes. Values of tab[][256] are chosen at random. Universal hashing can be implemented by a Zobrist hash with carefully chosen table values.
  • MD4 takes 9.5n+230 instructions. MD4 is a hash designed originally for cryptography by Ron Rivest. It takes 420 instructions to hash a block of 64 aligned bytes. I combined that with my hash's method of putting unaligned bytes into registers, adding 3n instructions. MD4 is overkill for hash table lookup.

Conclusion

A common weakness in hash functions is that a small set of input bits can cancel each other out. There is an efficient test to detect most such weaknesses, and many functions pass this test. The code I present here is for the fastest such function I could find. Hash functions without this weakness work equally well on all classes of keys.

More information about the search I used to discover this hash function can be found at http://ourworld.compuserve.com/homepage/bob_jenkins/avalanch.htm.


Listing One

typedef  unsigned long  int  ub4;   /* unsigned 4-byte quantities */typedef  unsigned       char ub1;   /* unsigned 1-byte quantities */


</p>
#define hashsize(n) ((ub4)1<<(n))
#define hashmask(n) (hashsize(n)-1)


</p>
/* mix -- mix 3 32-bit values reversibly.
For every delta with one or two bits set, and the deltas of all three
  high bits or all three low bits, whether the original value of a,b,c
  is almost all zero or is uniformly distributed,
* If mix() is run forward or backward, at least 32 bits in a,b,c
  have at least 1/4 probability of changing.
* If mix() is run forward, every bit of c will change between 1/3 and
  2/3 of the time.  (Well, 22/100 and 78/100 for some 2-bit deltas.)
mix() takes 36 machine instructions, but only 18 cycles on a superscalar
  machine (like a Pentium or a Sparc).  No faster mixer seems to work,
  that's the result of my brute-force search.  There were about 2^68
  hashes to choose from.  I only tested about a billion of those.
*/
#define mix(a,b,c) \
{ \
  a -= b; a -= c; a ^= (c>>13); \
  b -= c; b -= a; b ^= (a<<8); \
  c -= a; c -= b; c ^= (b>>13); \
  a -= b; a -= c; a ^= (c>>12);  \
  b -= c; b -= a; b ^= (a<<16); \
  c -= a; c -= b; c ^= (b>>5); \
  a -= b; a -= c; a ^= (c>>3);  \
  b -= c; b -= a; b ^= (a<<10); \
  c -= a; c -= b; c ^= (b>>15); \
}


</p>
/* hash() -- hash a variable-length key into a 32-bit value
  k       : the key (the unaligned variable-length array of bytes)
  len     : the length of the key, counting by bytes
  initval : can be any 4-byte value
Returns a 32-bit value.  Every bit of the key affects every bit of
the return value.  Every 1-bit and 2-bit delta achieves avalanche.
About 6*len+35 instructions.
The best hash table sizes are powers of 2.  There is no need to do
mod a prime (mod is sooo slow!).  If you need less than 32 bits,
use a bitmask.  For example, if you need only 10 bits, do
  h = (h & hashmask(10));
In which case, the hash table should have hashsize(10) elements.
If you are hashing n strings (ub1 **)k, do it like this:
  for (i=0, h=0; i<n; ++i) h = hash( k[i], len[i], h);
By Bob Jenkins, 1996.  [email protected].  You may use this
code any way you wish, private, educational, or commercial.  It's free.
See http://ourworld.compuserve.com/homepages/bob_jenkins/evahash.htm
Use for hash table lookup, or anything where one collision in 2^^32 is
acceptable.  Do NOT use for cryptographic purposes.
*/


</p>
ub4 hash( k, length, initval)
register ub1 *k;        /* the key */
register ub4  length;   /* the length of the key */
register ub4  initval;  /* the previous hash, or an arbitrary value */
{
   register ub4 a,b,c,len;


</p>
   /* Set up the internal state */
   len = length;
   a = b = 0x9e3779b9;  /* the golden ratio; an arbitrary value */
   c = initval;         /* the previous hash value */


</p>
   /*---------------------------------------- handle most of the key */
   while (len >= 12)
   {
      a += (k[0] +((ub4)k[1]<<8) +((ub4)k[2]<<16) +((ub4)k[3]<<24));
      b += (k[4] +((ub4)k[5]<<8) +((ub4)k[6]<<16) +((ub4)k[7]<<24));
      c += (k[8] +((ub4)k[9]<<8) +((ub4)k[10]<<16)+((ub4)k[11]<<24));
      mix(a,b,c);
      k += 12; len -= 12;
   }
   /*------------------------------------- handle the last 11 bytes */
   c += length;
   switch(len)              /* all the case statements fall through */
   {
   case 11: c+=((ub4)k[10]<<24);
   case 10: c+=((ub4)k[9]<<16);
   case 9 : c+=((ub4)k[8]<<8);
      /* the first byte of c is reserved for the length */
   case 8 : b+=((ub4)k[7]<<24);
   case 7 : b+=((ub4)k[6]<<16);
   case 6 : b+=((ub4)k[5]<<8);
   case 5 : b+=k[4];
   case 4 : a+=((ub4)k[3]<<24);
   case 3 : a+=((ub4)k[2]<<16);
   case 2 : a+=((ub4)k[1]<<8);
   case 1 : a+=k[0];
     /* case 0: nothing left to add */
   }
   mix(a,b,c);
   /*-------------------------------------------- report the result */
   return c;
}

Back to Article

Listing Two

/* Additive Hash */int additive(char *key, int len, int prime)
{
  int hash, i;
  for (hash=len, i=0; i<len; ++i) 
    hash += key[i];
  return (hash % prime);
}


</p>
/* Rotating Hash */
int rotating(char *key, int len, int prime)
{
  int hash, i;
  for (hash=len, i=0; i<len; ++i)
    hash = (hash<<5)^(hash>>27)^key[i];
  return (hash % prime);
}


</p>
/* Pearson's Hash */
char pearson(char *key, int len, char tab[256])
{
  char hash;
  int  i;
  for (hash=len, i=0; i<len; ++i) 
    hash=tab[hash^key[i]];
  return (hash);
}


</p>
/* CRC Hash and Generalized CRC Hash */
int crc(char *key, int len, int mask, int tab[256])
{
  int hash, i;
  for (hash=len, i=0; i<len; ++i)
    hash = (hash<<8)^tab[(hash>>24)^key[i]];
  return (hash & mask);
}


</p>
/* Universal Hash */
int universal(char *key, int len, int mask, int tab[MAXBITS])
{
  int hash, i;
  for (hash=len, i=0; i<(length<<3); i+=8)
  {
    register char k = key[i>>3];
    if (k&0x01) hash ^= tab[i+0];
    if (k&0x02) hash ^= tab[i+1];
    if (k&0x04) hash ^= tab[i+2];
    if (k&0x08) hash ^= tab[i+3];
    if (k&0x10) hash ^= tab[i+4];
    if (k&0x20) hash ^= tab[i+5];
    if (k&0x40) hash ^= tab[i+6];
    if (k&0x80) hash ^= tab[i+7];
  }
  return (hash & mask);
}


</p>
/* Zobrist Hash */
int zobrist( char *key, int len, int mask, int tab[MAXBYTES][256])
{
  int hash, i;
  for (hash=len, i=0; i<len; ++i)
    hash ^= tab[i][key[i]];
  return (hash & mask);
}

Back to Article

Listing Three

/* Compute the Funnel-15 result for CRC */void hum()
{
  ub4 i,j,k,whum,x;
  x=0x80000000;
  whum=31;
  for (i=0; i<(15*8); ++i)
  {
    x = (x & 0x80000000) ? ((x << 1) ^ 0x04c11db7) : (x << 1);
    printf("%.8lx\n",x);
    for (k=0, j=1; j; j=(j<<1)) if (j&x&0xff) ++k;
    if (k<whum)
    {
      printf("k is %ld\n",k);
      whum=k;
    }
  }
  printf("whum is %ld %ld %ld %ld\n",whum,x,k,j);
}

Back to Article

DDJ


Copyright © 1997, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.