Algorithm Alley

What makes one hash function better than another? Bob knows the answer, and he has used his knowledge to design a new hash function that may be better than what you're using now.


September 01, 1997
URL:http://www.drdobbs.com/database/algorithm-alley/184410284

Dr. Dobb's Journal September 1997: Hash Functions

Hash Functions

Bob works at Oracle, maintaining referential integrity. He can be contacted at [email protected].


Over the past two years, I've experimented with many different hash functions, and eventually designed my own general-purpose function to address the weaknesses I found. To serve as a plug-in replacement for existing hash functions, it had to meet several requirements:

I developed this hash function with an elaborate search mechanism that tested for a variety of weaknesses. The result of this search is Listing One which is the fastest hash I've been able to design that meets all of my requirements.

Like most hashes, Listing One can be modeled like Example 1. In my hash, mix() takes 3n of the 6n+35 instructions needed to hash n bytes. Blocks of text are combined with the internal state (a,b,c) by addition. This combining step is the rest of the hash function, consuming the remaining 3n instructions. The only postprocessing is to choose c out of (a,b,c) to be the result.

Three tricks promote speed:

The golden ratio in Listing One is just an arbitrary value. It keeps all zeros from being hashed to all zeros.

In this article, I'll explain some things I've learned about hashes, and show how some competing hashes stack up.

The Hash Must Do a Good Job

The most interesting requirement was that the hash must be better than its competition. What does it mean for a hash to be good for hash table lookup?

A good hash function distributes hash values uniformly. If you don't know the keys before choosing the function, the best you can do is map an equal number of possible keys to each hash value. If keys are distributed uniformly, an excellent hash would be to choose the first few bytes of the key and use that as the hash value. Unfortunately, real keys aren't uniformly distributed. Choosing the first few bytes works quite poorly in practice.

The real requirement then is that a good hash function should distribute hash values uniformly for the keys that users actually use. To understand some of the problems, consider Table 1, the EMP table. If you consider each horizontal row to be a key, you'll notice some patterns:

It's easy to design hashes that handle the first three patterns. If the length is included in the data being hashed, then lengths are not a problem. If the hash does not treat text blocks commutatively, then substrings are not a problem. You can see how a hash behaves with strings that are mostly zeros by listing all strings with only one bit set and checking if that set of strings produces too many collisions.

The remaining pattern is that keys often differ in only a few bits. If a hash allows small sets of input bits to cancel each other out, and some keys differ in only those bits, then those keys will map to the same handful of hash values.

A Common Weakness

Usually, when a small set of input bits cancel each other out, it is because those input bits only affect a smaller set of bits in the internal state.

Consider the hash function in Example 2, which maps the strings "EXXXXXB" and "AXXXXXC" to the same value. These keys differ in bit 3 of the first byte and bit 1 of the seventh byte. After the seventh byte is combined, no additional processing will help because the internal states are already the same.

Any time n input bits can only affect m output bits, and n>m, then the 2n keys that differ in those input bits can only produce 2m distinct hash values. The same is true if n input bits can only affect m bits of the internal state -- later mixing may make the 2m results look uniformly distributed, but there will still be only 2m results. I call this problem "funneling."

The aforementioned function has many sets of 2 bits that affect only 1 bit of the internal state. If there are n input bits, there are (n choose 2)=(n*n/2-n/2) pairs of input bits, only a few of which match weaknesses in our sample function. It is a common pattern for keys to differ in only a few bits. If those bits match one of a hash's weaknesses, which is a rare but not negligible event, the hash will do extremely poorly. In most cases, though, it will do just fine. (This allows a function to slip through sanity checks, like hashing an English dictionary uniformly, while still frequently bombing on user data.)

In hashes built of repeated combine-mix steps, this weakness usually arises because of the following sequence:

1. A small number of bits y of one input block are combined, affecting only y bits of the internal state. So far so good.

2. The mixing step causes those y bits of the internal state to affect only z bits of the internal state.

3. The next combining step overwrites those bits with z more input bits, canceling out the first y input bits.

When z is smaller than the number of bits in the output, then y+z input bits have affected only z bits of the internal state, causing 2y+z possible keys to produce, at most, 2z hash values.

The same thing can happen in reverse:

1. Uncombine this block, causing y block bits to reverse their effect on y bits of the internal state.
2. Unmix the internal state, leaving x bits unaffected by the y bits from this block.
3. Unmixing the previous block reverses the effect on those x bits, canceling out this block's y bits.

If x is less than the number of bits in the output, then the 2x+y keys differing in only those x+y input bits can produce at most 2x hash values.

If the mixing function is not a permutation of the internal state, it is not reversible. Instead, it loses information about the earlier blocks every time it is applied, so keys differing only in the first few input blocks are more likely to collide. The mixing function ought to be a permutation.

It is easy to test whether this weakness exists: If the mixing step causes any bit of the internal state to affect fewer bits of the internal state than there are output bits, the weakness exists. This test should be run on the reverse of the mixing function as well. It can also be run with all sets of two internal state bits, or all sets of three.

The final mix and postprocessing needs to cause every bit of the last input block to affect every bit of the output. If some bit of the last input block does not affect some bits of the output, the user may choose to use only those output bits. Then the user will get the same result for any keys differing in only that input bit.

A Survey of Hash Functions

We now have a new hash function and a theory for evaluating hash functions. Listing Two presents source code implementations for a number of popular hash functions (for all except MD4). Table 2 shows how they compare.

Conclusion

A common weakness in hash functions is that a small set of input bits can cancel each other out. There is an efficient test to detect most such weaknesses, and many functions pass this test. The code I present here is for the fastest such function I could find. Hash functions without this weakness work equally well on all classes of keys.

More information about the search I used to discover this hash function can be found at http://ourworld.compuserve.com/homepage/bob_jenkins/avalanch.htm.


Listing One

typedef  unsigned long  int  ub4;   /* unsigned 4-byte quantities */typedef  unsigned       char ub1;   /* unsigned 1-byte quantities */


#define hashsize(n) ((ub4)1<<(n)) #define hashmask(n) (hashsize(n)-1)

/* mix -- mix 3 32-bit values reversibly. For every delta with one or two bits set, and the deltas of all three high bits or all three low bits, whether the original value of a,b,c is almost all zero or is uniformly distributed, * If mix() is run forward or backward, at least 32 bits in a,b,c have at least 1/4 probability of changing. * If mix() is run forward, every bit of c will change between 1/3 and 2/3 of the time. (Well, 22/100 and 78/100 for some 2-bit deltas.) mix() takes 36 machine instructions, but only 18 cycles on a superscalar machine (like a Pentium or a Sparc). No faster mixer seems to work, that's the result of my brute-force search. There were about 2^68 hashes to choose from. I only tested about a billion of those. */ #define mix(a,b,c) \ { \ a -= b; a -= c; a ^= (c>>13); \ b -= c; b -= a; b ^= (a<<8); \ c -= a; c -= b; c ^= (b>>13); \ a -= b; a -= c; a ^= (c>>12); \ b -= c; b -= a; b ^= (a<<16); \ c -= a; c -= b; c ^= (b>>5); \ a -= b; a -= c; a ^= (c>>3); \ b -= c; b -= a; b ^= (a<<10); \ c -= a; c -= b; c ^= (b>>15); \ }

/* hash() -- hash a variable-length key into a 32-bit value k : the key (the unaligned variable-length array of bytes) len : the length of the key, counting by bytes initval : can be any 4-byte value Returns a 32-bit value. Every bit of the key affects every bit of the return value. Every 1-bit and 2-bit delta achieves avalanche. About 6*len+35 instructions. The best hash table sizes are powers of 2. There is no need to do mod a prime (mod is sooo slow!). If you need less than 32 bits, use a bitmask. For example, if you need only 10 bits, do h = (h & hashmask(10)); In which case, the hash table should have hashsize(10) elements. If you are hashing n strings (ub1 **)k, do it like this: for (i=0, h=0; i<n; ++i) h = hash( k[i], len[i], h); By Bob Jenkins, 1996.
[email protected]. You may use this code any way you wish, private, educational, or commercial. It's free. See http://ourworld.compuserve.com/homepages/bob_jenkins/evahash.htm Use for hash table lookup, or anything where one collision in 2^^32 is acceptable. Do NOT use for cryptographic purposes. */

ub4 hash( k, length, initval) register ub1 *k; /* the key */ register ub4 length; /* the length of the key */ register ub4 initval; /* the previous hash, or an arbitrary value */ { register ub4 a,b,c,len;

/* Set up the internal state */ len = length; a = b = 0x9e3779b9; /* the golden ratio; an arbitrary value */ c = initval; /* the previous hash value */

/*---------------------------------------- handle most of the key */ while (len >= 12) { a += (k[0] +((ub4)k[1]<<8) +((ub4)k[2]<<16) +((ub4)k[3]<<24)); b += (k[4] +((ub4)k[5]<<8) +((ub4)k[6]<<16) +((ub4)k[7]<<24)); c += (k[8] +((ub4)k[9]<<8) +((ub4)k[10]<<16)+((ub4)k[11]<<24)); mix(a,b,c); k += 12; len -= 12; } /*------------------------------------- handle the last 11 bytes */ c += length; switch(len) /* all the case statements fall through */ { case 11: c+=((ub4)k[10]<<24); case 10: c+=((ub4)k[9]<<16); case 9 : c+=((ub4)k[8]<<8); /* the first byte of c is reserved for the length */ case 8 : b+=((ub4)k[7]<<24); case 7 : b+=((ub4)k[6]<<16); case 6 : b+=((ub4)k[5]<<8); case 5 : b+=k[4]; case 4 : a+=((ub4)k[3]<<24); case 3 : a+=((ub4)k[2]<<16); case 2 : a+=((ub4)k[1]<<8); case 1 : a+=k[0]; /* case 0: nothing left to add */ } mix(a,b,c); /*-------------------------------------------- report the result */ return c; }

Back to Article

Listing Two

/* Additive Hash */int additive(char *key, int len, int prime)
{
  int hash, i;
  for (hash=len, i=0; i<len; ++i) 
    hash += key[i];
  return (hash % prime);
}


/* Rotating Hash */ int rotating(char *key, int len, int prime) { int hash, i; for (hash=len, i=0; i<len; ++i) hash = (hash<<5)^(hash>>27)^key[i]; return (hash % prime); }

/* Pearson's Hash */ char pearson(char *key, int len, char tab[256]) { char hash; int i; for (hash=len, i=0; i<len; ++i) hash=tab[hash^key[i]]; return (hash); }

/* CRC Hash and Generalized CRC Hash */ int crc(char *key, int len, int mask, int tab[256]) { int hash, i; for (hash=len, i=0; i<len; ++i) hash = (hash<<8)^tab[(hash>>24)^key[i]]; return (hash & mask); }

/* Universal Hash */ int universal(char *key, int len, int mask, int tab[MAXBITS]) { int hash, i; for (hash=len, i=0; i<(length<<3); i+=8) { register char k = key[i>>3]; if (k&0x01) hash ^= tab[i+0]; if (k&0x02) hash ^= tab[i+1]; if (k&0x04) hash ^= tab[i+2]; if (k&0x08) hash ^= tab[i+3]; if (k&0x10) hash ^= tab[i+4]; if (k&0x20) hash ^= tab[i+5]; if (k&0x40) hash ^= tab[i+6]; if (k&0x80) hash ^= tab[i+7]; } return (hash & mask); }

/* Zobrist Hash */ int zobrist( char *key, int len, int mask, int tab[MAXBYTES][256]) { int hash, i; for (hash=len, i=0; i<len; ++i) hash ^= tab[i][key[i]]; return (hash & mask); }

Back to Article

Listing Three

/* Compute the Funnel-15 result for CRC */void hum()
{
  ub4 i,j,k,whum,x;
  x=0x80000000;
  whum=31;
  for (i=0; i<(15*8); ++i)
  {
    x = (x & 0x80000000) ? ((x << 1) ^ 0x04c11db7) : (x << 1);
    printf("%.8lx\n",x);
    for (k=0, j=1; j; j=(j<<1)) if (j&x&0xff) ++k;
    if (k<whum)
    {
      printf("k is %ld\n",k);
      whum=k;
    }
  }
  printf("whum is %ld %ld %ld %ld\n",whum,x,k,j);
}

Back to Article

DDJ


Copyright © 1997, Dr. Dobb's Journal

Dr. Dobb's Journal September 1997: Hash Functions

Hash Functions

By Bob Jenkins

Dr. Dobb's Journal September 1997

Example 1: Modeling the hash function in Listing One.


Copyright © 1997, Dr. Dobb's Journal

Dr. Dobb's Journal September 1997: Hash Functions

Hash Functions

By Bob Jenkins

Dr. Dobb's Journal September 1997

Example 2: Typical hash function.


Copyright © 1997, Dr. Dobb's Journal

Dr. Dobb's Journal September 1997: Hash Functions

Hash Functions

By Bob Jenkins

Dr. Dobb's Journal September 1997

Table 1: The EMP Table.


Copyright © 1997, Dr. Dobb's Journal

Dr. Dobb's Journal September 1997: Hash Functions

Hash Functions

By Bob Jenkins

Dr. Dobb's Journal September 1997

Table 2: Comparison of several hash functions.


Copyright © 1997, Dr. Dobb's Journal

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.