Channels ▼

Al Williams

Dr. Dobb's Bloggers

Tiny Hashing

March 05, 2010

Today's embedded systems are often what I consider "big iron" -- computers that just a few years ago would have been powerful desktop PCs or servers are not tucked away inside of a variety of things.

However, most of my work centers on really small CPUs with very little memory and processing power. It takes a different kind of discipline to code for these computers. I mentioned in an earlier post that instead of an FFT, for example, you might use a Goertzel algorithm to decode tones; it is much more efficient, although not as general purpose. In fact, I make a point of collecting things for my "tiny toolbox" -- code that can replace bloated code in specific situations.

One that came to my attention recently was a hashing algorithm called Pearson hashing which is made for quickly hashing strings on 8 bit processors. As you probably know, hashing is transforming a key (usually a string) into an integer using some algorithm and then using the integer as an index into an array or a disk table. Using hashing you can get fast array access while using strings for keys instead of numbers.

As an example, suppose I'm writing a BASIC interpreter to run on a small CPU. I could easily create a table of function pointers for each keyword. The brute force approach would have me store the string with the function pointer and then just search the list. Of course, if the list were sorted, I could use a smarter technique, but in general I will have to make many string compares to find a keyword on average. Worse still, I have to store a lot more data and deal with variable length strings.

If I used hashing I would transform the keywords into an index. Let's try a simple (and not practical) algorithm. Suppose I shift the keyword to upper case and take the first letter's ASCII code and subtract the ASCII code for 'A' (65). Then I wind up with a number between 0 and 25. The DATA statement is index 3 and the INPUT statement is index 8. Of course, the IF statement is also index 8!

That's called a collision. A perfect hash algorithm won't create collisions for a given input set, so this is not a perfect hash algorithm! You could try to find a perfect algorithm by considering more (and maybe all) of the letters. For example, you might XOR the first two letters together so INPUT and IF would have different values. You'd have to try all the possibilities to see if there were any collisions.

This is the situation Pearson hashing addresses. It uses a simple algorithm and a 256 byte permutation table to create a high quality hashing algorithm that doesn't require much computing power. According to Wikipedia the algorithm, which was published by the eponymous Pearson in 1990, has several benefits:

  • It is extremely simple (each character is XORed or summed to the current hash code and the result is used to look up the new hash code in the permutation table).
  • It executes quickly on resource-limited processors.
  • There is no simple class of inputs for which collisions (identical outputs) are especially likely.
  • Given a small, privileged set of inputs (e.g., reserved words for a compiler), the permutation table can be adjusted so that those inputs yield distinct hash values, producing what is called a perfect hash function.

The key to making a perfect hash for our BASIC interpreter is manipulating the permutation table, but that's static analysis and you avoid all the run time string processing you'd expect to find in an interpreter.

Since I've read about the Pearson hash, I'm looking for a good chance to use it. Meanwhile, what's in your "tiny toolbox?"

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video