Database

Memory-Efficient Adaptive Huffman Coding

By Steven Pigeon and Yoshua Bengio, October 01, 1998

Although simple and often effective, Huffman's compression algorithm requires a lot of memory for 16-bit Unicode text files, and it doesn't adapt to varying conditions within the data. Steven and Yoshua explain how they updated Huffman's classic technique.

Oct98: Algorithm Alley

Steven is a Ph.D. student in the computer science and operations research department of the University of Montreal in Canada. Yoshua is an associate professor in the same department. They can be contacted at [email protected] and bengioy@ iro.umontreal.ca, respectively.

Periodically, someone claims to have "gotten around" Shannon's Law of Entropy. According to such claims, Shannon's limit on compression has a few loopholes, which are exploited by some new compression algorithm. Sometimes, the proposed algorithm has a flaw that's been overlooked. Other times, the algorithm is real, but the claim is based on a misunderstanding of Shannon's work.

Make no mistake: Every compression algorithm is subject to some form of Shannon's limit. The form that Steven and Yoshua discuss this month is just the simplest one. This particular form only applies to codes -- such as Huffman's -- that map a single input symbol to a single code symbol.

Other types of compression algorithms may not be limited by this precise formula, but they are still limited; every compression algorithm obeys some data model with its own unique notion of "entropy."

In practice, techniques (such as Huffman compression and arithmetic coding) that follow the simplest form of Shannon's Law work well for some types of data. Huffman coding, in particular, is simple to implement and free of patents, which pretty much guarantees its use for a long time.

This month, Steven and Yoshua take a look at a basic weakness common to many compression algorithms. Few compression algorithms work well when there are many symbols. As it turns out, a clever data structure solves this problem for Huffman coding.

-- Tim Kientzle

Many kinds of data can be viewed as a list of integers. This is true of character data, for instance. Usually, some integer values are more common than others, which makes it possible to compress the data by varying the number of bits used for different values. This is the idea behind Huffman compression. However, typical Huffman compression only works well when you have a small number of different values (usually less than 256). With 16-bit Unicode characters, for instance, it requires too much memory to be practical for many applications.

In this article, we'll examine an alternative to typical Huffman compression that lets it work with large numbers of symbols. We'll also show how to perform adaptive Huffman compression, in which the encoding is varied on-the-fly to match the data. The adaptive nature of our algorithm also saves memory when you have a large potential alphabet (such as 16-bit Unicode), of which only a portion of that alphabet is actually used.

An Example

Suppose that you want to store people's heights as one field in a database. You might expect most people to be between 60 and 75 inches tall, although some might be as tall as 80 inches or as short as 50 inches. It seems a waste to use a full eight bits to store the height, since most values will be in a fairly narrow range. But any code you use must allow for unusual values. When you store a rare height, you do not care if the code is very long. Because it is rare, it won't make a big difference. However, common values should use short code so that the total size of the database is kept as small as possible.

There are many ways to assign codes to symbols according to a probability law. In this example, a symbol is one height value. A height of 71 inches, for instance, is one possible symbol. In our example, the symbols are numbers, but they could be anything else, such as letters or colors. The optimal average length of a code for a given set S of symbols and a probability law X is given by Shannon's formula; see Figure 1.

According to Shannon's formula, if a symbol s from set S occurs with probability p(X=s), the optimal code has a length of -log₂ p(X=s) bits. The quantity H(X) is referred to as the "entropy of X." It represents the average number of bits required to code each symbol. Any encoding where the average code length is smaller than the entropy must lose information, making it impossible to unambiguously decode the result.

Huffman Coding

Huffman coding is one way to generate a nearly optimal binary code for a given set of symbols and probability law. You first estimate the probability of each symbol, usually by simply counting the frequency of each symbol. You then create a node for each symbol with its respective frequency. Finally, you combine these nodes to build a tree that specifies the code.

Initially, all nodes are marked as active. Pick the two lowest-frequency active nodes and create a parent node for them, whose frequency is the sum of these two symbols' frequencies. As soon as these two symbols have been processed, only the parent will be tagged as active. Repeat this procedure, merging the two least frequent active nodes, until you have only one parent node for all the symbols. Figure 2 illustrates the process.

The code for each symbol is given by the path from the root of the tree to each symbol. A right branch represents 0, a left branch 1. To decode the compressed information, read one bit at a time, and take the right or left branch as specified by that bit. When you reach a leaf, you have decoded one symbol. Start over at the root and continue reading bits to decode subsequent symbols.

Huffman has Problems

Huffman coding is efficient, since the average code length is generally close to the entropy. However, Huffman's method has three problems:

It is static. The code does not adapt to the data that is actually being compressed. Codes are sometimes developed from a single large "corpus" of typical data, then used to compress a variety of files. If the data doesn't quite match the corpus, then the code is no longer optimal. Even if you start by counting the probability in the file to be compressed, the probability can vary between sections of the file. You want a dynamic compression method, one that adjusts the code to the data being compressed right now.
The tree can be very large. If you code bytes, your tree will only have 511 nodes. But if you're encoding 32-bit integers, you would need 2³³ nodes! A better method should create nodes only when necessary.
The decoder has to somehow learn the tree. This usually requires storing the probability table in the compressed file, from which the decoder can construct the compression tree. Of course, the compressed file size plus the probability table may be bigger than the original file. A better method should provide some means to dispense with the probability table.

Our algorithm addresses these problems. First, it is dynamic (or adaptive). Second, it minimizes the number of nodes in the tree. The third problem is solved by the first solution: An adaptive algorithm does not need to store a table of probabilities. You can use one, if you want, to boost the compression right from the start, but it can be quite crude. The cruder it is, the less space it will use.

Storing Sets of Symbols

Our algorithm uses a tree with leaves that represent sets of symbols rather than individual symbols, and uses only two operations (migration and rebalancing) to remain as close as possible to optimal. The basic idea is for each leaf to hold symbols that have the exact same probability. This probability is computed incrementally, as we read symbols one by one from the file.

A symbol seen m times is put in the set of all symbols seen m times. This set lies in a leaf of the tree. When that symbol appears again, it "migrates" to the set of symbols seen m+1 times. It is removed from the set m and inserted into set m+1. This set-based approach works because, if you recall Shannon's equation, all symbols seen an equal number of times get a code of equal length. Example 1 presents the migration algorithm.

So we have a tree whose leaves are sets containing a certain number of symbols. Each set corresponds to a class of symbols of a given observed frequency (as opposed to a probability). The path from the root of the tree to the leaf is the prefix part of the code. The position of a symbol within a given set will give us the suffix part of the code. The suffix part is ceil(log₂|S|) bits long, where |S| is the size of set S. This gives a ceil(log₂|S|) bits number to every symbol in set S, from 0 to |S|-1. Since all symbols within a set have the same observed frequency, the suffix length is at worst one bit bigger than the optimal suffix length.

To ensure that a leaf gets the correct prefix code length, we "rebalance" the tree. Rebalancing tries to put a leaf at its correct depth depending on its weight. The weight of a leaf is the number of symbols in the associated set times the frequency of that set. That is, the weight will be m|S| for each set S. Rebalancing involves exchanging a leaf with its uncle whenever it becomes twice as heavy as its uncle. Figure 3 shows how this affects the tree (this will probably remind you of AVL trees, which rebalance themselves to be equally deep everywhere, to render search time uniform). The weight of an internal node of the tree is given by the sum of the weights of its two sons. We continue exchanging until either the node is not heavier than its uncle or because the node is the root of the tree. Example 2 details this algorithm.

In Figure 3, subtree A gets one level closer to the root and therefore gets a shorter prefix code. According to Shannon's formula, the code is one bit shorter for a symbol that is twice as probable as another. Here, a set that is twice as heavy (that is, probable) gets a prefix one bit shorter than the other two sets. A set is therefore maintained at approximately the appropriate depth, according to its weight. Figure 4 shows how we construct the tree starting from one initial leaf containing all the possible symbols, with the associated codes on the right. Eventually, the tree grows and nodes move up and down as rebalancing exchanges nodes at different levels in the tree.

To encode a file, read one symbol at a time. The prefix is the path in the tree from the root to the leaf corresponding to the set where the symbol is. The suffix is simply the position of the symbol in the set. Simply write the prefix and suffix to the output stream.

Symbols are decoded by reading the prefix bits. A "1" bit moves us down to the left in the tree, a "0" bit moves us down to the right. When we reach a leaf, we know the size of the set, so we can read a chunk of bits, which is converted to an index in this set. The index points to the symbol being decoded. This is faster than typical Huffman implementations, which must read one bit at a time for the entire code.

After each symbol is encoded (at compression time) or decoded (at decompression time) we call the Update procedure in Example 1, which migrates the most recent symbol and rebalances the tree.

Implementation Details

Examples 1 and 2 look simple, but you have to be careful about the details if you want a fast, efficient implementation. The first such detail is the internal representation of the sets. While the number of observed symbols will be small compared to the range, we do not want to use a set representation that always eats up one int per symbol. Our implementation of sets uses a linked list of ordered intervals. That is, if my set contains the numbers 1 to 10, I represent it in memory by {1...10} rather than {1,2,3,4,5,6,7,8,9,10}. Assuming 32-bit integers, the first representation uses two ints, plus a counter and pointer, for a grand total of 16 bytes. The second uses 10 ints, plus 10 pointers, for a total of 80 bytes. Compressing intervals also allows you to search much faster for a number in the set. An alternative would be to use bit maps, which would work well for sparse sets such as {2,3,5,7,11,13,17,23}, as long as the range isn't too large.

The second detail involves locating symbols in the tree. Each time you read a symbol, you must find that symbol and migrate it to another set. We used a sorted array to store this information, and use a binary search to locate items. With this structure, adding an item is relatively expensive -- proportional to the number of symbols already in the table -- but each symbol is only added to the table once when it is first seen.

This algorithm does well on many data sources, including text and Unicode text (16-bit character codes). If you have a good idea of the statistical behavior of your data, you can prime the algorithm by starting with a particular tree. The tree in Figure 5, for example, was used to initialize an image compression module. This gave much better results that the simple initial tree. If you start with a different initial configuration, you can either transmit the information about the initial tree, or you can build it into both your coder and decoder.

Conclusion

The codes given by our algorithm are at worst two bits longer than the optimal code lengths given by Shannon's formula. The suffix part is no more than one bit longer than necessary, since ceil(log₂(S)) is less than log₂(S)+1. Similarly, the prefix part can be at most one bit longer than optimal. These two bits are, however, rarely lost at the same time. In practice, we observed that the average length of the code generated by this algorithm is really close to the entropy given by Shannon's formula.

C++ source code, executables for Windows NT (compiled for a PentiumPro/II), and the demos are available electronically; see "Resource Center," page 3. As it comes, the codec is entirely encapsulated in an object, so it is easy to use more than one codec simultaneously in a program, because we often have to code things that have different characteristics in the same file and it would be inefficient to use the same codec for everything. We also have included two demo programs -- one to compress (and uncompress) files as eight-bit symbols, the other as 16-bit symbols. The source compiles correctly on a number of different platforms, such as SGI IRIX, Sun OS, and Solaris, so you shouldn't have serious problems compiling it on another platform.

DDJ

1 2 3 4 5 6 7 8 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Database