Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Design

LZW Data Compression


The Implementation Blues

The concepts used in the compression algorithm are so simple that the whole algorithm can be expressed in only a dozen lines. But because of the management required for the string table, implementation of this algorithm is somewhat more complicated.

In Listing One, I have used code sizes of 12-, 13-, and 14-bits. In a 12-bit code program, there are potentially 4096 strings in the string table. Each and every time a new character is read in, the string table has to be searched for a match. If a match is not found, a new string has to be added to the table. This causes two problems. First, the string table can get very large very fast. Even if string lengths average as low as 3 or 4 characters each, the overhead of storing a variable length string and its code can easily reach 7 or 8 bytes per code. In addition, the amount of storage needed is indeterminate, as it depends on the total length of all the strings.

The second problem involves searching for strings. Each time a new character is read in, the algorithm has to search for the new string formed by STRING+CHARACTER. This means keeping a sorted list of strings. Searching for each string takes on the order of log2 string comparisons. Using 12-bit words potentially means doing 12-string comparisons for each code. The computational overhead can be prohibitive.

The first problem can be solved by storing the strings as code/character combinations. Because every string is actually a combination of an existing code and an appended character, you can store each string as a single code plus a character. For example, in the compression example shown, the string /WEE is actually stored as code 260 with appended character E. This takes only 3 bytes of storage instead of 5 (counting the string terminator). By back-tracking, you find that code 260 is stored as code 256 plus an appended character E. Finally, code 256 is stored as a /character plus a W.

Doing the string comparisons is a little more difficult. The new method of storage reduces the amount of time needed for a string comparison, but it doesn't cut into the number of comparisons needed to find a match. This problem is solved by using a hashing algorithm to store strings. What this means is that you don't store code 256 in location 256 of an array, you store it in a location in the array based on an address formed by the string itself. When you are trying to locate a given string, you can use the test string to generate a hashed address and, with luck, can find the target string in one search. ,p>Because the code for a given string is no longer known merely by its position in the array, you need to store the code for a given string along with the string data. In Listing One, there are three array elements for each string. They are: code_value[i], prefix_code[i], and append_character[i].

When you want to add a new code to the table, use the hashing function in routine find_match to generate the correct i. find_match generates an address, then checks to see if the location is already in use by a different string. If it is, find_match performs a secondary probe until an open location is found.

The hashing function in use in this program is a straightforward xor-type hash function. The prefix code and appended character are combined to form an array address. If the contents of the prefix code and character in the array are a match, the correct address is returned. If that element in the array is in use, a fixed offset probe is used to search new locations. This continues until either an empty slot is found, or a match is found. The average number of searches in the table usually stays below 3 if you use a table about 25 percent larger than needed. Performance can be improved by increasing the size of the table. Note that in order for the secondary probe to work, the size of the table needs to be a prime number. This is because the probe can be any integer between 1 and the table size. If the probe and the table size are not mutually prime, a search for an open slot can fail even if there are still open slots available.

Implementing the decompression algorithm has its own set of problems. One of the problems from the compression code goes away. When you are compressing, you need to search the table for a given string. During decompression, you are looking for a particular code. This means that you can store the prefix codes and appended characters in the table indexed by their string code. This eliminates the need for a hashing function and frees up the array used to store code values.

Unfortunately, the method used to store string values causes the strings to be decoded in reverse order. This means that all the characters for a given string have to be decoded into a stack buffer, then output in reverse order. In the program given here, this is done in the decode_string function. Once this code is written, the rest of the algorithm turns into code easily.

A problem encountered when reading in data streams is determining when you have reached the end of the input data stream. In this particular implementation, I have reserved the last defined code, MAX_VALUE, as a special end of data indicator. Though this may not be necessary when reading in data files, it is helpful when reading compressed buffers out of memory. The expense of losing one defined code is minimal in comparison to the convenience gained.

Results

It is somewhat difficult to characterize the results of any data compression technique. The level of compression achieved varies quite a bit, depending on several factors. LZW compression excels when confronted with data streams that have any type of repeated strings. Because of this, it does extremely well when compressing English text. Compression levels of 50 percent or better can be expected. Likewise, compressing saved screens and displays generally shows great results. Trying to compress data files is a little more risky. Depending on the data, compression may or may not yield good results. In some cases, data files compress even more than text. A little bit of experimentation usually gives you a feel for whether your data will compress well or not.

Your Implementation

The code accompanying this article works. It was written, however, with the goal of being illuminating, not efficient, with some parts of the code being relatively inefficient. The variable length input and output routines, for example, are short and easy to understand, but require a lot of overhead. You could experience real improvements in speed in an LZW program using fixed-length 12-bit codes, just by recoding these two routines.

One problem with the code listed here is that it does not adapt well to compressing files of differing sizes. Using 14- or 15-bit codes gives better compression ratios on large files (because they have a larger string table to work with), but poorer performance on small files. Programs such as ARC get around this problem by using variable length codes. For example, when the value of next_code is between 256 and 511, ARC inputs and outputs 9-bit codes. When the value of next_code increases to the point where 10-bit codes are needed, both the compression and decompression routines adjust the code size. This means that the 12-bit and 15-bit versions of the program do equally well on small files.

Another problem on long files is that frequently the compression ratio begins to degrade as more of the file is read in. The reason for this is simple: Because the string table is of finite size, after a certain number of strings have been defined, no more can be added. But the string table is good only for the portion of the file that was read in while it was built. Later sections of the file may have different characteristics and really need a different string table.

The conventional way to solve this problem is to monitor the compression ratio. After the string table is full, the compressor watches to see if the compression ratio degrades. After a certain amount of degradation, the entire table is flushed and gets rebuilt from scratch. The expansion code is flagged when this happens because the compression routine sends out a special code. An alternative method is to keep track of how frequently strings are used, and to periodically flush values that are rarely used. An adaptive technique like this may be too difficult to implement in a reasonable-sized program.

One final technique for compressing the data is to take the LZW codes and run them through an adaptive Huffman coding filter. This generally exploits a few more percentage points of compression, but at the cost of considerably more complexity in the code as well as quite a bit more run time.

Portability

The code in Listing One was written and tested on MS-DOS machines and has successfully compiled and executed with several C compilers. It should be portable to any machine that supports 16-bit integers and 32-bit longs in C. MS-DOS C compilers typically have trouble dealing with arrays larger than 64K bytes, preventing an easy implementation of 15- or 16-bit codes in this program. On machines using different processors, such as the VAX, these restrictions are lifted, and using larger code sizes becomes much easier.

In addition, porting this code to assembly language should be fairly easy on any machine that supports 16- and 32-bit math, and offers significant performance improvements. Implementations in other high-level languages should be straightforward.

Bibliography

Terry Welch, "A Technique for High-Performance Data Compression," Computer, June 1984.

J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression," IEEE Transactions on Information Theory, May 1977.

Rudy Rucker, Mind Tools, Houghton Mifflin Company, Boston, Mass.: 1987.


Mark is a programmer for Greenleaf Software, Inc., Dallas, Texas..


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.