Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Design

LZW Data Compression


Compression

The LZW compression algorithm in its simplest form is shown in Figure 1. Each time a new code is generated, it means a new string has been added to the string table. Examination of the algorithm shows that LZW always checks whether the strings are already known and, if so, outputs existing codes rather than generating new codes.

Figure 1: The compression algorithm

  ROUTINE LZW_COMPRESS
  STRING = get input character
  WHILE there are still input characters DO
    CHARACTER = get input character
    IF STRING+CHARACTER is in the string table THEN
        STRING = STRING+character
    ELSE
      output the code for STRING
      add STRING+CHARACTER to the string table
      STRING = CHARACTER
    END of IF
  END of WHILE
  output the code for STRING

A sample string used to demonstrate the algorithm is shown in Figure 2. The input string is a short list of English words separated by the / character. As you step through the start of the algorithm for this string, you can see that in the first pass through the loop the system performs a check to see if the string /W is in the table. When it doesn't find the string in the table, it generates the code for /, and the string /W is added to the table. Because 256 characters have already been defined for codes 0 - 255, the first string definition can be assigned to code 256. After the system reads in the third letter, E, the second string code, WE, is added to the table, and the code for letter W is output. This process continues until, in the second word, the characters / and W are read in, matching string number 256. In this case, the system outputs code 256, and adds a three-character string to the string table. The process again continues until the string is exhausted and all of the codes have been output.

Figure 2: The compression process

  Input string:/WED/WE/WEE/WEB/WET

  Character input  Code output  New code value and associated string
  ------------------------------------------------------------------

   /W                /                     256 = /W
   E                W                      257 = WE
   D                E                      258 = ED
   /                D                      259 = D/
   WE               256                    260 = /WE
   /                E                      261 = E/
   WEE              260                    262 = /WEE
   /W               261                    263 = E/W
   EB               257                    264 = WEB
   /                B                      265 = B/
   WET              260                    266 = /WET
   <EOF>            T

The sample output for the string is also shown in Figure 2, along with the resulting string table. As you can see, the string table fills up rapidly, because a new string is added to the table each time a code is generated. In this highly redundant example input, five code substitutions were output, along with seven characters. If we were using 9-bit codes for output, the 19-character input string would be reduced to a 13.5-byte output string. Of course, this example was carefully chosen to demonstrate code substitution. In the real world, compression usually doesn't begin until a sizable table has been built, usually after at least 100 or so bytes have been read in.

Decompression

The companion algorithm for compression is the decompression algorithm. It takes the stream of codes output from the compression algorithm and uses it to exactly recreate the input stream. One reason for the efficiency of the LZW algorithm is that it does not need to pass the string table to the decompression code. The table can be built exactly as it occurred during compression, using the input stream as data. This is possible because the compression algorithm always outputs the STRING and CHARACTER components of a code before it uses the code in the output stream. This means that the compressed data is not burdened with carrying a large string translation table.

The decompression algorithm is shown in Figure 3. Just like the compression algorithm, it adds a new string to the string table each time it reads in a new code. All it needs to do in addition is to translate each incoming code into a string and send it to the output.

Figure 3: The decompression algorithm.

  ROUTINE LZW_DECOMPRESS
  Read OLD_CODE
  output OLD_CODE
  WHILE there are still input characters DO
    Read NEW_CODE
    STRING = get translation of NEW_CODE
    output STRING
    CHARACTER = first character in STRING
    add OLD_CODE + CHARACTER to the translation table
    OLD_CODE = NEW_CODE
    END of WHILE

Figure 4 shows the output of the algorithm given the input created by the compression discussed earlier in the article. The important thing to note is that the decompression string table ends up looking exactly like the table built up during compression. The output string is identical to the input string from the compression algorithm. Note that the first 256 codes are already defined to translate to single character strings, just like in the compression code.

Figure 4: The decompression process

  Input codes:/ W E D 256 E 260 261 257 B 260 T

  Input     OLD_CODE  STRING  CHARACTER  New table entry
  NEW_CODE              Output
  ------------------------------------------------------

    /         /        /
    W         /        W         W          256 = /W
    E         W        E         E          257 = WE
    D         E        D         D          258 = ED
    256       D        /W        /          259 = D/
    E         256      E         E          260 = /WE
    260       E        /WE       /          261 = E/
    261       260      E/        E          262 = /WEE
    257       261      WE        W          263 = E/W
    B         257      B         B          264 = WEB
    260       B        /WE       /          265 = B/
    T         260      T         T          266 = /WET

The Catch

Unfortunately, the nice, simple, decompression algorithm shown in Figure 4 is just a little too simple. There is a single exception case in the LZW compression algorithm that causes some trouble on the decompression side. If there is a string consisting of a (STRING, CHARACTER) pair already defined in the table, and the input stream sees a sequence of STRING, CHARACTER, STRING, CHARACTER, STRING, the compression algorithm outputs a code before the decompressor gets a chance to define it.

A simple example illustrates the point. Imagine the string JOEYN is defined in the table as code 300. When the sequence JOEYNJOEYNJOEY appears in the table, the compression output looks like that shown in Figure 5.

Figure 5: Sample problem

  Input string:... JOEYNJOEYNJOEY ...

  Character input  New code value and associated string  Code output
  ------------------------------------------------------------------

  JOEYN                   300 = JOEYN                  288 (JOEY)
  A                       301 = NA                     N
    .                           .                        .
    .                           .                        .
    .                           .                        .
  JOEYNJ                  400 = JOEYNJ                 300 (JOEYN)
  JOEYNJO                 401 = JOEYNJO                400

When the decompression algorithm sees this input stream, it first decodes the code 300, then outputs the JOEYN string and adds the definition for, lets say, code 399 to the table, whatever that may be. It then reads the next input code, 400, and finds that it is not in the table. This is a problem.

Fortunately, this is the only case where the decompression algorithm will encounter an undefined code. Because it is, in fact, the only case, you can add an exception handler to the algorithm. The modified algorithm just looks for the special case of an undefined code and handles it. In the example in Figure 6, the decompression routine sees code 400, which is undefined. Because it is undefined, it translates the value of OLD_CODE, which is code 300. It then adds the CHARACTER value, J, to the string. This results in the correct translation of code 400 to string JOEYNJ.

Figure 6: The modified decompression algorithm.

  ROUTINE LZW_DECOMPRESS
  Read OLD_CODE
  output OLD_CODE
  WHILE there are still input characters DO
    Read NEW_CODE
    IF NEW_CODE is not in the translation table THEN
       STRING = get translation of OLD_CODE
       STRING = STRING+CHARACTER
    ELSE
       STRING = get translation of NEW_CODE
    END of IF
    output STRING
    CHARACTER = first character in STRING
    add OLD_CODE + CHARACTER to the translation table
    OLD_CODE = NEW_CODE
  END of WHILE


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.