Compression
The LZW compression algorithm in its simplest form is shown in Figure 1. Each time a new code is generated, it means a new string has been added to the string table. Examination of the algorithm shows that LZW always checks whether the strings are already known and, if so, outputs existing codes rather than generating new codes.
Figure 1: The compression algorithm
ROUTINE LZW_COMPRESS STRING = get input character WHILE there are still input characters DO CHARACTER = get input character IF STRING+CHARACTER is in the string table THEN STRING = STRING+character ELSE output the code for STRING add STRING+CHARACTER to the string table STRING = CHARACTER END of IF END of WHILE output the code for STRING
A sample string used to demonstrate the algorithm is shown in Figure 2. The input string is a short list of English words separated by the / character. As you step through the start of the algorithm for this string, you can see that in the first pass through the loop the system performs a check to see if the string /W is in the table. When it doesn't find the string in the table, it generates the code for /, and the string /W is added to the table. Because 256 characters have already been defined for codes 0 - 255, the first string definition can be assigned to code 256. After the system reads in the third letter, E, the second string code, WE, is added to the table, and the code for letter W is output. This process continues until, in the second word, the characters / and W are read in, matching string number 256. In this case, the system outputs code 256, and adds a three-character string to the string table. The process again continues until the string is exhausted and all of the codes have been output.
Figure 2: The compression process
Input string:/WED/WE/WEE/WEB/WET Character input Code output New code value and associated string ------------------------------------------------------------------ /W / 256 = /W E W 257 = WE D E 258 = ED / D 259 = D/ WE 256 260 = /WE / E 261 = E/ WEE 260 262 = /WEE /W 261 263 = E/W EB 257 264 = WEB / B 265 = B/ WET 260 266 = /WET <EOF> T
The sample output for the string is also shown in Figure 2, along with the resulting string table. As you can see, the string table fills up rapidly, because a new string is added to the table each time a code is generated. In this highly redundant example input, five code substitutions were output, along with seven characters. If we were using 9-bit codes for output, the 19-character input string would be reduced to a 13.5-byte output string. Of course, this example was carefully chosen to demonstrate code substitution. In the real world, compression usually doesn't begin until a sizable table has been built, usually after at least 100 or so bytes have been read in.
Decompression
The companion algorithm for compression is the decompression algorithm. It takes the stream of codes output from the compression algorithm and uses it to exactly recreate the input stream. One reason for the efficiency of the LZW algorithm is that it does not need to pass the string table to the decompression code. The table can be built exactly as it occurred during compression, using the input stream as data. This is possible because the compression algorithm always outputs the STRING and CHARACTER components of a code before it uses the code in the output stream. This means that the compressed data is not burdened with carrying a large string translation table.
The decompression algorithm is shown in Figure 3. Just like the compression algorithm, it adds a new string to the string table each time it reads in a new code. All it needs to do in addition is to translate each incoming code into a string and send it to the output.
Figure 3: The decompression algorithm.
ROUTINE LZW_DECOMPRESS Read OLD_CODE output OLD_CODE WHILE there are still input characters DO Read NEW_CODE STRING = get translation of NEW_CODE output STRING CHARACTER = first character in STRING add OLD_CODE + CHARACTER to the translation table OLD_CODE = NEW_CODE END of WHILE
Figure 4 shows the output of the algorithm given the input created by the compression discussed earlier in the article. The important thing to note is that the decompression string table ends up looking exactly like the table built up during compression. The output string is identical to the input string from the compression algorithm. Note that the first 256 codes are already defined to translate to single character strings, just like in the compression code.
Figure 4: The decompression process
Input codes:/ W E D 256 E 260 261 257 B 260 T Input OLD_CODE STRING CHARACTER New table entry NEW_CODE Output ------------------------------------------------------ / / / W / W W 256 = /W E W E E 257 = WE D E D D 258 = ED 256 D /W / 259 = D/ E 256 E E 260 = /WE 260 E /WE / 261 = E/ 261 260 E/ E 262 = /WEE 257 261 WE W 263 = E/W B 257 B B 264 = WEB 260 B /WE / 265 = B/ T 260 T T 266 = /WET
The Catch
Unfortunately, the nice, simple, decompression algorithm shown in Figure 4 is just a little too simple. There is a single exception case in the LZW compression algorithm that causes some trouble on the decompression side. If there is a string consisting of a (STRING, CHARACTER) pair already defined in the table, and the input stream sees a sequence of STRING, CHARACTER, STRING, CHARACTER, STRING, the compression algorithm outputs a code before the decompressor gets a chance to define it.
A simple example illustrates the point. Imagine the string JOEYN is defined in the table as code 300. When the sequence JOEYNJOEYNJOEY appears in the table, the compression output looks like that shown in Figure 5.
Figure 5: Sample problem
Input string:... JOEYNJOEYNJOEY ... Character input New code value and associated string Code output ------------------------------------------------------------------ JOEYN 300 = JOEYN 288 (JOEY) A 301 = NA N . . . . . . . . . JOEYNJ 400 = JOEYNJ 300 (JOEYN) JOEYNJO 401 = JOEYNJO 400
When the decompression algorithm sees this input stream, it first decodes the code 300, then outputs the JOEYN string and adds the definition for, lets say, code 399 to the table, whatever that may be. It then reads the next input code, 400, and finds that it is not in the table. This is a problem.
Fortunately, this is the only case where the decompression algorithm will encounter an undefined code. Because it is, in fact, the only case, you can add an exception handler to the algorithm. The modified algorithm just looks for the special case of an undefined code and handles it. In the example in Figure 6, the decompression routine sees code 400, which is undefined. Because it is undefined, it translates the value of OLD_CODE, which is code 300. It then adds the CHARACTER value, J, to the string. This results in the correct translation of code 400 to string JOEYNJ.
Figure 6: The modified decompression algorithm.
ROUTINE LZW_DECOMPRESS Read OLD_CODE output OLD_CODE WHILE there are still input characters DO Read NEW_CODE IF NEW_CODE is not in the translation table THEN STRING = get translation of OLD_CODE STRING = STRING+CHARACTER ELSE STRING = get translation of NEW_CODE END of IF output STRING CHARACTER = first character in STRING add OLD_CODE + CHARACTER to the translation table OLD_CODE = NEW_CODE END of WHILE