Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Parallel

LZW Revisited


JUN90: LZW REVISITED

Shawn is a programmer/analyst for MicroBilt Inc. of Atlanta, Georgia. You can reach him through Interlink; or write him at 2127B Powers Ferry Rd., Marietta, GA 30067.


When the October 1989 issue of Dr. Dobb's came out, I was delighted to see Mark Nelson's article on LZW data compression ("LZW Data Compression"). Mark presented a clear description with code of a basic LZW compression program. As I used the program, however, I discovered that for some larger files, even when using 14-bit codes, the compressed file would actually be larger than the original. Because Mark's intention was to enlighten and not to complicate the subject, his code omitted some optimizing additions to LZW compression. Although, he did describe some optimization techniques -- including the use of variable code size -- as well as clearing the string table after the compression ratio degrades.

The original program's lack of performance on larger files can be traced to the fixed-length string table. When the table is full, new codes cannot be added and must be sent out in character form with no compression done. When this happens, you could actually be sending out an 8-bit character using a 14-bit code; this explains how the file can get larger. If your incoming data changes, even moderately, with the string table full, then the compression ratio begins to degrade rapidly.

With this in mind, imagine compressing a small file with your code size set at 14 bits. If your string table needs less than 511 entries, you could have used 9-bit codes saving 5 bits per output code. Of course you wouldn't want to fix your code size at 9 bits because the compression on larger files would suffer. What you would like is for the code size to start at 9 bits and if that table filled, the code size could increment to 10, thus providing optimal performance on any size file.

My Implementation

One of the more elegant features about LZW compression are that the compression and expansion programs build the exact same string table for a particular file. This means at the same point the compression program's string table is full so is the expansion program's. At this point you should try to adjust your code size. As you can see in the compression section of Listing One (page 127), I wait until after I have sent out the current code before the code size is incremented because the current code belongs to the previous code size. Notice in the compression section I increment when the code size is greater than max_code, while in the expansion program I increment when the code size equals max_code. This is because the expansion section is working a code behind using old_code instead of new_code. It is also because of this that I must handle a special case when incrementing the code size on an end-of-file condition.

Finally, you should also set some arbitrary limit on your code size. If you use 14 bits, your codes stay well under the positive integer maximum of 32767, which suits the program without any modification. Don't forget your table size needs to be a prime number somewhat larger than 2AX_CODE_SIZE. To implement the table clearing, start by monitoring the number of bytes (not codes) read in and then sent out. After a predetermined interval, compute the new compression ratio and check it against the previous one. If the ratio has increased, you need to clear out the string table and start over. You then need a device to send a signal from the compression program to the expansion program to clear the string table. The easiest way to accomplish this is by reserving the first of the 9-bit codes. In my example, I used 256 as the CLEAR_TABLE code. I also used 257 as the TERMINATORto signal the end-of-file condition. This means the first available code for compression is now 258, which I've defined as FIRST_CODE.

When combining both methods, you should not experience any degradation in compression until the table is full. When the table is full, you will first check to see if you can increase the code size. If you can't, then (and only then) will you start to monitor your compression ratio at your predefined interval and ultimately clear the string table. When this happens, you can reset your code size back to 9-bits because basically you're starting from scratch. Although you still won't get performance as good as PKZIP (from PKWARE, Glendale, Wisc.), you now have the source for a much improved version of this basic LZW compression program. Table 1 lists some typical compression levels I've achieved with this program.

Table 1: Typical compression levels using revised LZW

   Before     After      % of Original         File type
---------------------------------------------------------------------

  115,094    40,636           35%          .C - C program
   11,054     4,811           43%          .C - C program
  230,582   141,659           61%          .EXE - Executable
   16,944    12,905           76%          .EXE - Executable
   90,610    20,806           22%          .TXT - Redundant text file
  110,592    64,804           58%          .LIB - C object library

Your Implementation

Even though the program works well as is, there are still some improvements that can be made. As Mark suggested, the input and output routines can be modified for more speed. Also, a more sophisticated hashing routine might speed it up. For better compression, you might experiment with table clearing. I found on .EXE files the compression ratio drops steadily after a code size increase, then bottoms out and then starts rising again. If you suspend clearing until you are back to just below the starting ratio you can get a somewhat better compression. I also noticed that in smaller text files, I can at times get better compression by clearing the table instead of increasing the code size. Be careful, however, about basing any optimization methods on any preanalysis of the data. If, for example, you wish to use it with stream I/O you will be working with buffers and not files where any preanalysis might be difficult or useless.

_LZW REVISITED_ by Shawn M. Regan

[LISTING ONE]

<a name="014d_0008">

/* Basic LZW Data Compression program published in DDJ October 1989 issue.
 * Original Author: Mark R. Nelson
 * Updated by: Shawn M. Regan, January 1990
 * Added: - Method to clear table when compression ratio degrades
 *        - Self adjusting code size capability (up to 14 bits)
 * Updated functions are marked with "MODIFIED". main() has been updated also
 * Compile with -ml (large model) for MAX_BITS == 14 only
 */

#include <stdio.h>

#define INIT_BITS 9
#define MAX_BITS 14           /* Do not exceed 14 with this program */
#define HASHING_SHIFT MAX_BITS - 8

#if MAX_BITS == 14            /* Set the table size. Must be a prime    */
#define TABLE_SIZE 18041      /* number somewhat larger than 2AX_BITS.*/
#elif MAX_BITS == 13
#define TABLE_SIZE 9029
#else
#define TABLE_SIZE 5021
#endif

#define CLEAR_TABLE 256    /* Code to flush the string table */
#define TERMINATOR  257    /* To mark EOF Condition, instead of MAX_VALUE */
#define FIRST_CODE  258    /* First available code for code_value table */
#define CHECK_TIME  100    /* Check comp ratio every CHECK_TIME chars input */

#define MAXVAL(n) (( 1 <<( n )) -1)   /* max_value formula macro */

unsigned input_code();
void *malloc();

int *code_value;                      /* This is the code value array */
unsigned int *prefix_code;            /* This array holds the prefix codes */
unsigned char *append_character;      /* This array holds the appended chars */
unsigned char decode_stack[4000];     /* This array holds the decoded string */

int num_bits=INIT_BITS;               /* Starting with 9 bit codes */
unsigned long bytes_in=0,bytes_out=0; /* Used to monitor compression ratio */
int max_code;                         /* old MAX_CODE */
unsigned long checkpoint=CHECK_TIME;  /* For compression ratio monitoring */

main(int argc, char *argv[])
{
   FILE *input_file, *output_file, *lzw_file;
   char input_file_name[81];
 /* The three buffers for the compression phase.  */
   code_value=malloc(TABLE_SIZE*sizeof(unsigned int));
   prefix_code=malloc(TABLE_SIZE*sizeof(unsigned int));
   append_character=malloc(TABLE_SIZE*sizeof(unsigned char));

   if (code_value==NULL || prefix_code==NULL || append_character==NULL) {
      printf("Error allocating table space!\n");
      exit(1);
   }
 /* Get the file name, open it, and open the LZW output file. */
   if (argc>1)
      strcpy(input_file_name,argv[1]);
   else {
      printf("Input file name: ");
      scanf("%s",input_file_name);
   }
   input_file=fopen(input_file_name,"rb");
   lzw_file=fopen("test.lzw","wb");
   if (input_file == NULL || lzw_file == NULL) {
      printf("Error opening files\n");
      exit(1);
   }
   max_code = MAXVAL(num_bits);     /* Initialize max_value & max_code */
   compress(input_file,lzw_file);       /* Call compression routine */

   fclose(input_file);
   fclose(lzw_file);
   free(code_value);                    /* Needed only for compression */

   lzw_file=fopen("test.lzw","rb");
   output_file=fopen("test.out","wb");
   if (lzw_file == NULL || output_file == NULL) {
      printf("Error opening files\n");
      exit(1);
   }
   num_bits=INIT_BITS;                  /* Re-initialize for expansion */
   max_code = MAXVAL(num_bits);
   expand(lzw_file,output_file);        /* Call expansion routine */

   fclose(lzw_file);                    /* Clean it all up */
   fclose(output_file);
   free(prefix_code);
   free(append_character);
}
/* MODIFIED This is the new compression routine. The first two 9-bit codes
 * have been reserved for communication between the compressor and expander.
 */
compress(FILE *input, FILE *output)
{
   unsigned int next_code=FIRST_CODE;
   unsigned int character;
   unsigned int string_code;
   unsigned int index;
   int i,             /* All purpose integer */
   ratio_new,         /* New compression ratio as a percentage */
   ratio_old=100;     /* Original ratio at 100% */

   for (i=0;i<TABLE_SIZE;i++)   /* Initialize the string table first */
      code_value[i]=-1;
   printf("Compressing\n");
   string_code=getc(input);     /* Get the first code */

 /* This is the main compression loop. Notice when the table is full we try
  * to increment the code size. Only when num_bits == MAX_BITS and the code
  * value table is full do we start to monitor the compression ratio.
  */
   while((character=getc(input)) != (unsigned)EOF) {
      if (!(++bytes_in % 1000)) {     /* Count input bytes and pacifier */
         putchar('.');
      }
      index=find_match(string_code,character);
      if (code_value[index] != -1)
         string_code=code_value[index];
      else {
         if (next_code <= max_code ) {
            code_value[index]=next_code++;
            prefix_code[index]=string_code;
            append_character[index]=character;
         }
         output_code(output,string_code);   /* Send out current code */
         string_code=character;
         if (next_code > max_code) {      /* Is table Full? */
            if ( num_bits < MAX_BITS) {     /* Any more bits? */
               putchar('+');
               max_code = MAXVAL(++num_bits);  /* Increment code size then */
            }
            else if (bytes_in > checkpoint) {         /* At checkpoint? */
               if (num_bits == MAX_BITS ) {
                ratio_new = bytes_out*100/bytes_in; /* New compression ratio */
                if (ratio_new > ratio_old) {        /* Has ratio degraded? */
                  output_code(output,CLEAR_TABLE); /* YES,flush string table */
                  putchar('C');
                  num_bits=INIT_BITS;
                  next_code=FIRST_CODE;        /* Reset to FIRST_CODE */
                  max_code = MAXVAL(num_bits); /* Re-Initialize this stuff */
                  bytes_in = bytes_out = 0;
                  ratio_old=100;               /* Reset compression ratio */
                  for (i=0;i<TABLE_SIZE;i++)   /* Reset code value array */
                        code_value[i]=-1;
               }
               else                                /* NO, then save new */
               ratio_old = ratio_new;            /* compression ratio */
            }
            checkpoint = bytes_in + CHECK_TIME;    /* Set new checkpoint */
            }
         }
      }
   }
   output_code(output,string_code);   /* Output the last code */
   if (next_code == max_code) {       /* Handles special case for bit */
      ++num_bits;                     /* increment on EOF */
      putchar('+');
   }
   output_code(output,TERMINATOR);    /* Output the end of buffer code */
   output_code(output,0);             /* Flush the output buffer */
   output_code(output,0);
   output_code(output,0);
   putchar('\n');
}
/* UNCHANGED from original
 * This is the hashing routine.
 */
find_match(int hash_prefix, unsigned int hash_character)
{
   int index, offset;

   index = (hash_character << HASHING_SHIFT ) ^ hash_prefix;
   if (index == 0 )
      offset=1;
   else
      offset = TABLE_SIZE - index;
   while(1) {
      if (code_value[index] == -1 )
         return(index);
      if (prefix_code[index] == hash_prefix &&
                                     append_character[index] == hash_character)
         return(index);
      index -= offset;
      if (index < 0)
         index += TABLE_SIZE;
   }
}
/* MODIFIED This is the modified expansion routine. It must now check for the
 * CLEAR_TABLE code and know when to increment the code size.
 */
expand(FILE *input, FILE *output)
{
   unsigned int next_code=FIRST_CODE;
   unsigned int new_code;
   unsigned int old_code;
   int character,
   counter=0,
   clear_flag=1;          /* Need to clear the code value array */
   unsigned char *string;
   char *decode_string(unsigned char *buffer, unsigned int code);

   printf("Expanding\n");

   while((new_code=input_code(input)) != TERMINATOR) {
      if (clear_flag) {       /* Initialize or Re-Initialize */
         clear_flag=0;
         old_code=new_code;   /* The next three lines have been moved */
         character=old_code;  /* from the original */
         putc(old_code,output);
         continue;
      }
      if (new_code == CLEAR_TABLE) {     /* Clear string table */
         clear_flag=1;
         num_bits=INIT_BITS;
         next_code=FIRST_CODE;
         putchar('C');
         max_code = MAXVAL(num_bits);
         continue;
      }
      if (++counter == 1000) {           /* Pacifier */
         counter=0;
         putchar('.');
      }
      if (new_code >= next_code) {       /* Check for string+char+string */
         *decode_stack=character;
         string=decode_string(decode_stack+1,old_code);
      }
      else
         string=decode_string(decode_stack,new_code);

      character = *string;              /* Output decoded string in reverse */
      while (string >= decode_stack)
         putc(*string--,output);

      if (next_code <= max_code) {      /* Add to string table if not full */
         prefix_code[next_code]=old_code;
         append_character[next_code++]=character;
         if (next_code == max_code && num_bits < MAX_BITS) {
            putchar('+');
            max_code = MAXVAL(++num_bits);
         }
      }
      old_code=new_code;
   }
   putchar('\n');
}
/* UNCHANGED from original
 * Decode a string from the string table, storing it in a buffer.
 * The buffer can then be output in reverse order by the expansion
 * program.
 */
char *decode_string(unsigned char *buffer, unsigned int code)
{
   int i=0;

   while(code > 255 ) {
      *buffer++ = append_character[code];
      code=prefix_code[code];
      if (i++ >= 4000 ) {
         printf("Error during code expansion\n");
         exit(1);
      }
   }
   *buffer=code;
   return(buffer);
}

/* UNCHANGED from original
 * Input a variable length code.
 */
unsigned input_code(FILE *input)
{
   unsigned int return_value;
   static int input_bit_count=0;
   static unsigned long input_bit_buffer=0L;

   while (input_bit_count <= 24 ) {
     input_bit_buffer |= (unsigned long) getc(input) << (24 - input_bit_count);
     input_bit_count += 8;
   }
   return_value=input_bit_buffer >> (32-num_bits);
   input_bit_buffer <<= num_bits;
   input_bit_count -= num_bits;
   return(return_value);
}
/* MODIFIED Output a variable length code.
 */
output_code(FILE *output, unsigned int code)
{
   static int output_bit_count=0;
   static unsigned long output_bit_buffer=0L;

   output_bit_buffer |= (unsigned long) code << (32 - num_bits -
                                                             output_bit_count);
   output_bit_count += num_bits;
   while (output_bit_count >= 8) {
      putc(output_bit_buffer >> 24, output);
      output_bit_buffer <<= 8;
      output_bit_count -= 8;
      bytes_out++;                    /* ADDED for compression monitoring */
   }
}










Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.