Channels ▼
RSS

Design

LZW Data Compression

Source Code Accompanies This Article. Download It Now.


Every programmer should have at least some exposure to the concept of data compression. Programs such as ARC by System Enhancement Associates (Wayne, N.J.) and PKZIP by PKWARE (Glendale, Wisc.) are ubiquitous in the MS-DOS world. ARC has also been ported to quite a few other operating systems, for example, Unix, CP/M, and so on. CP/M users have long had SQ and USQ to squeeze and expand programs. Unix users have the COMPRESS and COMPACT utilities. Yet the data compression techniques used in these programs typically show up in only two places: File transfers over phone lines and archival storage.

Data compression has an undeserved reputation for being difficult to master, hard to implement, and tough to maintain. In fact, the techniques used in the previously mentioned programs are relatively simple, and can be implemented with standard utilities taking only a few lines of code. In this article, I'll discuss Lempel-Ziv-Welch (LZW) compression, a good, all-purpose data compression technique that belongs in every programmer's toolbox.

LZW, for example, by compressing the screens, can easily chop 50K bytes off a program that has several dozen help screens. With LZW compression, 500K bytes of software could be distributed to end users on a single 360K byte floppy disk. Highly redundant database files can be compressed to ten percent of their original size.

LZW Fundamentals

The original Lempel/Ziv approach to data compression was first published in 1977, and Terry Welch's refinements to the algorithm were published in 1984. The algorithm is surprisingly simple. In a nutshell, LZW compression replaces strings of characters with single codes. It does not do any analysis of the incoming text. Instead, it just adds every new string of characters it sees to a table of strings. Compression occurs when a single code is output replacing the string of characters.

The code generated by the LZW algorithm can be of any length, but it must have more bits in it than a single character. The first 256 codes (when using 8-bit characters) are by default assigned to the standard character set. The remaining codes are assigned to strings as the algorithm proceeds. The sample program runs, as shown in Listing One with 12-bit codes. This means codes 0 - 255 refer to individual bytes, and codes 256 - 4095 refer to substrings.

Listing One


/*************************************************************************
** LZW data compression/expansion demonstration program.
** Mark R. Nelson
**************************************************************************/
#include <stdio.h>

#define BITS 12                      /* Setting the number of bits to 12, 13 */
#define HASHING_SHIFT BITS-8         /* or 14 affects several constants.     */
#define MAX_VALUE (1 << BITS) - 1    /* Note that MS-DOS machines need to    */
#define MAX_CODE MAX_VALUE - 1       /* compile their code in large model if */
                                     /* 14 bits are selected.                */
#if BITS == 14
  #define TABLE_SIZE 18041           /* The string table size needs to be a  */
#endif                               /* prime number that is somwhat larger  */
#if BITS == 13                       /* than 2**BITS.                        */
  #define TABLE_SIZE 9029
#endif
#if BITS <= 12
  #define TABLE_SIZE 5021
#endif

void *malloc();

int *code_value;                     /* This is the code value array         */
unsigned int *prefix_code;           /* This array holds the prefix codes    */
unsigned char *append_character;     /* This array holds the appended chars  */
unsigned char decode_stack[4000];    /* This array holds the decoded string  */

/**************************************************************************
** This program gets a file name from the command line.  It compresses the
** file, placing its output in a file named test.lzw.  It then expands
** test.lzw into test.out.  Test.out should then be an exact duplicate of
** the input file.
**************************************************************************/

main(int argc, char *argv[])
{
FILE *input_file;
FILE *output_file;
FILE *lzw_file;
char input_file_name[81];
/*
**  The three buffers are needed for the compression phase.
*/
    code_value=malloc(TABLE_SIZE*sizeof(unsigned int));
    prefix_code=malloc(TABLE_SIZE*sizeof(unsigned int));
    append_character=malloc(TABLE_SIZE*sizeof(unsigned char));
    if (code_value==NULL || prefix_code==NULL || append_character==NULL)
    {
        printf("Fatal error allocating table space!\n");
        exit();
    }
/*
** Get the file name, open it up, and open up the lzw output file.
*/
    if (argc>1)
        strcpy(input_file_name,argv[1]);
    else
    {
        printf("Input file name? ");
        scanf("%s",input_file_name);
    }
    input_file=fopen(input_file_name,"rb");
    lzw_file=fopen("test.lzw","wb");
    if (input_file==NULL || lzw_file==NULL)
    {
        printf("Fatal error opening files.\n");
        exit();
    };
/*
** Compress the file.
*/
    compress(input_file,lzw_file);
    fclose(input_file);
    fclose(lzw_file);
    free(code_value);
/*
** Now open the files for the expansion.
*/
    lzw_file=fopen("test.lzw","rb");
    output_file=fopen("test.out","wb");
    if (lzw_file==NULL || output_file==NULL)
    {
        printf("Fatal error opening files.\n");
        exit();
    };
/*
** Expand the file.
*/
    expand(lzw_file,output_file);
    fclose(lzw_file);
    fclose(output_file);
    free(prefix_code);
    free(append_character);
}

/*
** This is the compression routine.  The code should be a fairly close
** match to the algorithm accompanying the article.
**
*/
compress(FILE *input,FILE *output)
{
unsigned int next_code;
unsigned int character;
unsigned int string_code;
unsigned int index;
int i;
    next_code=256;             /* Next code is the next available string code*/
    for (i=0;i<TABLE_SIZE;i++) /* Clear out the string table before starting */
        code_value[i]=-1;
    i=0;
    printf("Compressing...\n");
    string_code=getc(input);   /* Get the first code*/
/*
** This is the main loop where it all happens.  This loop runs util all of
** the input has been exhausted.  Note that it stops adding codes to the
** table after all of the possible codes have been defined.
*/
    while ((character=getc(input)) != (unsigned)EOF)
    {
        if (++i==1000)                   /* Print a * every 1000    */
        {                                /* input characters.  This */
            i=0;                         /* is just a pacifier.     */
            printf("*");
        }
        index=find_match(string_code,character); /* See if the string is in */
        if (code_value[index] != -1)             /* the table.  If it is,   */
            string_code=code_value[index];       /* get the code value.  If */
        else                                     /* the string is not in the*/
        {                                        /* table, try to add it.   */
            if (next_code <= MAX_CODE)
            {
                code_value[index]=next_code++;
                prefix_code[index]=string_code;
                append_character[index]=character;
            }
            output_code(output,string_code);     /* When a string is found  */
            string_code=character;               /* that is not in the table*/
        }                                        /* I output the last string*/
    }                                            /* after adding the new one*/
/*
** End of the main loop.
*/
    output_code(output,string_code);  /* Output the last code */
    output_code(output,MAX_VALUE);    /* Output the end of buffer code */
    output_code(output,0);            /* This code flushes the output buffer*/
    printf("\n");
}
/*
** This is the hashing routine.  It tries to find a match for the prefix+char
** string in the string table.  If it finds it, the index is returned.  If
** the string is not found, the first available index in the string table is
** returned instead.
*/
find_match(int hash_prefix,unsigned int hash_character)
{
int index;
int offset;

    index = (hash_character << HASHING_SHIFT) ^ hash_prefix;
    if (index == 0)
        offset = 1;
    else
        offset = TABLE_SIZE - index;
    while (1)
    {
if (code_value[index] == -1)
      return(index);
if (prefix_code[index] == hash_prefix && append_character[index] == hash_character)
      return(index);
  index -= offset;
  if (index < 0)
      index += TABLE_SIZE;
    }
}
/*
**  This is the expansion routine.  It takes an LZW format file, and expands
**  it to an output file.  The code here should be a fairly close match to
**  the algorithm in the accompanying article.
*/
expand(FILE *input,FILE *output)
{
unsigned int next_code;
unsigned int new_code;
unsigned int old_code;
int character;
int counter;
unsigned char *string;
char *decode_string(unsigned char *buffer,unsigned int code);
    next_code=256;          /* This is the next available code to define */
    counter=0;              /* Counter is used as a pacifier.            */
    printf("Expanding...\n");

    old_code=input_code(input);  /* Read in the first code, initialize the */
    character=old_code;          /* character variable, and send the first */
    putc(old_code,output);       /* code to the output file                */
/*
**  This is the main expansion loop.  It reads in characters from the LZW file
**  until it sees the special code used to inidicate the end of the data.
*/
    while ((new_code=input_code(input)) != (MAX_VALUE))
    {
        if (++counter==1000)           /* This section of code prints out  */
        {                              /* an asterisk every 1000 characters*/
            counter=0;                 /* It is just a pacifier.           */
            printf("*");
        }
/*
** This code checks for the special STRING+CHARACTER+STRING+CHARACTER+STRING
** case which generates an undefined code.  It handles it by decoding
** the last code, adding a single character to the end of the decode string.
*/
        if (new_code>=next_code)
        {
            *decode_stack=character;
            string=decode_string(decode_stack+1,old_code);
        }
/*
** Otherwise we do a straight decode of the new code.
*/
        else
            string=decode_string(decode_stack,new_code);
/*
** Now we output the decoded string in reverse order.
*/
        character=*string;
        while (string >= decode_stack)
            putc(*string--,output);
/*
** Finally, if possible, add a new code to the string table.
*/
        if (next_code <= MAX_CODE)
        {
            prefix_code[next_code]=old_code;
            append_character[next_code]=character;
            next_code++;
        }
        old_code=new_code;
    }
    printf("\n");
}
/*
** This routine simply decodes a string from the string table, storing
** it in a buffer.  The buffer can then be output in reverse order by
** the expansion program.
*/
char *decode_string(unsigned char *buffer,unsigned int code)
{
int i;

    i=0;
    while (code > 255)
    {
        *buffer++ = append_character[code];
        code=prefix_code[code];
        if (i++>=4094)
        {
            printf("Fatal error during code expansion.\n");
            exit();
        }
    }
    *buffer=code;
    return(buffer);
}
/*
** The following two routines are used to output variable length
** codes.  They are written strictly for clarity, and are not
** particularly efficient.
*/
input_code(FILE *input)
{
unsigned int return_value;
static int input_bit_count=0;
static unsigned long input_bit_buffer=0L;
    while (input_bit_count <= 24)
    {
       input_bit_buffer |= (unsigned long) getc(input) << (24-input_bit_count);
       input_bit_count += 8;
    }
    return_value=input_bit_buffer >> (32-BITS);
    input_bit_buffer <<= BITS;
    input_bit_count -= BITS;
    return(return_value);
}
output_code(FILE *output,unsigned int code)
{
static int output_bit_count=0;
static unsigned long output_bit_buffer=0L;
    output_bit_buffer |= (unsigned long) code << (32-BITS-output_bit_count);
    output_bit_count += BITS;
    while (output_bit_count >= 8)
    {
        putc(output_bit_buffer >> 24,output);
        output_bit_buffer <<= 8;
        output_bit_count -= 8;
    }
}


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video