Dr. Dobb's | Differential Compression Algorithms

Differential Compression Algorithms

At first glance, differential image compression suggests some degree of graphics dependence. As James points out, however, a close look at source-code implementations reveals that data need not be graphics related at all.

April 01, 1993
URL:http://www.drdobbs.com/database/differential-compression-algorithms/184408976

APR93: DIFFERENTIAL COMPRESSION ALGORITHMS

This article contains the following executables: DIFFCOMP.ARC

James recently graduated from California State University, Fullerton with a masters in computer science. He can be contacted at P.O. Box 1097, Yorba Linda, CA 92686-1097.

At first glance, differential image compression suggests some degree of graphics dependence--after all, the word "image" in the title implies that graphics must be used. However, a close look at source-code implementations of differential image-compression algorithms--like that presented by John Bridges in his article "Differential Image Compression," DDJ, February 1991--reveals that data need not be graphics at all.

Differential image compression keeps track of only the information necessary to change the currently active screen of graphics into the next screen; it then repeats this process for subsequent screens. In John's coding, for example, differential image compression uses the active screen of graphics as a fixed-size buffer for data and the compression/decompression algorithms simply modify the data in the buffer. That the buffer in the algorithms is actually a graphics screen (commonly a VGA screen for its elegant pixel-to-memory map) is simply a dependency of the algorithms presented. There's nothing inherent in this approach to prevent or inhibit the buffer from being any portion of random-access memory containing any type of data.

Still, you should make one minor modification when the buffer makes this leap into nongraphics memory. The category for the compression and decompression algorithms should be changed from differential image compression to simply differential compression to signify that the data in the buffer may be used for any general purpose.

Implementation Details

Implementation of differential compression begins with resolving at least the following three issues in the design phase:

Is the data buffer fixed or variant in length? Regardless of whether it's fixed or variant, what is the maximum size of the data buffer when full? Can it all be kept in memory or must portions be accessed from disk?
How does the data buffer get used and possibly updated so that the size of the encoded differential information can be kept as small as possible?
What type of initial information should be put into the data buffer at the start of both the compression and decompression programs to get the best overall compression results?

In my implementation of differential compression, I resolved the three issues this way:

I keep coding simple and allow the entire data buffer to be kept in memory. The buffer contains exactly 256 blocks, each having 8 bytes of storage data. Thus, the size of the data buffer is 256x8=2 Kbytes long.
I encode source data by comparing successive 8-byte blocks with all 256 blocks in the data buffer. The index of the best-matching block is then output as one byte, and differential information is output to update the best-matching block with the new data. Overall compression is optimized when the source data contains similar, ideally repetitive, blocks of data.
To guarantee at least one matching byte for the first block of source data, block #0 is filled with all 0 bytes; block #1 is filled with all 1 bytes; block #2 is filled with all 2 bytes; and so on to block #255, which is filled with all 255 bytes.

Coding Details

Figure 1 presents the encoding algorithm, and Listing One (page 146) lists PACK.C, its C implementation. Figure 2 shows the decoding technique, while Listing Two (page 146) lists the UNPACK.C decompression program. The listings resemble the algorithm structures with one exception: Special handling needs to be taken to handle the last byte(s) of source data, which may be smaller than the block size of the data buffer. According to the pack and unpack programs, any source file whose size MOD eight does not equal 0 requires this special handling. Specific details are documented in the source code of these two programs.

Figure 1: Encoding algorithm.

  initialize all blocks in data buffer
  while not end of file
  {
    load one block of source data
    find best matching block in data buffer
    output index of best matching block

    output a bit map (one byte) for the ordered bytes in
    the data buffer which need to be changed
    (0 bit ==> matching bytes)
    (1 bit ==> new byte in loaded block of source data)

    output all new bytes in loaded block of source data
    update best matching block with loaded block of data
  }

Figure 2: Decoding algorithm

  initialize all blocks in data buffer exactly as was done
  in the encoding algorithm

  while not end of file
  {
    load best matching block index
    load bit map (one byte) for the ordered bytes in
    the data buffer which need to be changed

    directly load all new (i.e. changed) bytes into the
    best matching block of data (automatic update)

    output the best matching block of data which now contains an exact
    copy of the originally loaded block
  }

In these programs, compression is optimized whenever the data in the source file contains highly similar or exactly repetitive data, as determined by the order of data in each block. For example, "abcdefgh" as a block of 8 bytes is similar to "accddfgf," and consequently would be compressed into 5 bytes (block index+01001001 bit map byte+c+d+f). However, "abcdefgh" is not similar to "habcdefg," which would be compressed into 10 bytes (block index+11111111 bit map byte+h+a+c+d+e+f+g). You can get around this problem in one of three ways:

Add a shift index just after the block index to signal how many bytes the data in the key block needs to be shifted before applying the comparison tests. In this case, the "habcdefg" block gets compressed into 2.375 bytes (block index+001 shift index+00000000 bit map byte).
Increase the number of blocks in the data buffer.
Prescan the source file, adding highly repetitive blocks to a list and then save this list and its parameters to the beginning of the encoded data file.

This will keep the number of blocks in the data buffer to a minimum, resulting in the fewest number of bits required for the block index. The data buffer should then be static during the encoding and decoding process since the optimization work was already done during the prescanning process. If "habcdefg" is a popular block, it will be encoded efficiently. If not, any loss of compression should be negligible.

Suggested Improvements

There are a number of possible improvements you can make to programs. For one thing, you might increase the number of bytes in the data buffer; specifically, increase BLOCKFACTOR to an integer>1.

Likewise, you can delete code which updates the data buffer. For instance, you can make the necessary deletions in the pack program. At the end of the unpack program, use the replacement code shown in Listing Three (page 146).

Another improvement might be to write your own code to prescan the source data, find the ideal initial information to be put in the data buffer, and then save this information at the start of the encoded data file.

You can also rework the code entirely to handle N blocks rather than the fixed 256 count and then find the optimal size of N for any given source data.

Finally, consider allowing less than a 100 percent copy of the original and then cut back on the number of changed bytes output during the encoding process.



_DIFFERENTIAL COMPRESSION ALGORITHMS_
by James H. Sylvester

[LISTING ONE]



/**************************************************************************/
/*  PACK.C -- by James Sylvester                                            */
/**************************************************************************/

#include <stdio.h>

void main(int argc, char *argv[])
{
  const blockfactor = 1;              /* adjust as desired */
  const blocksize = blockfactor * 8;
  char  buffer [257] [8];             /* enter blocksize for second index */

  int   i, j, currentsize;
  FILE  *sf, *tf;               /* sourcefile & targetfile respectively */
  int   bestblock;              /* best matching block in buffer */
  int   bestcount, matchcount;
  int   changeindex, bitvalue;

/* Verify and perform error handling for opening both the input file and   */
/* the output file as binary files.  Note, the input file has the original */
/* data and the output file will contain the encoded/compressed data.      */
  if (argc != 3)
  {
    printf("Correct usage is  >pack source_filename target_filename\n");
    exit(1);
  }
  if ((sf = fopen(argv[1], "rb"))==NULL)  /* read only mode for input file */
  {
    printf("Unable to open source file %s\n", argv[1]);
    exit(2);
  }
  if ((tf = fopen(argv[2], "wb"))==NULL)  /* write only mode for output file */
  {
    printf("Unable to open target file %s\n", argv[2]);
    exit(3);
  }
/* Initialize buffer with all zeros in the first block, all ones in the   */
/* second block, and so forth so that there will be at least one matching */
/* byte in the first block of input data regardless what it might be.     */
  for (i = 0; i < 256; i++)
    for (j = 0; j < blocksize; j++)
      buffer [i] [j] = i;
  while (1)  /* while true ==> stay in loop until internal exit */
  {
/* Load the next block of data from the sourcefile into the last slot of */
/* the buffer.  Also, keep track of how many bytes were read to signify  */
/* proper handling for the unfull block when at the end of file.         */
    currentsize = blocksize;
    for (j = 0; j < blocksize; j++)
    {
      i = getc(sf);
      if (i == EOF)
      {
        currentsize = j;     /* reset correct currentsize */
        break;               /* exit for loop */
      }
      buffer [256] [j] = i;  /* put character into last block */
    }
    if (currentsize == 0)    /* input data ended with previous full block */
    {
      printf("%s now contains encoded/compressed data!\n", argv[2]);
      exit(4);
    }
/* Find the best matching block in buffer for the recently loaded block */
/* of input data.  Afterwards, send the bestblock index to the output   */
/* file and process the changes between the two blocks.                 */
    bestcount = -1;
    for (i = 0; i < 256; i++)
    {
      matchcount = 0;
      for (j = 0; j < currentsize; j++)
        if (buffer [i] [j] == buffer [256] [j])
          matchcount++;
      if (matchcount > bestcount)
      {
        bestcount = matchcount;
        bestblock = i;
        if (bestcount == currentsize)
          break;                       /* exit for loop */
      }
    }
    putc (bestblock, tf);
    for (i = 0; i < blockfactor; i++)
    {
      changeindex = 0;
      bitvalue = 1;
      for (j = i*8; j < i*8+8; j++)
      {
        if (j >= currentsize)
          break;
        if (buffer [bestblock] [j] != buffer [256] [j])
          changeindex += bitvalue;
        bitvalue *= 2;
      }
      putc(changeindex, tf);
      for (j = i*8; j < i*8+8; j++)
      {
        if (changeindex % 2 == 1)
          putc(buffer [256] [j], tf);
        changeindex /= 2;
      }
    }
    if (currentsize < blocksize)  /* input data ended with unfull block */
    {
      putc(currentsize, tf);
      printf("%s now contains encoded/compressed data!\n", argv[2]);
      exit(5);
    }
    /* Update the best matching block in buffer with new data. Compression  */
    /* should improve as further loaded blocks match exact copies in buffer */
    for (j = 0; j < blocksize; j++)
      buffer [bestblock] [j] = buffer [256] [j];
  }
}

[LISTING TWO]


/**************************************************************************/
/*  UNPACK.C -- by James Sylvester                                          */
/**************************************************************************/

#include <stdio.h>

void main(int argc, char *argv[])
{
  const blockfactor = 1;              /* adjust as desired */
  const blocksize = blockfactor * 8;
  char  buffer [257] [8];             /* enter blocksize for second index */

  int   i, j;
  FILE  *sf, *tf;        /* sourcefile & targetfile respectively */
  int   bestblock = -1;  /* best matching block in buffer */
  int   changeindex;

/* Verify and perform error handling for opening both the input file and the */
/* output file as binary files. Input file contains encoded/compressed data */
/* data and output file should contain an  exact copy of the original data. */
  if (argc != 3)
  {
    printf("Correct usage is  >unpack source_filename target_filename\n");
    exit(1);
  }
  if ((sf = fopen(argv[1], "rb"))==NULL)  /* read only mode for input file */
  {
    printf("Unable to open source file %s\n", argv[1]);
    exit(2);
  }
  if ((tf = fopen(argv[2], "wb"))==NULL)  /* write only mode for output file */
  {
    printf("Unable to open target file %s\n", argv[2]);
    exit(3);
  }
/* Initialize buffer with exactly same information as in PACK.C program. */
  for (i = 0; i < 256; i++)
    for (j = 0; j < blocksize; j++)
      buffer [i] [j] = i;
/* Reconstruct original data from encoded data in sourcefile. */
  while (1)  /* while true ==> stay in loop until internal exit */
  {
    if (bestblock == -1)     /* input data yet to be loaded */
    {
      bestblock = getc(sf);
      if (bestblock == EOF)  /* original and encoded files had 0 bytes */
      {
        printf("%s now contains reconstructed original data!\n", argv[2]);
        exit(4);
      }
      changeindex = getc(sf);
    }
    else
    {
      bestblock = getc(sf);
      if (bestblock == EOF)    /* input data ended with previous full block */
      {
        for (j = 0; j < blocksize; j++)  /* output full block */
          putc(buffer [256] [j], tf);
        printf("%s now contains reconstructed original data!\n", argv[2]);
        exit(5);
      }
      changeindex = getc(sf);
      if (changeindex == EOF)  /* input data ended with unfull block */
      {
        for (j = 0; j < bestblock; j++)  /* reinterpret bestblock as  */
                                         /* last blocksize and output */
                                         /* this last, partial block  */
          putc(buffer [256] [j], tf);
        printf("%s now contains reconstructed original data!\n", argv[2]);
        exit(6);
      }
      for (j = 0; j < blocksize; j++)  /* output full block */
        putc(buffer [256] [j], tf);
    }
    for (i = 0; i < blockfactor; i++)
    {
      if (i > 0)
        changeindex = getc(sf);
      for (j = i*8; j < i*8+8; j++)
      {
        if (changeindex % 2 == 1)
          buffer [bestblock] [j] = getc(sf);  /* directly load changes */
                                              /* into buffer bestblock */
        changeindex /= 2;
        buffer [256] [j] = buffer [bestblock] [j];  /* copy block info */
      }
    }
  }
}

[LISTING THREE]




        if (changeindex % 2 == 1)
        { buffer [256] [j] = getc(sf); }
        else
        { buffer [256] [j] = buffer [bestblock] [j]; }
        changeindex /= 2;
      }         }
  }
}