Dr. Dobb's | Random Access Data Compression

Random Access Data Compression

September 01, 1997
URL:http://www.drdobbs.com/random-access-data-compression/184403389

September 1997/Random Access Data Compression/Figure 1

Figure 1: BPE transformation

Original data:     A   B   A   B   C   A   B   C   D
                   |   |   |   |   |   |   |   |   |
Change AB to E:     ---     ---    |    ---    |   |
                     |       |     |     |     |   |
Intermediate data:   E       E     C     E     C   D
                     |       |     |     |     |   |
Change EC to F:      |        -----       -----    |
                     |          |           |      |
Compressed data:     E          F           F      D

September 1997/Random Access Data Compression

Random Access Data Compression

Philip Gage

Need to compress text but still jump about within it? This simple compression scheme may do the trick.

This article describes a data compression algorithm called Byte Pair Encoding (BPE). BPE has not only excellent decompression-time performance but some unusual features as well. In particular, BPE supports direct access to compressed data, allowing a program to jump to any byte in a compressed file and retrieve or modify data at that point. I illustrate this process by presenting portable C programs for compression and decompression, and a text file browser program that demonstrates the random access technique.

BPE in a Nutshell

I described BPE at length in a previous article (see "A New Algorithm for Data Compression," CUJ, February 1994). Since space is limited, and since I want to focus on the random-access feature of BPE, I will just give a refresher of BPE here. I encourage readers who want more detail to consult the original article. Also, an expanded illustration of the BPE algorithm is available (along with this article) on the CUJ web site. See http://www.cuj.com.

The BPE algorithm is a general-purpose lossless compression method suitable for all types of data, although this article concentrates on text-file applications. The compression process is very simple. Unlike many other compression methods, BPE does not require variable-length bit packing or complex data structures. BPE loads an entire file or a block of data into memory and compresses it in multiple passes. The data is reanalyzed after each pass to account for changes as it is compressed.

BPE operates by finding the most frequently occurring pair of adjacent bytes in the data and replacing all instances of the pair with an unused byte that does not occur in the data. The compressor repeats this process until no more compression is possible, either because there are no more frequently occurring byte pairs or because there are no more unused bytes to represent pairs. The compressor then writes the resulting table of byte-pair substitutions to the output, immediately preceding the packed data. The following pseudocode illustrates the compression process:

Read file into buffer
While (compression possible)
{
   Find most frequent byte pair
   in buffer

   Add pair to table and assign it
   an unused byte

   Replace all such pairs with this
   unused byte
}
Write pair table and buffer

The decompressor first reads the pair table from the input stream, and stores the table in a buffer. The decompressor then processes each byte of compressed data, maintaining a small stack of partially expanded bytes. The next byte to be processed is popped from the stack, or if the stack is empty, read from the input stream. If the byte is a pair code, the pair of bytes is looked up in the table and pushed onto the stack; otherwise the byte is output as a normal character. The pseudocode below illustrates this decompression process.

Read and store pair table in buffer
While (stack not empty OR not end of file)
{
   If stack empty, read byte
   Else pop byte from stack
   If byte is pair code,
       push pair on stack
   Else write byte
}

Figure 1 shows the pair transformations that occur during compression of or decompression to the string "ABABCABCD." (This is the same data used for the examples on the web site.) You can read this diagram downwards for compression and upwards for decompression. This illustration shows why BPE supports random access. Each byte in the compressed data (bottom line) is an independent code that corresponds directly to one or more bytes in the original data. Thus, any section of compressed data can be decompressed on the fly. In addition, any part of the compressed data can be modified without affecting the rest of the data.

BPE in Code

I've included a sample compressor, compress.c (Listing 1) and decompressor, expand.c (Listing 2) to help illustrate how BPE works. For simplicity, the C code accompanying this article compresses only small text files and includes only minimal error handling. This code compresses typical text files to roughly half of their original size, but of course the actual amount of compression depends on the data being compressed.

This sample code relies on the fact that the ASCII character set uses only codes from 0 through 127. That frees up codes 128 through 255 for use as pair codes. Assigning pair codes in numerical order allows the byte pair table to be stored with only two bytes per pair. The byte value that represents each pair is indicated by the pair's position in the table. For a description of this code, please consult the sidebar.

The sidebar also briefly describes modifications to improve compression, to handle large files, and to handle binary data. These modifications were fully implemented in my previous article [1] . The full source code implementing these features is available on the CUJ ftp site under this month's code listings (see p. 3 for downloading details).

Random Access Decompression

Few commonly used compression algorithms allow random access to compressed data. Virtually all modern compression methods use adaptive models and generate variable-bit-length codes that must be decoded sequentially from beginning to end [2] . In this case, random access to large data sets can be provided only by dividing the data into smaller files or blocks and compressing each chunk separately. Even if an application is sophisticated enough to decompress separate blocks of data (the conventional technique), it still must begin decompressing each block at the start of that block.

There is a simple text compression trick that allows random access; it employs the unused high-order bit in ASCII characters to indicate that a preceding space character (the most common character in text files) has been removed. This technique in effect removes all single spaces and reduces runs of consecutive spaces to half length, compressing typical text by about 15 percent. Another simple approach is to encode common words as one byte using a fixed dictionary of the most frequently used words. But schemes such as these are not general-purpose and have limited usefulness.

BPE is a universal compression algorithm that supports random access for all types of data, not just text. The global substitution process of BPE produces a uniform data format that allows decompression to begin anywhere in the data. Using BPE, data from anywhere in a compressed block can be immediately decompressed without having to start at the beginning of the block. This can provide very fast access for some applications. It's like being able to grab a volume of an encyclopedia off the shelf and open it to the exact page you want without having to flip through all the earlier pages in the volume.

The text file browser program browse.c (Listing 3) illustrates how BPE can be used for direct access to compressed data. The program accepts a single command-line argument, the name of a file to view. This program accepts both normal text files and compressed files. The browser displays the current screen of text and the location in the file as a percentage, from 0 at the top to 99 at the bottom. The user then enters other percentage values to view any part of the file. Entering a number outside the 0 to 99 range exits the program.

The browser uses fseek to jump to a specified location in the file and then decompresses a screen of text on the fly. Jumping to a random location generally results in a partial first line of text being displayed. This could be handled simply by not printing characters until after the first newline character is seen. The browser could be improved by adding a Graphical User Interface (GUI) with a scroll bar to set the percentage and allow the user to scroll through a file by dragging the scroll bar with a mouse.

LINESPERPAGE determines how many lines of text are printed per screen. This value may be adjusted as needed. This program assumes that the previous text will scroll off the top of the screen. On some systems, a clear screen call may improve operation.

Modifying the browser to scroll forward an exact number of lines or pages is easy; just decompress and count newline characters. Scrolling backwards a specified number of lines is also easy using another strange BPE feature, decompression can be done backwards! This is done by processing compressed bytes backwards, storing the generated data backwards and reversing the order that the pair bytes are pushed on the stack. Thus newlines can be counted backwards from the current position while storing the unpacked data backwards in a buffer.

Possible applications

The BPE method is useful for applications requiring small, fast decompression code, such as image display, self-extracting programs, and embedded systems. It could also be useful for communication links, since worst-case data expansion is low and it operates on discrete packets of data.

BPE is a relatively fault-tolerant compression scheme. If a few bytes of input to the decompressor are corrupted by transmission errors, only a similar fraction of output bytes are affected, and only in the current block of data. Thus, BPE is may be more suitable for real-time mission-critical and data-transmission applications than most other compression algorithms.

A useful modification to the text browser technique would be to enable hypertext access to regions within the compressed file. The compression program would create an index file of byte offsets to topic names, chapter headers, and other selected phrases within the compressed file. The browser would then highlight these phrases and use the byte offsets to jump to the selected links. Note that it should also be possible to implement a text editor based on the browser technique that can modify, insert, or delete text almost as easily as a normal editor.

Finally, BPE could be useful for managing a compressed database. In this scheme, each database table would have its own BPE pair table, allowing delete, insert, and update of records by reusing the same pair table for compressing new data. Of course, the tables may require re-compression after extensive updating, to restore the compression ratio. This approach also supports binary search on compressed data.

Summary

The BPE algorithm has several significant advantages over typical compression methods:

simple, small, and fast decompression code
very low worst-case data expansion ratio
fault-tolerance to data transmission errors
easily tuned speed vs. compression performance
random read/write access to compressed data
forward or reverse decompression, beginning at any byte within the compressed data

In contrast to the example code, these techniques are not limited to the compression of small text files. BPE can be used on any type or size of data set by using buffering and unused byte tests as described in Reference [1] .

The main disadvantages of BPE are slow compression speed and a lower compression ratio than some other methods. However, slow compression with fast decompression is acceptable for many applications. Random access to compressed data adds a new twist that may outweigh the disadvantages in many cases.

References

[1] Philip Gage. "A New Algorithm for Data Compression," The C Users Journal, February 1994.

[2] Mark Nelson. The Data Compression Book (M&T Books, 1991).

Philip Gage has a BS degree in computer science and works as a software engineer in Colorado Springs, CO. He can be reached at [email protected].

September 1997/Random Access Data Compression/Listing 1

Listing 1: compress.c — Byte pair encoding compression

/* compress.c -- Byte Pair Encoding compression */
/* Copyright 1996 Philip Gage */

#include <stdio.h>
#include <stdlib.h>

#define MAXSIZE 65535L  /* Input file buffer size */
#define HASHSIZE 8192   /* Hash table size, power of 2 */
#define THRESHOLD   3   /* Increase for speed, min 3 */

void compress (FILE *input, FILE *output)
{
  unsigned char *buffer, *left, *right, *count;
  unsigned char a, b, bestcount=0, pairtable[128][2];
  int i, j, index, bestindex, code=128;
  size_t size;

  /* Dynamically allocate buffers and check for errors */
  buffer = (unsigned char *)malloc(MAXSIZE);
  left = (unsigned char *)malloc(HASHSIZE);
  right = (unsigned char *)malloc(HASHSIZE);
  count = (unsigned char *)malloc(HASHSIZE);
  if (buffer==NULL || left==NULL ||
      right==NULL || count==NULL) {
    printf("Error allocating memory\n");
    exit(1);
  }

  /* Read input file into buffer and check for errors */
  size = fread(buffer,1,MAXSIZE,input);
  if (size == MAXSIZE) {
    printf("File too big\n");
    exit(1);
  }
  for (i=0; i<size; i++)
    if (buffer[i] > 127) {
      printf("This program works only on text files\n");
      exit(1);
    }

  do {  /* Replace frequent pairs with bytes 128..255 */

    /* Enter counts of all byte pairs into hash table */
    memset(count,0,HASHSIZE);
    for (i=0; i<size-1; i++) {
      a = buffer[i];
      b = buffer[i+1];
      index = (a ^ (b << 6)) & (HASHSIZE-1);
      while ((left[index] != a || right[index] != b) &&
             count[index] != 0)
        index = (index + 1) & (HASHSIZE-1);
      left[index] = a;
      right[index] = b;
      if (count[index] < 255)
        ++count[index];
    }

    /* Search hash table for most frequent pair */
    bestcount = THRESHOLD - 1;
    for (i=0; i<HASHSIZE; i++) {
      if (count[i] > bestcount) {
        bestcount = count[i];
        bestindex = i;
      }
    }

    /* Compress if enough occurrences of pair */
    if (bestcount >= THRESHOLD) {

      /* Add pair to table using code as index */
      a = pairtable[code-128][0] = left[bestindex];
      b = pairtable[code-128][1] = right[bestindex];

      /* Replace all pair occurrences with unused byte */
      for (i=0, j=0; i<size; i++, j++)
        if (a == buffer[i] && b == buffer[i+1]) {
          buffer[j] = code;
          ++i;
        }
        else
          buffer[j] = buffer[i];
      size = j;
    }
    else
      break;
  } while (++code < 255);

  /* Write pair count, pair table and compressed data */
  putc(code,output);
  fwrite(pairtable,2,code-128,output);
  fwrite(buffer,1,size,output);
  free(buffer); free(left); free(right); free(count);
}

int main (int argc, char **argv)
{
  FILE *in, *out;
  if (argc != 3)
    printf("Usage: compress inputfile outputfile\n");
  else if ((in=fopen(argv[1],"rb"))==NULL)
    printf("Error opening input %s\n",argv[1]);
  else if ((out=fopen(argv[2],"wb"))==NULL)
    printf("Error opening output %s\n",argv[2]);
  else {
    compress(in,out);
    fclose(out);
    fclose(in);
  }
  return 0;
}/* End of File */

September 1997/Random Access Data Compression/Listing 2

Listing 2: expand.c — Byte pair encoding decompression

/* expand.c -- Byte Pair Encoding decompression */
/* Copyright 1996 Philip Gage */

#include <stdio.h>

void decompress (FILE *in, FILE *out)
{
  unsigned char stack[16], pair[128][2];
  short c, top = 0;

  /* Check for optional pair count and pair table */
  if ((c = getc(in)) > 127)
    fread(pair,2,c-128,in);
  else
    putc(c,out);

  for (;;) {

    /* Pop byte from stack or read byte from file */
    if (top)
      c = stack[--top];
    else if ((c = getc(in)) == EOF)
      break;
    /* Push pair on stack or output byte to file */
    if (c > 127) {
      stack[top++] = pair[c-128][1];
      stack[top++] = pair[c-128][0];
    }
    else
      putc(c,out);
  }
}

int main (int argc, char **argv)
{
  FILE *in,*out;

  if (argc != 3)
    printf("Usage: expand inputfile outputfile\n");
  else if ((in=fopen(argv[1],"rb"))==NULL)
    printf("Error opening input %s\n",argv[1]);
  else if ((out=fopen(argv[2],"wb"))==NULL)
    printf("Error opening output %s\n",argv[2]);
  else {
    decompress(in,out);
    fclose(out);
    fclose(in);
  }
  return 0;
}/* End of File */

September 1997/Random Access Data Compression/Listing 3

Listing 3: browse.c — Byte pair encoding file browser

/* browse.c -- Byte Pair Encoding file browser */
/* Copyright 1996 Philip Gage */

#include <stdio.h>
#define LINESPERPAGE 23

unsigned char pairtable[128][2];

void decompress_page (FILE *in)
{
  unsigned char stack[16];
  short c, top=0, linecount=0;

  for (;;) {

    /* Pop byte from stack or read byte from file */
    if (top)
      c = stack[--top];
    else if ((c=getc(in)) == EOF)
      break;

    /* Push pair on stack or print byte */
    if (c & 0x80) {
      stack[top++] = pairtable[c & 0x7F][1];
      stack[top++] = pairtable[c & 0x7F][0];
    }
    else {
      putchar(c);

      /* Count linefeeds and quit after one screen */
      if (c == '\n' && ++linecount == LINESPERPAGE)
        break;
    }
  }
}

int main (int argc, char **argv)
{
  FILE *in;

  long start, end;
  int c, percent = 0;

  if (argc != 2)
    printf("Usage: browse inputfile\n");
  else if ((in=fopen(argv[1],"rb"))==NULL)
    printf("Error opening input %s\n",argv[1]);
  else {
 
    /* Check for optional pair count and pair table */
    if ((c = getc(in)) & 0x80)
      fread(pairtable,2,c & 0x7F,in);
    else
      ungetc(c,in);

    /* Find first and last byte of packed data */
    start = ftell(in);
    fseek(in,0L,SEEK_END);
    end = ftell(in);
    fseek(in,start,SEEK_SET);

    for (;;) {

      /* Display current page and read new percent */
      decompress_page(in);
      printf("\n%d %% into file, enter new %% ",percent);
      scanf("%d",&percent);

      /* Quit program if input percent out of range */
      if (percent < 0 || percent > 99)
        break;

      /* Move file pointer to new position in file */
      fseek(in,start+percent*(end-start)/100,SEEK_SET);
    }
    fclose(in);
  }
  return 0;
}/* End of File */

September 1997/Random Access Data Compression/Sidebar

BPE Code Description

Listing 1 shows the compression program compress.c. The critical issue in the compress function is how long it takes to find the most frequent pair. A brute force search requires no additional data structures but gives ridiculously slow performance. On the other hand, a 256x256 array can be used to hold the counts, but occupies 64K of memory as unsigned char and 128K as short. The approach I use in this program is to count pairs in a hash table and then search the table for the highest count.

The compressor reads the entire input file into a buffer and compresses it in place. An outer loop tries to perform 128 global substitutions, replacing frequent pairs with unused characters in the range 128 to 255. Byte pairs are tallied in a hash table; each entry in the table consists of a left byte value, right byte value, and a count. The compressor searches the hash table for the highest count. If highest count found is large enough, the compressor replaces all occurrences of the pair in the buffer with the associated pair code. Otherwise, the break statement ends the loop. After completion of the loop the compressor writes the last pair code value, the pair table, and the packed data to the output file.

The compressor creates a complete pair table and data buffer after each substitution pass, so compression can be stopped after any pass. Since the most frequently occurring pairs are compressed first, quitting after the first few passes can greatly speed up the process while still obtaining reasonable compression. Thus you can easily tune the time/space performance by modifying the THRESHOLD value in compress.c.

This program uses dynamic allocation via malloc to avoid memory limitations that would be imposed by some systems. You can adjust the input buffer size MAXSIZE, and the hash table size HASHSIZE, as necessary. MS-DOS allows a MAXSIZE of only 64KB, but other systems may allow larger values. On MS-DOS systems, compile compress.c with a large memory model or reduce MAXSIZE. Setting HASHSIZE too small may result in hash table overflow. (It would probably be a good idea to add some error checking here.)

The decompression program expand.c (Listing 2) is very simple and fast, using only 276 bytes for variables. In this respect the decompression algorithm differs from most compression schemes, where the compressor and decompressor maintain the same large data model in synchronization. In BPE, it is the compressor that does most of the work. The compressor packages the data nicely to simplify decompression.

The first byte of the compressed data file represents the number of pairs in the pair table (which immediately follows). The number of pairs, which can range from 0 to 127, is encoded in this first byte as a value ranging from 128 to 255 respectively. This encoding scheme allows the decompressor to handle a normal text file in the same way as a compressed file. If the first byte is less than 128, then no pair table is expected and the first byte is output as is. Since the normal BPE decompression process passes non-pair-code bytes through unchanged, the program can accept both compressed and uncompressed files.

A Complete BPE Implementation

These programs can be modified to handle large files by buffering blocks of data and compressing each block separately. Each block is written with its pair table and the number of packed data bytes that follow. This technique also supports the use of data streams and can provide better compression by allowing local adaption to data in each block.

Similarly, binary files can be handled by adding bytes to the buffer until it is full, or only a small number of unused bytes remain to represent pair codes. Since these unused codes are typically scattered around the character set, the pair table can be written as three bytes per substitution, two for the pair and one for the unused byte that replaces the pair.

The original BPE code in Reference 1 implements both buffering and binary data processing into a complete general-purpose compression system. This code is available on the CUJ ftp site with this month's code listings (see p. 3 for downloading instructions). This version can handle files of any size containing any type of data. It compresses text files somewhat better than the simplified version presented in this article by exploiting all unused characters in the entire character set, not just the upper half. In comparison to typical LZW (Lempel Ziv Welch) programs, this original BPE implementation has slower compression time, faster decompression, a similar compression ratio on typical binary data and small text files, but a slightly lower ratio on large text files.

BPE's worst case data expansion amount on truly random data is only about 0.2 percent, while many other compression schemes may greatly inflate such data. If the data cannot be compressed, it is passed through unchanged except for the addition of a few header bytes for each block. o