Dr. Dobb's | Algorithm Alley

Algorithm Alley

Despite advances in data-compression techniques, more often than not it's the combination of selected algorithms--along with knowledge of a data file's contents--that produces optimal results.

July 01, 1993
URL:http://www.drdobbs.com/database/algorithm-alley/184409041

JUL93: ALGORITHM ALLEY

Alien Text-File Compression

File-compression algorithms are among the most fascinating in computer science. The very idea that a certain sequence of bytes can be represented by another, shorter sequence of bytes, makes me wonder if, someday, a programmer will discover the ultimate compression method that can reduce any data set to a single integer. Imagine how much disk space you would gain with a 99.999 percent file-compression ratio.

That may seem to be an impossible dream until you consider a story told by Martin Gardner in his book, Gotcha (W.H. Freeman, 1982). As Gardner tells the tale, on visiting Earth, an inquisitive alien wishes to collect all of human knowledge. Having no room for the Encyclopedia Britannica aboard an already-cramped spaceship (and lacking a CD-ROM player), the alien proposes a clever method for compressing the Encyclopedia's volumes. Assuming there are fewer than 1000 unique letters, digits, and other symbols in the text, our resourceful visitor assigns each symbol a three-digit code from 000 to 999, including leading 0s. The word "Snow," for example, might be encoded as 083110111119. Translating the entire Encyclopedia this way produces a giant integer that, with an imaginary decimal point to the left, is equivalent to the decimal fraction, A/B. To complete the data compression, the visitor places a mark on a small rod of otherworldly material that divides the bar into two lengths, A and B. After returning to Zenon (or wherever), the alien precisely measures the marked rod, obtains lengths A and B, and divides A/B to yield the original integer, which a computer decodes to print out the Encyclopedia.

Is this a valid data-compression technique? Yes. Is it possible to implement? Hardly. To precisely mark the rod, explains Gardner, would require measuring distances many times smaller than an electron. If, however, you could make an appropriately fine measurement, the method would work. So, theoretically speaking, does this imply that all data is infinitely compressible? Maybe not, but imagine a "rod drive" of the future that moves a quark-size mark up and down to compress gigabytes of data on a staff no bigger than a popsicle stick. What a great backup system that would be! All of this is science fiction, of course, although the story makes me wonder what sorts of exotic data-compression techniques--not to mention aliens--are out there waiting to be discovered.

Compressing Text Files

Popular text-compression algorithms such as variable-width Huffman codes, which have been covered extensively in DDJ, produce impressive compression ratios. Though highly efficient, Huffman codes must be processed one bit at a time, causing compression programs to run slowly.

When choosing a compression algorithm, however, efficiency isn't always the primary concern. Speed and ease of use might be more important in some applications, and in this column, I'll focus on algorithms that, unlike Huffman codes, run quickly while still appreciably reducing text-file sizes. The methods gain much of their speed simply by processing text as characters rather than as individual bits. I'll start with the most basic of these methods: embedded tabs. Though simple, the algorithms for inserting and deleting tabs belong in every programmer's toolbox. I'll also explore some other alien--that is, less common--text-compression algorithms that you might want to try.

Embedded Tab Compression

Example 1, Algorithm #6, lists the pseudocode for inserting tabs into a line of text L at fixed columns of size tabWidth. Example 2, Algorithm #7, removes tabWidth tabs from a line of text. Both algorithms operate on a single string, represented as an array of characters, indexed as S[1] to S[n], where n is the string length.

To insert tabs, Example 1 calls a subfunction, NextChar, which peeks ahead to the next character in line L. If that character equals a unique sentinel appended to the end of the line, the algorithm sets flag Eol True. A While loop appends tabs to a temporary string T. A second While loop appends blanks to T in order to fill out partial columns that can't be aligned with a tab. Removing tabs is the simpler of the two related algorithms; see Example 2. In this case, the procedure merely examines line L one character at a time, outputing blanks at every tab character encountered.

INSTAB.PAS (Listing One, page 142) implements Algorithm #6. REMTAB.PAS (Listing Two, page 142) implements Algorithm #7. To alter the column width, change tabWidth to any value from 2 up. A width of 4, instead of the standard 8, usually produces better results on program source files. For simplicity, and to keep the listings short, I designed the programs to write their output to a new file, TXT.OUT, which is overwritten in the current directory without warning. The original file is never changed.

While experimenting with these programs, I noticed an odd effect that tabs can have on other compression software such as the popular PKZIP and LHA. Table 1 shows the results of one test in which I inserted 4-column tabs into a Pascal source file of 16,273 bytes, reducing the file to 15,052 bytes. PKZIP further reduced the tabbed file to 5614 bytes, but to my surprise, this file was 277 bytes larger than the compressed, tab-less text. LHA did a better job than PKZIP in compressing this file, but here again, the compressed file with tabs was 224 bytes larger than the compressed, tab-less original.

These numbers suggest some interesting points about PKZIP and LHA, and possibly other data compressors. The amount of savings gained from compressing a file can be highly dependent on the nature of the information in that file. Also, compressing text with tabs apparently interferes with algorithms that compact runs of identical values (not only blanks) into a count N and a value V, a method commonly called "run-length encoding." It therefore makes good sense to strip tabs from text files before compressing them. Perhaps a smart text-compression engine could recognize this fact and detab files before compressing. During decompression, the tabs could easily be added back. I've never seen a data compressor that does this, but the idea might be worth giving a whirl.

Indentation Compression

UCSD Pascal uses a relatively obscure, but effective, text-compression technique that takes advantage of indentation in a typical Pascal, C, or C++ text file. The algorithms are straightforward, so I'll leave them as a project for you to implement. Simply replace each line's leading blanks with a unique escape code and a value that represents the number of replaced blanks--a relatively crude form of run-length encoding, but one that can produce better results than embedded tabs. To expand a line, if it begins with an escape code, read the next value and output that many spaces. Lines that don't begin with escape codes are displayed normally.

One drawback with indentation compression is the need to reserve a unique escape code, but a side benefit is that compressed files tend to display faster--the exact opposite of the extra runtime overhead usually added by compression algorithms. To display a line of compressed text, assuming the display row is clear, just move the cursor to the position of the first nonblank character, saving the time it would take to display that many spaces one at a time. It's amazing how much display speed you can gain from this simple method. A programmer's editor could use the concept internally to increase output speed on relatively slow terminals or with sluggish GUIs such as Microsoft Windows. The compressed text would also conserve memory.

Differential Compression

One of my favorite compression algorithms takes advantage of the redundancy in sorted files where groups of adjacent words begin with the same letters. Storing the differences between these words saves space by eliminating duplicate prefixes. For example, the words Aardvark, Aardwolf, and Aaronic can be encoded as Aardvark, 4wolf, and 3onic, where 4 and 3 represent the number of letters to be copied from the preceding, uncompressed words. This technique is especially good at squeezing lengthy dictionaries, but it can also be used with sorted name-and-address databases or with other alphabetically sorted files. Difference compression can also pack bitmaps, especially those of large-size color values, say, 24 bits per pixel. Such files can often be greatly reduced by storing the difference in color between one pixel and the next.

Example 3, Algorithm #8, lists the pseudocode for differentially compressing a word W given a preceding word P. Example 4, Algorithm #9, expands a compressed word W, again given an uncompressed preceding word P. To use these algorithms, initialize P to a null string, then pass a list of words W to the appropriate procedure.

DIFF.PAS (Listing Three, page 142) and UNDIFF.PAS (Listing Four, page 142) implement Algorithms #8 and #9, respectively. To test the programs, I compressed a sorted list of 21,400 proper names. Compressing the file with DIFF reduced its size from 174,510 to 112,935 bytes, a savings of about 35 percent. That's pretty good as compression ratios go, but consider what happened when I again compressed each of the two files using LHA. The original, plain text reduced to 57,295 bytes. The differentially compressed text slimmed down to only 42,977 bytes, a total reduction of greater than 75 percent. Now, that's a hot compression ratio!

The message here is that, despite advances in data-compression techniques, it's often the combination of selected algorithms along with knowledge of a data file's contents that can produce optimal results. Removing tabs from text files, taking advantage of indentation, and packing sorted files differentially (just to name a few of many possibilities), and then also compressing those precompressed files using other techniques might reduce file sizes more than any single algorithm on its own.

Your Turn

Next time, I'll explore data-compression algorithms used for packing Microsoft Windows bitmaps. Until then, if you know of any unusual text-file or other compression techniques, I'd like to hear from you. Feel free to upload files packed with PKZIP or LHA to my CompuServe ID, 73627,3241. Please send only text files; do not send executable code. If you don't have a CompuServe account, you can send a disk to me in care of DDJ. Include postage and a self-addressed mailer if you want your disk back.

Sorry, but I can't return marked rods or popsicle sticks. (I just know some alien with a sense of humor is going to send me a bag of these.)

Example 1: Pseudocode for Algorithm #6 (insert tabs).

procedure InsertTabs(L: String);
var
  T: String;
  J, K: Integer;
  C, Q: Char;
  Eol: Boolean; { End of line }
  function NextChar(var C: Char): Char;
   begin
    C <- L[J + 1];
    Eol <- C = Q; { True at end of line }
    Return C;
  end;
begin
  Set T to null string;
  Set Q to unique char;
  Append Q to L as sentinel;
  K <- 0;  { Column count }
  repeat
    J <- K;
    while NextChar(C) = blank do
    begin
      J <- J + 1;
      if J mod tabWidth = 0 then
      begin
        Append tab to T;
        K <- J;
      end;
    end;
    while (K < J) do
    begin
      Append blank to T;
      K <- K + 1;
    end;
    if not Eol then
    begin
      Append C to T;
      K <- K + 1;
    end;
  until Eol;
  Return T;
end;

Table 1: Results of compressing a text file with and without tabs.

               Original  PKZIP   LHA
Without tabs   16,273    5,337   5,127
With tabs      15,052    5,614   5,351
Bytes saved    1,221     -277    -224

Example 2: Pseudocode for Algorithm #7 (remove tabs).

procedure RemoveTabs(L: String);
var
  T: String;
  I: Integer;
  C: Char;
begin
  Set T to null string;
  for I <- 1 to Length(L) do
  begin
    C <- L[I];
    if C = tab then
      repeat
        Append blank to T;
      until Length(T) mod tabWidth = 0;
    else
      Append char C to T;
  end;
  Return T;
end;

Example 3: Pseudocode for Algorithm #8 (differential compression).

procedure Compress(var W, P: String);
var
  T: String;   { Temporary copy of W }
  I: Integer;  { String index }
begin
  Copy W to T;
  Set I equal to 1;
  while (I <= Length(W)) and
        (I <= Length(P)) and
        (W[I] = P[I]   ) do
    I <- I + 1;
  Delete I - 1 chars from head of W;
  Insert Chr(I - 1) at head of W;
  Set P equal to T;
end;

Example 4: Pseudocode for Algorithm #9 (differential decompression).

procedure Decompress(var W, P: String);
var
  I: Integer;  { String index }
begin
  Set I to value of W[1];
  Delete W[1] from W;
  while (I >= 1) do
  begin
    Append P[I] to head of W;
    I <- I - 1;
  end;
  Set P equal to W;
end;

[LISTING ONE]

 (Text begins on page 121.)

{ instab.pas -- Algorithm #6: Insert Tabs by Tom Swan }
program InsTab;
const
  outFName = 'TXT.OUT';
  tabWidth = 8;  { Try 4 for source code files }
  blank = #32;   { ASCII blank character }
  tab = #09;     { ASCII tab character }
var
  InFName: String;
  InFile, OutFile: Text;
  Line: String;
procedure InsertTabs(var L: String);
var
  T: String;
  J, K: Integer;
  C, Q: Char;
  Eol: Boolean; { End of line }
  function NextChar(var C: Char): Char;
  begin
    C := L[J + 1];
    Eol := C = Q;  { True at end of line }
    NextChar := C
  end;
begin
  T := '';     { Set result T to null }
  Q := #0;     { Sentinel }
  L := L + Q;  { Append sentinal to L }
  K := 0;      { Column count }
  repeat
    J := K;
    while NextChar(C) = blank do
    begin
      J := J + 1;
      if J mod tabWidth = 0 then
      begin
        T := T + tab;
        K := J
      end;
    end;
    while (K < J) do
    begin
      T := T + blank;
      K := K + 1
    end;
    if not Eol then
    begin
      T := T + C;
      K := K + 1
    end;
  until Eol;
  L := T  { Return T via parameter L }
end;
begin
  Writeln('Insert tabs');
  Write('Input file name? ');
  Readln(InFName);
  Assign(InFile, InFName);
  Reset(InFile);
  Assign(OutFile, outFName);
  Rewrite(OutFile);
  Write('Inserting tabs...');
  while not Eof(InFile) do
  begin
    Readln(InFile, Line);
    InsertTabs(Line);
    Writeln(OutFile, Line)
  end;
  Writeln;
  Close(InFile);
  Close(OutFile);
  Writeln(InFName, ' -> ', outFName)
end.

[LISTING TWO]

{ remtab.pas -- Algorithm #7: Remove Tabs by Tom Swan }
program RemTab;
const
  outFName = 'TXT.OUT';
  tabWidth = 8;  { Try 4 for source code files }
  blank = #32;   { ASCII blank character }
  tab = #09;     { ASCII tab character }
var
  InFName: String;
  InFile, OutFile: Text;
  Line: String;
procedure RemoveTabs(var L: String);
var
  T: String;
  I: Integer;
  C: Char;
begin
  T := '';
  for I := 1 to Length(L) do
  begin
    C := L[I];
    if C = tab then
      repeat
        T := T + blank
      until Length(T) mod tabWidth = 0
    else
      T := T + C
  end;
  L := T  { Return T via parameter L }
end;
begin
  Writeln('Remove tabs');
  Write('Input file name? ');
  Readln(InFName);
  Assign(InFile, InFName);
  Reset(InFile);
  Assign(OutFile, outFName);
  Rewrite(OutFile);
  Write('Removing tabs...');
  while not Eof(InFile) do
  begin
    Readln(InFile, Line);
    RemoveTabs(Line);
    Writeln(OutFile, Line)
  end;
  Writeln;
  Close(InFile);
  Close(OutFile);
  Writeln(InFName, ' -> ', outFName)
end.

[LISTING THREE]

{ diff.pas -- Algorithm #8: Differential Compression by Tom Swan }
program Diff;
const
  outFName = 'TXT.OUT';
var
  InFName: String;
  InFile, OutFile: Text;
  AWord, Prev: String;
procedure Compress(var W, P: String);
var
  T: String;   { Temporary copy of W }
  I: Integer;  { String index }
begin
  T := W;
  I := 1;
  while (I <= Length(W)) and  (I <= Length(P)) and (W[I] = P[I]   ) do
    I := I + 1;
  Delete(W, 1, I - 1);
  W := Chr(I - 1) + W;
  P := T
end;
begin
  Writeln('Differential Compression');
  Write('Input file name? ');
  Readln(InFName);
  Assign(InFile, InFName);
  Reset(InFile);
  Assign(OutFile, outFName);
  Rewrite(OutFile);
  Prev := '';
  Write('Compressing...');
  while not Eof(InFile) do
  begin
    Readln(InFile, AWord);
    Compress(AWord, Prev);
    Writeln(OutFile, AWord)
  end;
  Writeln;
  Close(InFile);
  Close(OutFile);
  Writeln(InFName, ' -> ', outFName)
end.

[LISTING FOUR]

{ undiff.pas -- Algorithm #9: Differential Decompression by Tom Swan }.
program UnDiff;
const
  outFName = 'TXT.OUT';
var
  InFName: String;
  InFile, OutFile: Text;
  AWord, Prev: String;
procedure Decompress(var W, P: String);
var
  I: Integer;
begin
  I := Ord(W[1]);
  Delete(W, 1, 1);
  while (I >= 1) do
  begin
    W := P[I] + W;
    I := I - 1
  end;
  P := W
end;
begin
  Writeln('Differential Decompression');
  Write('Input file name? ');
  Readln(InFName);
  Assign(InFile, InFName);
  Reset(InFile);
  Assign(OutFile, outFName);
  Rewrite(OutFile);
  Prev := '';
  Write('Decompressing...');
  while not Eof(InFile) do
  begin
    Readln(InFile, AWord);
    Decompress(AWord, Prev);
    Writeln(OutFile, AWord)
  end;
  Writeln;
  Close(InFile);
  Close(OutFile);
  Writeln(InFName, ' -> ', outFName)
end.
End Listings