Despite advances in data-compression techniques, more often than not it's the combination of selected algorithms--along with knowledge of a data file's contents--that produces optimal results.
July 01, 1993
URL:http://www.drdobbs.com/database/algorithm-alley/184409041
File-compression algorithms are among the most fascinating in computer science. The very idea that a certain sequence of bytes can be represented by another, shorter sequence of bytes, makes me wonder if, someday, a programmer will discover the ultimate compression method that can reduce any data set to a single integer. Imagine how much disk space you would gain with a 99.999 percent file-compression ratio.
That may seem to be an impossible dream until you consider a story told by Martin Gardner in his book, Gotcha (W.H. Freeman, 1982). As Gardner tells the tale, on visiting Earth, an inquisitive alien wishes to collect all of human knowledge. Having no room for the Encyclopedia Britannica aboard an already-cramped spaceship (and lacking a CD-ROM player), the alien proposes a clever method for compressing the Encyclopedia's volumes. Assuming there are fewer than 1000 unique letters, digits, and other symbols in the text, our resourceful visitor assigns each symbol a three-digit code from 000 to 999, including leading 0s. The word "Snow," for example, might be encoded as 083110111119. Translating the entire Encyclopedia this way produces a giant integer that, with an imaginary decimal point to the left, is equivalent to the decimal fraction, A/B. To complete the data compression, the visitor places a mark on a small rod of otherworldly material that divides the bar into two lengths, A and B. After returning to Zenon (or wherever), the alien precisely measures the marked rod, obtains lengths A and B, and divides A/B to yield the original integer, which a computer decodes to print out the Encyclopedia.
Is this a valid data-compression technique? Yes. Is it possible to implement? Hardly. To precisely mark the rod, explains Gardner, would require measuring distances many times smaller than an electron. If, however, you could make an appropriately fine measurement, the method would work. So, theoretically speaking, does this imply that all data is infinitely compressible? Maybe not, but imagine a "rod drive" of the future that moves a quark-size mark up and down to compress gigabytes of data on a staff no bigger than a popsicle stick. What a great backup system that would be! All of this is science fiction, of course, although the story makes me wonder what sorts of exotic data-compression techniques--not to mention aliens--are out there waiting to be discovered.
Popular text-compression algorithms such as variable-width Huffman codes, which have been covered extensively in DDJ, produce impressive compression ratios. Though highly efficient, Huffman codes must be processed one bit at a time, causing compression programs to run slowly.
When choosing a compression algorithm, however, efficiency isn't always the primary concern. Speed and ease of use might be more important in some applications, and in this column, I'll focus on algorithms that, unlike Huffman codes, run quickly while still appreciably reducing text-file sizes. The methods gain much of their speed simply by processing text as characters rather than as individual bits. I'll start with the most basic of these methods: embedded tabs. Though simple, the algorithms for inserting and deleting tabs belong in every programmer's toolbox. I'll also explore some other alien--that is, less common--text-compression algorithms that you might want to try.
To insert tabs, Example 1 calls a subfunction, NextChar, which peeks ahead to the next character in line L. If that character equals a unique sentinel appended to the end of the line, the algorithm sets flag Eol True. A While loop appends tabs to a temporary string T. A second While loop appends blanks to T in order to fill out partial columns that can't be aligned with a tab. Removing tabs is the simpler of the two related algorithms; see Example 2. In this case, the procedure merely examines line L one character at a time, outputing blanks at every tab character encountered.
INSTAB.PAS (Listing One, page 142) implements Algorithm #6. REMTAB.PAS (Listing Two, page 142) implements Algorithm #7. To alter the column width, change tabWidth to any value from 2 up. A width of 4, instead of the standard 8, usually produces better results on program source files. For simplicity, and to keep the listings short, I designed the programs to write their output to a new file, TXT.OUT, which is overwritten in the current directory without warning. The original file is never changed.
While experimenting with these programs, I noticed an odd effect that tabs can have on other compression software such as the popular PKZIP and LHA. Table 1 shows the results of one test in which I inserted 4-column tabs into a Pascal source file of 16,273 bytes, reducing the file to 15,052 bytes. PKZIP further reduced the tabbed file to 5614 bytes, but to my surprise, this file was 277 bytes larger than the compressed, tab-less text. LHA did a better job than PKZIP in compressing this file, but here again, the compressed file with tabs was 224 bytes larger than the compressed, tab-less original.
These numbers suggest some interesting points about PKZIP and LHA, and possibly other data compressors. The amount of savings gained from compressing a file can be highly dependent on the nature of the information in that file. Also, compressing text with tabs apparently interferes with algorithms that compact runs of identical values (not only blanks) into a count N and a value V, a method commonly called "run-length encoding." It therefore makes good sense to strip tabs from text files before compressing them. Perhaps a smart text-compression engine could recognize this fact and detab files before compressing. During decompression, the tabs could easily be added back. I've never seen a data compressor that does this, but the idea might be worth giving a whirl.
One drawback with indentation compression is the need to reserve a unique escape code, but a side benefit is that compressed files tend to display faster--the exact opposite of the extra runtime overhead usually added by compression algorithms. To display a line of compressed text, assuming the display row is clear, just move the cursor to the position of the first nonblank character, saving the time it would take to display that many spaces one at a time. It's amazing how much display speed you can gain from this simple method. A programmer's editor could use the concept internally to increase output speed on relatively slow terminals or with sluggish GUIs such as Microsoft Windows. The compressed text would also conserve memory.
Example 3, Algorithm #8, lists the pseudocode for differentially compressing a word W given a preceding word P. Example 4, Algorithm #9, expands a compressed word W, again given an uncompressed preceding word P. To use these algorithms, initialize P to a null string, then pass a list of words W to the appropriate procedure.
DIFF.PAS (Listing Three, page 142) and UNDIFF.PAS (Listing Four, page 142) implement Algorithms #8 and #9, respectively. To test the programs, I compressed a sorted list of 21,400 proper names. Compressing the file with DIFF reduced its size from 174,510 to 112,935 bytes, a savings of about 35 percent. That's pretty good as compression ratios go, but consider what happened when I again compressed each of the two files using LHA. The original, plain text reduced to 57,295 bytes. The differentially compressed text slimmed down to only 42,977 bytes, a total reduction of greater than 75 percent. Now, that's a hot compression ratio!
The message here is that, despite advances in data-compression techniques, it's often the combination of selected algorithms along with knowledge of a data file's contents that can produce optimal results. Removing tabs from text files, taking advantage of indentation, and packing sorted files differentially (just to name a few of many possibilities), and then also compressing those precompressed files using other techniques might reduce file sizes more than any single algorithm on its own.
Sorry, but I can't return marked rods or popsicle sticks. (I just know some alien with a sense of humor is going to send me a bag of these.)
procedure InsertTabs(L: String); var T: String; J, K: Integer; C, Q: Char; Eol: Boolean; { End of line } function NextChar(var C: Char): Char; begin C <- L[J + 1]; Eol <- C = Q; { True at end of line } Return C; end; begin Set T to null string; Set Q to unique char; Append Q to L as sentinel; K <- 0; { Column count } repeat J <- K; while NextChar(C) = blank do begin J <- J + 1; if J mod tabWidth = 0 then begin Append tab to T; K <- J; end; end; while (K < J) do begin Append blank to T; K <- K + 1; end; if not Eol then begin Append C to T; K <- K + 1; end; until Eol; Return T; end;
Original PKZIP LHA Without tabs 16,273 5,337 5,127 With tabs 15,052 5,614 5,351 Bytes saved 1,221 -277 -224
procedure RemoveTabs(L: String); var T: String; I: Integer; C: Char; begin Set T to null string; for I <- 1 to Length(L) do begin C <- L[I]; if C = tab then repeat Append blank to T; until Length(T) mod tabWidth = 0; else Append char C to T; end; Return T; end;
procedure Compress(var W, P: String); var T: String; { Temporary copy of W } I: Integer; { String index } begin Copy W to T; Set I equal to 1; while (I <= Length(W)) and (I <= Length(P)) and (W[I] = P[I] ) do I <- I + 1; Delete I - 1 chars from head of W; Insert Chr(I - 1) at head of W; Set P equal to T; end;
procedure Decompress(var W, P: String); var I: Integer; { String index } begin Set I to value of W[1]; Delete W[1] from W; while (I >= 1) do begin Append P[I] to head of W; I <- I - 1; end; Set P equal to W; end;
(Text begins on page 121.)
{ instab.pas -- Algorithm #6: Insert Tabs by Tom Swan } program InsTab; const outFName = 'TXT.OUT'; tabWidth = 8; { Try 4 for source code files } blank = #32; { ASCII blank character } tab = #09; { ASCII tab character } var InFName: String; InFile, OutFile: Text; Line: String; procedure InsertTabs(var L: String); var T: String; J, K: Integer; C, Q: Char; Eol: Boolean; { End of line } function NextChar(var C: Char): Char; begin C := L[J + 1]; Eol := C = Q; { True at end of line } NextChar := C end; begin T := ''; { Set result T to null } Q := #0; { Sentinel } L := L + Q; { Append sentinal to L } K := 0; { Column count } repeat J := K; while NextChar(C) = blank do begin J := J + 1; if J mod tabWidth = 0 then begin T := T + tab; K := J end; end; while (K < J) do begin T := T + blank; K := K + 1 end; if not Eol then begin T := T + C; K := K + 1 end; until Eol; L := T { Return T via parameter L } end; begin Writeln('Insert tabs'); Write('Input file name? '); Readln(InFName); Assign(InFile, InFName); Reset(InFile); Assign(OutFile, outFName); Rewrite(OutFile); Write('Inserting tabs...'); while not Eof(InFile) do begin Readln(InFile, Line); InsertTabs(Line); Writeln(OutFile, Line) end; Writeln; Close(InFile); Close(OutFile); Writeln(InFName, ' -> ', outFName) end.
{ remtab.pas -- Algorithm #7: Remove Tabs by Tom Swan } program RemTab; const outFName = 'TXT.OUT'; tabWidth = 8; { Try 4 for source code files } blank = #32; { ASCII blank character } tab = #09; { ASCII tab character } var InFName: String; InFile, OutFile: Text; Line: String; procedure RemoveTabs(var L: String); var T: String; I: Integer; C: Char; begin T := ''; for I := 1 to Length(L) do begin C := L[I]; if C = tab then repeat T := T + blank until Length(T) mod tabWidth = 0 else T := T + C end; L := T { Return T via parameter L } end; begin Writeln('Remove tabs'); Write('Input file name? '); Readln(InFName); Assign(InFile, InFName); Reset(InFile); Assign(OutFile, outFName); Rewrite(OutFile); Write('Removing tabs...'); while not Eof(InFile) do begin Readln(InFile, Line); RemoveTabs(Line); Writeln(OutFile, Line) end; Writeln; Close(InFile); Close(OutFile); Writeln(InFName, ' -> ', outFName) end.
{ diff.pas -- Algorithm #8: Differential Compression by Tom Swan } program Diff; const outFName = 'TXT.OUT'; var InFName: String; InFile, OutFile: Text; AWord, Prev: String; procedure Compress(var W, P: String); var T: String; { Temporary copy of W } I: Integer; { String index } begin T := W; I := 1; while (I <= Length(W)) and (I <= Length(P)) and (W[I] = P[I] ) do I := I + 1; Delete(W, 1, I - 1); W := Chr(I - 1) + W; P := T end; begin Writeln('Differential Compression'); Write('Input file name? '); Readln(InFName); Assign(InFile, InFName); Reset(InFile); Assign(OutFile, outFName); Rewrite(OutFile); Prev := ''; Write('Compressing...'); while not Eof(InFile) do begin Readln(InFile, AWord); Compress(AWord, Prev); Writeln(OutFile, AWord) end; Writeln; Close(InFile); Close(OutFile); Writeln(InFName, ' -> ', outFName) end.
{ undiff.pas -- Algorithm #9: Differential Decompression by Tom Swan }. program UnDiff; const outFName = 'TXT.OUT'; var InFName: String; InFile, OutFile: Text; AWord, Prev: String; procedure Decompress(var W, P: String); var I: Integer; begin I := Ord(W[1]); Delete(W, 1, 1); while (I >= 1) do begin W := P[I] + W; I := I - 1 end; P := W end; begin Writeln('Differential Decompression'); Write('Input file name? '); Readln(InFName); Assign(InFile, InFName); Reset(InFile); Assign(OutFile, outFName); Rewrite(OutFile); Prev := ''; Write('Decompressing...'); while not Eof(InFile) do begin Readln(InFile, AWord); Decompress(AWord, Prev); Writeln(OutFile, AWord) end; Writeln; Close(InFile); Close(OutFile); Writeln(InFName, ' -> ', outFName) end. End Listings
Copyright © 1993, Dr. Dobb's Journal
Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.