Channels ▼
RSS

Database

Algorithm Alley

Source Code Accompanies This Article. Download It Now.


NOV93: ALGORITHM ALLEY

It's been said that a black hole in space is the ultimate data compressor. But, as I recently discovered, the grand poobah of compressors is actually an anonymous clerk who invents acronyms for government publications. ATBAOTCOF is a gem that I recently came across in a volume on southern navigational hazards. As everybody knows, ATBAOTCOF is an abbreviation for "Area to Be Avoided Off the Coast of Florida." Don't laugh. It takes real talent to come up with compact, phonetic acronyms like that. It's phonetic because, if you could pronounce it, ATBAOTCOF is the sound you'd probably make upon accidentally straying into the ATBAOTCOF and being fined a maximum civil penalty of $50,000.

Palindromic Encryption

Government or military acronyms aren't always so ridiculous. The word "radar," for instance, is a shining exception. An acronym for "Radio Detecting And Ranging," radar is also a palindrome, a word that is spelled the same forward and backward. My middle name, Reyer, is a palindrome. Madam is another. Palindromes can also be phrases or sentences such as, quoting from Webster's, Roma tibi subito motibus ibit amor.

A few months ago, while working on a book about file structures, I stumbled across a data-encryption method used in Microsoft Windows Calendar files. I call it the "palindrome encryption algorithm" because the technique produces numeric palindromes from most input sequences. Unfortunately, there's no corresponding recovery method. Once data is encrypted using this algorithm, there is no way to restore the original information.

That may seem to make palindrome encryption as appealing as a bank that accepts deposits but doesn't permit withdrawals. The algorithm may have practical applications, though, in producing encrypted keys with built-in redundancy checks. For example, a Calendar file is identified by the palindrome signature bytes B5 A2 B0 B3 B3 B0 A2 B5, formed by adding the ASCII values of the characters in the uppercase string CALENDAR to the lowercase string radnelac, which is calendar spelled backwards.

Example 1, Algorithm #13, describes the palindrome encryption algorithm in pseudocode. PALPAK.PAS (Listing One, page 144) implements the algorithm in Pascal. Some input strings produce repetitious sequences of identical values--evidence, I suppose, that a corresponding recovery method is unlikely to be discovered. ABCD, for example, translates to the four hexadecimal bytes A5 A5 A5 A5. Technically, that's a numeric palindrome, but it's not a desirable outcome. Interestingly, however, preprocessing this and similarly degenerate cases as palindromes--in other words, feeding the program ABCDDCBA--produces a usable palindrome key. In fact, inputting palindromes such as radar and madam always produces numeric palindromes. How strange!

Also interesting is the fact that any palindrome with an even number of symbols can be compressed by 50 percent simply by throwing out the redundant half. This suggests that palindrome encryption could double as a data-compression method, but since there's no corresponding recovery algorithm, this property has no apparent value. If anybody discovers a practical use for Algorithm #13 or an adaptation, I'd like to know.

Blazing Text Compression

July's "Alien Text-File Compression" produced a basketful of mail along with several interesting text-compression techniques. Chuck Guzis sent along his text compressor that can squash files at blazing speed. Listings Two and Three (page 144 and 145, respectively) implement Chuck's algorithm in C. Compile the programs, then enter a command such as COMP5 IN.TXT OUT.TXT to compress file IN.TXT to file OUT.TXT. Use DECOMP5 similarly to decompress a packed file. Chuck explains how the code operates:

I based the algorithm on the old five-level Baudot TTY's that used five-bit words and escape codes to shift between character sets. This inspired me to use "short" bytes to compress eight-bit characters. Huffman encoding works similarly, but instead of varying the number of bits according to a character's frequency of occurrence, I use only two lengths--short (five-bit) and long (eight-bit) codes. Five bits can store 32 values--enough room for 26 characters plus six punctuation marks or other symbols. Using this fact, the algorithm packs three characters in a 16-bit word, leaving one bit as a flag. Unencoded characters with values below 128 are stored literally. Characters with values above 128 are stored with a zero prefix, which also indicates a run of same value characters when followed by a value less than 128. Most files compress by about 30 percent. Unpacking uses a simple table lookup, and is very fast.

Short and Sweet

Chuck Walbourn contributed short, but sweet, ideas for compressing text information containing numeric values. He writes:

A good way to compress text that contains numbers is to convert them into binary integers or floats. This saves space, of course, but only if the representation is smaller than the ASCII original. You also need to include an escape code to mark encoded values. Another method is to use binary coded decimal (BCD), replacing pairs of digits with bytes. For example, the string "Abc123456" compresses to "Abc" 0x12 0x34 0x56, a total of six bytes instead of nine.

These techniques are examples of a simple but effective compression method called "compact notation." You use this method every time you "compress" dates such as January 15, 1994 to a shorthand numeric form 1-15-94. In a computer, assuming 1900 for the century, any date can be further packed into a 16-bit word using four bits for the month, five bits for the day, and seven bits for the year. The interesting point about compact notation is that it compresses data by searching for equivalent, but shorter, forms of symbols or other values. Only the form of the information changes, not their values.

Zenon Revisited

July's opener, in which an alien from Zenon compressed an encyclopedia by placing a mark on a rod, drew several comments. Russ Pillsbury writes, "I got a real kick out of the marked rod [compression] method. At sub-electron distances, you can insert nearly an infinite number of marks on the same rod. Since any book or volume of books will generate a unique fraction, you can probably place the entire works of humanity on that same stick!"

Sounds great, Russ. When you are finished marking a sample rod, be sure to send me a review copy. First, however, you may want to listen to author Neil Rubenking's ideas for gaining even better compression ratios (as if infinite compression weren't enough). Neil writes:

There's a variation on your stick-marking compression that gets truly excellent results when some characters are more common than others. You start with an imaginary ruler that's marked off with a space for each character proportional to the frequency of that character. Lay this ruler against your stick to start. Take the first character in your input and find its range on the ruler. Now reduce the ruler to the size of that range and move it so it precisely fits on the stick. Find the second character on your ruler and squish the ruler to fit that range. Keep going until you've handled all the input characters. The nice thing is that the number of required significant digits increases more slowly when you encode a more common character. Say you just use A, B, and C, in proportion 2:1:1. Encoding an A reduces the active range to 1/2; encoding a B or C reduces it to 1/4. You can also do this with standard IEEE real numbers, jamming in characters until you've used up all your significant digits and then starting a new real number. It works! The hard part is knowing when to stop decoding the real numbers, but it can be done.

This compression technique, which goes by the name "arithmetic encoding," reminds me of the supposedly true observation that every work--including all of Shakespeare plus the text of every Superman comic ever published--is encoded somewhere in the value of pi--assuming, that is, pi's decimal expansion never repeats. In fact, my unfinished novel is also probably encoded in pi, so I may as well not complete the thing since it has already been "published." (And if you believe that, I know where you can rent a roomful of monkeys to write your next computer program.) (For more information on arthimetic encoding, see "Arithmetic Coding and Statistical Modeling" by Mark Nelson, DDJ, February 1991.)

Mashed Dictionaries

Neil Rubenking also offered an interesting, and perhaps more practical, text compression technique, which employs a variation on Chuck Guzis's method of using short character codes:

I use a kind of differential compression to mash down the dictionary for my anagram generator, NAMEGRAM. First, I reduce every input letter to a number from 0 to 25 (the dictionary consists of capitalized words with no punctuation). Words are limited to 31 characters. Each word is represented by the number of letters it shares with the previous word, the number of additional letters, and the codes for the letters themselves. So your AARDVARK, AARDWOLF, AARONIC example would be encoded as 0 8 A A R D V A R K 4 4 W O L F 3 4 ONIC. This data stream is subject to additional compression by one-third. Every number is less than 32, so 15 bits is enough to hold three numbers. I simply mash three numbers into every two bytes, obtaining better results than PKZIP. A one megabyte file, for example, mashes down to about 290K. The same zipped file is 330K! This just goes to show what you can do when you have control over the data being compressed.

Your Turn

Write to me, make suggestions, complain about your taxes, or even better, share a favorite algorithm. Send mail in care of DDJ, or send compressed text files to my Compuserve ID, 73627,3241. When sending files, please tell me what encryption or compression methods you use. Be forewarned: Undecipherable files will be forwarded to the appropriate government agency for dumping in the ATBAOTCOF.

Example 1:

input
  S: String;
var
  I, E, C: Integer;
begin
  I <- 1;
  E <- Length(S);
  while (I <= Length(S)) do
  begin
    C <- (Uppercase(S[I])
      + Lowercase(S[E]);
    Write C to output;
    I <- I + 1;
    E <- E - 1;
  end;
end;



_ALGORITHM ALLEY_
by Tom Swan


[LISTING ONE]


(* ---------------------------------------------------------------------- *(
**  palpak.pas -- Algorithm #13: Palindrome Encryption                       **
** -----------------------------------------------------------------------**
**  Demonstrates data encryption technique for producing keys or signatures, **
**  Microsoft Calendar files, for example, begin with signature bytes formed **
**  by applying the algorithm to the word Calendar. There is no known        **
**  recovery method for restoring text encrypted using this method.          **
**  Copyright (c) 1993 by Tom Swan. All rights reserved.                     **
)* ---------------------------------------------------------------------- *)

program PalPak;
var S1, S2: String;

function Uch(C: Char): Integer;
begin
  if (C in ['a' .. 'z']) then
    Uch := Ord(C) - 32
  else
    Uch := Ord(C)
end;
function Lch(C: Char): Integer;
begin
  if (C in ['A' .. 'Z']) then
    Lch := Ord(C) + 32
  else
    Lch := Ord(C)
end;
procedure Encrypt(S: String);
var I, E, C: Integer;
begin
  I := 1;
  E := Length(S);
  while (I <= Length(S)) do
  begin
    C := Uch(S[I]) + Lch(S[E]);
    Write(C, ' ');
    I := I + 1;
    E := E - 1
  end
end;
var S: String;
begin
  repeat
    Write('Enter a string: ');
    Readln(S);
    if Length(S) > 0 then
    begin
      Write('Encrypted string: ');
      Encrypt(S);
      Writeln
    end
  until Length(S) = 0;
end.

[LISTING TWO]
<a name="0347_000d">

/* ----------------------------------------------------------------------- *\
**  comp5.cpp -- Compress an ASCII text file.                              **
** - A fast, efficient text compressor that can acheive up to about 33     **
** percent reduction on most text files. - Lowercase characters are        **
** compressed 3 per word. Since there are 32 values in 5 bits, the program **
** can pack the basic 26-character alphabet plus 6 specials characters:    **
** <space>, comma, period, semicolon, hyphen, and quote. -If a value of 00 **
** is detected, next character is taken as a literal respresentation if    **
** the next character has a value in excess of 127 (0x80); otherwise, the  **
** value represents a repeated character count of the next char.           **
**   Copyright (c) 1993 by Chuck Guzis. All rights reserved.               **
\* ----------------------------------------------------------------------- */

#include  <stdio.h>
#include  <stdlib.h>
#include  <string.h>
#include  <ctype.h>

#define LINE_MAX 128          // Length of the longest line
#define DUPE_THRESHOLD 4      // Smallest duplicated byte count
typedef unsigned char UCHAR;  // Define a UCHAR
typedef unsigned int UINT;    // Define a UINT
long Comp_Count;              // How many bytes of compressed text?
long Orig_Count;              // How many bytes of original text?
UCHAR
  lbi[ LINE_MAX+1],           // Input line buffer
  lbo[ LINE_MAX+1];           // Output line buffer
FILE *in;                     // Input file
FILE *out;                    // Output file

//  Our 5-bit alphabet.
unsigned char Alphabet[] = "abcdefghijklmnopqrstuvwxyz ,.;- ";

//  An array for testing for the above.
UCHAR AlphaMap[256];          // One byte per ASCII value

//  Function prototypes.
void main(int, char **);      // Main function
void Error(char *, char *);   // Error routine
void Compress(void);          // Compressor
int DupeCheck(UCHAR *);       // Scan for duplicate characters

void main( int argc, char *argv[])
{
  UCHAR
    *ch,                      // Pointer and short int used to
    i;                        //  build AlphaMap.
  in = NULL;
  out = NULL;
  Comp_Count = 0;
  Orig_Count = 0;

//  Open the files, give an error if problems arise.
  if ( argc != 3)
    Error( "Command form is - COMP5 <in-file> <out-file>\n", NULL);
  if ( !(in = fopen( argv[1], "r")) )
    Error( "Can\'t open %s.\n", argv[1]);
  if ( !(out = fopen( argv[2], "wb")) )
    Error( "Can\'t create %s.\n", argv[2]);

//  Construct the AlphaMap array.  This is done purely for speed, as
//  the memchr() function could be used to search the string.
  memset( AlphaMap, 0, sizeof( AlphaMap)); // Clear it to 0
  for ( ch = Alphabet, i = 1; *ch; ch++, i++)
    AlphaMap[ *ch] = (UCHAR) i;
//  Read the input file, compress and encode it.
  while( fgets( (char *)lbi, sizeof( lbi)-1, in))
  { // Read until eof
    Orig_Count += (strlen( (char *)lbi));   // Total original UCHARs
    Compress();
  } // Digest the file
//  Show some summary data.
  printf( "\n\n"
    "\tOriginal text size:\t\t%ld\n"
    "\tCompressed text length:\t\t%ld\n"
    "\tSavings:\t\t\t%ld\n",
    Orig_Count, Comp_Count, Orig_Count-Comp_Count);
  fclose(in);
  fclose( out);
  exit(0);
}  // End of main

//  Compress - Compress a line and do repeated byte encoding.
//  This is a two-pass operation. Internally, we store a flag in lbi of
//  0x8000 + count. The next byte in lbi[] may contain the repeated byte.
void Compress( void)
{
  register UINT k;
  UCHAR
    *ch1,               // Source pointer
    *ch2;               // Destination pointer
  for ( ch1 = lbi, ch2 = lbo; *ch1;)
  { // Scan the line
    if ( (k = DupeCheck( ch1)) > DUPE_THRESHOLD )
    { // Compression is okay
      *ch2++ = 0;
      *ch2++ = (UCHAR) k;
      *ch2++ = *ch1;            // Store the repeated string
      ch1 += k;
    }
    else
    { // See about characters > 127 -- quote them
      if ( *ch1 > 127)
      { // If needs quoting
        *ch2++ = 0;
        *ch2++ = *ch1++;
      }
      else
      {
//  See if there are three consecutive symbols that reside in our 5-bit
//  alphabet. Note that an end-of-line will fail the test automatically, so
//  it doesn't require checking.
    if ( AlphaMap[ *ch1] &&
         AlphaMap[ *(ch1+1)] &&
         AlphaMap[ *(ch1+2)])
    { // Bingo--got all three
      k = 0x8000 | ((AlphaMap[ *ch1]-1) << 10) |
                   ((AlphaMap[ *(ch1+1)]-1) << 5) |
                    (AlphaMap[ *(ch1+2)] -1);
      *ch2++ = (UCHAR) (k >> 8);   // Store first byte
      *ch2++ = (UCHAR) (k & 255);  // Store second byte
      ch1 += 3;  // Advance
    }
    else
     *ch2++ = *ch1++;  // Everything else
      } // If character below 128
    } // If not compressible string
  } // Scan the line
  Comp_Count += (ch2 - lbo);  // Update compressed count
  fwrite((char *)lbo, (size_t)(ch2 - lbo), sizeof( UCHAR),
    out);  // Dump the line
  return;
} // Compress

//  DupeCheck - Return the number of duplicated characters.
//  Scans to end of line. Always returns at least 1.
int DupeCheck( UCHAR *what)
{
  UCHAR cref;               // Reference character
  int k;                    // Induction variable
  for ( cref = *what++, k = 1; *what; what++, k)
    if ( cref != *what)
      break;                // Just scan for same character
  return k;
} // DupeCheck

//  Error - Give an error and a message, then exit.
//  A string argument may be given, if desired.
void Error( char *msg, char *str)
{
  if ( in)
    fclose(in);      // Close files if still open
  if ( out)
    fclose( out);
  if ( str)
    fprintf( stderr, msg, str);
  else
    fprintf( stderr, msg);
  exit(1);
} // Error


<a name="0347_0030"><a name="0347_000e"><a name="0347_000f">
[LISTING THREE]
<a name="0347_000f">

/* ----------------------------------------------------------- *\
**  decomp5.cpp -- Decompress file compressed with comp5.      **
**   Copyright (c) 1993 by Chuck Guzis. All rights reserved.   **
\* ----------------------------------------------------------- */

#include  <stdio.h>
#include  <stdlib.h>
#include  <string.h>
#include  <ctype.h>

#define LINE_MAX 128              // Length of the longest line
typedef unsigned char UCHAR;      // Define a UCHAR
typedef unsigned int UINT;        // Define a UINT
FILE *in;                         // Input file
FILE *out;                        // Output file

//  Our 5-bit alphabet.
char Alphabet[] = "abcdefghijklmnopqrstuvwxyz ,.;- ";

//  Function prototypes.
void main( int, char **);          // Main function
void Error( char *, char *);       // Error routine
void Decompress( void);            // Decompress
void main( int argc, char *argv[])
{
  in = NULL;
  out = NULL;
//  Open the files, give an error if problems arise.
  if ( argc != 3)
    Error( "Command form is - DECOMP5 <in-file> <out-file>\n", NULL);
  if ( !(in = fopen( argv[1], "rb")) )
    Error( "Can\'t open %s.\n", argv[1]);
  if ( !(out = fopen( argv[2], "w")) )
    Error( "Can\'t create %s.\n", argv[2]);
//  Read the input file, decode it.
  Decompress();
  fclose(in);
  fclose( out);
  exit(0);
}  // End of main

//  Decompress - Read and decompress the data.
//  This routine does all of the work. Single-character I/O is done
//  in the interest of simplicity.
void Decompress( void)
{
  UCHAR ch;             // Current character
  UINT k;               // Scratch word cell
  while( !feof(in))
  { // Read and decode
    ch = (UCHAR) fgetc( in);
    if ( ch & 0x80)
    { // Special flagging, packed 5 bits
      k = (ch << 8);
      ch = (UCHAR) fgetc( in);
      k |= ch;           // Form a word
      fputc( Alphabet[ (k >> 10) & 31], out);
      fputc( Alphabet[ (k >> 5) & 31], out);
      fputc( Alphabet[ k & 31], out);   // Unpack all three
    }
    else
    {
      if ( ch == 0)
      { // Escape or repeated byte
    ch = (UCHAR) fgetc( in);
    if ( ch < 128)
    { // Repeated byte
      k = ch;
      ch = (UCHAR) fgetc( in);  // Get the actual byte
      while (k--)
        fputc( ch, out);        // Write the repeated byte
    }
    else
      fputc( ch, out);      // Not repeated, but just quoted
      }
      else
    fputc( ch, out);        // Just an ordinary character
    } // If not flagged
  } // Read the input
  return;
} // Decompress

//  Error - Give an error and a message, then exit.
//  A string argument may be given, if desired.
void Error( char *msg, char *str)
{
  if ( in)
    fclose(in);         // Close files if still open
  if ( out)
    fclose( out);
  if ( str)
    fprintf( stderr, msg, str);
  else
    fprintf( stderr, msg);
  exit(1);
} // Error






Copyright © 1993, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video