*Kas is a consultant specializing in the design and implementation of ultrahigh-speed data compression algorithms. He can be reached at 578 Fairfield Ave., Stamford, CT 06902.*

"Omit needless words! Omit needless words! Omit needless words!"

-- Will Strunk, Jr.

Data compression is many times thought of as an exercise in redundancy removal. Actually, it is much more. Data compression cuts right to the heart of one of the two classical problems of information theory -- how best to encode a message. (The other is how best to send a message in the presence of noise.) "Best" here is taken to mean most efficient, in terms of bits per symbol. When a message has been expressed in the fewest possible bits per symbol, it is said to be optimally encoded. No bits are wasted. This is the goal of data compression.

Claude Shannon was among the first to try to quantify the encoding efficiency of various coding schemes, including ordinary English text. One of Shannon's favorite investigative techniques was a simple cocktail-party game in which he would pick a page in a book at random and (while reading a portion of text out loud, one letter at a time) have volunteers try to guess what the next letter(s) would be, based solely on the letters that had come before. If the player could not guess correctly, Shannon would give the player the correct answer, and that would form the "clue" for the next round. Shannon kept a tally of correct and incorrect guesses as the game went on. At the end of the game, it was possible to tally the redundancy (or encoding inefficiency) of the given portion of text. For example, the outcome of one game might be that the player correctly guessed 60 out of 90 letters in a stretch of text. This would imply that two thirds of the letters were redundant, because the player could predict them in advance, based on conventional spelling rules, grammar, and usage.

Shannon concluded that ordinary English text is anywhere from 70 to 80 percent redundant. This, in turn, implies that only about 2 bits per 8-bit byte of text stored on disk (or in RAM) actually contain information -- the remaining bits are redundant. Shannon would turn this statement around and say that the average information content of English is 2 bits per symbol, or thereabouts.

Not content to let party-goers determine the information content of data streams, Shannon looked for ways to calculate the information content of various "messages." The strategy he used was disarmingly simple. Let the unit of information be the "bit," yes or no, one or zero. Let a message (or event) be deemed informative only to the extent that it resolves uncertainty in the mind of the observer. If the observer already knows (or can correctly guess) a message, then that message conveys no information. If the message cannot be guessed, it does convey information.

Suppose our alphabet is only two letters long. The information conveyed by a stream of bits (each bit representing one letter of our alphabet) is inversely proportional to the predictability of the bits' values. This is made clearer by imagining that our bit values represent opposite sides of a coin (1 for heads, 0 for tails). Each toss of the coin resolves an uncertainty of 1 bit -- assuming the coin lands heads-up half the time and tails-up half the time. But consider the case of a weighted coin: Suppose we know (from experience) that our "dishonest nickel" falls heads-up three-fourths of the time. How much uncertainty is resolved with each toss? Clearly, it must be less than 1 bit, because if we simply guess "heads" 100 percent of the time, we'll be correct more often than not when attempting to guess the outcome of successive coin tosses. What it means is that the information efficiency of the toss has been degraded. And we can calculate the amount.

Start by reducing everything to its probability of occurrence. An honest nickel has a 0.50 chance of turning up heads, and an equal chance of turning up tails, and therefore an informational degree of freedom corresponding to -0.50 * log(0.50) for heads, plus -0.50 * log(0.50) for tails, or a total of 1.00 bit per toss. (Here, we mean base-2 logarithm when we say "log.") By contrast, the dishonest nickel is constrained so that its informational degree of freedom is as illustrated in Example 1. In other words, if an honest nickel is telling us 1 bit of information per toss, a dishonest nickel that falls heads-up 75 percent of the time is telling us only 0.811 bits of information per toss. You could say that each toss is 19.9 percent redundant!

#### Example 1: The informational degree of freedom of the dishonest nickel

- (0.75) * log(0.75) for heads + - (0.25) * log(0.25) for tails ------------------------------------- TOTAL = - [ 0.25 * (-2) + 0.75 * (-0.415) ] = 0.811 bit per toss

With English text, we have an alphabet of 26 characters, representing (if you will) 26 possible outcomes for each "toss" (each symbol), and thus 4.76 bits per "toss" if all outcomes are equally probable. But we know that all outcomes are not equally likely in English text. The 26-sided dice have been weighted so that "e" turns up more often than any other letter, with "x" occurring much less frequently than "s," and so forth. To obtain the informational degree of freedom of English text, we need to determine the probability P of occurrence of each of the 26 letters of the alphabet, then sum the terms - Plog(P) for all letters.

Shannon reserved a special name for this quantity, which we've been calling the "informational degree of freedom." He called it entropy, in honor of the fact that the equation that expresses it is of the same form as the equation derived by Boltzmann for thermodynamic entropy, namely S = k log(W) where S is entropy, W is the number of ways in which the parts of the system can be rearranged, and k (for computations involving gas molecules, etc.) is a fundamental constant of nature, now known as Boltzmann's constant.

In information theory as well as thermodynamics, entropy is a measure of freedom of choice. It represents the average uncertainty as to which of many states a system might be in. For a data stream, it's the number of bits per symbol required to encode the message.

Because entropy calculation is a straightforward matter, it's also easy to determine the degree of redundancy in a file. If a data file is represented as 8-bit bytes, we can obtain the redundancy of the file by subtracting its entropy from 8. (The answer will be in bits per byte.) A Turbo C program (called ENTROPY.C) to calculate the entropy and apparent redundancy of a file is shown in Listing One. Note that this program calculates the entropy of a message with respect to a certain model, in this case the order-0 finite context model. An order-1 finite context model would likely produce a different calculation. Furthermore, there are other ways of modeling data, all of which may generate vastly different calculations. (As an aside, you can think of Shannon's model as a readily available neural network, the human brain.)

Listing One is short and self-explanatory. It is appropriate to point out, however, that the calculation method used in this program offers a first approximation only. Shannon himself saw each symbol (indeed, each string) in a message as constituting the "source" of the symbol located immediately downstream of itself. For instance, in the word "quiet," the "q" can be considered a message source for the message "u," just as "qu" can be considered a message source for "i," and "ui" a source for "e," and so on. The central idea here is that context, as well as statistical abundance, determines information content. A more accurate entropy estimate is obtained when context (upstream and downstream characters) is taken into effect. This can be done by tallying character frequencies in a two-dimensional array, with one dimension given by an upstream character and the other given by the character currently being read. (The "alphabet" associated with "q" is just one letter long: namely "u." Because there is only one allowable following symbol for "q," the "u" contributes nothing to the information of the message, and its entropy contribution in the context of "q" is zero.)

Having said all this, the zero-order or static entropy calculation offered in ENTROPY.C nonetheless gives a very good first approximation to entropy. For example, running ENTROPY on itself (the file ENTROPY.EXE being some 25K in size) yields an entropy estimate of 6.933 bits per byte, for a file redundancy of 13 percent. Using ARC 6.02 to compress the same file resulted in a 14 percent size reduction. Running ENTROPY on COMMAND.COM (Compaq DOS 3.3) gave an entropy estimate of 6.436 bits per byte, for an apparent redundancy of 20 percent; ARC reduced the file by 22 percent.

Knowing nothing else about a file other than the frequency of occurrence of its constituent bytes, and using no data compression techniques whatsoever, we are able to calculate its redundancy to an astonishingly accurate degree. When you think about it, this is pretty amazing. (It wasn't long after Shannon published his work in this area, of course, that a fellow by the name of Huffman devised an algorithm to exploit statistical redundancy in files, to achieve more efficient coding.)

Understanding entropy is basic to understanding data compression. Only when data compression is seen in the context of efficient information encoding (as opposed to mere redundancy removal) can true insight into the data compression problem be obtained.

### References

- Huffman, D.A. "A Method for the Construction of Minimum Redundancy Codes." Proceedings of Institute of Electrical and Radio Engineers 40 (9), 1098-1101. September, 1952.
- Shannon, C.E. & W. Weaver. The Mathematical Theory of Communication. Urbana, Illinois: Univ. of Illinois Press, 1949.
- Lelewer, Debra A. and Hirschberg, Daniel S. "Data Compression." ACM Computing Surveys, vol. 19, no 3. New York, September, 1987.

_ENTROPY_ by Kas Thomas

**[LISTING ONE]**

<a name="006c_0007"> /* * * * * * * * * * * * * * * * * ENTROPY.C * * * * * * * * * * * * * * * */ /* Calculates zero-order entropy of a file, a la Shannon. */ /* Turbo C version by Kas Thomas */ /* You may distribute this listing to fellow programmers. Please retain */ /* authorship notices, however. */ /* This program will give an approximate measure of how compressible a */ /* given file is using Huffman-type compression techniques. It calculates */ /* the best compression possible using order-0 finite context modelling. */ /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ #include<stdio.h> #include<stdlib.h> #include<math.h> #define LOG(x) 3.32 * log10(x) /* base-2 log macro */ #define ENTROPY(x) -(x * LOG(x)) /* classic definition of entropy */ FILE *in; /* input file pointer */ unsigned int table[256]; /* count data goes here */ void read_input(void); double analyze(void); void usage(void); /* ------------------- MAIN ------------------ */ main(int ac, char **av) { double result; /* return value of analyze() */ if (ac==1) usage(); /* explain program usage & exit */ in = fopen(av[1],"rb"); /* open the input file */ if (!in) printf("\nCouldn't open input file."); /* error message */ if (!in) exit(-1); /* exit program if file couldn't be opened */ printf("\n ** Reading file . . ."); /* status message */ read_input(); /* read file & tally character frequencies */ printf("\n ** Calculating . . .\n"); /* status message */ result = analyze(); /* analyze the frequency data */ /* finally, print the results to the screen */ printf("\n The file \"%s\" has a zero-order",av[1]); printf("\n entropy of %3.3f bits per byte.\n",result); printf("\n Approximate shrinkage potential"); printf("\n using Huffman techniques:"); printf(" %2.0f%%\n\n\n",100-(result * 100)/8); fclose(in); /* close file */ return (1); /* optional, but a good idea anyway */ } /* end function main() */ /* ----------------------- read_input() ----------------------- */ void read_input() { int ch; while (( ch = getc(in)) != EOF) /* until EOF reached . . . */ table[ch]++; /* read a byte at a time & tally char counts */ } /* end function read_input() */ /* ----------------------- analyze() -------------------------- */ double analyze() { double accum = 0.0; /* entropy will accumulate here */ double freq; /* frequency of occurrence of character */ long fsize = 0L; /* input file's size */ register int z; /* scratch variable */ fsize = ftell( in ); /* get file size */ for (z = 0; z < 256; z++) /* for every position in table */ if (table[z]) /* if data exists */ { freq = (double) table[z]/fsize; /* calculate frequency */ accum += (double) ENTROPY(freq); /* get entropy contribution */ } return accum; } /* end analyze() */ /* --------------------------------- usage() -------------------------- */ /* Explain program & exit. */ void usage() { printf("\n\n"); printf(" Entropy v1.00 by Kas Thomas. Public Domain.\n\n"); printf(" Syntax: ENTROPY {filename} [Enter]\n\n"); printf(" Entropy is a measure of information storage efficiency.\n"); printf(" This program calculates a file's entropy, hence its\n"); printf(" compressibility, using the entropy equation of Shannon.\n"); printf(" (See \"Information Theory: Symbols, Signals, & Noise,\"\n"); printf(" by John Pierce, Dover, 1981).\n\n"); exit(1); } /* end function usage() */

Copyright © 1991, *Dr. Dobb's Journal*