Channels ▼
RSS

Design

Star Encoding

Source Code Accompanies This Article. Download It Now.


Aug02: Algorithm Alley

Mark is a DDJ contributing editor. You can reach Mark at markn@ieee.org or via his web site at http://www.dogma.net/markn/.


Transformation algorithms are an interesting class of data-compression techniques in which you can perform reversible transformations on datasets to increase their susceptibility to other compression techniques. For instance, the Burrows-Wheeler Transform (see my article "Data Compression with the Burrows-Wheeler Transform," DDJ, September 1996) is a completely reversible sorting algorithm that can create long runs of identical characters in the output text. Using a move-to-front algorithm followed by Huffman or arithmetic coding results in compression that outperforms all but the best algorithms. Likewise, in the JPEG compression algorithm, 8×8 blocks of pixels can be run through the Discrete Cosine Transform (DCT), which transforms the spatial data points to the frequency domain. At that point, lossy compression algorithms can then be effectively applied.

The transformations I describe here generally don't compress data at all; instead, they prep the data before compression is applied. Thus, a key factor in using any transform is understanding what compression technique can effectively be used on the output. With the Burrows-Wheeler Transform (BWT), the best results come from a three-stage compressor. The output of the BWT transform is usually piped through a move-to-front stage, then a run-length encoder stage, and finally an entropy encoder (normally arithmetic or Huffman coding).

Star Encoding

BWT is a sorting algorithm that radically changes the nature of an input file. Star encoding is simpler to understand and can be implemented easily with the help of the C++ Standard Library.

Star encoding works by creating a large dictionary of commonly used words expected in the input files. The dictionary must be prepared in advance and must be known to both the compressor and decompressor. Each word in the dictionary has a star-encoded equivalent, in which as many letters as possible are replaced by the "*" character. For example, a commonly used word such as "the" might be replaced by the string "t**". The star-encoding transform simply replaces every occurrence of the word "the" in the input file with "t**".

Ideally, the most common words have the highest percentage of "*" characters in their encodings. If done properly, this means that the transformed file will have a huge number of "*" characters. This ought to make the transformed file more compressible than the original plain text.

Figure 1(a), for example, is a section of text from Project Gutenberg's version of Romeo and Juliet. Running this text through the star encoder yields Figure 1(b). You can see that the encoded data has exactly the same number of characters, but is dominated by stars. It certainly looks as though it is more compressible. But before you can test the compressibility of the output data, you need to build programs to create the dictionary, star encode a file using a dictionary, and decode the file using an identical dictionary.

Creating the Dictionary

MakeDictA.cpp (available electronically; see "Resource Center," page 5) creates the dictionary. MakeDictA is invoked with a list of filenames as command-line arguments. It scans through each file looking for tokens, adds them to the dictionary if needed, and updates the token's frequency count.

Once the tokens have been counted, the list is sorted by frequency. The program then goes through the list, starting with the most frequent tokens, and assigns them star codes. As a final step, the dictionary is written to standard output.

Tokenizing and Scanning

Unfortunately, the C++ Standard Library doesn't have a stream tokenizer. If my needs were more complicated, I might use the Boost Tokenizer (http://boost.org/libs/ tokenizer/index.htm), but in this case, I simply wrote the code shown in Listing One. This function is called repeatedly from main(). The map<string,int> object named frequency is updated as each token is encountered. If the string is not found in frequency, it is inserted with a count of 1. Otherwise, the count is incremented; see Listing Two.

Sorting and Assigning Codes

After all the tokens from all the input files have been scanned, you have a map with an appearance count for every word encountered in the input text. The next step requires that you iterate through all the words, starting at the most frequent and going all the way to the least.

Unfortunately, the frequency map is keyed on the word name, not the frequency. So before code assignment can start, you need to reorder the data. You can do this with just a few lines of code by inserting all the string/integer pairs into a map<int,string> object called counts. Since this map is keyed on the integer value instead of the string, the values are automatically sorted in the form you want; see Listing Three.

With the data sorted correctly, the remaining piece of hard work is to iterate over the counts map and assign a code to each one, as in Listing Four. A couple of interesting things to note show up in this loop. First, each time a new code is assigned to a token, the code value is stored in a set<string> called used_codes. You have to keep track of the used code values during the assignment process so that no star code is inadvertently used twice.

Second, you can see that the value of each token and its code is being written to standard output. (I write the frequency count as well, although this is technically not needed. It helps me to keep an eye on the process and ensure it is working as desired.)

The Code Assignment Algorithm

In MakeDictA, I perform a star-code assignment in a routine called create_star_code(). This algorithm uses a greedy heuristic to assign codes, attempting to use the highest possible number of "*" characters to each token; see Listing Five. For a token of length N, the assignment routine first attempts to replace all characters in the token with stars, then all but one, then all but two, and so on. The often overlooked next_permutation function from the C++ Standard Library helps with this.

Executing MakeDictA

To test this program, I went to Project Gutenberg (http://promo.net/pg/) and retrieved the text of eight Shakespeare plays (also available electronically). With allowances for line breaks, the text below shows the command line I used to create the dictionary file:

[c:\star] MakeDictA text\AsYouLikeIt.txt text\Hamlet.txt text\JuliusCaesar.txt text\KingLear.txt text\Macbeth.txt text\RomeoAndJuliet.txt text\TheTamingOfTheShrew.txt text\TheTempest.txt > DictA.txt

The program creates 16,927 codes in fairly short order, producing DictA.txt (available electronically) shown here in a somewhat abridged form:

the *** 4556

I * 4390

and **d 3676

to ** 3218

of *f 2899

you **u 2775

a a 2714

...[skipping 8500 lines]...

vexations ****t**n* 1

vestall **s*a** 1

versall **r**l* 1

verities *** i***s 1

verge v*r** 1

...[skipping 8500 lines]...

'Twentie *T****** 1

'Saue *S*** 1

'Prethee *P****** 1

'Pre *P** 1

'Boue *B***

With this somewhat limited vocabulary, you can see that even words that appear very rarely are replaced with a very high percentage of stars. The most frequent words are replaced completely.

Compressing and Decompressing The Transformed Data

StarEncode.cpp and StarDecode.cpp (both available electronically) transform text files to and from the star-encoded format. Both programs are invoked with the name of the dictionary file on the command line. The programs then transform standard input to standard output. Internally, they simply read the dictionary file into a map<string,string> object that holds all the encodings. They then read in text, passing special characters through unchanged, and translating tokens whenever possible.

Figure 2 is the encoding program in operation, as it encodes Romeo and Juliet using the dictionary created earlier. The result of this encoding should be a file that is now more compressible than the original. To test this theory, I compressed both the original and encoded file in WinZip, with the results shown in Figure 3. The results are encouraging. The compressed file decreases in size by almost 12 percent, with the compression ratio improving from 60 percent to 64 percent — tangible savings with only a small increase in computation.

Variant B

While building the encoder, I started wondering about the structure of my star codes. It seemed kind of arbitrary to insist that every star code have the same number of stars as the word it was encoding. For example, "the" (the most frequent word) was encoded as "***," while "I" (the second most frequent word) was encoded with a single star. Shouldn't the most frequent word get the fewest number of stars?

To test this theory, I created a second dictionary builder using a new heuristic for assigning codes. In this heuristic, a star code is simply a single-star character concatenated with a plain-text code number, with the most frequent token being assigned code 0.

Using this scheme simplified my encoder. The main loop that assigns codes to tokens no longer needed to keep track of used codes and was simple enough to do code assignment inside the loop, instead of in a separate function; see Listing Six. Comparing the output from Variant B to that described earlier shows how the two strategies create quite different dictionaries:

the *0 4556

and *1 3676

to *2 3218

of *3 2899

you *4 2775

...[skipping 8500 lines]...

vexations *8480 1

vestall *8481 1

versall *8482 1

verities *8483 1

verge *8484 1

...[skipping 8500 lines]...

'Twentie *16899 1

'Saue *16900 1

'Prethee *16901 1

'Pre *16902 1

'Boue *16903 1

One difference, in this case, is that I don't bother to encode strings of just a single character in length, as their encoded values would always be larger.

To test this variant, I compiled MakeDictB.cpp, created DictB.txt, and ran RomeoandJuliet.txt through StarEncode.cpp; see Figure 1(c). While the text is somewhat harder to read, it is more compact. The character set is much more limited here because every word is star encoded using only the star character and the integers.

Comparing A and B

Figure 4 is the listing of a Zip file that contains both encoded texts plus the original. In this case, the difference between a standard Zip compressor and Variant B runs nearly 15 percent, quite a nice savings. You can see that Variant B comes out ahead, with a delta that seems worth going for.

Recommendations and Caveats

Star encoding is a handy way to squeeze some additional fluff out of your files. Of course, there is a catch. For star encoding to be effective, you have to know in advance what the vocabulary of the source files is going to be. There are many times when this is the case; for example, computer-generated reports, e-mail or newsgroup archives, trace or log files, cached web pages, and the like. Still, I doubt star encoding could be used in a general-purpose compressor. A hypothetical archiver called "StarZip" might attempt to create a good dictionary and tokenizer for various file types, but the task seems difficult.

While the code presented could form the basis for production code, it needs some hardening. For example, the programs presented here can't cope with something as simple as a star character embedded in the plain text. Caveat emptor.

Star encoding was developed by a team at the University of Central Florida working under Professor Amar Mukherjee. The research that led to this work was supported by the National Science Foundation.

For More Information

Franceschini, Robert and Amar Mukherjee, "Data Compression Using Encrypted Text," Proceedings of the Third Forum on Research and Technology, Advances on Digital Libraries, 1996.

Franceschini, Robert et al., "Lossless, Reversible Transformations that Improve Text Compression Ratios" (http://vlsi.cs.ucf .edu/datacomp/papers/textcomp.doc).

The VLSI and Data Compression Lab at the University of Central Florida (http://vlsi.cs.ucf.edu/

DDJ

Listing One

string get_token( ifstream &file )
{
    string token;
    while ( file )
    {
        char c;
        file >> c;
        if ( isalpha( c ) || c == '\'' )
            token += c;
        else if ( token.size() )
            return token;
    }
    return token;
}

Back to Article

Listing Two

string token = get_token( infile );
if ( token.size() == 0 )
  break;
map< string, int >::const_iterator ii = frequency.find( token );
if ( ii == frequency.end() )
  frequency[ token ] = 1;
else
  frequency[ token ]++;

Back to Article

Listing Three

multimap<int, string> counts;
for ( map<string, int>::iterator ii = frequency.begin() ; 
      ii != frequency.end();
      ii++ )
  counts.insert( pair<int,string>( (*ii).second, (*ii).first ) )

Back to Article

Listing Four

set<string> used_codes;
for ( multimap<int, string>::reverse_iterator jj = counts.rbegin() ; 
      jj != counts.rend() ; 
      jj++ )
{
  string code = create_star_code( (*jj).second, used_codes );
  used_codes.insert( code );
  codes[ (*jj).second ] = code;
  cout << (*jj).second << " " << code << " " << (*jj).first << endl;
}

Back to Article

Listing Five

for ( int star_count = token.size() ; star_count > 0 ; star_count-- )
{
  vector<char> pattern( token.size(), '*' );
  for ( int i = star_count ; i < token.size() ; i++ )
    pattern[ i ] = '-';
  do 
  {
    string test( token );
    for ( int j = 0 ; j < token.size() ; j++ )
      if ( pattern[ j ] == '*' )
        test[ j ] = '*';
    set<string>::const_iterator kk = used_codes.find( test );
    if ( kk == used_codes.end() )
      return test;
  } while ( next_permutation( pattern.begin(), pattern.end() ) );
}
return token;

Back to Article

Listing Six

int star_code = 0;
for ( multimap< int, string >::reverse_iterator jj = counts.rbegin() ; 
      jj != counts.rend() ; 
      jj++ )
{
  string token = (*jj).second;
  if ( token.size() > 1 )
  {
    cout << token 
         << " " 
         << '*' 
         << star_code++ 
         << " " 
         << (*jj).first 
         << endl;
  }
}

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video