Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Database

Data Compression with the Burrows Wheeler Transform


In mathematics, difficult problems can often be simplified by performing a transformation on a data set. For example, digital-signal processing programs use the Fast Fourier Transform (FFT) to convert sets of sampled audio data points to equivalent sets of frequency data. Pumping up the bass or applying a notch filter is then just a matter of multiplying some frequency data points by a scaling factor. Perform an inverse FFT on the resulting points, and voila, you have a new audio waveform transformed according to your specifications.

Michael Burrows and David Wheeler recently released the details of a transformation function that opens the door to some revolutionary new data-compression techniques. The Burrows Wheeler Transform (BWT) transforms a block of data into a format that is extremely well-suited for compression. It does such a good job that even the demonstration programs I'll present here (available electronically) will outperform state-of-the-art programs.

The Burrows Wheeler Transform

Michael Burrows and David Wheeler released a research report in 1994 discussing work they had been doing at the Digital Systems Research Center (Palo Alto, CA). Their paper, "A Block-sorting Lossless Data Compression Algorithm," presented a data-compression algorithm based on a previously unpublished transformation discovered by Wheeler in 1983.

While the paper discusses a complete set of algorithms for compression and decompression, the heart of the paper consists of the disclosure of the BWT algorithm.

BWT Basics

The BWT algorithm rearranges a block of data using a sorting algorithm. The resulting output block contains exactly the same data elements as the input block, differing only in their ordering. The transformation is reversible, meaning the original ordering of the data elements can be restored with no loss of fidelity.

The BWT is performed on an entire block of data at once. Most of today's familiar lossless compression algorithms operate in streaming mode, reading a single byte or a few bytes at a time. But with this new transform, you want to operate on the largest chunks of data possible. Since the BWT operates on data in memory, you may encounter files too big to process in one fell swoop. In these cases, the file must be split up and processed a block at a time. The demo programs that accompany this article work comfortably with block sizes of 50 KB to 250 KB.

To illustrate, I'll start with a much smaller data set; see Figure 1. This string contains seven bytes of data. To perform the BWT, the first thing you do is treat a string S, of length N, as if it actually contains N different strings, with each character in the original string being the start of a specific string that is N bytes long. (In this case, the word "string" doesn't have the meaning it does in C/C++ semantics, where it represents a null-terminated set of characters. In the context of BWT, a string is just a collection of bytes.) We also treat the buffer as if the last character wraps back to the first.


Figure 1.

At this point, it's important to remember that you don't actually make N-1 rotated copies of the input string. In the demonstration program, I represent each of the strings by a pointer or an index into a memory buffer. The complete set of strings is arranged as in Figure 2.


Figure 2.

The next step in the BWT is to perform a lexicographical sort on the set of input strings; that is, you want to order the strings using a fixed-comparison function. This means you should compare using a function similar to the C functions strcmp() or memcmp(). In this high-level view of the algorithm, the comparison function has to be able to wrap around when it reaches the end of the buffer, so a slightly modified comparison function would be needed.

After sorting, the set of strings is arranged as in Figure 3. There are two important points to note in this picture. First, the strings have been sorted, but I've kept track of which string occupied which position in the original set. So string 0-the original unsorted string-has now moved down to row 4 in the array.


Figure 3.

Second, I've tagged the first and last columns in the matrix with the special designations "F" and "L," signifying the first and last columns of the array. Column F contains the characters in the original string in sorted order. So the original string "DRDOBBS" is represented in F as "BBDDORS."

The characters in column L don't appear to be in any particular order, but in fact they have an interesting property. Each character in L is the prefix character to the string that starts in the same row in column F.

The actual output of the BWT, oddly enough, consists of two things: a copy of column L and the primary index (an integer indicating which row in L contains the original first character of the buffer). So, performing the BWT on the original string generates the output string L, which contains "OBRSDDB," and a primary index of 5.

The integer 5 is found easily enough since the original first character of the buffer will always be found in column L in the row that contains S1. Since S1 is simply S0 rotated left by a single character position, the first character of the buffer is rotated into the last column of the matrix. Therefore, locating S1 is equivalent to locating the buffer's first character position in L.

Two Obvious Questions

At this point, there are two obvious questions. First, it doesn't seem possible that this is a reversible transformation. Generally, a sort() function doesn't come with an unsort() partner that can restore your original ordering. Second, what possible good does this strange transformation provide?

Unsorting column L requires the use of something called the "transformation vector," an array that defines the order in which the rotated strings are scattered throughout the rows of the matrix of Figure 3.

The transformation vector, T, is an array with one index for each row in column F. For a given row i,T[i] is defined as the row where S[i+1] is found. In Figure 3, row 3 contains S0, the original input string, and row 5 contains S1, the string rotated one character to the left. Thus, T[3] contains the value 5. S2 is found in row 2, so T[5] contains a 2. For this matrix, the transformation vector can be calculated to be {1,6,4,5,0,2,3}.

Figure 4 shows how the transformation vector is used to walk through the various rows. For any row that contains S[i], the vector provides the value of the row where S[i+1] is found.


Figure 4.

The Rosetta Vector

The reason the transformation vector is so important is that it provides the key to restoring L to its original order. Given L and the primary index, you can restore the original S0. In this case, Example 1 does the job. You can get there from here.

Example 1: Implementing the transformation vector.
t: hat acts like this:<13><10><1
t: hat buffer to the constructor
t: hat corrupted the heap, or wo
W: hat goes up must come down<13
t: hat happens, it isn't  likely 
w: hat if you want to dynamical
t: hat indicates an error.<13><1
t: hat it removes arguments from
t: hat looks like this:<13><10><
t: hat looks something like this
t: hat looks something like this
t: hat once I detect the mangled

int T[] = { 1, 6, 4, 5, 0, 2, 3 };
char L[] = "OBRSDDB";
int primary_index = 5;

void decode()
{
    int index = primary_index;
    for ( int i = 0 ; i < 7 ; i++ ) {
        cout  L[ index ];
        index = T[ index ];
    }
}

Here is the core premise of the Burrows Wheeler transform: Given a copy of L, you can calculate the transformation vector for the original input matrix. And consequently, given the primary index, you can recreate S0, or the input string.

The key that makes this possible is that calculating the transformation vector requires only that you know the contents of the first and last columns of the matrix. And believe it or not, simply having a copy of L means that you do, in fact, have a copy of F as well.

Given just the copy of L, we don't know much about the state of the matrix. Figure 5 shows L, which I've moved into the first column for purposes of illustration. In this figure, F is going to be in the next column. And fortunately for us, F has an important characteristic: It contains all of the characters from the input string in sorted order. Since L also contains all the same characters, we can determine the contents of F by simply sorting L!


Figure 5.

Now you can start calculating T. The character "O" in row 0 moves to row 4 in column F, which means T[4]=0. But what about row 1? The "B" could match up with either the "B" in row 0 or row 1. Which do you select?

The choice is not ambiguous, although the decision-making process may not be intuitive. By definition, column F is sorted and all strings beginning with "B" in column L also appear in sorted order. Why? They all start with the same character, and they are sorted on their second character, by virtue of their second characters appearing in column F.

Since the strings in F must appear in sorted order, all the strings that start with a common character in L appear in the same order in F, although not necessarily in the same rows. Because of this, you know that the "B" in row 1 of L is going to move up to row 0 in F. The "B" in row 6 of L moves to row 1 of F.

Once that difficulty is cleared, it's a simple matter to recover the transformation matrix from the two columns; see Figure 6. And once that's done, recovering the original input string is short work as well. Simply applying the code shown in Example 1 does the job.


Figure 6.

The Punchline

Figure 7 shows a snippet of debug output taken from the test program BWT.CPP. Each line of output shows a prefix character followed by a sorted string. The prefix character is actually what is found in column L after the strings have been sorted.

     t: hat acts like this:<13><10><1
     t: hat buffer to the constructor
     t: hat corrupted the heap, or wo
     W: hat goes up must come down<13
     t: hat happens, it isn't  likely 
     w: hat if you want to dynamicall
     t: hat indicates an error.<13><1
     t: hat it removes arguments from
     t: hat looks like this:<13><10><
     t: hat looks something like this
     t: hat looks something like this
     t: hat once I detect the mangled
Figure 7.

This section of output consists of a group of sorted strings that all start with the characters "hat." Not surprisingly, the prefix character for almost all of these strings is a lowercase "t." There are also a couple of appearances of the letter w.

Whenever you see repetition like this, you have an opportunity for data compression. Sorting all of the strings gives rise to a set of prefix characters that has lots of little clumps where characters repeat.

Using the Opportunity

You could take the output of the BWT and apply a conventional compressor to it, but Burrows and Wheeler suggest an improved approach-a Move-to-Front scheme, followed by an entropy encoder.

A Move-to-Front (MTF) encoder simply keeps all 256 possible codes in a list. Each time a character is to be output, you send its position in the list, then move it to the front. In the aforementioned example, column L might start with the following sequence: tttWtwttt. If you encode that using an MTF scheme, you would output the following integers: {116,0,0,88,1,119,1,0,0}.

If the output of the BWT operation contains a high proportion of repeated characters, you can expect that applying the MTF encoder will give a file filled with lots of zeros, heavily weighted toward the lowest integers.

At that point, the file can finally be compressed using an entropy encoder, typically a Huffman or arithmetic encoder. The example programs accompanying this article use an adaptive order-0 arithmetic encoder as the final step of the process.

And the Winner Is...

Using the BWT as a front end for data compression has a result similar to that of simple statistical-modeling programs. An order-n statistical model simply uses the previous n characters to predict the value of the next character in a string. Compression programs based on statistical modeling can give good results, at the expense of high memory requirements.

The BWT, in effect, uses the entire input string as its context. But instead of using a preceding run of characters to model the output, it uses the string following a character as a model. Because of the sorting method used, characters hundreds of bytes downstream can affect the ordering of the BWT output.

This characteristic of the BWT has another side effect. It means that, in general, the bigger the block size, the better the compression. As long as the data set is homogenous, bigger blocks will generate longer runs of repeats, leading to improved compression. Of course, the sorting operations needed at the front end of the BWT will usually have O(N*log(N)) performance, meaning bigger blocks might slow things down considerably. (Although Burrows and Wheeler report that a suffix tree sort can be done in linear time and space.)

You might also be impressed by the fact that the BWT is apparently not covered by any software patents. Burrows and Wheeler, when asked, both indicated that no patents had been initiated by DEC. Since this algorithm was published in 1994, it may mean the technique is in the public domain. (Consult an attorney, however.)

Even if the BWT is in the public domain, you need to be careful when selecting an entropy encoder for the final stage of compression. Arithmetic coding offers excellent compression, but is covered by quite a few patents. Huffman coding is slightly less efficient, but seems less likely to provoke intellectual property fights.

A Reference Implementation

One of the nice things about BWT compression is that it's easy to isolate each step of the process. I took advantage of that when writing the demo programs to accompany this article. The compression and decompression processes will consist of running several programs in sequence, piping the output of one program to the input of another, and repeating until the process is complete.

Compressing a file using the demo programs is done using the following command line: RLE input-file|BWT|MTF|RLE|ARI > output-file. The program files include the following:

  • RLE.CPP implements a simple run-length encoder. If the input file has many long runs of identical characters, the sorting procedure in the BWT can be degraded dramatically. The RLE front end prevents that from happening.
  • BWT.CPP is where the standard Burrows Wheeler transform is done. This program outputs repeated blocks consisting of a block-size integer, a copy of L, the primary index, and a special last-character index. This is repeated until BWT.EXE runs out of input data.
  • MTF.CPP is the Move-to-Front encoder.
  • RLE.CPP is useful if the output file is top-heavy with runs containing zeros, because applying another RLE pass to the output can improve overall compression. Further processing of MTF output will provide fertile ground for additional improvements.
  • ARI.CPP is an order-0 adaptive arithmetic encoder, directly derived from the code published by Witten and Cleary in their 1987 "CACM" article.

The output of the compressor can be decompressed by applying the same algorithms in reverse. The command line that accomplishes this is UNARI input-file|UNRLE|UNMTF|UNBWT|UNRLE > output-file. Each of the programs here has a one-to-one correspondence with its namesake from the compression section.

The ten source files that make up this program suite are fairly short, and their inner workings are simple. The most complicated programs are the arithmetic encoder and decoder, which are well-documented elsewhere.

Results

I used the standard Calgary Corpus to benchmark my version of BWT compression. The corpus is a set of 18 files with varying characteristics that can be found via anonymous ftp at ftp.cpsc.ucalgary.ca/ pub/projects/text.compression.corpus. For purposes of comparison, I show the raw size of the file, the compressed sizes using the BWT compressor, and the compressed sizes using PKZIP with the default compression level selected.

As Table 1 shows, this demonstration program manages to compete well with PKWare's PKZip. I consider this encouraging in light of the fact that there should be plenty of headroom left for tweaking additional percentage points from my code.

Table 1: Comparison of results.
File Name   Raw Size     PKZIP Size   PKZIP Bits/Byte   BTW Size    BTW Bits/Byte

bib         111,261      35,821       2.58              29,567      2.13

book1       768,771      315,999      3.29              275,831     2.87

book2       610,856      209,061      2.74              186,592     2.44

geo         102,400      68,917       5.38              62,120      4.85

news        377,109      146,010      3.10              134,174     2.85

obj1        21,504       10,311       3.84              10,857      4.04

obj2        246,814      81,846       2.65              81,948      2.66

paper1      53,161       18,624       2.80              17,724      2.67

paper2      82,199       29,795       2.90              26,956      2.62
      
paper3      46,526       18,106       3.11              16,995      2.92

paper4      13,286       5,509        3.32              5,529       3.33

paper5      11,954       4,962        3.32              5,136       3.44

paper6      38,105       13,331       2.80              13,159      2.76

pic         513,216      54,188       0.84              50,829      0.79

progc       39,611       13,340       2.69              13,312      2.69

progl       71,646       16,227       1.81              16,688      1.86

progp       49,379       11,248       1.82              11,404      1.85

trans       93,695       19,691       1.68              19,301      1.65

Total:      3,251,493    1,072,986    2.64              978,122     2.41

Acknowledgment

My thanks go out to the many people who reviewed and commented on this article, including Nico de Vries, Leo Broukhis, Jean-Loup Gailly, and David Wheeler.

Bibliography and Resources

You can find resources, additional source code, and commentary regarding this compression on my Web page at http:// web2 .airmail.net/markn, under "Magazine Articles." An excellent source of information on data compression in general can be found at http://www.cs.waikato.ac.nz/ ~singlis/compression-pointers.html.

Burrows, M and D.J. Wheeler. "A Block-sorting Lossless Data Compression Algorithm." Digital Systems Research Center Research Report 124, 1994. http://gatekeeper .dec.com/pub/DEC/SRC/researchreports/abstracts/src-rr-124.html.

Witten, I.H., R. Neal, and J.G. Cleary. "Arithmetic Coding for Data Compression." Communications of the Association for Computing Machinery, June 1987.

Nelson, Mark and Jean-Loup Gailly. The Data Compression Book, Second Edition. New York, NY: M&T Books, 1995.


Mark is the author of C++ Programmer's Guide to the Standard Template Library (IDG Books, 1995) and The Data Compression Book, Second Edition (M&T Books, 1995).


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.