Combinatorial Data Compression
Newcomers to the world of data compression often stumble on this old idea in hopes of creating a novel and powerful algorithm. In a nutshell, the idea is to create an enumerative coding system that uses combinatorial numbering to identify a message, in hopes of providing a more compact representation . Unfortunately, these schemes always fail, for reasons that I'll lay out in this article.
CombinationsTo design a combinatorial algorithm that will compress files, you can think of the file as a series of integers. Since most of the files that you use are streams of bytes, consider each file to be a sequence of integers with values from 0 to 255.
If you go back to your first classes on probability and statistics, you might remember the definition of a combination. A combination is simply a way of selecting a number of things from a larger set. When you are trying to compress a file of bytes, the natural size of this set is 256.
Probability theory tells us that we can count the number of combinations of a given size using a pretty simple formula. If a set has n elements and we are choosing k at a time, the number of possible combinations is given by the formula n!/k!*(n-k)!. This number is also known as the binomial coefficient.
Just for a simple example, the number of different ways you can select three bytes out of a set of 256 is 2,763,520. In general, with a large set, most combinations are going to generate very large numbers. The exceptions will be values of that are either very small or very close to the size of the set.
Combinations are well ordered, so any instance of the three bytes has a specific number between 0 and 2,763,519. We can call this the combinatorial rank. This means I can identify any three byte sequence by a combination number.
Assuming all combinations are equal, we can use an optimal arithmetic coder to encode this number in lg(2,763,510) bits, roughly 21.4. That's interesting, because the three bytes actually take up 24 bits, so maybe there is some savings to be found here.
The First Problem
Knowing the combinatorial rank is good, but it won't let you reconstruct a compressed file on its own. The combinatorial rank gives you the set of bytes in the file, but doesn't tell you the order of those bytes. If there are k bytes, they can be ordered in k! different permutations. So to fully describe the file, you need to encode the combination rank and the permutation number.
Encoding the permutation for your 3 byte file is going to take 2.58 bits, calcluated as lg(3!). This makes the total needed to encode your three byte file 23.98 bits. Admittedly not a lot of savings, but it's also non-zero.
Let's look at the number of bits needed to encode a 20 byte message. The number of combinations of this length are roughly 2.8*10^29, which will take 97.8 bits to encode. 20! is roughly 2.43*10^18, which will take 61.1 bits to encode. The total comes out to 158.9 bits. Since we're encoding 160 bits of information, there is clearly a greater savings.
As the message size increases, the savings start to grow. At a message length of 50 we save 7.5 bytes, at 75 bytes we save 16 bytes per message. The trend looks good. By the time you get to a message length of 100 bytes, you're saving 32 bits per message - a compression of 4% for doing nothing but recoding!
The Second Problem
The second problem you encounter in the combinatorial system is that, by definition, a combination is composed of unique elements. So if you are compressing a three byte file, you can't have any duplicate bytes. Is this a problem?
Your inclination is to hope not. You know that every compression scheme only works on a subset of files, so perhaps the combinatorial scheme can be developed to work on segments of files with no duplicates.
How likely are you to find a duplicate in a file of three bytes? You can start by enumerating the total number of files of that length: 256^3. And you know how many files there are with no duplicates: the combinatorial number times the number of permutations. So it's a simple matter to calculate the probability that a message of length k has no duplicated bytes. The value will be n!/(n-k)!*256^k.
For a value of 3, we see that the probability of no duplicate bytes is .988 - this means you can compress almost every file by a fraction of a bit.
You'd like to think that you can look at pretty long stretches of data and expect a low probability of duplicates, but unfortunately you run into the Birthday Paradox. In the birthday paradox, you're asked a question something like this: in a room of 23 people, what are the chances that two people share a common birthday? For most people, the answer, 50% or so, is non-intuitive.
Likewise, it means that a file with 100 bytes and no duplicates is such a rarity that it might as well never appear - the chances are less than one in a billion.
Facing the Music
You can see the problem here. We can compress very short sequences using a combinatorial system, but the savings are very small. Even so, we can compress most files. We can compress longer files for greater savings, but very few sequences will prove to be eligible.
It's actually worse than that. Let's work out the number on a hypothetical compressor. This compressor will use a combinatorial scheme to compress all files of 10 bytes. The compressor will look at the file, and if it has no duplicates, it will set a flag symbol in the output stream to be true, followed by the combinatorial number, followed by the permutation.
If the 10 byte file has duplicates, the compressor will generate a flag symbol of false, followed by the uncompressed data.
If this scheme gives us some savings, we can scale it up to operate on files of any size - we'll just compress them in 10-byte chunks.
So let's analyze the result. First, the number of files that will make it past the first test is pretty impressive: 83.695%. Each of these files will be compressed down to 79.743 bits. The remaining 16.305 percent will take exactly 80 bits in the output stream. So the overall size of our output file thus far is going to add up to 79.78519 bytes. Our algorithm is still in the black!
Unfortunately, we also need to account for the cost of the flag message. Using optimal coding, when the flag is true we are going to require .25679 bits. When it is false, optimal coding of the much rarer message will require 2.6 bits. Add in the cost of the flag, and the average output size goes up to a smidgen over 80 bits.
In other words, you lose.
The problem is a familiar one in data compression. Every time you come up with a way to encode a subset of files that saves some space, you find that all your savings are lost when you try to encode the files that aren't part of the subset. Even using a single bit to flag special files as being incompressible is enough to wipe out your savings. It is the definition of Whac-A-Mole.
With combinatorial coding you will find that same rule to hold true as for all forms of data compression: It isn't going to be a universal compressor that can reduce every file in size. The only reason it will be useful is if you have an input set of files that all have a common characteristic: a preponderance of streams where duplicates are rare. And odds are, this set of files will probably be compressible using some more reasonable algorithm.