Dr. Dobb's | The Book Cipher Algorithm

The Book Cipher Algorithm

The Book cipher algorithm uses letters of subsequent words in some text or book as a key to encode messages.

September 24, 2008
URL:http://www.drdobbs.com/architecture-and-design/the-book-cipher-algorithm/210603676

Unless you're a professional cryptanalyst, writing cryptography code means meddling with "powers" you cannot fully comprehend, and seemingly insignificant slips can be fatal. During World War II, for instance, Polish and British mathematicians broke Germany's Enigma code only because the same message-key was enciphered twice at the beginning of every message. The Germans did this to avoid mistakes caused by radio interference, but at the same time, it ruined their carefully planned cryptosystem. And how many slips are there in the code that multiply big numbers, look for 1000-digits primes, and encrypt the fixed header of your document?

With the Book cipher algorithm, you're safe from these kinds of errors because it is simple enough that you can code it in a few lines of C that are completely understandable, but still extremely secure. The so-called Beale ciphers (unmuseum.org/beal.htm), which point to a location of buried treasure somewhere in Bedford county, were coded in 1885, but still have not been decoded. This secret (or maybe hoax) has occupied some of the best cryptanalytic minds. Likewise, when Simon Singh gave 10 problems in the appendix of The Code Book, problem #5 (Book cipher) was the most difficult one for the winners of the £10,000 prize (www.simonsingh.com/Cipher_ Challenge.html). Still, the Book cipher has probably never been used in commercial software.

Book Cipher Algorithms

Basically, the Book cipher algorithm uses letters of subsequent words in some text or book as a key to encode a message. Figure 1 is the simplest form, usually called the "running key cipher." In this case, text (usually from a book) is used to provide a very long key stream. The book used is agreed upon in advance, while the passage used is chosen randomly for each message and secretly indicated somewhere in a previous message. In this example, we agreed to use J.K. Rowling's Harry Potter and the Order of the Phoenix and to start on page 335, line 28, with the sentence, "Hermione bit her lip and did not answer." We write this text under the plaintext and use it as the running key. The particular message to send is "DRDOBBS." We XOR the corresponding characters of the message and the running key to get the ciphertext 12 23 22 2 11 13 29.

Plaintext         D   R   D   O   B   B   S
Plaintext (hex)   44  52  44  4F  42  42  53
Running key       H   E   R   M   I   O   N
Running key (hex) 48  45  52  4D  49  4F  4E
Ciphertext (hex)  0C  17  16  02  0B  0D  1D
Ciphertext        12  23  22  2   11  13  29

Figure 1: Running key cipher.

The running key cipher is much better than the famous Vigenère cipher because we do not repeat the key—a book is hopefully long enough to encode everything we have to say. However, security is still poor because the entropy per character of both the plaintext and the running key is low, and the combining operation is relatively easily inverted. By guessing probable plaintexts along the ciphertext, attackers will eventually recognize the book and break the code.

A better idea is to replace words in the plaintext with the location of words from a book. For example, to encode the word "computer," you look for the first appearance of "computer" in the previously chosen book and enter its position as the cipher text. The second appearance of the word "computer" is replaced by the position of the second appearance of that word in the book, and so on. The real problem with this approach is finding the word: If you agree to use David Copperfield as a code book, and then try to encrypt an article about hash functions, you are unlikely to find all the necessary words.

An alternative approach that gets around this problem is to replace individual letters rather than words, in which case the Book cipher is properly a cipher. Figure 2 illustrates the concept. We are encoding a message "DRDOBBS" using the same passage from Harry Potter and the Order of the Phoenix. To code the letter "D," we look for the first word in the passage starting with "D" (it's the 6th word, "did"). Then we look for the first word starting with "R" (the 11th word, "rang"), then for the next word starting with "D" (the 16th word, "down"), and so on. The final ciphertext is 6, 11, 16, 17, 2, 10, 15.

[Click image to view at full size]

Figure 2: Encoding the message "DRDOBBS" using the Book cipher.

Decoding is even simpler. We take the ciphertext number-by-number and look for the corresponding words in the book, generating the original plaintext.

Practical Issues

The Book cipher is straightforward to implement and secure, but using it in real life presents some difficulties. First of all, you can encrypt only letters, preferably after converting them to uppercase. Most of the crypto textbooks deal with uppercase letters only—introducing symbols that appear very often (such as blanks) or sparsely (such as $ or the #) gives attackers an unnecessary advantage. On the other hand, sometimes you have to encode a message containing numbers or special symbols, and reading decoded text without blanks is a real pain.

This issue can be resolved by preencoding the plaintext using an algorithm similar to UUENCODE, which produces uppercase letters only. A possible candidate is the Five-Letter Codegroup Filter (www.fourmilab.ch/codegroup). While encoding, you could compress text and even implement some kind of double encryption. Compression is good because the main problem with the Book cipher is the need for a long book. To encode a 5000-character plaintext, you need at least a 5000-word book. In fact, it should be much longer because you cannot expect the frequency of all letters in the text to match the frequency of initial letters of words in the book.

Due to the disproportion in frequencies of all letters and initial letters, the coding process may become impossible, and you could run out of some letters even before you start coding. For example, there are practically no English words beginning with the letter "X"—in the Harry Potter book, there is only one such "word" in the phrase "a fiery X appeared on the door." In other books, the only "words" beginning with "X" are chapter numbers written as Roman numerals. In the second Beale cipher (the only one that was successfully decoded), every "X" was coded as 1005, probably pointing to the word "sexes" in the Declaration of Independence, which was used as a key ("sexes" obviously does not start with "x," but does sound like it). Some books also contain zero words beginning with the letter "Z", unless they contain the word "zero" itself.

Therefore, it seems like a good idea to investigate the frequencies of first, second, third, or even last letters in words, to find out which of them matches the frequency of all letters. We analyzed 14 books, from Pride and Prejudice to The Da Vinci Code, and found out that the distribution of frequencies for the third letter in the word has the best correlation with frequencies of all letters. A possible solution is to let users choose which letter in the word to use, by setting an appropriate parameter.

To get the system rolling, you also have to agree with the correspondent about the books you will use to encode the messages. Using "real" books isn't very practical (OCRing them on both sides introduces different errors). So you will probably use some texts available on the Internet, and you might even consider promoting some of your own essays into "books" for additional security. (If we were running the NSA, we would make sure we had a machine-readable copy of the Library of Congress in a form suitable for fast searching.)

Implementation

To illustrate the algorithm, we wrote three programs in C—bkadd (Listing One), bkcode (Listing Two), and bkdecode (Listing Three). They use Standard C, except for processing command-line arguments with the Microsoft-specific Visual C _splitpath and _makepath functions. File processing is via the getc and putc functions, and can be improved by using buffers to increase efficiency.

Listing One


// bkadd
#include <stdio.h>
#include <ctype.h>
#include <stdlib.h>
#include <string.h>

#define MAX_WORD_LEN 100

int main(int argc, char *argv[])
{
   FILE *fp_cod, *fp_book;
   int  ch, charno;
   char word[MAX_WORD_LEN];
   int  fromstart = 1;
   int  wlen = 0;
   // Argument processing
   char code[_MAX_PATH], book[_MAX_PATH];
   char drive[_MAX_DRIVE];
   char dir[_MAX_DIR];
   char fname[_MAX_FNAME];
   char ext[_MAX_EXT];

   if (argc != 3 && argc != 4) goto error1;
   _splitpath(argv[1], drive, dir, fname, ext );
   if (strlen(ext) == 0) strcpy(ext, "cod");
   _makepath(code, drive, dir, fname, ext);

   _splitpath(argv[2], drive, dir, fname, ext );
   if (strlen(ext) == 0) strcpy(ext, "txt");
   _makepath(book, drive, dir, fname, ext);

   if (argc == 3 ||(sscanf(argv[3], "%d", &charno) != 1)
                 || (charno == 0)) charno=1;
   if (charno < 0) { fromstart = 0; charno = -charno; }

   // File opening

   if ((fp_cod  = fopen(code, "a")) == NULL) goto error2;
   if ((fp_book = fopen(book, "r")) == NULL) goto error3;

   // Main loop
   do {
      ch = getc(fp_book);
      if (isalpha(ch))
         word[wlen++] = ch;
      else {
         if (charno <= wlen)
            putc(toupper(word[fromstart ?
            (charno - 1) : (wlen - charno)]), fp_cod);
         wlen = 0;
      }
   } while (!feof(fp_book));
   // Termination
   fclose(fp_book); fclose(fp_cod); return 0;
// Error handling
error1: printf("USAGE: bkadd codfile bookfile [charno]\n"); return 1;
error2: printf("Can not open code file %s\n", argv[1]); return 1;
error3: fclose(fp_cod);
        printf("Can not open book file %s\n", argv[2]); return 1;
}

Listing Two


// bkcode

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#define N_CAPITAL_LETTERS 26
#define LTRTOIDX(x) (x-'A')

int main(int argc, char *argv[])
{
    char plain[_MAX_PATH], codfile[_MAX_PATH];
    char cipher[_MAX_PATH], positions[_MAX_PATH];
    char drive[_MAX_DRIVE];
    char dir[_MAX_DIR];
    char fname[_MAX_FNAME];
    char ext[_MAX_EXT];

    FILE *fp_plain, *fp_cod, *fp_cipher, *fp_pos;
    long pos[N_CAPITAL_LETTERS];
    int  ch_plain, ch_cod;

    // Argument processing
    if (argc != 3) goto error1;
    _splitpath(argv[1], drive, dir, fname, ext );
    if (strlen(ext) == 0) strcpy(ext, "cod");
    _makepath(codfile, drive, dir, fname, ext);
    _makepath(positions, drive, dir, fname, "pos");
    _splitpath(argv[2], drive, dir, fname, ext );
    if (strlen(ext) == 0) strcpy(ext, "txt");
    _makepath(plain, drive, dir, fname, ext);
    _makepath(cipher, drive, dir, fname, "cry");

    // File opening
    if ((fp_plain  = fopen(plain, "r")) == NULL) goto error3;
    if ((fp_cod = fopen(codfile, "r")) == NULL) goto error3;
    if ((fp_cipher = fopen(cipher, "w")) == NULL) goto error3;
    fp_pos = fopen(positions, "rb+");

    // Position array intialization
    if (fp_pos == NULL) memset(pos, 0, sizeof(pos));
    else {
       fread(pos, sizeof(long), N_CAPITAL_LETTERS, fp_pos);
       fclose(fp_pos);
    }
    // Main loop
    do {
         ch_plain = toupper(getc(fp_plain));
         if (isalpha(ch_plain))
         {
            fseek(fp_cod, pos[LTRTOIDX(ch_plain)], SEEK_SET);
            do {
               ch_cod = getc(fp_cod);
            } while (!(ch_cod == ch_plain || ch_cod == EOF));
            if (ch_cod == ch_plain) {
               pos[LTRTOIDX(ch_plain)] = ftell(fp_cod);
               fprintf (fp_cipher, "%1ld ", pos[LTRTOIDX(ch_plain)]);
            } else goto error2;
         }
    } while (ch_plain != EOF);
    // Termination
    fp_pos = fopen(positions, "wb");
    fwrite (pos, sizeof(long), N_CAPITAL_LETTERS, fp_pos);
    fclose(fp_plain); fclose(fp_cod); fclose(fp_cipher); fclose(fp_pos);
    return 0;
// Error handling
error1: printf ("USAGE: bkcode codfile message\n"); return 1;
error2: printf ("Run out of letters %c!\n", ch_plain);
        fclose(fp_plain); fclose(fp_cod);
        fclose(fp_cipher); remove(cipher); return 1;
error3: printf ("Can not open files\n"); return 1;
}

Listing Three


// bkdecode
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[])
{
    char codfile[_MAX_PATH];
    char cipher[_MAX_PATH], decoded[_MAX_PATH];
    char drive[_MAX_DRIVE];
    char dir[_MAX_DIR];
    char fname[_MAX_FNAME];
    char ext[_MAX_EXT];

    FILE *fp_cipher, *fp_cod, *fp_decoded;
    long position, cod_size;

    // Argument processing
    if (argc!=3) goto error1;
    _splitpath(argv[1], drive, dir, fname, ext );
    if (strlen(ext) == 0) strcpy(ext, "cod");
    _makepath(codfile, drive, dir, fname, ext);
    _splitpath(argv[2], drive, dir, fname, ext );
    _makepath(cipher, drive, dir, fname, "cry");
    _makepath(decoded, drive, dir, fname, "txt");

    // File opening
    if ((fp_cipher  = fopen(cipher, "r")) == NULL) goto error3;
    if ((fp_cod = fopen(codfile, "r")) == NULL) goto error3;
    if ((fp_decoded = fopen(decoded, "w")) == NULL) goto error3;

    // Determine codfile size
    fseek(fp_cod, 0, SEEK_END);
    cod_size = ftell(fp_cod);

    // Main loop
    while ((fscanf(fp_cipher, "%ld", &position) != EOF)) {
        if (--position <= cod_size) {
           fseek(fp_cod, position, SEEK_SET);
           putc(getc(fp_cod), fp_decoded);
        }
        else goto error2;
    }
    // Termination
    fclose(fp_cipher); fclose(fp_cod);
    fclose(fp_decoded);
    return 0;
// Error handling
error1: printf ("USAGE: bkdecode codfile cipher \n"); return 1;
error2: printf ("Invalid ciphertext\n");
        fclose(fp_cipher); fclose(fp_cod);
        fclose(fp_decoded); remove (decoded); return 1;
error3: printf ("Can not open files\n"); return 1;
}

The idea is to create a "tank" of letters from books for each of the correspondents. Using the command bkadd allice mybook.txt 1, Bob creates the file allice.cod with the initial letters of each word in mybook.txt. The last parameter determines which letter of each word should be used—positive values 1, 2, 3, stand for the first, second, third, respectively, and setc letter, while negative values mean that letters should be counted from the last position in the word. If this parameter is omitted, the first letters are taken as the default.

Program bkcode is used to encode a message: bkcode allice message1.txt transforms message1.txt into message1.cry using letters from allice.cod. At the successful completion of the process, the program generates the file allice.pos with pointers to the positions of last used letters a, b, c, d... in allice.cod. When the next message is encoded (bkcode allice message2.txt), the search for the letters automatically continues from the previously memorized positions; no part of the (transformed) book is used twice. If the process fails (that is, if the supply of at least one of the letters in allice.cod is consumed), the appropriate message is generated and the file allice.pos is unchanged. Bob should add another book to allice.cod using the command bkadd allice myotherbook.txt 1 and repeat the bkcode command.

To decode a message, Allice uses the command bkdecode allice message1.cry, producing the message1.txt file. Figure 3 shows the complete process. In a two-way communication, by using different books, Allice should generate the file bob.cod and use it to encode messages to Bob.

[Click image to view at full size]

Figure 3: Programs and results of coding and decoding the message using the Book cipher.

Eve is welcome to intercept any of the .cry files but without knowledge of the books used, she is clueless even if her other name is "Susan Fletcher."

How Strong Is It?

Cryptanalysts mostly agree that the Book cipher, if used properly, is practically unbreakable; nearly as good as the one-time pad. Why isn't it used every day? Maybe because of that "if used properly" clause—the complete algorithm is somehow "private." The next time you bury a treasure, you can describe its location within an encrypted message and be reasonably sure that it will not be decoded for the next 150 years, but if you have to organize a secure correspondence for a web of spies all over the world, finding, deploying, and protecting adequate books might prove very difficult. By implementing the Book cipher in your applications, you don't meddle with powers you cannot comprehend—you leave the meddling to users of your software. The average user will probably go to www.gutenberg.org, download the first book, and use it as a key without even bothering to delete the copyright message (which is basically the same for every book on that site).

On the other hand, if users of your software contribute ideas of their own—using texts from the Internet instead of books, using different languages, pre-encoding messages using other algorithms, and so on—then potential attackers would be facing many groups of short messages encrypted using (at least slightly) different algorithms, which might present the ultimate challenge.

Dejan is editor-in-chief of PC Press, a personal computer magazine in Serbia and former Yugoslavia. He can be contacted at www.ristanovic.com. Jelica is a professor of computer engineering at the School of Electrical Engineering, University of Belgrade. She can be contacted at www.jeca.rs.