*Simon is a freelance programmer and author, whose titles include *Beginning Perl* (Wrox Press, 2000) and *Extending and Embedding Perl* (Manning Publications, 2002). He's the creator of over 30 CPAN modules and a former Parrot pumpking. Simon can be reached at [email protected]*

By the time you read this, we should be coming up to a significant date: The 7th of June 2005 is the 51st anniversary of the death of Alan Turing. I have a very high regard for Alan Turing, but then his impact on the computer world cannot be underestimated: It is fair to say that he single-handedly ensured both the academic credibility of the discipline of computer science and the practical utility of the digital computer.

It is also no great exaggeration to say that the ingenuity of he and his team at Bletchley Park changed the face of the naval landscape during the Second World War. To find an expert mathematician, an innovative computer programmer, and a groundbreaking cryptographer would be astounding; to find them all in the same person was nothing short of miraculous.

So it was in honor of Alan Turing that I decided to teach myself Banbarismus, one of the cryptographic methods that he invented in order to decrypt the German naval Enigma code.

## Enigma Machines

The Enigma machine used during the war went through several revisions, but the basic principle remained the same. The simplest form of code is the substitution cipher: You scramble the alphabet, and then substitute each letter of the original ("plaintext") with the letter in the same position in the scrambled alphabet. So if you had:

abcdefghijklmnopqrstuvwxyz Code: ekmflgdqvzntowyhxuspaibrcj

Then the letters "c" "a" "t" would be substituted to give "m" "e" "p" (the "ciphertext"). In Perl, we'd implement a substitution cipher with the *tr///* operator:

$text = "cat"; $text =~ tr/a-z/ekmflgdqvzntowyhxuspaibrcj/; print $text; # mep

Simple substitution ciphers like this are easy to break, using frequency analysis: The letter that most appears in the ciphertext is likely to represent "e" in an English text, and then "t" and so on. The Enigma machine uses a combination of simple substitution ciphers—one preprogrammed one called the "reflector" and one user-programmable one called the "Steckerboard."

Between the Steckerboard and the reflector is where the magic happens. There is a space for three, or in naval Enigma up to five, rotors. The operator chooses three rotors from the box that the machine comes in, and places them in the correct order inside the machine. Each of these rotors encodes a substitution cipher. Each character of plaintext gets substituted through the Steckerboard, substituted again through each of the rotors, substituted through the reflector, back again through the rotors, and back again through the Steckerboard before the encrypted character lights up on the operator's display.

So far we have seven substitution ciphers acting in series, which is actually no better than one substitution cipher. What makes Enigma special is that the rotors rotate after each character. So after one substitution, the code on the first rotor would be:

abcdefghijklmnopqrstuvwxyz Code: kmflgdqvzntowyhxuspaibrcje

The cipher is scrambled for each letter, meaning that two "A"s in a row would encipher to different letters. This all went to create what was thought to be an "unbreakable" cipher—until Turing came along.

## Our Very Own Enigma

Enough talk—let's have some Perl. There are two Enigma machine implementations on CPAN; one of them is *Crypt::OOEnigma*, which models the physical Enigma machine in an OO way: You create your own *Rotor* objects, create a *Steckerboard* object, plug them all into an *Enigma* object, and off you go. This is very good as an example of OO Perl, but not as an example of an Enigma machine, unfortunately—it doesn't deal with the fact that rotors rotate at different speeds. (Turing found that if the five rotors were set to "AAAAA," the second would move to "B" when the first was at "R," the third would move when the second was at "F," the fourth would move when the third was at "W," and so on. The mnemonic "Royal Flags Wave, Kings Above" was used to remember the turnover positions of the rotors.)

*Crypt::Enigma* is better, but does not use the Royal Flags turnovers of naval Enigma. It is simpler to use, though. Here's a Perl-based Enigma machine:

use Crypt::Enigma; my($enigma) = Crypt::Enigma->new(); $enigma->setup("341", "HAL", "CGC");

First we set up the Enigma machine; "341" states that we want wheel 3, then wheel 4, then wheel 1. The next three letters tell you what the rotors should be showing when you insert them into the machine, and then rotors are rotated, with turnovers, until they show the next letters.

Next we configure the Steckerboard:

$enigma->stekker($_->[0] => $_->[1]) for ( [qw/a p/], [qw/b m/], [qw/c l/], [qw/d e/], [qw/j u/], [qw/o z/], [qw/v h/], [qw/x n/], [qw/y i/], [qw/f s/] );

This makes up the "base state" of the Enigma, the "Grundstellung." Now we can start typing. We want to use *Term::ReadKey* to look at individual keys, and we use a ReadMode of 4 so that the keys are not displayed on the screen:

use Term::ReadKey; ReadMode 4; my $chars; while (defined ($key = ReadKey(0))) {

Now we need to normalize the key we've read: *Crypt::Enigma* wants lower-case characters, and we also need to stop if we see a control character like ^C.

last if ord($key) < 10; $key = lc $key; next unless $key =~ /[a-z]/; print uc($enigma->input($key));

Finally, we want to format the ciphertext in the same way the German Enigma did it: groups of four characters:

$chars++; print " " if $chars % 4 == 0; print "\n" if $chars % 36 == 0; }

Now if I run this and hold down the "a" key, I get the following output:

JXJC TTPO LNVG ZNSW EZSC BWQQ FEIJ WHRR RCTI NJFG WELY HLSF XTGZ EQPG SYTR EXBW YGDE XENS PIML BNSS OKDG GYGJ ZCGW GEBC GJTD LXVI PCUD OEKK KGFD BYDT DYLS MHMU QFJL BCTY ZYWQ VFWS

As you can see, this is now something far more than a simple substitution cipher. So how did Turing break it?

## Banbarismus

When decrypting a message with Enigma, the German operators would need to enter a three letter code, the indicator, to get the machine into the right state away from the Grundstellung. This meant that when an encrypted message was sent, the indicator was itself encoded and sent along with the message. A raid on a German submarine gave Turing's team the the codebook used to encode the indicator. So now, from a message header, we know the indicator; we don't know the Grundstellung, but we can work out parts of it.

The reason for this is the way the rotors advance. If we have a machine whose rotors show "CGC" and one that shows "CGG," and we type four characters on the first machine, both will be in the same state. From then on, the two machines will be marching in sync, and given letter frequencies, there's a good chance that the same plaintext letter will appear at similar points in the messages. This is clearly demonstrated in our "all A" message; the same message encoded with a start of "CGG" looks like this:

TTPO LNVG ZNSW EZSC BWQQ FEIJ ...

So what we do to recover the difference in starts is to put the texts next to each other and slide them until we have a match. The amount we slide them is called the "depth." This is the Banbarismus procedure, so called because the messages were printed onto special sheets of paper—the top sheet contained holes under each of the characters so that the bottom sheet could be seen as the top message slid along—printed in Banbury in Oxfordshire. Here are the results of applying Banbarismus to our two messages:

Depth: 0 JXJCTTPOLNVGZNSWEZSCBWQQFEIJ TTPOLNVGZNSWEZSCBWQQFEIJ (0 characters in common) Depth: 1 JXJCTTPOLNVGZNSWEZSCBWQQFEIJ TTPOLNVGZNSWEZSCBWQQFEIJ (0 characters in common) Depth: 2 JXJCTTPOLNVGZNSWEZSCBWQQFEIJ TTPOLNVGZNSWEZSCBWQQFEIJ (0 characters in common) Depth: 3 JXJCTTPOLNVGZNSWEZSCBWQQFEIJ TTPOLNVGZNSWEZSCBWQQFEIJ (2 characters in common) Depth: 4 JXJCTTPOLNVGZNSWEZSCBWQQFEIJ TTPOLNVGZNSWEZSCBWQQFEIJ (25 characters in common)

When the number of common characters reaches a maximum, the message is said to be "set at depth." So although we don't know the original rotor starts, we know the two indicators and the optimal depth. At Bletchley Park, this information would be written like so:

01, 02 EJK + 4 = EJP

This says that a machine with an indicator "EJK" (as in message number 01) is four rotations away from one with the indicator "EJP" (message number 02). Now you do this for all the captured texts you have, and you can produce a set of correspondences. Here's a part of one such set:

05,06 DXE + 2 = DXW 06,07 DXW + 2 = DXD 14,15 DIL + 2 = DIO 14,17 DIL + 6 = DIZ 02,05 DXO + 6 = DXE 02,06 DXO + 8 = DXW 20,21 DID + 2 = DIC

I've deliberately looked at those with very similar indicators because they indicate a very similar Grund. Now we have to find a partial alphabet that has these properties: E is 2 places away from W, and so on. With a bit of head scratching, you can come up with:

L.O.I.Z.E.W.D.C

Compare this against a normal alphabet, and you have the cipher used on one of the rotors:

ABCDEFGHIJKLMNOPQRSTUVWXYZ L.O.I.Z.E.W.D.C

You'll see that O and C pair up—C encrypts to O, and O to C. This tells us we're on the right track. With more correspondences, we can actually identify which rotor this is—if the first two letters of the indicator are the same, then there has been no turnover; the alphabet turns out to run from U to O, and so the rotor must be number 1—its turnover (remember R for Royal) is the only one that is outside this range.

This vastly reduces the number of combinations required to brute-force all the settings, and means the messages can be broken using Turing's Bombe computers.

### Break Shark!

OK, let's now do this in Perl. I used the set of "test" messages from Tony Sale's Lecture on Naval Enigma, from which I learned the various Enigma procedures.

The core of Banbarismus is setting the message at depth: working out the optimal depth that gives us the largest number of characters in common between two messages. The Perl implementation of this might be a bit surprising, so let's have a look at it:

use List::Util qw(reduce); sub compare { my ($text_a, $text_b) = map {$crypts[$_]} @_; my @holes = map { ($text_a ^ (" " x $_ . $text_b)) =~ tr/\0// } 0..52; my $offset = reduce { $holes[$a] > $holes[$b] ? $a : $b } 0..52; return $offset; }

We're assuming that we have the the ciphertexts in the array *@crypts*, and we'll pass in two message numbers, expecting to return the depth.

This is based on a trick I learnt from Nicholas Clark, who in turn claims to have got it from Damian Conway. Originally, the question was about finding the length of longest common prefix between two strings. Damian's solution was to *xor* the strings together—then any identical characters between the two strings would be replaced by character 0's:

$one = "abcabc"; $two = "abc "; $one ^ $two # "\0\0\0ABC"

Now you count the *\0*s at the beginning of the string:

($one ^ $two) =~ /^(\0+)/; length($1)

And you've got the length of the common prefix.

Here we're using the same technique to count the common characters, by counting (with *tr///*) the nulls all the way through the *xor*-ed string:

($one ^ $two) =~ tr/\0//;

Now setting something at depth is just a matter of sliding it along, by adding a variable number of spaces at the beginning of one of the texts, then counting the nulls—the "holes" in the Banbarismus sheets:

my @holes = map { ($text_a ^ (" " x $_ . $text_b)) =~ tr/\0// } 0..52;

Now we have to find out which depth produced the most holes. This uses *List::Util*'s *reduce* routine, which "accumulates" a list:

my $offset = reduce { $holes[$a] > $holes[$b] ? $a : $b } 0..52;

In this case, *$a* is the current "best offset" and *$b* is the next entry in the list from 0 to 52. If the current position has more holes than *$b*'s position, we keep *$a*; else, *$b* becomes our new "best offset." This is faster than the more usual:

my ($offset) = sort { $holes[$b] <=> $holes[$a] } 0..52;

because *reduce* only scans over the list once: the algorithm is order *O(n)* rather than *sort*'s *O(n log n)*.

So far so good? We simply compare all the cyphertexts, find the largest offset, and then we can build our associations, right?

### Statistical Significance

Wrong. The problem is, we shouldn't be expecting to see something significant for each pair of messages. The very nature of setting at depth is a statistical process, and we should make sure that the result we get is statistically significant before we accept it. If we decide that one pair of messages is set at depth 3 because there are nine holes, is this significant? If every other depth gives us two or three holes, then it probably is; if every other depth gives us eight holes, it's probably not.

We can apply a very simple test to determine statistical significance: We take the standard deviation of the data and see if our best figure is more than *n* standard deviations away from the average. There's a statistical rule of thumb that says that 95 percent of the data in a normal distribution lies *n*=2 standard deviations away from the mean; that means if our best depth is more than two standard deviations away from the mean, it's very likely to be significant. So, let's work out the average and standard deviation.

Again, *List::Util* comes in handy with its *sum* routine:

my $average = sum(@holes)/@holes; my $stddev = sqrt( sum( map { ($average - $_)**2 } @holes) / @holes );

And now we can reject this association if our best depth doesn't give a number of holes two standard deviations away from the mean:

return if $holes[$offset] - $average < $stddev * 2;

### Putting It Together

With this *compare* routine in place, we can put something together that produces a "first indicator plus offset equals second indicator" list like the one we looked at above:

use File::Slurp; my @crypts = read_file("crypts"); my @indicators = read_file("indicators"); for my $i (0..$#crypts) { for my $j ($i+1 .. $#crypts) { if ( my $offset = compare($i, $j) ) { print "($i, $j) $indicators[$i] + $offset = $indicators[$j]\n"; } } }

For the test messages from Tony Sale's page, this produced 28 lines of correspondences:

01,14 DXL + 26 = DIL 01,15 DXL + 28 = DIO 02,05 DXO + 6 = DXE 02,06 DXO + 8 = DXW 02,08 DXO + 12 = DXZ ...

Of course, now we have to do the hard part of arranging the alphabet on the rotor. I'm sure there is a way to do this with constraint-based linear programming, but... well, we have to do some code-breaking work or this isn't any fun. We can, however, make it a bit easier by visualizing the problem.

### Pretty Graphs

I like to use *Graphviz* to visualize data, and especially on OS X, you can drag around the graph nodes to your heart's content; there's a *Graphviz* Perl module we could use, but the file format is easy enough to cook up manually. Given the line:

02,06 DXO + 8 = DXW

we want to represent the information that "O" and "W" are eight places apart. In *Graphviz*, we can create a line between node "O" and node "W" and label it "8," like so:

o -- w [label="8"]

In fact, this is so easy to do we can just filter the output of our previous script with a couple of lines of Perl:

while (<>) { /(\(?\d+, ?\d+\)?) (\w+) \+ (\d+) = (\w+)/ or die $_.": OOF"; next unless substr($2,0,2) eq substr($4,0,2); print substr($2,2,1). " -- ".substr($4,2,1)." [label=\"$3\"];\n"; }

This checks that no fallover has occurred, just to make things a bit simpler, then outputs the appropriate line. Feeding this to *Graphviz*, and playing about with the nodes until they're more or less in a straight line, we get the network shown in the diagram:

Notice there are still some problems with this diagram; we have the chains

Z + 6 = L I + 6 = L Z + 2 = I

which evidently can't be resolved, so even though there was a statistical significance to the correlations, we have to be discerning about the ones we use. However, with this data, we can construct a partial rotor alphabet. Cracking the rest of the code is left as an exercise for the reader!

**TPJ**