*The authors are associated with the COSIC research group of the Department of Electrical Engineering at the Katholieke Universiteit Leuven, Belgium. They can be contacted at [email protected], [email protected]c.be, and [email protected] .ac.be, respectively.*

An increasing number of applications use software implementations of cryptographic algorithms to provide an acceptable security level at a low cost. However, the design of secure cryptographic primitives that achieve very high software performance is a challenging problem. The best that can be achieved currently is to design fast primitives with some provable properties. Subsequently, these candidate algorithms are scrutinized by cryptanalysts, and lessons are learned from discovered flaws. Still, there are primitives, which are provably secure under some reasonable assumptions (such as the difficulty of factoring the product of two large primes), but these are several orders of magnitude slower than the fastest algorithms currently in use.

In this article, we'll compare different approaches and design choices and their performance in software. We hope to give potential users, cryptanalysts, and designers a sense of the diverse approaches of the present day cryptographic primitives. In the process, we'll examine three types of cryptographic primitives:

- Additive stream ciphers.
- Hash functions.
- Block ciphers.

An extended version of this article, entitled "Recent Developments in the Design of Conventional Cryptographic Algorithms," is included in *Computer Security and Industrial Cryptography: State of the Art and Evolution* (Springer-Verlag, 1998). For the latest information on a selection of block ciphers, see http://www.ii.uib.no/~larsr/bc.html.

### Cryptographic Primitives

**Additive stream ciphers** stretch a short key to a key-stream sequence. The sender adds this key-stream sequence to the data, simulating the operation of a one-time pad (but without the perfect secrecy). The recipient can recover the data by subtracting the same key-stream sequence. Additive stream ciphers are also known as pseudorandom string generators.

The stream ciphers that are discussed here are "alleged RC4" and SEAL. Cryptographic literature contains a large number of papers on other constructions derived from linear feedback shift registers. They are usually defined at bit level, which makes them more suited for hardware than for software. (Although it is certainly possible to adapt these constructions to software environments, we will not consider such possibilities here.)

**Cryptographic hash functions** compress strings of arbitrary lengths to strings of fixed lengths (typically 64, 128, or 160 bits). In addition, they satisfy the following properties:

- Preimage resistance: It should be hard to find a preimage for a given hash result.
- Second preimage resistance: It should be hard to find a second preimage for a given input.
- Collision resistance: It should be hard to find two different inputs with the same hash result.

While these properties are simple, experience has taught us that achieving them is quite hard.

The hash functions discussed here are MD4, MD5, SHA-1, RIPEMD-160 (see "The RIPEMD-160 Cryptographic Hash Function," by A. Bosselaers, H. Dobbertin, and B. Preneel, *DDJ*, January 1997), MD2, Snefru, Subhash, and Tiger. The first four belong to the so-called "MD*x* family."

Block ciphers transform a relatively short string (typically 64 or 128 bits) to a string of the same length, under control of a secret key. The transformation is invertible.

Block ciphers are often used to construct other primitives, such as hash functions, stream ciphers, and MACs. For this purpose, a number of different operation modes has been defined (see ISO/IEC 10116, "Information Technology -- Security Techniques -- Modes of Operation of an *n*-bit Block Cipher Algorithm," IS 10116, 1997). The popularity of block ciphers in cryptography is closely related to the popularity of the Data Encryption Standard (DES) (see FIPS 46, "Data Encryption Standard," Federal Information Processing Standard [FIPS], Publication 46, National Bureau of Standards, U.S. Department of Commerce, Washington D.C., January 1977). After its publication as a Federal Information Processing standard in 1977, DES became widely used to provide cryptographic protection. You can anticipate that the popularity of block ciphers will continue, as NIST is preparing for a successor of DES, the AES (Advanced Encryption Standard).

In this article, we'll compare the following block ciphers:

- DES (and 3-DES).
- FEAL.
- IDEA.
- Blowfish.
- Khufu.
- SAFER (and variants).
- LOKI91.
- CAST.
- RC5.
- SHARK.
- SQUARE.
- MISTY1 and MISTY2.
- 3-WAY.

### Brute Force Attacks

Resistance against brute force attacks is determined by the general parameters of the cryptographic algorithm -- key length, output size, and size of the internal state.

For encryption algorithms, the most important brute force attack is an exhaustive key search. For an ideal cipher with a *k*-bit key, searching the key space requires 2^{k }encryptions. As an example, DES with its 56-bit key is vulnerable to this attack. With dedicated hardware costing about $1 million, a DES key can be recovered in about an hour; the cost per key is less than $50. RSA Data Security's contest has demonstrated that a DES key can be recovered using idle cycles on the Internet over a period of 90 days. For long term security (10 or more years), *k* should be at least 75 to 90 bits (see "Minimal Key Lengths for Symmetric Ciphers to Provide Adequate Commercial Security," by M. Blaze, W. Diffie, R.L. Rivest, B. Schneier, T. Shimomura, E. Thompson, M. Wiener, http://www.bsa.org/, January 1996); 128 bits provide an ample margin.

For a hash function, the output size is an important parameter. If the output size is *n* bits, brute force attacks to find a preimage or a second preimage require about 2^{n} hash operations. For preimage or second resistance, a value of *n*=75 to 90 is sufficient for 10 years. Due to the birthday paradox, finding a collision requires only about 2^{n/2} operations (see *Handbook of Applied Cryptography*, by A.J. Menezes, P.C. van Oorschot, and S.A. Vanstone, CRC Press, 1997; and "Cryptographic Primitives for Information Authentication: State of the Art," by B. Preneel, *Computer Security and Industrial Cryptography: State of the Art and Evolution*, Springer-Verlag, 1998). Consequently, a collision-resistant hash function requires *n*=160.

Comparable attacks exist for block ciphers. In a code book attack, the cryptanalyst constructs a table that lists all the plaintext/ciphertext pairs of a block cipher for a given key. This attack can be precluded by choosing *n *large, or by changing the key frequently. If a block cipher is used in CBC, CFB, or OFB mode, information on the plaintext starts to leak after 2^{n/2} encryptions. This explains why NIST requires that AES candidate algorithms have a block length of 128 bits.

Additive stream ciphers have the property that the key stream will eventually repeat. If the internal state consists of *m* bits, this will happen after about 2^{m/2} bits if the next state function is a random function. If the next state function is a permutation, one expects repetitions after 2^{m-1} bits. A sufficiently large value of *m* can preclude this attack.

### Design Principles for Cryptographic Algorithms

Designing a slow and secure algorithm is easy for someone who understands the current designs and their strengths and weaknesses. If, on the other hand, the performance has to be pushed to the limits, a thorough analysis has to be combined with a good understanding of the limitations of the processor or technology in which the system has to be implemented. Very little is known regarding which structures provide the best security for a given number of instructions per encrypted or hashed byte.

Besides the obvious restrictions imposed by the brute force attacks, the recent progress in cryptanalysis (such as differential and linear cryptanalysis) is an important factor in the development of new algorithms. These cryptanalytic techniques have brought new insights in the security of cryptographic primitives and stimulated research on building blocks such as S-boxes and diffusion layers. This section summarizes the different design aspects.

Hash functions and block ciphers operate on relatively large inputs (64...256 bits). These primitives are designed based on a principle proposed by Shannon: Small nonlinear substitutions are alternated with mixing functions. In the 1970s, Feistel proposed to construct ciphers by iterating a relatively simple transformation, called "round." Except in a few degenerate cases, an algorithm can be made arbitrarily secure by adding more rounds. However, performance will decrease accordingly.

If every input bit is treated in a similar way, one can speak of a uniform transformation, or substitution-permutation network (SPN); see Figure 1. A disadvantage of this approach is that the inverse function may be different from the function itself. With block ciphers, this can be a problem for hardware and smart-card applications. A clear advantage is the inherent parallelism. Examples of block ciphers in this class are SAFER, SHARK, SQUARE, and 3-WAY. Subhash is a hash function using this approach.

A different approach consists of dividing the input into two halves, and applying a nonlinear function only to the right half. The result is added into the left half, and subsequently left and right halves are swapped. Ciphers following this approach are called "Feistel ciphers" (see Figure 2). Since the nonlinear part requires most of the computation, two rounds of a Feistel cipher require about the same effort as a uniform transformation. The output of one nonlinear function is input directly to the next one, which decreases the amount of parallelism but increases the propagation of local changes. Due to the special structure of the round transformation, the nonlinear function itself need not be invertible, and the round transformation is equal to its inverse. Since DES is a Feistel cipher, more cryptanalytic experience is available on Feistel ciphers than on any other general structure. Other Feistel ciphers are FEAL, Blowfish, Khufu, LOKI91, CAST, and MISTY1.

The approach of a Feistel cipher can be further extended by dividing the input into more parts (so called "generalized unbalanced Feistel networks"), as in RC2, the hash functions of the MD*x*-family, MD2, Tiger, and Snefru. Other variants and extensions of uniform transformations and Feistel ciphers have been proposed (RC5, MISTY2, and IDEA, for instance).

A nonlinear component is essential to every strong cryptographic primitive. The goal of the designer is to build a large nonlinear primitive from smaller ones.

A straightforward way to implement simple nonlinear functions are lookup tables or S-boxes. The DES uses eight different S-boxes with six input bits and four output bits (denoted with 6->4); the total size of 256 bytes was clearly dictated by hardware constraints of the mid 1970s. Byte level S-boxes (8->8) are very popular because they are suited for software implementations on 8-bit processors. Ciphers that use such S-boxes include SAFER, MD2, alleged RC4, SHARK, and SQUARE. LOKI91 uses S-boxes of type 12->8.

For modern processors with 32-bit or 64-bit words, S-boxes with more output bits can provide higher efficiency. Snefru was the first cipher to use 8->32 S-boxes; this example was followed by Blowfish, Khufu, CAST, and SQUARE. SHARK and Tiger use even larger lookups (8->64). For SHARK and SQUARE, these are obtained by combining the byte-level S-boxes with the diffusion operation.

For Blowfish, Khufu, and alleged RC4, the S-boxes contain secret values, which makes attacks more difficult. The value of S-boxes is often selected at random (MD2 and Snefru, for example), or carefully selected to achieve certain properties (DES, for instance). In the case of SAFER, MISTY1, MISTY2, SHARK, and SQUARE the S-boxes contain some mathematical structure (such as an exponentiation over a finite field).

If the nonlinear operation is not implemented by table lookups, it is realized using the available processor instructions. FEAL uses addition and rotation; the MD*x* family adds to this also simple bitwise Boolean functions; IDEA uses addition and multiplication. The innovative choice for RC5 was data dependent rotations.

To restrict the complexity of the implementation, nonlinear operations can only be applied to small parts of the block. Several techniques are used to spread local changes. Linear transformations are well suited for this purpose. The simplest solution is a bit permutation, as is used by DES, LOKI91, and Subhash, or a rotation, as in Khufu and Khafre. An alternative is to add up the output of several S-boxes, as is done for Blowfish and CAST. More general linear transformations are the pseudo-Hadamard transform used in SAFER, and the diffusion operation based on Maximum Distance Separable (MDS) linear codes used in SHARK and SQUARE (see "The Block Cipher Square Algorithm," by J. Daemen, L.R. Knudsen, and V. Rijmen, *DDJ*, October 1997.)

Some cryptographic primitives have no separate diffusion operation, but combine linear and nonlinear operations in such a way that changes spread quickly through the block. Examples of this approach are FEAL, IDEA, and the MD*x* family.

The key schedule computes the round keys from the external key. Some applications require very fast key schedules; banking applications, for instance, which use a new session key per transaction, or Asynchronous Transfer Mode (ATM) communication lines.

One issue is the existence of weak keys -- those for which the block cipher is more vulnerable. Weak keys have been identified for DES, LOKI91, and IDEA. Ideally, such keys should not exist. If they form only a sparse subset of the key space, they pose no security problem if the block cipher is used for encryption; they can be a problem for other applications such as hash functions based on block ciphers.

Recently related key attacks have been developed in which an attacker obtains ciphertext corresponding to keys with a known or chosen relation. The lesson learned from these attacks is that key schedules should not be too simple.

Again, many approaches can be distinguished, varying from a selection of bits (DES), over a rotation operation (SAFER, LOKI91, IDEA), to nonlinear operations (CAST). SQUARE uses a key scheduling based on linear codes to guarantee a large distance between round keys from different external keys. The importance of a good key scheduling is demonstrated with the case of SAFER. After the attack of L.R. Knudsen on SAFER-K, the key scheduling has been improved by adding a parity byte to the round keys. The new SAFER is called "SAFER-SK."

Some block ciphers such as Blowfish, SHARK, and RC5 use the block cipher itself to perform the key schedule. The hash function SHA-1 uses a variant of a shortened cyclic code to diffuse the message input throughout the calculations (this operation plays the same role as the key schedule of a block cipher). Tiger also applies a diffusion transformation to the message input; it consists of Boolean operations, additions, and rotations.

The aforementioned criteria relate mostly to block ciphers and hash functions. Next, we'll discuss the most important design choices of two additive stream ciphers. The small number of proposals does not yet allow us to identify general approaches.

SEAL uses basic operations borrowed from the MD*x* family, and is tuned to be fast on the 80486 processor. It derives a large part of its strength from a strong re-keying procedure for every block that uses the hash function SHA-1. This rekeying is quite slow, which implies that the performance depends on the block size.

Alleged RC4 is an innovative design using a 256-byte table (containing an 8-bit permutation), which is updated during every iteration. Its description is simple; the basic operations are some additions and table lookups on bytes. The key schedule converts a key of arbitrary size to a random 8-bit permutation.

### Performance Evaluation

Optimizing the performance of software and hardware is quite different. Fast hardware relies on parallelism and pipelining, whereas, for software, the access to memory is a key to high performance: Designers try to minimize access to slow memory, and to stay as much on chip (registers and cache memory) as possible. However, parallelism becomes increasingly important for software as well: Recent evolutions in general-purpose processors are clearly oriented toward more inherent parallelism, both on the hardware level (multiple execution units) and the software level (SIMD instructions). Two basic multiple-issue architectures can be distinguished -- superscalar and very long instruction word (VLIW).

- A superscalar processor has dynamic issue capability: Varying numbers of instructions are issued every clock cycle. The hardware dynamically decides which instructions are simultaneously issued and to which execution units.
- A VLIW processor has fixed issue capability: Every clock cycle, a fixed number of instructions is issued, formatted as one long instruction (hence the name). The software (that is, the compiler) is completely responsible for creating a package of instructions that can be issued simultaneously.

It is a challenge for designers of new cryptographic primitives to exploit this parallelism in an optimal way, without compromising security.

An additional way to exploit parallelism inherent in many algorithms is single-instruction, multiple-data (SIMD) processing. An SIMD instruction performs the same operation in parallel on multiple data elements, packed into a single processor word. Tuned to accelerate multimedia and communications software, these instructions can be found in an increasing number of general-purpose processor architectures. Examples include Intel's MMX, UltraSPARC's VIS, and PA-RISC 2.0 architecture's MAX.

The problem of accessing slow memory becomes more and more important as the memory access time seems to decrease more slowly than the cycle time of the processors. This suggests that faster cryptographic primitives will make use of logic and arithmetic operations available on a standard processor and of relatively small S-boxes, typically only a few KB to fit in the primary on-chip cache. Opening up new perspectives is the recent trend to include ever larger secondary caches on a dedicated bus limiting latency to a minimum. The advantage of S-boxes is that they can yield a strong nonlinear relation between input and output. S-boxes with 8 input bits and 32 or 64 output bits seem to be particularly suited for the current 32-bit or 64-bit architectures.

Other, less important aspects that influence software performance are word size, byte ordering (endianness), and carries, which tend to be processor dependent. Running an algorithm on a processor that has a smaller word size than that of the algorithm will normally result in reduced performance, as it requires more instructions to do the same operations. On a processor having a larger word size than that of the algorithm, advantage might be taken of the new SIMD instructions, if available. Byte ordering influences performance if endian-sensitive operations are used (like add, multiply, and rotate over nonmultiples of eight) and the processor doesn't support the addressing mode the algorithm uses. All algorithms considered here either use endian-neutral operations or specify the endian convention to be employed (except for SEAL, for which either convention is allowed, but inclusion in the ciphertext of information about the endian convention used is suggested).

No such thing as the software performance of a given algorithm exists, even if figures are given for a specific processor. The key is again the use of memory -- very different results are obtained depending on whether the data resides in cache, in main memory, or on disk. On the other hand, one wants to measure the performance of the algorithm rather than of the computer. Other factors influencing the performance of an algorithm's implementation include equivalent representations using tables and the quality of the compiler.

You can ask whether or not you should try to optimize the design toward a single processor: Designing and reviewing a cryptographic algorithm will take several years, and by that time the processor will probably be outdated. But most processors are downward compatible, and if one tunes an algorithm to a recent processor without exploiting particular features (such as the availability of certain carries), it is likely to achieve a high speed on most other processors as well (SEAL is a good example). Finally, it should be noted that the evolution of processors on smart cards is significantly slower than that of general purpose processors. Therefore, there is still an interest in new algorithms for older processors.

Table 1 gives an overview of the speed of the most important algorithms. To measure as much as possible the performance of the algorithm (and not of the compiler or of the computer used), and to compare the algorithms on an equal basis, we decided to implement all the algorithms on a single, widely used processor -- a 90-MHz Pentium. The most important features of the Pentium are a complex instruction set, a 32-bit word size, a small register set of seven general-purpose registers, Little- endian addressing, a two-way superscalar architecture, and separate on-chip code and data caches of 8 KB each. All implementations have been written in assembly language, and all of them have been optimized to the same (high) level. A comparison of the CISC (Complex Instruction Set Computer) Intel architecture with the most popular RISC (Reduced Instruction Set Computer) architectures (MIPS IV, PA-RISC 2.0, PowerPC, SPARC V9, Alpha EV5) suggests that the Pentium can to a certain extent be considered as a lower bound. Current RISC features include 64-bit word size, at least 31 general-purpose registers, provisions for both Little- and Big-endian addressing, and up to 32 KB of primary (data) cache. Algorithms doing well on the Intel architecture are expected to do well on any modern 32-bit processor. An important exception are algorithms using a rotate instruction, which is not available on all RISC architectures.

Most algorithms, despite the fact that their operations are basically serial in nature, contain enough instruction-level parallelism to keep both execution units of the Pentium busy for most of the time. Some algorithms (SHA-1 and RIPEMD-160, for instance) contain even more parallelism than most current general-purpose processors are able to provide. Only Khufu contains hardly any instruction-level parallelism, so that its performance does not benefit much from having parallel execution units. Also, the inherent parallelism of SEAL is limited, but it is nevertheless remarkably fast on the Pentium, which is explained by the fact that it was tuned by its designers to be fast on 80x86 processors (use of registers, choice, and scheduling of operations). Algorithms with word sizes larger than 32 bits (SHARK, RC5-64/24, and Tiger, for example) will perform relatively better on 64-bit architectures. The SHARK implementation has the additional disadvantage on a Pentium of having tables that are too large to fit into the on-chip data cache, a property it shares with Snefru. Tiger has tables equally as large as the on-chip cache, but this is still a problem, as additional memory is required for the state, a copy of the input data, and the data buffer itself. IDEA suffers from the Pentium's slow integer multiplication (nine cycles), but a one-cycle multiply would still only result in a doubling of the speed, highlighting the fact that IDEA is more than just integer multiplications. An interesting alternative is an IDEA implementation using the SIMD MMX instructions, providing both a fast and parallel (but unfortunately only signed) multiplication. Such an implementation requires in the order of 400 cycles per 64-bit block, being about 1.5 times faster than a Pentium-only implementation (see "IDEA: A Cipher for Multimedia Architectures?" by H. Lipman, Springer-Verlag, 1999, and http://home.cyber.ee/helger/ fastidea/). Some algorithms (Snefru and SHA-1) will perform relatively better on processors having a large register set, such as most RISC processors. This enables the complete internal state to be kept in registers.

Running these implementations on a Pentium Pro or a Pentium II will not necessarily result in the same relative speeds: Some implementations heavily rely on reading a 32-bit register shortly after having written to an 8-bit subset of the same register. On the successors of the Pentium this results in so-called partial register stalls, each of which accounts for at least seven cycles, hence degrading performance considerably.

**DDJ**

*Copyright © 1998, Dr. Dobb's Journal*