Joan is an employee of PWI, Brussels, Belgium. He can be contacted at firstname.lastname@example.org. Craig works at PictureTel Corp. He can be contacted at craigc@ pictel.com.
Panama is a cryptographic module that can be used both as a cryptographic hash function and stream cipher. It is designed to be very efficient in software implementations on 32-bit architectures. Its basic operations are on 32-bit words.
Panama has a low per-byte workfactor while still claiming high security. The price paid for this is a relatively high fixed computational overhead for every execution of the hash function. This makes the Panama hash function less suited for the hashing of messages shorter than the equivalent of a typewritten page. For the stream cipher, it results in a relatively long initialization procedure. Hence, in applications where speed is critical, too frequent resynchronization should be avoided.
A typical application for Panama might be the encryption or decryption of video-rate data in conditional access applications (pay TV, for instance). Set-top boxes and digital TVs increasingly include media processors for decoding compressed video and for performing other computationally intensive image-processing tasks. This is an application space where data rates are high, high-performance processors are increasingly likely to be present, and decryption must be done (without unduly burdening an already heavily loaded processor).
In this article, we'll examine Panama's basic design principles and implementation. For more in-depth information, see our paper "Fast Hashing and Stream Encryption with Panama" in Fast Software Encryption (edited by S. Vaudenay, Springer-Verlag, 1998, http://standard.pictel.com/ftp/research/security/).
Basic Design Principles
Panama is based on a finite state machine with a 544-bit state and 8192-bit buffer. The state and buffer can be updated by performing an iteration. There are two modes for the iteration function:
- A Push iteration injects an input and generates no output.
- A Pull iteration takes no input and generates an output. (A blank Pull iteration is a Pull iteration in which the output is discarded.)
The hashing state is updated by a parallel nonlinear transformation, the buffer operates as a linear feedback shift register, similar to that applied in the compression function of The Secure Hash Algorithm (SHA) (see "SHA: The Secure Hash Algorithm," by William Stallings, DDJ, April 1994; and Secure Hash Standard, Federal Information Processing Standard, Publication 180, NIST, U.S. Department of Commerce, Washington, D.C., May 1993). The updating transformation of the state has high diffusion and distributed nonlinearly. Its design is aimed at providing high nonlinearity and fast diffusion for multiple iterations. This is realized by the combination of four distinct transformations, each with its specific contribution -- one for nonlinearity, one for bit dispersion, one for diffusion, and one for injection of buffer and input bits.
The buffer behaves as a linear feedback shift register that ensures that input bits are injected into the state over a wide interval of iterations. In Push mode, the input to the shift register is formed by the external input; in Pull mode, by part of the state. The Panama hash function is defined as performing Push iterations with message blocks as input. If all message blocks have been injected, a number of blank Pull iterations allow the last message blocks to be diffused into the buffer and state. This is followed by a final Pull iteration to retrieve the hash result.
Panama's stream encryption scheme is initialized by doing two Push iterations to inject the key and diversification parameter, followed by a number of blank Pull iterations to let the key and parameter be diffused into the buffer and state. After this initialization, the scheme is ready to generate keystream bits at leisure by performing Pull iterations.
The state is denoted by a and consists of 17 (32-bit) words a to a. The buffer b is a linear feedback shift register with 32 stages, each consisting of eight 32-bit words. The three possible modes for the Panama module are:
- Reset mode, in which the state and buffer are set to 0.
- Push mode, in which an eight-word input input is applied and there is no output.
- Pull mode, in which there is no input and an eight-word output z is delivered.
The buffer update operation is defined as follows: Let b be the contents of the buffer before the update operation and B after it. In Push mode in Example 1(a), q is the input block inp; in Pull mode, it is part of the state a, with its eight component words given by Example 1(b).
The state updating transformation is defined as follows: Let a be the state before applying the transformation and A after it. In Example 2(a), <<< denotes cyclic shift to the left. In Push mode, p corresponds to the input inp, while in Pull mode it is part of the buffer b with its eight component words given by Example 2(b). In Pull mode, the output z consists of eight words of the state, prior to the application of the update operation; see Example 2(c).
The Panama Hash Function
The Panama hash function maps a message of arbitrary length M to a hash result of 256 bits. The Panama hash function is executed in two phases:
- Padding. M is converted into a string M' with a length that is a multiple of 256 by appending a single 1, followed by a number d of 0-bits with 0<=d<256.
- Iteration. The input sequence M'=m1 m2...mV is loaded into the Panama module as per Table 1.
After all input blocks have been loaded, an additional 32 blank Pull iterations are performed. Then the Hash result is returned. The number of Push and Pull iterations to hash a V-block input sequence is V+33.
Panama can be turned into a secure MAC by simply including a secret key in the message input.
The Panama Stream Encryption Scheme
The stream cipher is initialized by first loading the 256-bit key K, the 256-bit diversification parameter Q, and performing 32 additional blank Pull iterations. During key-stream generation, an eight-word block z is delivered at the output for every iteration. In practice, the diversification parameter allows for frequent resynchronization without the need to change the key. Table 2 presents the sequence diagram of the Panama stream encryption scheme.
Panama's heavy reliance on bit-wise logical operations make it well-suited to implementation on 8-, 16-, 32-, or 64-bit processors, except that its use of 32-bit rotations does somewhat favor 32-bit architectures.
To determine the maximum theoretical performance of Panama in its hash and stream cipher modes on a suitably parallel processor architecture, we will identify the critical path through the algorithm. This is the closed path through the algorithm's computational flowgraph that has the largest weighted instruction count, the weighting being the number of cycles of latency associated with each type of instruction.
For instance, on most processors the result of a simple operation, such as an addition or XOR, can be used in the subsequent cycle; these instructions are said to have a one-cycle latency. On modern high-performance processors, it is also common for shifts and rotates to be single-cycle instructions. However, reading from memory takes several cycles. Even when the data is in the CPU's local cache it commonly suffers a two- or three-cycle latency on modern, deeply pipelined processors.
The software critical path of Panama is through the state updating transformation, whose input is the output of the previous iteration. Each word of the state incurs seven single-cycle instructions: four XORs, one OR, one NOT, and one cyclic shift (except for the unrotated word). By merging the XOR operations of the diffusion and buffer injection layers, they can be implemented with a logic depth of two (rather than three). Hence, the critical path is just six cycles.
In addition to these 17×7-1 logical operations, the state updating transformation entails a total of 16 reads (from buffer stages 4 and 16 for Pull, or input p and buffer stage 16 for Push). Updating the buffer is not on the software critical path and is most efficiently implemented as a circular buffer in memory with moving pointers used to create the appearance of a shift register. Its execution involves 16 reads, 16 XOR operations, and 16 writes (buffer stages 0 and 25), plus three or four instructions to update each of the pointers to the accessed stages for simulating the shift register, three pointers being needed for Push and four for Pull.
In Pull mode, an additional eight reads, eight XORs, and eight writes are necessary for encrypting the data buffer. Thus, ignoring for the moment the few extra instructions necessary for implementing the loop and maintaining pointers into the data buffers, we have a workload of 191 instructions for each iteration of Push and 218 for each Pull. This is equivalent to about six instructions per byte hashed, or 6.8 instructions per byte enciphered.
An estimate of how many fully pipelined execution units the algorithm might usefully exploit can be obtained by dividing the total number of operations per iteration by the number of cycles in the critical path. For Panama this works out to 191 or 218 instructions per iteration divided by six cycles per iteration, from which we estimate that the hashing and encryption modes might reasonably exploit processors with 30 or more 32-bit datapaths operating in parallel so long as enough of them can access memory concurrently. However, on practical processors where (cache) memory accesses tend to be limited to two per cycle, Panama will take at least 24 cycles per iteration for hashing, and 32 cycles for ciphering due to the memory- access bottleneck, so that performance might reach the point of memory saturation with as few as seven or eight concurrent execution units.
Panama excels on processors having substantial instruction-level parallelism. With Intel's next-generation Merced processor expected to have much greater instruction-level parallelism than its predecessors, efficiency on such processors will soon become a mainstream concern. In the meantime, some of the best examples of highly parallel CPUs are in media processors and some advanced DSPs.
To demonstrate the throughputs achievable by Panama, we wrote an optimized C implementation (available electronically; see "Resource Center," page 3) for the Philips TriMedia TM-1000 processor -- a Very Long Instruction Word (VLIW) CPU containing five 32-bit pipelined execution units sharing a common set of 128 32-bit registers. All five execution units can perform arithmetic and logical operations, but loads, stores, and shifts are each supported by only two of them. The two execution units that support shifts are distinct from the two that support loads and stores (see http://www.trimedia.philips.com/). Given an appropriate instruction mix, the processor can issue up to five instructions per clock cycle.
For the benchmarked implementation, the Push and Pull routines were each expressed as fully unrolled inline C code, leaving the TriMedia C compiler to recognize and exploit the allowable overlap between the different steps, which make up the state updating transformation.
In C, the use of pointers to access memory can often lead to situations in which the compiler cannot be sure that memory accessed through one pointer is not that which is also accessed through another pointer. Such aliasing can severely restrict the amount of parallelism available for the compiler to exploit and is often unintentional on the part of the programmer because standard C is limited in its ability to express the true aliasing constraints. The TriMedia compiler addresses this problem by adopting the restricted pointer syntax proposed by the Numerical C Extensions Group of ANSI X3J11 (see http://www.lysator.liu.se/c/restrict.html). Where appropriate, the Panama benchmark code used restricted pointers.
The loop for iterating Pull mode compiled into 234 TriMedia assembly instructions. The scheduled code was tightly packed by the compiler into less than 50 instruction-issue cycles; that is, a sustained instruction-issue rate very close to the theoretical maximum of five instructions per cycle. Compiled code for Push showed comparable efficiency. The optimized C code was benchmarked on a 100-MHz TriMedia processor by encrypting or hashing a 128-KB data buffer. This buffer size is chosen as several times larger than the on-chip data cache so as to make the benchmark be representative of the sustainable encryption or hashing performance to external memory, in this case comprised of synchronous DRAM. At the level of performance achieved by Panama, external memory bandwidth can become a significant factor in the overall performance. For the encryption benchmark the data buffer was encrypted in-place so as to minimize the performance loss arising from memory accesses. No off-chip cache was present (the TriMedia chip does not actually support off-chip cache).
An encryption throughput of 4.5 bits per cycle was measured, equivalent to 1.8 cycles per byte, or 450 Mbits/sec on a 100-MHz processor. This includes all loop overhead and cycles lost to cache misses, memory accesses, and so on. This is uncommonly fast among stream ciphers. For comparison, two other acknowledged fast software ciphers -- RC4 (see Applied Cryptography, Second Edition, by Bruce Schneier, John Wiley & Sons, 1996) and SEAL (see "A Software-Optimized Encryption Algorithm," by P. Rogaway and D. Coppersmith in Fast Software Encryption, LNCS 809, edited by R. Anderson, Springer-Verlag, 1994) are reported as capable of 10.6 cycles per byte and 3.5 cycles per byte, or 75 Mbits/sec and 230 Mbits/sec, respectively, when benchmarked under these same conditions (see C.S.K. Clapp's "Optimizing a Fast Stream Cipher for VLIW, SIMD, and Superscalar Processors," in Fast Software Encryption, LNCS 1267, edited by E. Biham, Springer-Verlag, 1997). Panama is also slightly faster on this processor than the variants of WAKE (also described in Clapp's paper).
Notably, Panama achieves its speed advantage not by having the lowest work factor among these ciphers. Rather, Panama's speed advantage comes from the substantial degree of parallelism present in the algorithm, an attribute that can be well exploited by a VLIW processor such as the TriMedia, and should make it shine on Intel's Merced.
On the TriMedia processor, Panama achieves a hashing throughput of 5.1 bits per cycle, equivalent to 1.6 cycles per byte, or 510 Mbits/sec on a 100-MHz device. A simple comparison shows that the per-byte workload of Panama is similar to that of MD4 (see "The MD4 Message-Digest Algorithm, by R.L. Rivest, RFC 1320, Internet Activities Board, Internet Privacy Task Force, April 1992), the fastest member of the family of hash functions to which MD5, SHA-1, and RIPEMD-160 all belong. Benchmarks for these popular hashes have been published for the Intel Pentium processor in "Hashing on the Pentium," by A. Bosselaers, R. Govaerts, and J. Vandewalle in Advances in Cryptology: Proceedings Crypto'96, LNCS 1109, edited by N. Koblitz, (Springer-Verlag, 1996) from which we can make comparisons to Panama.
Performance of an optimized C-code implementation of Panama on a 200-MHz Pentium Pro (using a library function for rotate) was measured at 244 Mbits/sec for ciphering and 267 Mbits/sec for hashing; that is, a throughput of 1.2 bits per cycle for ciphering and 1.3 bits per cycle for hashing. This compares to hashing speeds reported by Bosselaers et al. for SHA-1 and RIPEMD-160 of 0.24 bits per cycle and 0.21 bits per cycle, respectively, for optimized C-code. (Bosselaers et al. report performance on a 90-MHz Pentium, while here we report performance on a 200-MHz Pentium Pro. By converting all results into the normalized measure of bits per cycle we attempt to provide a uniform basis for comparison; however, be aware that no allowance has been made for the architectural differences between the Pentium and Pentium Pro.) Bosselaers et al. also report optimized assembly-code versions of SHA-1 and RIPEMD-160 as achieving 0.54 bits per cycle and 0.44 bits per cycle, respectively. It is currently unknown what further speed improvement could be achieved for Panama by assembly coding it, but even without such improvement it shows about a 2× speed advantage over assembly-coded versions of these other hashes.
Copyright © 1998, Dr. Dobb's Journal