### Acceleration Technologies

We are currently researching solutions to realize the vision of encrypting the Internet so that HTTPS sessions are accelerated by factors. The next micro-architecture generation adds new instructions for potentially speeding up symmetric encryption by 3-10 times. These instructions not only provide better performance but also protect applications against an important type of threat known as side-channel attacks. Second, we have developed improved integer arithmetic software that can speed up key exchange and establishment procedures by a factor of 40 to 100 percent.

Third, the Intel Core i7 micro-architecture re-introduces the SMT feature into the CPU. SMT is ideal for hiding the cycles of compute-intensive public key encryption software under the stall times of network application memory lookups.

In the next generation of Intel processors, a new set of instructions will be introduced that enable high performance and secure round encryption and decryption. These instructions are AESENC (AES round encryption), AESENCLAST (AES last round encryption), AESDEC (AES round decryption), and AESDECLAST (AES last round decryption). Two additional instructions are also introduced for implementing the key schedule transformation, AESIMC and AESKEYGENASSIST.

The design of these new processor instructions is based on the structure of AES. Systems such as AES involve complex mathematical operations such as finite field multiplications and inversions [6], as discussed earlier. These operations are time or memory consuming when implemented in software, but they are much faster and more power efficient when implemented by using combinatorial logic. Moreover, the operands involved in finite field operations can fit into the SIMD registers of the IA architecture. In this article, we discuss the concept of implementing an entire AES round as a single IA processor instruction by using combinatorial logic. An AES round instruction is much faster than its equivalent table-lookup-based software routine and can also be pipelined, thereby allowing the computation of an independent AES round result potentially every clock cycle.

The AESENC instruction implements these transformations of the AES specification in the order presented: ShiftRows, S-box, MixColumns, and AddRoundKey. The AESENCLAST implements ShiftRows, S-box, and AddRoundKey but not MixColumns, since the last round omits this transformation. The AESDEC instruction implements inverse ShiftRows, inverse S-box, inverse MixColumns, and AddRoundKey. Finally, the AESDECLAST instruction implements inverse ShiftRows, inverse S-box, and AddRoundKey, omitting the inverse MixColumns transformation. More details about these AES instructions can be found in [7].

Our AES instructions can be seen as cryptographic primitives for implementing not only AES but a wide range of cryptographic algorithms. For example, several submissions to NIST's recent SHA-3 hash function competition use the AES round or its primitives as building blocks for computing cryptographic hashes. Moreover, combinations of instruction invocations can be used for creating more generic mathematical primitives for finite field computations. Our new instructions outperform by approximately 3-10 times the best software techniques doing equivalent mathematical operations on the same platform.

Together with the AES instructions, Intel will offer one new instruction supporting carry-less multiplication, named PCLMULQDQ. This instruction performs carryless multiplication of two 64-bit quadwords that are selected from the first and second operands, according to the immediate byte value.

Carry-less multiplication, also known as Galois Field (GF) multiplication, is the operation of multiplying two numbers without generating or propagating carries. In the standard integer multiplication, the first operand is shifted as many times as the positions of bits equal to "1" in the second operand. The product of the two operands is derived by adding the shifted versions of the first operand to each other. In carry-less multiplication, the same procedure is followed, except that additions do not generate or propagate carry. In this way, bit additions are equivalent to the exclusive OR (XOR) logical operation.

Carry-less multiplication is an essential component of the computations done as part of many systems and standards, including cyclic redundancy check (CRC), Galois/counter mode (GCM), and binary elliptic curves, and it is very inefficient when implemented in software in today's processors. Thus, an instruction that accelerates carry-less multiplication is important for accelerating GCM and all communication protocols that depend on it [8].

We have also developed integer arithmetic software that can accelerate big number multiplication and modular reduction by at least 2X. Such routines are used not only in RSA public key encryption but also in Diffie Hellman key exchange and elliptic curve cryptography (ECC). Using our software, we are able to accelerate RSA 1024 from a performance of approximately 1500 signatures per second (OpenSSL v.0.9.8g) or 2000 signatures per second (OpenSSL v.0.9.8.h), to potentially 2900 signatures per second on a single Intel Core i7 processor. Similarly, we are able to accelerate other popular cryptographic schemes such as RSA 2048 and Elliptic Curve Diffie-Hellman, based on the NIST B-233 curve.

The performance of RSA can be improved by accelerating the big number multiplication that is an essential and compute-intensive part of the algorithm. Our implementation uses an optimized schoolbook big number multiplication algorithm. RSA is a compute-intensive operation consuming millions of clocks on multiplying, adding, and subtracting 64-bit quantities. However, the state which RSA accesses is small, typically consisting of key information as well as 16-32 multipliers that fit into the L1 cache of Intel CPUs. With our software, an RSA 1024 decrypt operation consumes about 0.99 million clocks, whereas the corresponding RSA 2048 decrypt operation consumes about 6.73 million clocks on an Intel Core i7 processor. This is about 40 percent faster than corresponding operations that use OpenSSL (v. 0.9.8h).

The code listed in Code 1 illustrates the main idea, which is to combine multiply and add operations with a register recycling technique for intermediate values. In Code 1, 'a' and 'b' hold the two large numbers to be multiplied, and the results are stored in 'r'. These operations are repeated over the entire inputs to generate intermediate values that are then combined with addition to produce the large number multiplication result.

asm("mulq %3;\n" :"=a"(t0), "=d"(t1) :"a"(a[0]), "g"(b[0]) :"cc"); t2 = t0; t3 = t1; r[0] = t2; t2 = t3; t3 = t4; t4 = 0; asm("movq (%5), %%rax;\n\t" "mulq 8(%6);\n\t" "addq %3, %0;\n\t" "adcq %4, %1;\n\t" "adcq $0, %2;\n\t" "movq 8(%5), %%rax;\n\t" "mulq (%6);\n\t" "addq %3, %0;\n\t" "adcq %4, %1;\n\t" "adcq $0, %2;\n" :"+r"(t2), "+r"(t3), "+r"(t4), "=a"(t0), "=d"(t1) :"r"(a), "g"(b) :"cc"); r[1] = t2; asm("mulq %3;\n" :"=a"(t0), "=d"(t1) :"a"(a[1]), "g"(b[1]) :"cc"); asm("addq %2, %0;\n\t" "adcq %3, %1;\n" :"+r"(t0), "+r"(t1) :"r"(t3), "r"(t4) :"cc"); r[2] = t0; r[3] = t1;

We also investigated other techniques for big number multiplication, including Karatsuba-like constructions, but we found this schoolbook algorithm implementation to be the fastest [9, 10].

**Simultaneous Multithreading**. The most recent Intel i7 core micro-architecture re-introduces the feature of hyperthreading (now referred to as simultaneous multi-threading or SMT) into the CPU. SMT represents a major departure from the earlier core micro-architecture, where each core was single threaded. As part of our research, we have demonstrated that SMT can result in substantial performance improvements for a certain class of workloads. Such workloads are associated with secure web transactions. We propose a new programming model where one compute-intensive thread performs only RSA public key encryption operations, and another thread performs memory access-intensive tasks. We show that RSA is an ideal companion thread for four representative memory access-intensive workloads when SMT is used, resulting in a 10–100 percent potential efficiency increase.

The system benefits most when a thread performing dependent-memory lookups is paired with an RSA thread. The throughput of the memory thread almost doubles, reaching the value it would have had if it hadn't been paired with RSA. Another way to interpret the same result is that the RSA computation comes for free, because of SMT. In reality the RSA computation is hidden under the very long stall times of the memory thread. We also observe that the throughput of a single memory thread is increased by approximately 30 percent when SMT is switched on, and the memory thread is multiplexed with another memory thread. The same throughput is almost doubled when the memory thread is paired with an RSA thread. These results indicate that RSA is a much better companion thread than a second memory thread, due to the fact that one workload is memory access-intensive, and the other workload is compute-intensive. If an RSA thread is paired with a memory thread, then RSA performance also increases by 21 percent when SMT is switched ON as compared to OFF [11].

To further validate our position that SMT is beneficial especially to crypto workloads, we built a test bed running SpecWeb 2005. The test bed consisted of a server machine using an Intel Core i7 processor connected to two client machines running a total of four client engines. We measured the server's capacity with SMT turned on and off for the banking and support (regular HTTP) workloads. Our experiments indicate that SMT improves the overall system performance by at least 10 percent -- more for the banking workload than the support workload. This result is in accordance with our earlier experiments, and it indicates that crypto workloads can take advantage of SMT.

The overall impact of our cryptographic algorithm acceleration technologies is in Figure 3. The first bar represents the crypto overhead of a 230 Kbyte SSL transaction as it runs on an Intel Core i7 processor today. The encryption scheme used is AES-256 in the counter mode. The next bar shows the acceleration gain if AES is implemented with the new instructions. The third bar shows the incremental gain by using our RSA software and SMT. Finally, the last bar shows the gain associated with replacing SHA1 with GCM. GCM is a message authentication scheme offering the same functionality as HMAC-SHA1. As is evident from the figure, our acceleration technologies substantially reduce the crypto overheads resulting in significant performance and efficiency improvement.