### High Performance Implementation of AES in Counter Mode

Significant performance optimization for encrypting (and decrypting) can be achieved if software using the AES instructions is designed to process multiple data blocks in parallel. This "software pipelining" technique is applicable for parallelizable modes of operation such as Electronic Code Book (ECB), CTR, and decryption with the Cipher Block Chaining (CBC-Decryption) mode.

In such modes, different data blocks can be encrypted (or decrypted) independently of each other, and the hardware that supports the AES round instructions is pipelined. This allows independent AES instructions to be dispatched, theoretically every one to two CPU clock cycles, if data can be provided sufficiently fast. As a result, the AES throughput can be significantly enhanced for parallel modes of operation, if the software implementation itself is pipelined. Instead of completing the encryption of one data block and then continuing to the subsequent block, it is preferable to write software sequences that compute one AES round on multiple blocks, using one round key, and only then continue to compute the subsequent round for multiple blocks. This technique speeds up any parallelizable mode of operation, in particular the CTR mode. Listing 3 shows a code snippet encrypting eight blocks in parallel as part of the CTR mode (where the counters are encrypted).

mov rdx, OFFSET keyex_addr ; load Round key movdqu xmm0, XMMWORD PTR [rdx] pxor xmm1, xmm0 pxor xmm2, xmm0 pxor xmm3, xmm0 pxor xmm4, xmm0 pxor xmm5, xmm0 pxor xmm6, xmm0 pxor xmm7, xmm0 pxor xmm8, xmm0 mov ecx, 9 main_loop: ; load Round key add rdx, 0x10 movdqu xmm0, XMMWORD PTR [rdx] aesenc xmm1, xmm0 aesenc xmm2, xmm0 aesenc xmm3, xmm0 aesenc xmm4, xmm0 aesenc xmm5, xmm0 aesenc xmm6, xmm0 aesenc xmm7, xmm0 aesenc xmm8, xmm0 loop main_loop add rdx, 0x10 movdqu xmm0, XMMWORD PTR [rdx] aesenclast xmm1, xmm0 aesenclast xmm2, xmm0 aesenclast xmm3, xmm0 aesenclast xmm4, xmm0 aesenclast xmm5, xmm0 aesenclast xmm6, xmm0 aesenclast xmm7, xmm0 aesenclast xmm8, xmm0 ; storing the encrypted blocks movdqu XMMWORD PTR [dest], xmm1 movdqu XMMWORD PTR [dest+0x10], xmm2 movdqu XMMWORD PTR [dest+0x20], xmm3 movdqu XMMWORD PTR [dest+0x30], xmm4 movdqu XMMWORD PTR [dest+0x40], xmm5 movdqu XMMWORD PTR [dest+0x50], xmm6 movdqu XMMWORD PTR [dest+0x60], xmm7 movdqu XMMWORD PTR [dest+0x70], xmm8

### High Performance Implementation of Galois Counter Mode

We now examine how GCM can be efficiently computed by using the PCLMULQDQ instruction, in combination with some improved algorithms.

The most compute-intensive part of GCM is the computation of the Galois hash, which is multiplication in the finite field GF(2^{128}), defined by the reduction modulo g = x^{128} + x^{7} + x^{2} + x + 1. The multiplication in this field is carried out in two steps: the first step is the carry-less multiplication of two 128-bit elements, and the second step is the reduction of the 256-bit carry-less product modulo g = x^{128} + x^{7} + x^{2} + x + 1. We explain these steps in the rest of this section.

### Computing a 256-bit Carry-less Product with the PCLMULQDQ Instruction

The following algorithm steps can be viewed as one iteration of a carry-less schoolbook multiplication:

- Multiply carry-less by the following operands: A
_{0}with B_{0}, A_{0 1}with B_{1}, A_{0}with B_{1}, and A_{1}with B_{0}. Let the results of the above four multiplications be A_{0}* B_{0}= [C_{1}: C_{0}], A_{1}* B_{1}= [D_{1}: D_{0}], A_{0}, * B_{1}= [E_{1}: E_{0}], A_{1}* B_{0}= [F_{1}: F_{0}] - Construct the 256-bit output of the multiplication [A
_{1}: A_{0}] * [B_{1}: B_{0}] as follows in Equation 5:

One can also trade off one multiplication for additional XOR operations. This 2-step alternative approach can be viewed as a "carry-less Karatsuba" multiplication [9]:

- Multiply carry-less by the following operands: A
_{1}with B_{1}, A_{0}with B_{0}, and A_{0}⊕ A_{1}with B0 ͵B; A_{1}. Let the results of the above three multiplications be [C_{1}: C_{0}], [D_{1}: D_{0}], and [E_{1}: E_{0}], respectively. - Construct the 256-bit output of the multiplication [A1: A0] * [B1 : B0] as follows in Equation 6:

Both methods can be used for the first step of the computation of the Galois hash.