### Efficient Reduction

To reduce a 256-bit carry-less product modulo g, we first split it into two 128-bit halves. The least-significant half is simply XOR'd with the final remainder (since the degree of g is 128). For the most-significant part, we develop an algorithm that realizes division via two multiplications. This algorithm can be seen as an extension of the Barrett reduction algorithm [10] to modulo-2 arithmetic, or as an extension of the Feldmeier CRC generation algorithm [11] to dividends and divisors of arbitrary size.

Since we do not need to take into account the least-significant half of the input (see above), we investigate the efficient generation of a remainder p(x) defined as follows in Equation 7:

Where c(x) is a polynomial of degree s-1, with coefficients in GF(2), representing the most-significant bits of the carry-less product (for GCM, s =128).

t is the degree of the polynomial g. (for GCM, t = 128).

g(x) is the irreducible polynomial defining the final field (for GCM, g = g(x) = x^{128} + x^{7} + x^{2} + x + 1).

For the polynomials p(x), c(x), and g(x) we write as in Equation 8:

Hereafter, we use the notation L^{u}(v) to denote the coefficients of the u least-significant terms of the polynomial v and M^{u}(v) to denote the coefficients of its u most-significant terms. The polynomial p(x) can be expressed in Equation 9 as:

where q(x) is a polynomial of degree s -1 equal to the quotient from the division of c (x) * x^{t} with g. The intuition behind the equation above is that the t least-significant terms of the dividend c (x) * x^{t} equal zero. Further, the dividend c (x) * x^{t} can be expressed as the sum of the polynomials g * q and p as in Equation 10 below:

where operator '+' means XOR ('⊕'). From the equation above one can expect that the t least-significant terms of the polynomial g * q are equal to the terms of the polynomial p. Only if these terms are equal to each other, the result of the XOR operation g * q ⊕ p is zero for its t least-significant terms. Hence in Equation 11:

Now we define in Equation 12:

The polynomial g* represents the t least-significant terms of the polynomial g. Obviously in Equation 13:

However, the t least-significant terms of the polynomial q * g_{t} * x^{t} are zero. Therefore in Equation 14:

From the equation above it follows that in order to compute the remainder p we need to know the value of the quotient q. The quotient can be calculated in a similar manner as that of the Barrett reduction algorithm in Equation 15:

In Equation 16 let:

where q^{+} is an s-degree polynomial equal to the quotient from the division of x^{t+s} with g, and p^{+} is the remainder from this division. The degree of the polynomial p^{+} is t -1. From the two previous equations we get in Equation 17:

and in Equation 18:

One can see that the polynomials c * g * q^{+} and g * q * x^{s} are of degree t + 2 * s - 1. The polynomial c * p^{+} is of degree t + s - 2, and the polynomial p * xs is of degree t + s - 1. As a result, the s most-significant terms of the polynomials in the left- and right-hand side of the previous equation are not affected by the polynomials c * p^{+} and p * x^{s}. Hence in Equation 19:

Next, we observe that the s most-significant terms of the polynomial c * g * q^{+} are equal to the s most-significant terms of the polynomial g * M^{s} (c * q^{+}) * x^{s}. The polynomial M^{s}(c*q) * x^{s} results from c * q^{+} by replacing the s least-significant terms of this polynomial with zeros. The intuition behind this observation is that the s most-significant terms of the polynomial c * g * q^{+} are calculated by adding together the s most-significant terms of the polynomial c * q^{+} in as many offset positions as defined by the terms of the polynomial g. Thus, the s most-significant terms of c * g * q^{+} do not depend on the s least-significant terms of c * q^{+}, and consequently, this results in Equation 20:

Equation 20 above is satisfied for q given by Equation 21:

Since there is a unique quotient q satisfying Equation 10 one can show that there is a unique quotient q satisfying equation (20). As a result this quotient q must be equal to M^{s} (c (x) * q^{+} (x)).

It follows that the polynomial p is found in Equation 22 by

The equation above can be translated to the following algorithm for computing the polynomial p.

Preprocessing: Compute the polynomials g* and q^{+} for the given irreducible polynomial g. The polynomial g* is of degree t - 1, consisting of the t least-significant terms of g, and the polynomial q^{+} is of degree s, and is equal to the quotient of the division of x^{t+s} with the polynomial g.

- Multiply the input c with q
^{+}. The result is a polynomial of degree 2 s – 1. - Multiply the s most-significant terms of the polynomial resulting from Step 1 with g*. The result is a polynomial of degree t+s – 2.
- Return the t least-significant terms of the polynomial resulting from Step 2. This is the desired remainder.

One can see that the quotient from the division of x256 with g is g itself. The polynomial g = g (x) = x^{128} + x^{7} + x

Special attention should be paid when implementing the GCM mode, because the standard specifies that the bits inside their 128-bit double quadwords are reflected. That is, the bit corresponding to the least-significant coefficient of the polynomial representation of the entities that are multiplied is bit number 127, rather than bit number 0. This also implies that the order of bits in the reduction polynomial is [11100001:<120 zeros>:1] as opposed to [1:<120 zeros>:10000111]. Note that this property is not merely the difference between Little Endian and Big Endian notations.

To handle this peculiarity, we point out the following fundamental property of carry-less multiplication in Equation 23, namely:

Using the identity above, and shifting-by-one of two registers containing the carry-less product of two inputs, the Galois Hash can be computed using the PCLMULQDQ instruction, regardless of the bit-order representation of the input and the output operands (see [2] for details and code samples).

### Estimated Performance Benefits

Encryption in CTR mode can be accelerated by roughly an order of magnitude, compared with some current and frequently used AES look-up tables that are based on implementations of AES (for example, OpenSSL implementation). The 64-bit carry-less multiplication helps speed up the computation of the GCM, and avoids the potential security problems that are associated with the current implementation that is based on look-up tables.

### Conclusion

In this article, we describe Intel's new instructions for high-performance cryptographic processing, which also eliminate all currently known software side-channel threats. Our main focus was on the use of these instructions for obtaining a high-performing and secure implementation of AES-GCM authenticated encryption.

Significant acceleration can be achieved when the new instructions are used efficiently, by taking advantage of the parallelism in the CTR mode, and by using the new techniques for carry-less multiplication and reduction modulo sparse polynomials.

### References

[1] S. Gueron. "Advanced Encryption Standard (AES) Instructions Set." At http://softwarecommunity.intel.com.

[2] S. Gueron and M. Kounavis. "Carry-Less Multiplication and its Usage for Computing The GCM Mode." http://softwarecommunity.intel.com

[3] M. Dworkin. "Recommendation for Block Cipher Modes of Operation: Galois/Counter Mode (GCM) for Confidentiality and Authentication." Federal Information Processing Standard Publication FIPS 800-38D, April 20, 2006. http://csrc.nist.gov

[4] "IEEE 802.1AE - Media Access Control (MAC) Security." IEEE 802.1 MAC Security Task Group Document. At http://www.ieee802.org

[5] J. Viega and D. McGrew. "The Use of Galois/Counter Mode (GCM)." In IPsec Encapsulating Security Payload (ESP), IETF RFC 4106. At http://www.rfc-archive.org

[6] "IEEE Project 1619.1 Home." At https://siswg.net

[7] "The Fibre Channel Security Protocols Project." ISO-T11 Committee Archive. At http://www.t11.org

[8] J. Salowey, A. Choudhury and D. McGrew. "RSA-based AES-GCM Cipher Suites for TLS." At http://www1.ietf.org

[9] A. Karatsuba and Y. Ofman. "Multiplication of Multidigit Numbers on Automata." Soviet Physics. Doklady, Volume 7, pages 595-596, 1963.

[10] P. Barrett. "Implementing the Rivest, Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor." Master's Thesis, University of Oxford, UK, 1986.

[11] D. Feldmeier. "Fast Software Implementation of Error Correcting Codes." IEEE Transactions on Networking, December, 1995.

### Acknowledgments

We are indebted to all of our many colleagues who contributed to the design, specification, and implementation of the new AES and PCLMULQDQ instructions.

*This article and more on similar subjects may be found in the Intel Technology Journal, June 2009 Edition, "Advances in Internet Security". More information can be found at http://intel.com/technology/itj
*

*
.
*

*
*

*
*