### Employing SIMD to Vectorize SHA-1 Computation

Intel SSE2 allows parallel operations on four 32-bit integers, stored in 128-bit XMM registers. The known SSE2 vectorization attempts of SHA-1 are rightfully focused on the message scheduling part for rounds 17-80, as a relatively isolated compute chain of W values which can be allocated in SIMD registers.

In the following statements:

W[i] = (W[i-3] ^ W[i-8] ^ W[i-14] ^ W[i-16]) rol 1; (**) ... W[i] + K(i);

It is obvious, that the dependency **W[i] -> W[i-3]** prevents straightforward vectorization of this part for the architectures with vector lengths more than 3. There are two approaches to have it vectorized on the machine with registers of four elements:

- The vectorization of only part of the expression:
**(W[i-8] ^ W[i-14] ^ W[i-16])**W'[i ] = W[i-8] ^ W[i-14] ^ W[i-16] W'[i+1] = W[i-7] ^ W[i-13] ^ W[i-15] W'[i+2] = W[i-6] ^ W[i-12] ^ W[i-14] W'[i+3] = W[i-5] ^ W[i-11] ^ W[i-13]

It takes four XMM registers to keep the last 16 W values. Vectors W[i-8:i-5] and W[i-16:i-13] are naturally aligned with XMM registers, while vector W[i-14:i-11] requires data from two different XMM registers. This can be most efficiently computed with the Intel Supplemental SSE3 (SSSE3) instruction PALIGNR.

And then the scalar code does the rest of the computation steps:

W[i] = (W'[i] ^ W'[i-3]) rol 1; ... W[i] + K;

Note that the completed computation of

**W[i]**needs to be reflected back to the vector side of the processor in time to be used in the next round of calculations. - Vectorize the whole expression plus more, overcoming dependency with additional iteration (approach by Dean Gaudet):
// done on 4 consequtive W[i] values in a single XMM register W[i ] = (W[i-3] ^ W[i-8] ^ W[i-14] ^ W[i-16]) rol 1 W[i+1] = (W[i-2] ^ W[i-7] ^ W[i-13] ^ W[i-15]) rol 1 W[i+2] = (W[i-1] ^ W[i-6] ^ W[i-12] ^ W[i-14]) rol 1 W[i+3] = ( 0 ^ W[i-5] ^ W[i-11] ^ W[i-13]) rol 1 // this additional calculation unfortunately requires many additional operations W[i+3] ^= W[i] rol 1 // once we have 4 W[i] values in XMM we can also add four K values with one instruction W[i:i+3] += {K,K,K,K}

Both require either extensive amount of data shuffling between general purpose and XMM registers (1) or in XMM registers (2), and if applied for all rounds 17-80 end up being not much faster than the best available scalar code.

Our implementation uses the second approach for the rounds 17-32 (4 vectorized iterations) only. But for the rounds 33 to 80 we came up with the improved vectorization presented below.

### Equal Transformation of SHA-1 Algorithm to Remove Vectorization Overhead

The second approach relies on the following distributive identity:

(A ^ B) rol C = (A rol C) ^ (B rol C) (1)

Let's also employ another one. If W defined as:

W[i] = W[i-A] ^ W[i-B]

Then, as can be shown with basic substitution (see below), it is equivalent to:

W[i] = W[i-2*A] ^ W[i-2*B] (2)

if values (i-2*A) and (i-2*B) are still in the defined range for i

.For those curious to see the proof of it:

If (i-2*A) and (i-2*B) are in the defined range for i, then:

By substitution:

W[i-A] = W[i-A-A] ^ W[i-A-B]; W[i-B] = W[i-B-A] ^ W[i-B-B]; W[i] = (W[i-A-A] ^ W[i-A-B]) ^ (W[i-B-A] ^ W[i-B-B]);

Considering that: W[i-A-B] = W[i-B-A];

W[i] = W[i-A-A] ^ W[i-B-B]; W[i] = W[i-2*A] ^ W[i-2*B] QED.

We need one more basic identity:

(X rol 1) rol 1 = X rol 2 (3)

Now, if we apply (2) in an extended form for four elements being XOR'ed instead of two, with rotates and XOR distribution identities (1) and (3) to the statement (**) calculating W[i] in SHA-1, it allows the following transformation:

W[i] = (W[i-3] ^ W[i- 8] ^ W[i-14] ^ W[i-16]) rol 1

for i>32, is equivalent to:

W[i] = (W[i-6] ^ W[i-16] ^ W[i-28] ^ W[i-32]) rol 2 (4)

The outcome is that the dependency

**W[i]->W[i-3]**was transformed into**W[i]->W[i-6]**, without increasing the amount of calculations, thanks to mutual cancelling of equal values with XOR. It can now be vectorized with 4 element SIMD in a very straightforward way for rounds 33-80. Also quite fortunately vectors**W[i-16]**,**W[i-28]**, and**W[i-32]**align themselves perfectly with XMM registers, each of which keeps four elements of array**W[]**, and only the one starting at**W[i-6]**requires forming the vector from the two "neighbour" XMM registers. This still can be handled very efficiently with a single SSSE3 PALIGNR instruction. The snippet of the vectorized code to compute 4 W values for the rounds 33-80 is below (all W's below are XMM registers, each keeping four consecutive values as defined by algorithm):movdqa W_TMP, W_minus_04 pxor W, W_minus_28 ; W equals W[i-32:i-29] before XOR ; W = W[i-32:i-29] ^ W[i-28:i-25] palignr W_TMP, W_minus_08, 8 ; W_TMP = W[i-6:i-3], combined from ; W[i-4:i-1] and W[i-8:i-5] vectors pxor W, W_minus_16 ; W = (W[i-32:i-29] ^ W[i-28:i-25]) ^ W[i-16:i-13] pxor W, W_TMP ; W = (W[i-32:i-29] ^ W[i-28:i-25] ^ W[i-16:i-13]) ^ W[i-6:i-3]) movdqa W_TMP, W ; 4 dwords in W are rotated left by 2 psrld W, 30 ; rotate left by 2 W = (W >> 30) | (W << 2) pslld W_TMP, 2 por W_TMP, W movdqa W, W_TMP ; four new W values W[i:i+3] are now calculated paddd W_TMP, [K_XMM] ; adding 4 current round's values of K movdqa [WK(i)], W_TMP ; storing for downstream GPR instructions to read

In the end four values of

**W[i]+K**are calculated in W_TMP XMM register, then we use memory instructions (every 128-bit store is later read by four 32-bit loads) as a way to move precalculated data from XMM to general purpose registers domain. It allows offloading of these operations to memory unit where stores and loads can be executed independently from ALU operations. Latency of the L1 cache store plus load is either completely hidden since W pre-computation can be scheduled early or served by store to load forwarding mechanism in the microarchitecture.There are a few other improvements worth mentioning. The use of SSSE3 instruction PSHUFB allows efficient conversion between big- and little-endian data formats for rounds 1 to 16, where values of

**W[i]**are read from the message data, four values are converted by a single PSHUFB instruction. The other optimization is software pipelining: if an application uses the interface to SHA-1 update function such that it allows to process several continuous 64-byte blocks at one call, then applying software pipelining (also used by Dean in his implementation) may give about 5% to 10% of additional performance improvement.### Performance Benefits and Integration

The overall performance improvement of this implementation over the best known scalar implementations ranges from ~1.2X to ~1.5X, achieving as low as 5.8 cycles per byte on 1024-byte buffer being hashed on the latest generations of Intel processors. Also, here at Intel, we have a unique opportunity to use performance simulators to measure performance and tune the code for the several future generations of processors in a development pipeline. This code was very well tuned and will maintain and improve its performance advantage over the scalar implementations going forward.

We are providing 64-bit assembly implementation of SHA-1 using platform independent NASM[7] syntax (it was also tested with YASM[8]). We wanted the final performance to be the best and independent of the platform and compiler being used. Assembler, in the case of SHA-1, has a solid performance advantage over C/C++ code with all the compilers we tried. The speedups ranged from large to moderate. Coding in assembly language allowed perfect control over the instruction scheduling as well as register allocation, which compilers cannot yet achieve. The assembly code employs macros efficiently to remain high-level. The code should be easy to integrate into existing applications. Please refer to the instructions in the beginning of the source file. We took care of the CPUID SSSE3 feature flag check and dispatch. A 32-bit version may show up later too, although it must be noted that the smaller number of registers available to 32-bit apps will impact performance compared to 64-bit code.

### Conclusion

This implementation notably advances the performance of SHA-1 algorithm compared to existing implementations. We are encouraging all projects utilizing SHA-1 to integrate this new fast implementation and are ready to help if issues or concerns arise (you are welcome to leave a comment or write an email to the authors). It is provided "as is" and free for either commercial or non-commercial use. We are already working with several interested customers and open source projects to integrate it. Hopefully you will see this SHA-1 implementation as a part of OpenSSL library, and, of course, Intel Integrated Performance Primitives will have it soon too.

I would like to thank and give credit to Dean Gaudet for his SHA-1 vectorization research, with several key ideas used in this implementation, as well as to a few exceptional engineers from various corners of Intel who contributed to this arguably world-best SHA-1 implementation for IA at the moment. They are: Ronen Zohar, Jim Guilford, Vinodh Gopal, Wajdi Feghali, Mark Buxton, Shay Gueron, Zino Benaissa and Anu Suryanarayanan. Thank you very much.