Nowadays we are seeing more and more services in the Internet and personal computing requiring secure data communications and storage. HTTPS with the SSL and TLS protocols are replacing non-secured HTTP traffic in online commerce, banking, secure web mail, and generally what is now being called "the Internet cloud." File system encryption is becoming ubiquitous on client operating systems as well as in enterprise solutions. In addition to security applications, we are seeing the usage of cryptographic hash algorithms in data de-duplication applications for storage and networking. These trends are very well reflected by the feedback we receive from our customers and partners who are seeing a continuous shift in their workload distribution towards more cryptographic computations. We are doing our best to address these compute intensive workloads. In 2010 Intel introduced AES-NI -- the new Intel Architecture (IA) instruction set extension for the Advanced Encryption Standards (AES), the first implemented in processors produced with 32-nm technology like Intel Xeon 5600 and 2010 Intel CoreTM processor families, utilizing u-architecture codenamed "Westemere", with dramatic performance improvements [1]. We also found that it is possible to leverage previously introduced IA instruction set extensions with some algorithmic innovations to increase the performance of the widely used Secure Hash Algorithm (SHA-1) [2]. This is the subject of this article.

### Brief History of SHA-1

SHA-1 has been implemented many times in various software and hardware products, including implementations for the Intel Architecture instruction set. One must note that virtually all widely used software implementations of SHA-1 are using only scalar ALU operations on the general purpose register set, while Intel has introduced several generations of vector (or SIMD) instruction set extensions to IA32 and Intel64. Some of them, namely Intel SSE2, Intel Supplemental SSE3 (SSSE3) and Intel SSE4 are particularly useful for integer algorithms.

SHA-1 received its share of researchers' attention. Besides research devoted to the security aspects such as strength and potential weaknesses, there have also been explorations paying attention to implementation efficiency with regard to modern superscalar and parallel computer architectures, like [3] or [4]. The summary is that although generally SHA-1 is a vastly parallel algorithm in terms of instruction level parallelism (it would have been able to efficiently utilize up to 7 ALU pipelines), it does not suit itself easily for an efficient implementation with SIMD instruction set architectures. Our work, however, demonstrates that SHA-1 can efficiently use SIMD to better utilize the capabilities of IA for enhanced performance.

There have been attempts to implement SHA-1 with SSE2. One good example is the implementation made by Dean Gaudet [5]. We built on some of Dean's ideas with several new important improvements.<.p>

It would be appropriate for the reader who is not familiar with SHA-1 to give a brief introduction. Certainly more details can be found in [2] or [6].

SHA-1 produces a 160 bit (20 byte) hash value (digest), taking as an input a sequence of 1 or more 512 bit (64 byte) data blocks. The original source data also requires some padding according to the standard. The data is treated as an array of big-endian 32-bit values. Processing each of the 64-byte input blocks consists of 80 iterations also known as "rounds."

Schematically SHA-1 looks like this using C-like syntax:

int F1(int B, int C, int D) { return (D ^ ( B & (C ^ D)); } int F2(int B, int C, int D) { return (D ^ B ^ C); } int F3(int B, int C, int D) { return (B & C) | (D & (B ^ C)); } #define K1 0x5A827999 #define K2 0x6ED9EBA1 #define K3 0x8F1BBCDC #define K4 0xCA62C1D6 int K( int i ) { return i < 20 ? K1 : i < 40 ? K2 : i < 60 ? K3 : K4; } int F( int i, int B, int C, int D ) { return i < 20 ? F1(B, C, D) : i < 40 ? F2(B, C, D) : i < 60 ? F3(B, C, D) : F2(B, C, D) ; } // Update HASH[] by processing a one 64-byte block in MESSAGE[] void SHA1( int HASH[], int MESSAGE[] ) { // these arrays are not necessary but used to better highlight dependencies int A[81], B[81], C[81], D[81], E[81]; int W[80]; int i, FN; A[0] = HASH[0]; B[0] = HASH[1]; C[0] = HASH[2]; D[0] = HASH[3]; E[0] = HASH[4]; for ( i=0; i<80; ++i ) { // W[i] calculation, also known as 'Message Scheduling' if ( i < 16 ) // reverse the order of bytes on little-endian W[i] = BIG_ENDIAN_LOAD( MESSAGE[i] ); else W[i] = ROTATE_LEFT( W[i-3] ^ W[i-8] ^ W[i-14] ^ W[i-16], 1 ); FN = F( i, B[i], C[i], D[i] ); A[i+1] = FN + E[i] + ROTATE_LEFT( A[i], 5 ) + W[i] + K(i); B[i+1] = A[i]; C[i+1] = ROTATE_LEFT( B[i], 30 ); D[i+1] = C[i]; E[i+1] = D[i]; } HASH[0] += A[80]; HASH[1] += B[80]; HASH[2] += C[80]; HASH[3] += D[80]; HASH[4] += E[80]; }