Instead of computing a Cyclic Redundancy Check (CRC) of an entire message with a traditional linear method, we use a faster method to split the data buffer into a number of smaller fixed-size segments, compute the CRC on these segments in parallel, then execute a recombination step of computing the effective CRC using the partial CRCs of the segments. Parallelized CRC computation is used to maximize the throughput of the CRC32 instruction and results in excellent performance.

### Overview

A CRC is the remainder, or residue (typically 32 bits), of binary division of a potentially long message, by a CRC polynomial. There is an ever-increasing need for very high-speed CRC computations on processors for end-to-end integrity checks. In this article, we show how Intel processors can accelerate CRC computation with a fixed CRC polynomial of degree 32 — the iSCSI CRC (0x11EDC6F41) using the CRC32 instruction introduced in the Intel Core i7 Processor.

The instruction definition can be found in the Intel 64 and IA-32 Architectures Software Developer's Manual Volume 2A [1]. The instruction operates on two operands, the accumulated CRC value and the new data, and updates the accumulated CRC with computations based on the new data. The instruction can operate on a maximum data size of 64 bits (a Qword). In the Intel Core i7 processors, the instruction is implemented with a latency of three cycles and a throughput of one cycle.

### Detecting the CRC32 Instruction

CRC32 belongs to the SSE4.2 family of instructions. To test for its presence, a programmer executes the CPUID instruction with **EAX = 1**, and then tests bit 20 of ECX on the output.

Microsoft's Visual Studio 2010 and GNU Compiler Collection (gcc) both provide easy mechanisms to query the CPUID instruction from C. In Visual Studio, this is provided by an intrinsic from <intrin.h>. Sample code is available from Microsoft [3]. In the gcc, <cpuid.h> includes code to query the CPUID instruction. On gcc version 4.3.4, the following code sequence could be used to test for the presence of SSE 4.2:

#include <cpuid.h> #include <stdio.h> void main () { unsigned int eax, ebx, ecx, edx; __get_cpuid(1, &eax, &ebx, &ecx, &edx); if (ecx & bit_SSE4_2) printf ("SSE4.2 is supported\n"); return; }

### Basic Idea: Perform CRC of Multiple Parts of a Buffer

The basic concepts in this article are derived from and explained in detail in the patents and pending applications [4][5][6] as well as the whitepaper Fast CRC Computation for iSCSI Polynomial Using CRC32 Instruction (PDF) [2]. If we divide the buffer into three non-overlapping parts, we can perform three CRC calculations in parallel. Since the calculation for each part is independent of the calculations for the other parts, each set of three CRC32 instructions is no longer dependent and can execute at the throughput rate.

At the end of this process, we have three separate CRC values. These need to be merged into a single value. This is the recombination "overhead" that we need to pay in order to process the buffer in parallel. On the Intel Core i7 Processor, the recombination can be done using lookup tables. In this article, we focus on the case of computing the CRC of a fixed-size buffer of length 1024 bytes.

### Detailed Implementation: CRC of a 1024-Byte Buffer

A 1024-byte (128 Qword) buffer can be broken into a (most-significant) segment of 43 Qwords, followed by two segments of 42 Qwords each, followed by a final (least-significant) Qword. Each of the three segments can be processed in parallel, resulting in three partial CRCs. We'll call the three values (from most-significant to least-significant) **crc0**, **crc1**, and **crc2**.

The **crc2** value is in the correct bit position with respect to the final Qword. The other two partial CRCs need to be "shifted" to the right, so that their new CRC values appear at the correct bit positions with respect to the final Qword. Since **crc0** can be handled in the exact same way as **crc1**, we only describe **crc1** in the rest of this section. The "shifting" of a partial CRC value is done using a single lookup table. To keep the table size reasonable, the table only contains 256 entries, so that each byte of **crc1** needs to be processed separately. Each table entry corresponding to a byte **b**, is the 32-bit result of applying CRC to a data buffer containing **b** shifted a fixed distance to the left. The table effectively moves the contribution of the byte, to the right by a fixed number of bits. Since the 4 bytes are staggered by 1 byte, the results of the lookup operation are also staggered by 1 byte.

The details of the recombination of a partial CRC value are shown in Figure 1.

Doing the table lookup for 1 byte of **crc1** gives **crc1a**, which is in the same bit position as **crc2**. However, the lookups for the other 3 bytes give results that are shifted (**crc1b…crc1d**). The 4 **crc1x** values are **xor**'d to the final Qword, then a final CRC32 operation combines the modified final Qword with **crc2** to give the final CRC value. The trade-off here is that we use just one table at the cost of doing some bit-shift operations to align the **crc1x** values correctly.

Another approach would be to use four different tables for the 4 bytes of **crc1**, each of which shifts the original byte down by 8 bits less than the previous table (e.g., **crc1b** would be shifted 1 byte less than **crc1a**). This results in four CRC values, each of which is in the same position. These can just be **xor**'d together without requiring any aligning shift operations. However, to reduce the data size, we just use one table (for **crc1**) in our implementation.

The C code for the function can be found in Listing One. The two tables of constants have been omitted, however they can be found in the aforementioned whitepaper [2].

LISTING ONE #include <stdio.h> #include <stdlib.h> #include <intrin.h> #include "types.h" // buffer size is 1024 (= 336 * 3 + 16) #define BLOCKSIZE 336 #define BLOCKSIZE8 (BLOCKSIZE / 8) extern UINT32 mul_table1_336[256]; extern UINT32 mul_table1_672[256]; UINT32 crc_1024_c(UINT8 *buffer, UINT32 initval){ UINT64 crc0, crc1, crc2, tmp; UINT64 *p_buffer; p_buffer = (UINT64*)buffer; crc1 = crc2 = 0; // Do first 8 bytes here for better pipelining crc0 = _mm_crc32_u64(initval, p_buffer[0]); for(int i = 0; i < 42; i++){ crc1 = _mm_crc32_u64(crc1, p_buffer[1 + 1*BLOCKSIZE8 + i]); crc2 = _mm_crc32_u64(crc2, p_buffer[1 + 2*BLOCKSIZE8 + i]); crc0 = _mm_crc32_u64(crc0, p_buffer[1 + 0*BLOCKSIZE8 + i]); } // merge in crc1 tmp = p_buffer[127]; tmp ^= mul_table1_336[crc1 & 0xFF]; tmp ^= ((UINT64)mul_table1_336[(crc1 >> 8) & 0xFF]) << 8; tmp ^= ((UINT64)mul_table1_336[(crc1 >> 16) & 0xFF]) << 16; tmp ^= ((UINT64)mul_table1_336[(crc1 >> 24) & 0xFF]) << 24; // merge in crc0 tmp ^= mul_table1_672[crc0 & 0xFF]; tmp ^= ((UINT64)mul_table1_672[(crc0 >> 8) & 0xFF]) << 8; tmp ^= ((UINT64)mul_table1_672[(crc0 >> 16) & 0xFF]) << 16; tmp ^= ((UINT64)mul_table1_672[(crc0 >> 24) & 0xFF]) << 24; return (UINT32) _mm_crc32_u64(crc2, tmp); }