Channels ▼
RSS

Parallel

Fast, Parallelized CRC Computation Using the Nehalem CRC32 Instruction


Instead of computing a Cyclic Redundancy Check (CRC) of an entire message with a traditional linear method, we use a faster method to split the data buffer into a number of smaller fixed-size segments, compute the CRC on these segments in parallel, then execute a recombination step of computing the effective CRC using the partial CRCs of the segments. Parallelized CRC computation is used to maximize the throughput of the CRC32 instruction and results in excellent performance.

Overview

A CRC is the remainder, or residue (typically 32 bits), of binary division of a potentially long message, by a CRC polynomial. There is an ever-increasing need for very high-speed CRC computations on processors for end-to-end integrity checks. In this article, we show how Intel processors can accelerate CRC computation with a fixed CRC polynomial of degree 32 — the iSCSI CRC (0x11EDC6F41) using the CRC32 instruction introduced in the Intel Core i7 Processor.

The instruction definition can be found in the Intel 64 and IA-32 Architectures Software Developer's Manual Volume 2A [1]. The instruction operates on two operands, the accumulated CRC value and the new data, and updates the accumulated CRC with computations based on the new data. The instruction can operate on a maximum data size of 64 bits (a Qword). In the Intel Core i7 processors, the instruction is implemented with a latency of three cycles and a throughput of one cycle.

Detecting the CRC32 Instruction

CRC32 belongs to the SSE4.2 family of instructions. To test for its presence, a programmer executes the CPUID instruction with EAX = 1, and then tests bit 20 of ECX on the output.

Microsoft's Visual Studio 2010 and GNU Compiler Collection (gcc) both provide easy mechanisms to query the CPUID instruction from C. In Visual Studio, this is provided by an intrinsic from <intrin.h>. Sample code is available from Microsoft [3]. In the gcc, <cpuid.h> includes code to query the CPUID instruction. On gcc version 4.3.4, the following code sequence could be used to test for the presence of SSE 4.2:

#include <cpuid.h>
#include <stdio.h>

void
main () {
  unsigned int eax, ebx, ecx, edx;

  __get_cpuid(1, &eax, &ebx, &ecx, &edx);

  if (ecx & bit_SSE4_2)
    printf ("SSE4.2 is supported\n");

  return;
}

Basic Idea: Perform CRC of Multiple Parts of a Buffer

The basic concepts in this article are derived from and explained in detail in the patents and pending applications [4][5][6] as well as the whitepaper Fast CRC Computation for iSCSI Polynomial Using CRC32 Instruction (PDF) [2]. If we divide the buffer into three non-overlapping parts, we can perform three CRC calculations in parallel. Since the calculation for each part is independent of the calculations for the other parts, each set of three CRC32 instructions is no longer dependent and can execute at the throughput rate.

At the end of this process, we have three separate CRC values. These need to be merged into a single value. This is the recombination "overhead" that we need to pay in order to process the buffer in parallel. On the Intel Core i7 Processor, the recombination can be done using lookup tables. In this article, we focus on the case of computing the CRC of a fixed-size buffer of length 1024 bytes.

Detailed Implementation: CRC of a 1024-Byte Buffer

A 1024-byte (128 Qword) buffer can be broken into a (most-significant) segment of 43 Qwords, followed by two segments of 42 Qwords each, followed by a final (least-significant) Qword. Each of the three segments can be processed in parallel, resulting in three partial CRCs. We'll call the three values (from most-significant to least-significant) crc0, crc1, and crc2.

The crc2 value is in the correct bit position with respect to the final Qword. The other two partial CRCs need to be "shifted" to the right, so that their new CRC values appear at the correct bit positions with respect to the final Qword. Since crc0 can be handled in the exact same way as crc1, we only describe crc1 in the rest of this section. The "shifting" of a partial CRC value is done using a single lookup table. To keep the table size reasonable, the table only contains 256 entries, so that each byte of crc1 needs to be processed separately. Each table entry corresponding to a byte b, is the 32-bit result of applying CRC to a data buffer containing b shifted a fixed distance to the left. The table effectively moves the contribution of the byte, to the right by a fixed number of bits. Since the 4 bytes are staggered by 1 byte, the results of the lookup operation are also staggered by 1 byte.

The details of the recombination of a partial CRC value are shown in Figure 1.

[Click image to view at full size]
Figure 1. Recombining a partial CRC.

Doing the table lookup for 1 byte of crc1 gives crc1a, which is in the same bit position as crc2. However, the lookups for the other 3 bytes give results that are shifted (crc1b…crc1d). The 4 crc1x values are xor'd to the final Qword, then a final CRC32 operation combines the modified final Qword with crc2 to give the final CRC value. The trade-off here is that we use just one table at the cost of doing some bit-shift operations to align the crc1x values correctly.

Another approach would be to use four different tables for the 4 bytes of crc1, each of which shifts the original byte down by 8 bits less than the previous table (e.g., crc1b would be shifted 1 byte less than crc1a). This results in four CRC values, each of which is in the same position. These can just be xor'd together without requiring any aligning shift operations. However, to reduce the data size, we just use one table (for crc1) in our implementation.

The C code for the function can be found in Listing One. The two tables of constants have been omitted, however they can be found in the aforementioned whitepaper [2].

LISTING ONE

#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>
#include "types.h"

// buffer size is 1024 (= 336 * 3 + 16)
#define BLOCKSIZE  336
#define BLOCKSIZE8 (BLOCKSIZE / 8)

extern UINT32 mul_table1_336[256];
extern UINT32 mul_table1_672[256];

UINT32 crc_1024_c(UINT8 *buffer, UINT32 initval){
    UINT64 crc0, crc1, crc2, tmp;
    UINT64 *p_buffer;

    p_buffer = (UINT64*)buffer;
    crc1 = crc2 = 0;

    // Do first 8 bytes here for better pipelining
    crc0 = _mm_crc32_u64(initval, p_buffer[0]);

    for(int i = 0; i < 42; i++){
      crc1 = _mm_crc32_u64(crc1, p_buffer[1 + 1*BLOCKSIZE8 + i]);
      crc2 = _mm_crc32_u64(crc2, p_buffer[1 + 2*BLOCKSIZE8 + i]);
      crc0 = _mm_crc32_u64(crc0, p_buffer[1 + 0*BLOCKSIZE8 + i]);
    }

    // merge in crc1
    tmp  = p_buffer[127];
    tmp ^= mul_table1_336[crc1 & 0xFF];
    tmp ^= ((UINT64)mul_table1_336[(crc1 >>  8) & 0xFF]) <<  8;
    tmp ^= ((UINT64)mul_table1_336[(crc1 >> 16) & 0xFF]) << 16;
    tmp ^= ((UINT64)mul_table1_336[(crc1 >> 24) & 0xFF]) << 24;

    // merge in crc0
    tmp ^= mul_table1_672[crc0 & 0xFF];
    tmp ^= ((UINT64)mul_table1_672[(crc0 >>  8) & 0xFF]) <<  8;
    tmp ^= ((UINT64)mul_table1_672[(crc0 >> 16) & 0xFF]) << 16;
    tmp ^= ((UINT64)mul_table1_672[(crc0 >> 24) & 0xFF]) << 24;

    return (UINT32) _mm_crc32_u64(crc2, tmp);
}


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video