Fast, Parallelized CRC Computation Using the Nehalem CRC32 Instruction

CRC of an Arbitrary Length Buffer

The basic approach to performing the CRC of an arbitrary-length buffer is to divide up the input into a series of fixed-sized buffers, then process each such buffer in a manner similar to the aforementioned example. The size of the fixed-size buffer can be selected to optimize the performance of the range of buffer lengths in a specific usage model. In one specific example, 1920 bytes of the input are iteratively processed until the size of the remaining bytes is less than 1920. We then process 960 bytes, 480 bytes, and/or 240 bytes, depending on how many bytes remain after each step. Finally, the remaining bytes (fewer than 240) are processed in a straightforward linear manner.

The processing for each buffer is handled in a manner similar to processing the 1024-byte fixed-size buffer, except that there is no "final Qword" (because 1920 is divisible by 3). Instead, the processing of the third (least-significant) portion of the input (i.e., the processing that results in crc2) stops one Qword short. Then the recombination becomes identical to that of the 1024-byte fixed-size buffer.

One further optimization is to note that the tables are defined by the distance the partial CRC value needs to be shifted down. This is basically 1/3 or 2/3 of the buffer size minus the 8 bytes to reflect that we are stopping one Qword early. So, for the 240-byte buffer, we need tables that shift by 72 and 152 bytes. For the 480-byte buffer, we need tables that shift by 152 and 312 bytes, and so on. The second table for a given buffer size is the same as the first table for the next larger buffer. That means that for four sizes of buffers, we only need five tables (rather than eight). The use of a small number of fixed buffer sizes is needed to keep the number of tables reasonable. Refer to the aforementioned whitepaper [2] for the detailed description, code, and performance of methods to perform CRC on arbitrary-length buffers.

Performance- CRC of a 1024-Byte Buffer

The performance was measured on the Intel Core i7 Processor and represents the performance on a single core running a single thread. For each run, we compute the CRC of the same data buffer 350,000 times. The CRC output of one calculation is used as the initial CRC for the next calculation to serialize them. The rdtsc instruction is used to sample the processors timestamp clock before and after the run, to get the number of timestamps per run. We then perform 256 runs, discarding the 64 fastest and 64 slowest times, and use the mean of the remaining 128 values. This number is then adjusted to reflect that the timestamp clock may not run at the same rate as the processor core clock due to turbo mode and/or power saving features. Our CRC implementation designed for the fixed-size of 1024 bytes performs at the rate of ~ 1.17 cycles/Qword.

Conclusion

We described optimized algorithms and code for computing the iSCSI CRC, which unleashes the full potential of the CRC32 instruction introduced in the Intel Core i7 processor. The performance of these methods tend towards 1.17 cycles/Qword for large/fixed buffers.

Acknowledgments

The authors would like to thank Erdinc Ozturk, Gil Wolrich, and Deniz Karakoyunlu for their contributions to the development and optimization of the implementation.

References

[1] Intel 64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M http://www.intel.com/products/processor/manuals/

[3] __cpuid, __cpuidex http://msdn.microsoft.com/en-us/library/hskdteyh.aspx

[4] "Determining a Message Residue," Gopal et al. United States Patent 7,886,214

[5] "Determining a Message Residue," Gueron et al. United States Patent Application 20090019342

[6] "Determining a Message Residue," Gopal et al. United States Patent Application 20090158132

[7] "A Tutorial on CRC Computations", Ramabadran et al., Micro, IEEE, IEEE Computer Society, Aug. 1988, pp. 62-75.

[8] "High-Speed CRC Design for 10 Gbps applications," Lin et al., lSCAS 2006, IEEE, pp. 3177-3180.

[9] "Cyclic Redundancy Code (CRC) Polynomial Selection for Embedded Networks," Koopman et al., The International Conference on Dependable Systems and Networks, DSN- 2004, pp. 1-10.

[10] "Parallel CRC Realization," Campobello et al., IEEE Transactions on Computers, vol. 52, No. 10, IEEE Computer Society, Oct. 2003, pp. 1312-1319.

[11] "A Systematic Approach to Building High Performance Software-based CRC Generators," Kounavis et al., Proceedings of the 10th IEEE Symposium on Computers and Communications, ISCC 2005; pp. 855-862.

Vinodh Gopal, Jim Guilford, Martin Dixon, and Wajdi Feghali are IA Architects with the IAG Group at Intel Corporation.

More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.