*John has developed process-control systems for the steel industry, spacecraft telemetry systems for NASA, a real-time distributed database for a long-distance telephone company, and a GPS navigation system for a marine electronics firm. He recently completed his MS in computer science and can be reached through the DDJ office.*

As computers have become increasingly ubiquitous, data-communication applications such as electronic mail, FAX, remote file transfers, and computer teleconferencing have become commonplace. Noise and bandwidth limitations have always been and will continue to be a major issue in computer communications, even as new technologies such as ISDN replace the old-fashioned, noisy, voice-grade telephone line.

The most common way of dealing with noisy transmission channels is to split the data into records and send a few extra bytes of checksum information with each data record. Checksum information is a numeric value which is computed from the data being transmitted and then used by the receiver to verify correct transmission. Over the years, a number of methods have been developed to perform this verification.

The earliest methods involved the transmission of parity bits. An additional bit was appended to each byte, which caused the byte to contain an even number of 1 bits. If a byte with an odd number of 1 bits was received, a transmission error must have occurred. While this method, known as "vertical parity," provides some verification, errors can still easily go undetected. For example, any character received with an even number of bit errors will pass this simple parity test, rather than being reported as an invalid character.

Another method, known as "horizontal parity," uses a similar technique--and suffers from similar problems. In horizontal parity, several bytes of data are grouped into a block. An additional parity byte is then appended to the end of the block. The value of this parity byte is calculated so that an even number of 1 bits will have been sent in each bit position of the block. This calculation can be performed by simply exclusive-ORing together all the bytes in the block. Here again, two single-bit errors can cancel each other if they occur in the same bit position in a block.

A more powerful error detection method is the Cyclic Redundancy Check, or CRC. Based on results from the theory of cyclic groups, CRC schemes are among the most reliable means of error detection in use. In most CRC algorithm implementations, a 16-bit CRC checkword is formed by treating all the bits of a transmission block's message portion as one large number, and computing the remainder of this number after division by a known 17-bit divisor. This 16-bit CRC checkword is adequate to detect all single-bit errors, all burst errors of 16 or fewer bits in length, and all double-bit errors separated by fewer than 65,536 bits (or 8192 bytes). This is a respectable showing for a technique requiring so little additional data.

Another desirable property of CRC is that the generation and testing of the checkword can be readily implemented in hardware by sequentially shifting, masking, and exclusive-ORing the bits of the bytes being transmitted. Unfortunately, most computer architectures lack the instructions required to allow this type of bit-oriented calculation to be performed efficienily. Because of this, software CRC generation and testing tends to be processor intensive. A good assembly-language implementation of CRC will typically require at least six machine instructions for each bit transmitted, or over 50 machine instructions for each byte of data being transmitted.

At high data rates, slow processors can have performance problems.

### Enter Fletcher

In the late '70s, an error-detection technique was developed which provides error-detection properties nearly equal to those of CRC but requires much less computational effort. This method, known as the "Fletcher checksum," was devised by John G. Fletcher of Lawrence Livermore Labs. The algorithm and an analysis of its performance characteristics were published in the IEEE Transactions on Communications in January 1982. One version of the Fletcher checksum (known as the one's complement version) has since been adopted for use in the class-4 transport layer of the ISO network protocol. In spite of all this. Fletcher's checksum is not as well known as more commonly used techniques.

In his paper on this algorithm, Fletcher analyzed several common measures of the error-detection ability of his checksum algorithm and compared the results to those for a CRC algorithm. The results were generally similar for both algorithms, but CRC had a slight edge in most categories.

Both the CRC and Fletcher's checksum will detect all single-bit errors. Both allow a small fraction of errors to get through undetected: 0.001526 percent using a CRC vs. 0.001538 percent using Fletcher's checksum. Both detect all double-bit errors as long as the erroneous bits are close enough to each other. However, the definition of "close enough" varies from 2040 bits for Fletcher's algorithm to 65,535 bits for the CRC algorithm. In practical terms, this means that a CRC will detect all double-bit errors so long as a block size of under 8191 bytes is used, while Fletcher's checksum can only guarantee this performance on blocks of fewer than 255 bytes.

### The Algorithm

The check field in Fletcher's algorithm is made up of two 8-bit check values. The first is initially set to 0. As each byte in the message is processed, its value is added to this check value, and the remainder on division by 255 is saved as the new first check value. The second check value is computed in a similar manner. Its value is initialized to 0, and updated as each byte in the message is processed. This time, however, it is updated by adding the current value of the first check value. Again, the remainder on division by 255 is saved as the new second check value. Example 1(a) shows pseudocode for this.

#### Example 1: (a) Pseudocode for the first part of Fletcher's algorithm; (b) pseudocode for the second part of Fletcher's algorithm.

(a) integer i, sum1, sum2; sum1 = 0; sum2 = 0; for i from 1 to message_length do sum1 = ( sum1 + message[i] ) modulo 255; sum2 = ( sum2 + sum1 ) modulo 255; end for (b) check1 = 255 - (( sum1 + sum2 ) modulo 255); message[message_length+1] = check1; check2 = 255 - (( sum1 + check 1 ) modulo 255); message[message_length+2] = check2;

After these two check values have been computed, they must be incorporated into the data stream so that they can be checked at the receiving end. One approach is to simply append them to the end of the message block. This would provide good error detection, but many of the desirable properties discussed previously would be lost. To avoid this, the checksum itself is not transmitted with the message. Rather, a value is computed from the checksum bytes, which causes the receiver to end up with a checksum of 0 when the checksum of all data from the start of the message data through the end of the checksum field is generated. The first of these two values is 255 minus the modulo-255 remainder of the sum of the 2 message-checksum bytes. The second value is 255 minus the modulo-255 remainder of the sum of the first message checksum byte and previously computed first checksum byte. Pseudocode for the second part of the computation is shown in Example 1(b).

### A Real-world Example

As far as the theory goes, that's about all there is to Fletcher's checksum algorithm. In practice, however, additional details must be taken into account--for example, how to do it fast in assembly language.

I received my real-world introduction to Fletcher's checksum in 1989 while working as a contractor at NASA. A team of NASA engineers had developed an experimental local area network which had a series of custom-built processing nodes connected through fiber-optic cable. The network was operational, but was not meeting its performance expectations. I was called in to investigate and suggest improvements.

After some examination, I determined that when the network was operating at maximum capacity, over half of the CPU time was spent in a single subroutine: the subroutine that calculated the first part of Fletcher's checksum for the 2040-byte data packets. This checksum calculation was perfonned using the lines of C shown in Example 2(a). The corresponding assembly language (for the Motorola 680x0) generated by the C statements is shown in Example 2(b).

#### Example 2: (a) An unoptimized C implementation of the basic checksum; (b) the corresponding assembly language.

(a) register unsigned char *ptr; register short int i, len, sum1, sum2; sum1 = sum2 = 0; for (i=0; i<len; i++) { sum1 += *ptr++; if (sum1 >= 255) sum1 -= 255; sum2 += sum1; if (sum2 >= 255) sum2 -= 255; } (b) moveq #0,d5 move.w d5,d4 ; sum1 = sum2 = 0; moveq #0,d6 bra.s L_29 ; for(i=0; i<len; i++) { L_28 moveq #0,d0 move.b (a3)+,d0 add.w d0,d4 ; sum1 += *ptr++; cmp.w #255,d4 blt.s L_32 sub.w #255,d4 ; if(sum1 >= 255) sum2 -= 255; L_32 add.w d4,d5 ; sum2 += sum1; cmp.w #255,d5 blt.s L_36 sub.w #255,d5 ; if(sum2 >= 255) sum2 -= 255; L_36 addq.w #1,d6 L_29 cmp.w d7,d6 blt.s L_28 ; }

This is reasonably efficient code generation: All variables are kept in registers, there are no superfluous loads or stores, and there are no simple peephole optimizations which would speed things up. Even so, the 13 instructions in this loop require an average of 84 clock cycles to execute, assuming that each of the two subtract operations is executed on half of the passes. On the 10-MHz 68010 processor used in this project, this calculation required 8.4 microseconds per byte, or about 17 milliseconds to process the 2040-byte data buffers used in this network.

The most obvious improvement to the existing algorithm is to eliminate the last IF statement, replacing it with a remaindering operation performed upon exit from the loop. This is possible because of the linearity of the modulus operator, which provides that the modulo-255 sum of a series of numbers is the same as the modulo-255 remainder of the sum of the series. It is now necessary, however, to use a long (32-bit) integer for sum2. This will prevent overflow, so that the remainder can be computed correctly. A long integer representation of sum2 cannot overflow with a buffer smaller than (2[32])/255 bytes. Because this is just over the 2[24]-byte address space limit of the 68010 processor, overflow will not be a problem.

The initial value of len need not be preserved inside the subroutine, so the FOR loop can be changed to a slightly quicker WHILE loop which decrements len in the loop's exit test. This in turn can be coded in assembly language as a DBRA machine instruction. With these changes, the C-version algorithm can be modified, as shown in Example 3(a).

#### Example 3: (a) An improved C version of the basic checksum; (b) the corresponding assembly language.

(a) register unsigned char *ptr; register short int sum1, len; register unsigned long int sum2; sum1 = sum2 = 0; while (len--) { sum1 += *ptr++; if (sum1 >= 255) sum1 -= 255; sum2 += sum1; } sum2 %= 255; (b) loop moveq #0,d3 move.b (a3)+,d3 add.w d3,d0 ; sum1 += *ptr++; cmp.w #255,d0 blt.s no_sub sub.w #255.d0 ; if (sum1 >= 255) sum1 -= 255; no_sub add.1 d0,d1 ; sum2 += sum1; dbra d2,loop ; while (len--);

When the C code is turned into assembly language, the register usage is as follows:

- D0 holds sum1, the first checksum value.
- D1 holds sum2, the second checksum value.
- D2 holds len, the length in bytes of the buffer.
- A0 holds ptr, a pointer into the buffer.

These changes drop the inner loop to eight instructions per byte with an execution time of 56 cycles, or 5.6 microseconds per byte. While this is a significant improvement, a few tricks remain.

The most obvious problem is that after adding a byte in memory into the current sum1 value, three instructions are required to maintain the proper modulo-255 remainder for this value. Because sum1 should remain in the 0-254 range, it would be helpful if byte arithmetic could be used to perform this addition, thus reducing the first three instructions to a single add.b (a0)+,d0 instruction. As it turns out, this change will almost work correctly. A byte addition such as this will be performed using modulo-256 arithmetic, so if an overflow occurs, the resulting sum will be 1 smaller than the correct value.

Luckily, the 68010 instruction set includes an extended add instruction designed for use in performing extended precision arithmetic. This instruction adds two registers and increments the result if the extended bit was set when the instruction was entered. The extended bit will be set by the previous add instruction whenever the sum exceeds 255. By keeping a value of 0 in a register and performing an extended add of this 0 value to the sum of the previous addition, the increment required to compensate for the error caused by using modulo-256 arithmetic is performed whenever an overflow has occurred in the previous addition.

There is, however, one special case in which this scheme fails: when the addition to the sum1 value stored in register DO results in a value of 255. This isn't in the 0-254 range required by the checksum algorithm, and because it doesn't exceed 255, it isn't large enough to generate the overflow which would force it into this range. A close examination of the algorithm reveals that this can be worked around outside the loop. If this situation arises at the end of the buffer, sum1 will contain a value of 255 instead of its proper value of 0. This can be easily tested for on exiting the loop. If this situation occurs anywhere other than at the end of the loop, the next nonzero byte to be added to the sum1 value will cause an overflow, which will correct the condition. The sum1 value is also required to form the sum2 value. Because the only important part of the sum2 value is the modulo-255 remainder, adding the erroneous value of 255 to this value is equivalent (that is, it is modulo-255 congruent) to adding the correct value of 0.

Using D4 to hold the constant 0 value required for the add extended trick, the inner loop reduces to the four instructions shown in Example 4. This optimized version is a considerable improvement over the initial implementation. Only four instructions are executed for each byte of data being checksummed. The three adds require eight, four, and six machine cycles, respectively, while the decrement and branch instruction takes ten machine cycles, for a total of 28 machine cycles per byte. With this loop reduced to the point where nearly a third of the execution time is used by the decrement and branch instruction, another significant improvement in execution time can be obtained by unrolling the loop.

#### Example 4: The final optimized version of the basic checksum.

; Register Usage ; D0 the first checksum value (sum1) ; D1 the second checksum value (sum2) ; D2 the length in bytes of the buffer (len) ; D4 contains zero, for the addx.b instruction ; A0 pointer to the buffer (ptr) loop add.b (a0)+,d0 ; sum1 += *ptr++; addx.b d4,d0 ; if (sum1 >= 256) sum1 += 1; add.1 d0,d1 ; sum2 += sum1; dbra d2,loop ; while (len--);

The way I finally implemented this routine was to unroll the inner loop 16 times. Before entering this loop, an address register is loaded with the address of the appropriate occurrence of the three-instruction addition sequence to be used the first time through the loop. This is based on the modulo-16 remainder of the byte count. If this value is 0, the loop will be entered at its end; if this value is 1, the loop will be entered at the last set of add instructions, and so on. The byte-count value is then set to one-sixteenth of its actual value to compensate for the loop being unrolled.

With this final enhancement, three add instructions must be executed per byte, using 18 machine cycles per byte. In addition, one decrement and branch instruction must be executed for each 16 bytes, using an average of ten-sixteenths of a machine cycle per byte. This is over four times faster than the original version of the algorithm.

The revision of the checksum-generation code resulted in a routine over four times faster (18 5/8 vs. 84 machine cycles) than the original implementation. This optimization resulted in a near doubling of the throughput of the network using this function. This is a classic case of finding the one routine which consumes the majority of processing time, and obtaining a big overall performance improvement by tuning only that one routine. While the new checksum routine is machine dependent due to its being written in Motorola 680x0 assembly language, comparable instructions are available in all contemporary computers with which I am familiar, making the technique broadly applicable.

### Fletcher vs. CRC

An interesting question is how the performance of Fletcher's checksum algorithm compares with the performance of a well-tuned, assembly-language implementation of a CRC-generation algorithm.

To get a rough order-of-magnitude idea of the relative performance of these two algorithms, I examined several routines which generate the CRC values used in the XModem protocol. The best CRC implementations have inner loops which involve two AND instructions, two shift instructions, a conditional jump, and one or two exclusive-OR instructions for each bit in a message. This averages to 6.5 instructions per bit, or 52 instructions for each byte in a message--nearly 20 times more instructions per byte than Fletcher's technique.

In summary, Fletcher's checksum technique provides excellent error detection using a simple algorithm that can be implemented in only a few lines of code. Compared to CRC error detection, Fletcher's algorithm provides a comparable level of data integrity with a fraction of the computational effort. It's an excellent choice where high-link speeds must be achieved while using moderate- or low-performance computer hardware.

$CDDJ

_FLETCHER'S CHECKSUM_ by John Kodis Example 1: (a) integer i, sum1, sum2; sum1 = 0; sum2 = 0; for i from 1 to message_length do sum1 = ( sum1 + message[i] ) modulo 255; sum2 = ( sum2 + sum1 ) modulo 255; end for. (b) check1 = 255 - (( sum1 + sum2 ) modulo 255); message[message_length+1] = check1; check2 = 255 - (( sum1 + check1 ) modulo 255); message[message_length+2] = check2; Example 2: (a) register unsigned char *ptr; register short int i, len, sum1, sum2; sum1 = sum2 = 0; for (i=0; i<len; i++) { sum1 += *ptr++; if (sum1 >= 255) sum1 -= 255; sum2 += sum1; if (sum2 >= 255) sum2 -= 255; } (b) moveq #0,d5 move.w d5,d4 ;sum1 = sum2 = 0; moveq #0,d6 bra.s L_29 ;for(i=0; i<len; i++) { L_28 moveq #0,d0 move.b (a3)+,d0 add.w d0,d4 ; sum1 += *ptr++; cmp.w #255,d4 blt.s L_32 sub.w #255,d4 ; if(sum1 >= 255) sum1 -= 255; L_32 add.w d4,d5 ; sum2 += sum1; cmp.w #255,d5 blt.s L_36 sub.w #255,d5 ; if(sum2 >= 255) sum2 -= 255; L_36 addq.w #1,d6 L_29 cmp.w d7,d6 blt.s L_28 ;} Example 3: (a) register unsigned char *ptr; register short int sum1, len; register unsigned long int sum2; sum1 = sum2 = 0; while (len--) { sum1 += *ptr++; if (sum1 >= 255) sum1 -= 255; sum2 += sum1; } sum2 %= 255; (b) loop moveq #0,d3 move.b (a3)+,d3 add.w d3,d0 ; sum1 += *ptr++; cmp.w #255,d0 blt.s no sub sub.w #255,d0 ; if (sum1 >= 255) sum1 -= 255; no_sub add.1 d0,d1 ; sum2 += sum1; dbra d2,loop ; while (len--); Example 4: ; Register Usage ; D0 the first checksum value (sum1) ; D1 the second checksum value (sum2) ; D2 the length in bytes of the buffer (len) ; D4 contains zero, for the addx.b instructions ; A0 pointer into the buffer (ptr) loop add.b (a0)+,d0 ; sum1 += *ptr++; addx.b d4,d0 ; if (sum1 >= 256) sum1 += 1; add.l d0,d1 ; sum2 += sum1; dbra d2,loop ; while (len--);

Copyright © 1992, *Dr. Dobb's Journal*