Carl Ek is an architect at Code Integrity Solutions, a consultancy focused on helping companies get more value from their static source code analysis and build tool investments. Prior to Code Integrity Solutions, Carl held senior engineering positions at Electronic Arts and in the compiler development division at IBM.
"There are 0010 0000 kinds of people in the world: Those that understand the difference between Big Endian and Little Endian, and those that do not."
Since all binary processors (hardware or software) have an endian design, correct processing of the data based on that endian design is extremely important. The statement above is a version of another joke, but with a twist: the binary is represented in little endian, giving some mild humor for those that understand. For those that don't understand endianness, the humor is lost, much like a processor which has an endian processing defect. In this article, I describe the kinds of defects which occur, and methods where static analysis tools can help detect programming errors and enforce correct programming. But first, let's define some terms:
- Endianness. The nature of byte layout in storage of multi-byte datatypes
- Big Endian. Byte store in most significant order across ascending address
- Little Endian. Byte store in most significant order across descending address
For example, the two-byte integer 54,321 = 0xD431
- Big Endian: stored as D4 31
- Little Endian: stored as 31 D4
It is easyto see that if a little endian processor stored the integer value 54,321 and subsequently processed in the incorrect ascending address order, the value would be incorrectly derived as 0x31D4 or 12,756.
Hence, endianness issues in programming are twofold:
- When endianness is not mixed, care must be taken if programs are to be portable across both little and big endian processors: this is an "Endian Neutral" strategy.
- When endianness in programs is mixed, datatypes must be processed with their corresponding correct byte order: this is an "Endian Protocol" strategy.
Endianness and the C Programming Language
The C language gives the programmer many ways to "shoot ones self in the foot". Since the language gives access to the bits and bytes of storage through pointers and other mechanisms, access of data is dependent upon the storage layout of the bytes.
The two strategies (a) and (b) above can both be handled with carefully coded algorithms. In the case of (a), some general rules for C/C++, if diligently followed, can avoid algorithmic difficulties to avoid endian related problems. Here is a (non-exhaustive) list:
- Avoid using unions which combine different multi-byte datatypes.
- the layout of the unions may have different endian-related orders.
- Avoid accessing byte arrays outside of the byte datatype.
- the order of the byte array has an endian-related order
- Avoid using bit-fields and byte-masks
- since the layout of the storage is dependent upon endianness, the masking of the bytes and selection of the bit fields is endian sensitive.
- Avoid casting pointers from multi-byte type to other byte types.
- when a pointer is cast from one type to another, the endianness of the source (ie. The original target) is lost and subsequent processing may be incorrect.
All rules are similar in concept: avoid processing bytes in an order with assumptions about their storage layout. More rules for C and C-style languages could be derived from the above, and rules could be developed for other languages. Enforcing rules such as the above will result in more lines of code which are portable across different endian systems. Good implementations of this strategy could be run on one endian system, and could be ported with minimal changes to another endian system.
What is the best practice when the endianness of datatypes in a single application is mixed? In this case the programming must properly process both little endian and big endian data, and an Endian Protocol strategy is needed. The fundamental operation in these algorithms require methods to swap bytes as data is processed. This can be done in hardware or in software. However, hardware solutions are not practical in many cases due to the variability of configurations and data to be handled. Software byte swapping methods have been developed where the endianness is defined or derived and the data bytes are processed in the correct context based on their endianness.
In the C language, numerous bytes swap facilities have been designed for this:
- The swab function: to swap two adjacent bytes
- Byte swapping macros:
- ntohs: network to host short
- ntohl: network to host long