Performance Evaluation and Optimization
In addition to implementing traditional compiler optimizations, the BEC implements several byte order-specific optimizations to reduce the overhead of byte order swap operations added at the boundaries of big- and little-endian code.
Bswap Elimination Optimization
The bswap elimination optimization is based on the concept of swap-tolerant expressions, which is defined as an expression that can be replaced with another expression operating on (some or all of ) arguments of different byte order and producing a valid result of the same or a different byte order.
For example, a comparison to a constant (for example, x == 0x12345678) is swap-tolerant because it has a counterpart (y == 0x78563412) that, if given a swapped argument SWAP(x), would produce the same result as the original expression.
A bit-wise AND is swap-tolerant because there is an operation (the same AND) such that, taking the swapped arguments, it would produce the correct result, but of different byte order.
Arithmetic (+, -, *, /) operations are not swap-tolerant because they strictly require data of specific byte order to produce correct results.
Domain is defined as a set of expressions of the code under compilation.
Domain entry is an expression outside the domain, result of which is taken as an argument by an expression belonging to the domain.
Domain exit is an expression outside the domain that takes a result of an expression belonging to the domain as an argument.
A swap-tolerant domain is defined as a set of swap-tolerant expressions that can be replaced by their counterpart expressions such that if some or all of the domain arguments are replaced with data of different byte order then all the domain results would be valid results of the same or a different byte order.
Swap of the domain is a code transformation involving the following two actions:
- Change of byte order of some or all of the domain entries and exit by placing or removing byte swap operations at necessary domain entries and exits.
- Substituting all expressions in the domain with their counterparts operating on different byte order, so that code semantics are preserved. Byte swap operations are removed if the entry or exit expression is a byte swap.
Consider this example:
T1 = SWAP(A)
T2 = SWAP(B)
RES = T1 == T2
The expression "T1 == T2" comprises a swap-tolerant domain, expressions "SWAP(A)" and "SWAP(B)" are domain entries, assignment "RES = ..." is a domain exit. A domain swap would be:
T1 = A //byte swap is removed
T2 = B //byte swap is removed
RES = T1 == T2 //byte order of result is the same
Domain swap benefit is an estimate of the performance benefit from the swap and is computed by factoring in the amount of code removed minus the amount of code added (taking into account execution cycle counts of specific instructions placed and removed).
To build a swap-tolerant domain one should start with any swap-tolerant expression or from a byte swap that needs to be removed and extend the domain with connected swap-tolerant expressions. If further domain extension is either impossible or performance negative, convert the current domain and move to the next expression.
Data Byte Order Choice
The BEC discussed in this article is primarily used to compile legacy code to execute on modern little-endian architectures. It has also been used to prototype new network stacks in which the endianess of the protocol headers is explicitly part of the data type definitions, eliminating the necessity of programmer vigilance to maintain the byte order. One typical usage model is where the programmer marks all of the legacy code as big-endian rather than determining and employing explicit byte order declarations on the minimal set of data. In this use case, the compiler adds BOCOs after loads and before stores of the data even when its byte order is not really sensitive. A second usage model is in a mixed endian environment where the user application requires big-endian semantics and the underlying operating system is a commercial offering for the target requiring little-endian semantics. This usage model also has a fair number of additional but semantically unnecessary BOCOs.
Another optimization that improves performance is related to the choice of the actual data byte order. At times, the compiler can prove that the programmer specified byte order has no impact on the program semantics. In cases where data with opposite byte order would perform better, the compiler will convert the data to employ that byte order.
The data (variables, data structures, heap data, function arguments, and so on) is byte order sensitive if changing its byte order may have side effects on the application.
Figure 2 is a flow chart that represents the algorithm used to determine if data is byte order sensitive. For example, the byte order of a top level static variable, the address of which is never taken, is not byte order sensitive.

Figure 2: Determining byte order sensitivity (Source: Intel Corporation, 2011).
Determining if data is byte order sensitive is done conservatively; that is, assume data is byte order sensitive unless proven otherwise. To determine if function arguments are byte order sensitive, the compiler additionally ensures that all the calls of the function are known (including indirect calls).
For byte order insensitive data, the compiler may change its byte order if the resulting code would be more efficient.
An implementation works on two compilation passes. On the first pass, it accumulates information about data usage and also calculates byte order preference from the performance perspective. At the second pass, the data usage is adjusted according to the selected byte order for each specific piece of data. To reduce compile time the BEC may heuristically break analysis and make conservative decisions.
Performance
To estimate the penalty for the BOCOs a modified version of little-endian benchmarks was used: the benchmarks were adapted and compiled to execute with big-endian semantics. The execution times were compared with the original benchmarks compiled as a regular little-endian code.
| CPU | Mode | Overhead Compared to Little-Endian | Overhead Compared to Little-Endian |
|---|---|---|---|
| Benchmarks | Geomean | ||
| Intel Atom Processor | 32 | 1-13% | 5.6% |
| Intel Core i7 Processor | 32 | 1-14% | 8.5% |
| Intel Core i7 Processor | 64 | 5-30% | 13.8% |
Table 2: Penalty for the byte swaps (Data source: Intel Corporation 2011).
Table 2 shows the performance impact of the benchmarks compiled as big-endian (system libraries compiled as little-endian) compared with the same benchmarks compiled entirely as little-endian. The benchmarks comprise the C benchmarks in SPEC2000 and EEMBC 1.1. The benchmarks were executed on systems based upon the Intel Atom processor and the Intel Core i7 processor under both 32- and 64-bit modes as detailed in the table. The Intel Atom processor-based system executed at 1.66 GHz, included 2 GB RAM, and ran Ubuntu Linux 10.04. The Intel Core i7 processor-based system executed at 3.33 GHz, included 6 GB RAM, and ran Red Hat Enterprise Linux 5.U2 x86. The Intel C++ Compiler Standard Edition for Linux OS with Bi-Endian Technology, version 11.2, was employed in all cases.
There are two reasons for the overhead. The first is that byte swap operations take time. The overhead depends on the efficiency of byte swap elimination optimizations. Performance impact of the bswap operations also depends on the efficiency of byte swaps on underlying computer architecture. For that reason the gap on the Intel Atom architecture having the MOVBE instruction is lower. The MOVBE instruction performs a load from memory or a store to memory with a byte swap of the data as it transits to or from memory.
The second reason is that the presence of the byte swaps might be an obstacle for optimizations to generate good code. For example, byte swaps in a loop may prevent it from being vectorized. Byte swap eliminations applied at an early compilation phase usually helps to minimize this issue.
The overall impact of byte swap elimination on geomean score for the selected benchmarks is about 10–15 percent. There are tests (for example, in the EEMBC suite) that improve their performance in times when these optimizations are applied.
Conclusion
The BEC enables migration of large legacy code bases containing byte order dependencies and significantly reduces the debug and validation times associated with porting legacy source code to an execution environment with opposite byte order. The BEC does not find the specific byte order dependencies, but instead enables the application to execute with the same byte order semantics that it produced. By maintaining the assumptions of the existing and proven execution environment, latent bugs that would otherwise materialize do not require debugging.
This article detailed the evolution and design of the bi-endian technology, extensions to aid in its use, porting process, and performance optimizations. A performance assessment shows the overhead of employing the BEC as ranging from 5.6 percent to 13.8 percent on a range of benchmark programs. Future research directions include enabling full use of C++ features such as operator overloading with bi-endian types.
Acknowledgements
Our thanks go to Matt Adiletta, Bob Kushlis, and Jack Johnson for early work on the Bi-Endian Compiler. We thank Azita Refah, Peter Horn, Michael Rice, Kittur Ganesh, Damion Desai, Greg Anderson, Joe Wolf, and Kevin J. Smith for their contributions to this project spanning multiple years.
References
[1] Adiga, H. Writing Endian-Independent Code in C.
[2] Adiletta, M., Wilkinson, H., and Kushlis R., "Method and apparatus for I mplementing a bi-endian capable compiler," U.S. Patent 7 552 427, June 22, 2006.
[3] Blanc, B. and Marraoui, B. Endianness or Where is Byte 0. Retrieved. Dec. 21, 2008 [PDF].
[4] Cohen, D. 1981. On Holy Wars and a Plea for Peace. IEEE Computer 14, 10 (Oct. 1981), 48-54.
[5] Domeika, M. 2008. Software Development for Embedded Multi-core Systems. Elsevier, 93-99.
[6] Domeika, M., Loenko, M., Ozhdikhin, P., Brevnov, E., 2011. Bi-Endian Compiler: A Robust and High Performance Approach for Migrating Byte Order Sensitive Applications. Worldcomp '11: International Conference on Embedded Systems and Applications, 55-60.
[7] James, D. 1990. Multiplexed Buses: the Endian Wars Continue. IEEE Micro 10, 3 (Jun. 1990), 9-21.
[8] Matassa, L., Endianness Whitepaper, Intel Software Network.
[9] Souloglou, J., Rawsthorne, A., "Program code conversion," U.S. Patent 7 421 686, Sept. 2, 2008.
[10] Understanding Big and Little Endian Byte Order. Better Explained.
The benchmark results are not official results because the source code was modified to execute with big-endian semantics. See http://software.intel.com/en-us/articles/optimization-notice for more information regarding performance and optimization choices in Intel software products.
Evgueni Brevnov, Max Domeika, Mikhail Loenko, Pavel Ozhdikhin, Xinan Tang, and Hugh Wilkinson are engineers or architects at Intel Corp. This article is adapted from material in the Intel Technology Journal, Volume 16 Issue 1, and parts of it are copyright Intel Corporation 2011-2012.


