Channels ▼
RSS

C/C++

Writing a Bi-Endian Compiler


Performance Evaluation and Optimization

In addition to implementing traditional compiler optimizations, the BEC implements several byte order-specific optimizations to reduce the overhead of byte order swap operations added at the boundaries of big- and little-endian code.

Bswap Elimination Optimization

The bswap elimination optimization is based on the concept of swap-tolerant expressions, which is defined as an expression that can be replaced with another expression operating on (some or all of ) arguments of different byte order and producing a valid result of the same or a different byte order.

For example, a comparison to a constant (for example, x == 0x12345678) is swap-tolerant because it has a counterpart (y == 0x78563412) that, if given a swapped argument SWAP(x), would produce the same result as the original expression.

A bit-wise AND is swap-tolerant because there is an operation (the same AND) such that, taking the swapped arguments, it would produce the correct result, but of different byte order.

Arithmetic (+, -, *, /) operations are not swap-tolerant because they strictly require data of specific byte order to produce correct results.

Domain is defined as a set of expressions of the code under compilation.

Domain entry is an expression outside the domain, result of which is taken as an argument by an expression belonging to the domain.

Domain exit is an expression outside the domain that takes a result of an expression belonging to the domain as an argument.

A swap-tolerant domain is defined as a set of swap-tolerant expressions that can be replaced by their counterpart expressions such that if some or all of the domain arguments are replaced with data of different byte order then all the domain results would be valid results of the same or a different byte order.

Swap of the domain is a code transformation involving the following two actions:

  • Change of byte order of some or all of the domain entries and exit by placing or removing byte swap operations at necessary domain entries and exits.
  • Substituting all expressions in the domain with their counterparts operating on different byte order, so that code semantics are preserved. Byte swap operations are removed if the entry or exit expression is a byte swap.

Consider this example:


T1 = SWAP(A)
T2 = SWAP(B)
RES = T1 == T2

The expression "T1 == T2" comprises a swap-tolerant domain, expressions "SWAP(A)" and "SWAP(B)" are domain entries, assignment "RES = ..." is a domain exit. A domain swap would be:

T1 = A //byte swap is removed
T2 = B //byte swap is removed
RES = T1 == T2 //byte order of result is the same

Domain swap benefit is an estimate of the performance benefit from the swap and is computed by factoring in the amount of code removed minus the amount of code added (taking into account execution cycle counts of specific instructions placed and removed).

To build a swap-tolerant domain one should start with any swap-tolerant expression or from a byte swap that needs to be removed and extend the domain with connected swap-tolerant expressions. If further domain extension is either impossible or performance negative, convert the current domain and move to the next expression.

Data Byte Order Choice

The BEC discussed in this article is primarily used to compile legacy code to execute on modern little-endian architectures. It has also been used to prototype new network stacks in which the endianess of the protocol headers is explicitly part of the data type definitions, eliminating the necessity of programmer vigilance to maintain the byte order. One typical usage model is where the programmer marks all of the legacy code as big-endian rather than determining and employing explicit byte order declarations on the minimal set of data. In this use case, the compiler adds BOCOs after loads and before stores of the data even when its byte order is not really sensitive. A second usage model is in a mixed endian environment where the user application requires big-endian semantics and the underlying operating system is a commercial offering for the target requiring little-endian semantics. This usage model also has a fair number of additional but semantically unnecessary BOCOs.

Another optimization that improves performance is related to the choice of the actual data byte order. At times, the compiler can prove that the programmer specified byte order has no impact on the program semantics. In cases where data with opposite byte order would perform better, the compiler will convert the data to employ that byte order.

The data (variables, data structures, heap data, function arguments, and so on) is byte order sensitive if changing its byte order may have side effects on the application.

Figure 2 is a flow chart that represents the algorithm used to determine if data is byte order sensitive. For example, the byte order of a top level static variable, the address of which is never taken, is not byte order sensitive.


Figure 2: Determining byte order sensitivity (Source: Intel Corporation, 2011).

Determining if data is byte order sensitive is done conservatively; that is, assume data is byte order sensitive unless proven otherwise. To determine if function arguments are byte order sensitive, the compiler additionally ensures that all the calls of the function are known (including indirect calls).

For byte order insensitive data, the compiler may change its byte order if the resulting code would be more efficient.

An implementation works on two compilation passes. On the first pass, it accumulates information about data usage and also calculates byte order preference from the performance perspective. At the second pass, the data usage is adjusted according to the selected byte order for each specific piece of data. To reduce compile time the BEC may heuristically break analysis and make conservative decisions.

Performance

To estimate the penalty for the BOCOs a modified version of little-endian benchmarks was used: the benchmarks were adapted and compiled to execute with big-endian semantics. The execution times were compared with the original benchmarks compiled as a regular little-endian code.

CPU Mode Overhead Compared to Little-Endian Overhead Compared to Little-Endian
Benchmarks Geomean
Intel Atom Processor 32

1-13% 5.6%
Intel Core i7 Processor 32 1-14% 8.5%
Intel Core i7 Processor 64 5-30% 13.8%

Table 2: Penalty for the byte swaps (Data source: Intel Corporation 2011).

Table 2 shows the performance impact of the benchmarks compiled as big-endian (system libraries compiled as little-endian) compared with the same benchmarks compiled entirely as little-endian. The benchmarks comprise the C benchmarks in SPEC2000 and EEMBC 1.1. The benchmarks were executed on systems based upon the Intel Atom processor and the Intel Core i7 processor under both 32- and 64-bit modes as detailed in the table. The Intel Atom processor-based system executed at 1.66 GHz, included 2 GB RAM, and ran Ubuntu Linux 10.04. The Intel Core i7 processor-based system executed at 3.33 GHz, included 6 GB RAM, and ran Red Hat Enterprise Linux 5.U2 x86. The Intel C++ Compiler Standard Edition for Linux OS with Bi-Endian Technology, version 11.2, was employed in all cases.

There are two reasons for the overhead. The first is that byte swap operations take time. The overhead depends on the efficiency of byte swap elimination optimizations. Performance impact of the bswap operations also depends on the efficiency of byte swaps on underlying computer architecture. For that reason the gap on the Intel Atom architecture having the MOVBE instruction is lower. The MOVBE instruction performs a load from memory or a store to memory with a byte swap of the data as it transits to or from memory.

The second reason is that the presence of the byte swaps might be an obstacle for optimizations to generate good code. For example, byte swaps in a loop may prevent it from being vectorized. Byte swap eliminations applied at an early compilation phase usually helps to minimize this issue.

The overall impact of byte swap elimination on geomean score for the selected benchmarks is about 10–15 percent. There are tests (for example, in the EEMBC suite) that improve their performance in times when these optimizations are applied.

Conclusion

The BEC enables migration of large legacy code bases containing byte order dependencies and significantly reduces the debug and validation times associated with porting legacy source code to an execution environment with opposite byte order. The BEC does not find the specific byte order dependencies, but instead enables the application to execute with the same byte order semantics that it produced. By maintaining the assumptions of the existing and proven execution environment, latent bugs that would otherwise materialize do not require debugging.

This article detailed the evolution and design of the bi-endian technology, extensions to aid in its use, porting process, and performance optimizations. A performance assessment shows the overhead of employing the BEC as ranging from 5.6 percent to 13.8 percent on a range of benchmark programs. Future research directions include enabling full use of C++ features such as operator overloading with bi-endian types.

Acknowledgements

Our thanks go to Matt Adiletta, Bob Kushlis, and Jack Johnson for early work on the Bi-Endian Compiler. We thank Azita Refah, Peter Horn, Michael Rice, Kittur Ganesh, Damion Desai, Greg Anderson, Joe Wolf, and Kevin J. Smith for their contributions to this project spanning multiple years.

References

[1] Adiga, H. Writing Endian-Independent Code in C.

[2] Adiletta, M., Wilkinson, H., and Kushlis R., "Method and apparatus for I mplementing a bi-endian capable compiler," U.S. Patent 7 552 427, June 22, 2006.

[3] Blanc, B. and Marraoui, B. Endianness or Where is Byte 0. Retrieved. Dec. 21, 2008 [PDF].

[4] Cohen, D. 1981. On Holy Wars and a Plea for Peace. IEEE Computer 14, 10 (Oct. 1981), 48-54.

[5] Domeika, M. 2008. Software Development for Embedded Multi-core Systems. Elsevier, 93-99.

[6] Domeika, M., Loenko, M., Ozhdikhin, P., Brevnov, E., 2011. Bi-Endian Compiler: A Robust and High Performance Approach for Migrating Byte Order Sensitive Applications. Worldcomp '11: International Conference on Embedded Systems and Applications, 55-60.

[7] James, D. 1990. Multiplexed Buses: the Endian Wars Continue. IEEE Micro 10, 3 (Jun. 1990), 9-21.

[8] Matassa, L., Endianness Whitepaper, Intel Software Network.

[9] Souloglou, J., Rawsthorne, A., "Program code conversion," U.S. Patent 7 421 686, Sept. 2, 2008.

[10] Understanding Big and Little Endian Byte Order. Better Explained.

The benchmark results are not official results because the source code was modified to execute with big-endian semantics. See http://software.intel.com/en-us/articles/optimization-notice for more information regarding performance and optimization choices in Intel software products.


Evgueni Brevnov, Max Domeika, Mikhail Loenko, Pavel Ozhdikhin, Xinan Tang, and Hugh Wilkinson are engineers or architects at Intel Corp. This article is adapted from material in the Intel Technology Journal, Volume 16 Issue 1, and parts of it are copyright Intel Corporation 2011-2012.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video