Embedded Systems

Embedded Development Compilers

By Don Hair and Cesar Quiroz, August 01, 1998

Don and Cesar examine the differences between native and embedded development systems, and discuss the features of embedded development compilers that make them different from native compilers.

Dr. Dobb's Journal August 1998: Embedded Systems

Don is director of compiler products for Microtec and can be reached at [email protected]. Cesar is Microtec's PowerPC compilers project manager and can be reached at [email protected].

Compilers designed for embedded application development are more than just cross compilers. Their feature sets are tailored to the development of ROMable, reentrant, and compact high-performance applications. These compilers typically offer more control of storage allocation, data alignment, structure layout, and size/performance trade-offs.

In this article, we'll examine the differences between native and embedded development systems. In the process, we'll discuss the unique features of embedded development compilers that make them different from native compilers.

Embedded Systems

Most of us have on occasion used a screwdriver to hammer in a nail. While we recognize that it isn't the most efficient tool in this case, it may be more readily available and it manages to do the job. However, we seek out a hammer when the task grows beyond a few nails. This is also true in embedded application development. You can get away with using a native compiler -- assuming you can match it to your system architecture -- but you quickly discover its shortcomings when developing large, complex, or deeply embedded applications.

To understand the differences between embedded and native compilers, you need to understand the differences between the systems you are designing. The term "embedded" has come to mean many things. Obviously, the microcontroller at the heart of an antilock braking system is embedded. But what about the Windows PC mounted inside a wide-bed plotter? While the latter configuration could be argued as embedded, the development needs do not match the typical scenario. Classic embedded systems generally:

Are self contained and self starting.
Are designed and tailored to perform a specific task.
Have limited resources, especially when it comes to program storage.
Are remotely deployed, making hardware and even software upgrades difficult. Many are completely isolated from any external interfaces.

Sections and Memory Mapping

Embedded developers usually need control over individual data types. Embedded compilers tend to separate code, strings, constants, literals, and data (initialized or uninitialized) into individual sections. This simplifies the load ordering and locating of these data types within the system's memory map. Most compilers will also assist with even further separation by allowing you to locate individual data types from different modules in different locations. You can even specify new section names for specific data within a module.

Who needs all this? Embedded systems can incorporate many different types of memory. Consider a system that executes its main program from ROM. It also contains initialized and uninitialized data that will reside in RAM, and some configuration or setup tables stored in nonvolatile memory (NVRAM or battery backed-up CMOS RAM). Even if the system was designed so that memory appeared as one contiguous address block, your code must fit in ROM and the data in RAM. This means you must be able to place your various data in the proper locations. This is obviously easier to do if you can categorize it.

System Startup Considerations

This same ROM-based system has many startup implications that native application developers can safely ignore. C and C++ require initialization for global data and allow user-specified initializers. In a ROM-based system, everything with an initial value (strings, literals, constants, and initialized data) must be placed in ROM. Most of this data is read-only and can remain in ROM. Initialized data is usually the exception, as it must end up in RAM if you wish to change its initial value.

Moving initialized data into RAM is usually performed by startup code that moves this data to its final destination. However, some work must be done by the linker to ensure that references to this data are given the final run-time address location in RAM -- not the link-time or ROM location. The linker must also provide enough information to the startup code to find and copy this data to the correct location.

Additionally, the startup code must initialize the stack and heap pointers, and must know the location and size of the heap. Without the luxury of virtual memory, you have to allocate an adequate portion of the memory resource to serve as heap space. If you allocate too much, resources are wasted; too little, and the program fails at run time.

Even beyond the language requirements, a ROM-based system needs its hardware and software configuration initialized properly before it can run an application. This is true even for workstations. In fact, the boot code for a workstation is similar to the code to start an embedded system. You could even say a workstation is an embedded system at this level. It really isn't different from other embedded systems until it starts to load an operating system.

The aforementioned section-renaming capability can help with the tasks of initializing and locating interrupt vector tables. Other features such as data and bitfield alignment control make it possible to initialize I/O devices in C/C++.

Data Alignment and Access

Data alignment options are commonly implemented in structure layouts. This is where users expect more mapping control, since language rules enforce adherence to the relationship between data declaration and allocation ordering.

In the native world, data is aligned to simplify access by the supporting hardware, thereby maximizing performance. Additionally, committees form standardized interfaces to enable the sharing of binary objects within a system. These application binary interfaces (ABIs) regulate such things as object module format, register use and parameter passing, and data alignment.

ABIs also exist in the embedded world, often in modified form to reduce memory requirements. Still, embedded designers want more flexibility in data alignment. They often need to match the data layout to specific hardware requirements in order to share data between different CPU architectures, to directly map to hardware registers, to preserve existing formats when migrating old software, or simply to save space.

There are a couple of extensions (on top of standard C) worth noting in Listing One First, the packed keyword is used by some compilers to indicate that the structure should not contain any padding for alignment. Second, the use of char and short types for bitfields adds flexibility in the way these bitfields are aligned and accessed.

Many desktop-development tools do not allow the use of char and short types for bitfields. Such tools would allocate 32 bits for the structure in Listing One. Tools that do not support packed would optimize access of bitfields and might also need to allocate a padding byte between field2 and field3, allowing them to align the short word on even addresses. Most desktop systems toolkits would access any field of the structure in Listing One with a single byte read/modify/write operation.

With many embedded tools, each different bitfield type declaration tells the compiler to start a new allocation unit. In this case, field3 would start in the next word (or byte if packed), even though it would fit in the current byte. This would introduce a single bit of padding between field2 and field3. Bitfield types can also be used to indicate the desired memory access. For instance, any access of field3 through field7 should use a two-byte read/modify/write cycle, even though any of these fields can be accessed in a single byte. This allows users not only to map alignment, but also to control access to meet hardware needs. Obviously, this type of access is restricted to architectures that can handle it. Some architectures simply can't perform misaligned memory accesses.

The packed keyword is also useful for nonbitfield structures because it allows portability across architectures and provides space savings. In cases where the CPU architecture can perform misaligned memory accesses (such as with Motorola's 68020 and Intel's 80x86), additional compiler effort is not required beyond the necessary data allocation to access packed structures.

For architectures that can't handle misaligned memory accesses (such as Motorola's 68000 and AMD's 29000), the compiler takes one of two approaches. Some compilers attempt to analyze individual structures and determine their access requirements; misaligned accesses will be converted from single load/store operations to a series of byte and word moves. Other compilers simply perform single-byte moves whenever operating on potentially misaligned data.

For systems that produce recoverable exceptions for misaligned accesses, another possible method for handling packed structures is to generate code as if there were no alignment issues and perform corrections with an exception handler.

Consider the structure and code in Listing Two. Code for systems that support misaligned accesses is straightforward, as shown in Listing Two(b) (using Motorola 680x0 assembly syntax). Code for systems that can't support misaligned accesses can be optimized as in Listing Two(c), or basically supported without optimization, as in Listing Two(d). Each method has its advantage. Supporting optimized code generation for targets that don't support misaligned access becomes increasingly challenging as program complexity grows. We have not seen a compiler that can handle misaligned access for every situation.

Why? Consider Listing Three, in which the code generated for func() cannot be optimal for both situations (unless it is inlined). The conservative approach would be to access members using byte moves. However, this appears to be inefficient when dealing with aligned data. Generating optimal code that uses word operations will fail when given the address of a misaligned structure. There are many cases like this. In some cases, the fact that a structure is packed or misaligned is hidden from the compiler. The result is code generation that causes exceptions at run time.

Such cases lend credibility to the run-time exception processing method. Code can be generated that is optimal for the best-case scenario, yet will still execute correctly in all other cases. The downside is the penalty paid in performance due to exception processing overhead. However, this method may actually perform better than other alternatives when faced with lots of code similar to the previous example.

Data alignment can also impact the size of global variable storage. Without some intervention, compilers typically reserve storage for globals in the order of declaration. This is more for convenience than by rule, but it can lead to a large amount of overhead to correctly align the data. The C code in Listing Four(a) gives you the assembly code in Listing Four(b) from most compilers (PowerPC syntax), which costs 16 bytes to store 11 bytes of actual data.

While 32 percent overhead is probably an extreme case, 10 to 20 percent is realistic. However, if the linker reorganized this data so that all eight-byte elements were allocated followed by four, then two, and finally one byte of data, a minimal amount of padding would be required (less than 1 percent).

Though data reorganization requires minimal overhead at link time, few linkers implement this feature. Good programming practices (ordering the data with overhead reduction in mind, for example) can also reduce requirements below 10 percent without toolkit interaction.

Optimizations

Every industrial strength compiler offers tradeoffs between the time it takes to compile a program and the quality of the code generated. Native compilers typically support a direct exchange of time spent compiling for time spent running: If you specify a high optimization level, compilation will take longer, but the code will be faster than it would be at a lower level of optimization.

Because of virtual memory, code size is rarely ever a consideration for native compilers. The effect of optimization on code size is then a side effect of the effort to increase run-time speed; sometimes, less code will run faster, so the code size shrinks as a consequence of the quest for speed.

On the other hand, inline expansion of function calls (inlining) speeds up execution at the cost of tolerating larger code sections. The tradeoff is especially clear in architectures in which the calling convention is expensive and is implemented by some complex instruction or instructions. This would be the case, for instance, with the VAX call instruction or with move-multiple instructions available in several architectures to save several registers at a time.

Embedded development compilers need to pay attention not only to the total code size but also to its interaction with data size, as it affects usage of ROM (frequently, a limiting factor), RAM (for run-time data), and stack (another limiting resource, as it needs to be preallocated in nonvirtual systems).

Common Optimization Issues

Both native and embedded development compilers face code explosion from inlining, and both offer ways to prevent inlining altogether, to prevent or enable it selectively (on a function by function basis), or to make it conditional based on the size of the compiled function.

The difference between how different compilers handle inlining is the size of the threshold: Native compilers can afford to inline aggressively because they expect that any paging costs will be recouped by otherwise faster execution. Embedded development compilers, on the other hand, need to set the threshold low, as running out of ROM is catastrophic. Similar considerations apply to other straight trades of size for speed, such as loop unrolling.

Another area of shared concern is the placement and allocation of compiler-generated data needed to support the compiled code. Examples of this class of data include jump tables needed for the implementation of switch statements, and class hierarchy representations for run-time type identification and for exception handling.

Native compilers investigate the density of the cases of a switch (that is, how many of the values between the least and the highest are actually used as targets) to decide between compiling the switch as a complex series of conditionals (cascaded ifs) or as an indirect jump.

The latter alternative, when desirable, involves producing an array of labels (code pointers, for example). The parameters involved in deciding whether to use this optimization vary with the nature of the target system. A native system, with ample memory and the backing of virtual storage, may use the indirect jump approach (especially inside a loop) at the least provocation, to preserve agile control flow. This approach costs two conditionals to determine if the actual switch expression is in range, followed by one indirect jump, compared to an average of N/2 conditionals for the complex conditional approach (or log N for a binary-search implementation, not commonly seen in practice).

The same problem, addressed by an embedded development compiler, reveals additional aspects. The time costs are the same; however, the jump table has to be put in ROM (or somehow delivered to the embedded target along with the rest of the code). Having a big table full of null pointers to account for the unused cases is not acceptable. A big table would have to be fairly dense before it is tolerable (in that case, it would be optimal for this problem).

Embedded compilers either give you finer control over when they will rely on generating switch tables, or provide implementations with multiple indirect jumps (to select between dense subsets of the original range). For instance, assume a switch has target labels for 1, 2, 3, 4, 11, 12, 13, 14, 21, 22, 23, 24, and so on. You can generate a dense representation of the jumps and use conditionals to select a range before doing the final indirect jump.

Compilers could potentially further exploit the regular structure of this example, although we are not aware of any that do. An example would be to use arithmetic to reduce the first selection; in our case, select table i if the switch expression divided by 10 yields i, then do an indirect jump to the final destination, based on the remainder. Investing in this sort of optimization is justified only if you know that enough user code will see an improvement.

Code versus Stack

Some embedded applications do not take advantage of operating systems, so developers need to take responsibility for memory placement. Lacking a mechanism to reallocate or relocate memory sections at run time and a virtual addressing scheme that can be extended as needed, these applications depend on absolute addresses.

This includes allocating a given address range for the activation frames stack. Making a large initial allocation ensures that well-behaved programs don't run out of stack space, but could be wasteful if it forces you to require additional physical memory.

This requirement is addressed by two general facilities in the embedded development toolkit -- an optimization component to control the growth of the stack, and an assembler/linker/loader feature to estimate and report the stack requirements of nonrecursive code. The optimization component needs to balance two opposite goals: minimization of stack maintenance overhead versus minimization of total stack depth.

The stack maintenance overhead comes from having to lay out activation frames on function (or scope) entry, pushing them onto the stack, and then having to terminate them and pop them out of the stack at function (or scope) exit.

The semantic requirements (argument passing by copying values, and so on) can be satisfied by continuously grabbing stack space never to be reused. Of course, that doesn't work on finite memories, but suggests that one can postpone some of this work (the reclamation of stack frames) until a convenient moment. Time-saving techniques like stack frame popping combination (for example, allocating one frame not just for a function but for a subset of the call graph that is likely to be called as a unit) depends on this technique.

In Listing Five, for instance, the compiler may determine that g is only called from h. So, a single frame created for h that preallocates room for t, z, w (on behalf of h), and y (on behalf of g) would do well. On entry to h, the single frame would be allocated. Upon call to g, it would find room already allocated for its local (y) without requiring any allocation overhead. On return from g, no extra overhead is incurred either. Only when returning from h do you have to spend effort popping up the stack frame.

This optimization reduces the size of the generated code and the execution time for the function. However, z and w are allocated storage, even if any execution of the function will require only one of them. To properly complete this optimization, you need to do the deallocations as soon as the locals involved become dead. This excludes combining pops for functions that may not be called at all (allocate a frame only if called, deallocate it as soon as possible).

In Listing Five, it looks as if this optimization is not useful, saving a few bytes of stack here and there. Nevertheless, there are two cases to consider:

If z is a big object (say, a large array), then you need to decide if it is wise to allocate storage for both z and w before you need them. For example, assume that only the first call to h allocates z, while later calls require just w. Then you may expose yourself to running out of stack in one of those later calls to h.
In addition to the ostensible locals, you may use this mechanism to clean up temporaries created by the compiler in order to enforce the language semantics.

Since the framing overhead includes calling destructors (in C++), combining pops may not be consistent with the language semantics if postponing the stack frame pop may result in the destructors being executed in the wrong order or in the wrong lexical context.

Obviously, you wouldn't generate code that pushes and pops the stack (by subtracting and adding from and to the stack pointer register) following the nested scopes contour of the source program. Instead, you would just keep track of available stack offsets "in the compiler's head." For instance, you may allocate the same stack offset to both z and w in the previous example, confident that this won't result in a collision at run time.

Linker/Loader Optimizations

Virtual memory systems can afford to keep around both the inlined function calls and the body of the function -- even if there aren't any calls to the function -- for simplicity of implementation. Those pieces of dead code will effectively hide in shared libraries and will be paged out harmlessly. Embedded systems can't afford this luxury, so they require a clean-up mechanism to pick up whatever isn't needed elsewhere.

For instance, in Listing Six, the call to in_century may be eliminated by inlining, but the body of the function (because it is not static) cannot. If what is shown is all there is in that translation unit, the compiler will generate code for this function, but that code may be wasted at run time if no other translation unit calls it.

Embedded systems can't afford to hide that sort of waste under the virtual memory rug. Fortunately, they have an asset often unavailable to native development tools: control of the linking/loading mechanism.

Native compilers use the linking/loading facilities of the underlying operating system. Not doing so would mean not having access to system-supplied (or third-party) libraries and debuggers. Therefore, native systems have traditionally put up with obsolete linking technology (six character monocase externals, for instance).

Embedded development systems provide their own linkers/loaders and can tailor them to remove inefficiencies like the one just discussed. This potential demands that the linkers do more than the traditional activities of concatenating together sections, and resolving and relocating cross-references.

Some embedded toolkits can perform final optimization at linking/loading time, as well as provide ways to trim down standard libraries to support the application's exact needs. For instance, an application that doesn't require floating-point support can save a large amount of code in the standard libraries if it can configure out that support.

It is likely that native systems will begin adopting optimizing linkers in the future. The overhead associated with C++ template usage, for instance, stems from the generation of unnecessary code "just in case."

Consider a class template that provides a number of standard methods. Say there is a method (perhaps a print method) that is not needed in a given application. After the program instantiates the class template a few times, it will create print methods for each instance, which are not needed at all. Recognizing this sort of dead code (the method is dead, even if the whole class isn't) generally requires waiting until link time. What is worse, keeping the dead method may force us to keep (or to instantiate afresh) other methods on its behalf. Imagine having to keep stdio around just because of those unneeded print functions.

The asm() Pseudofunction

The asm() pseudofunction is available in most compilers, and a standard syntactic feature of C++. However, it is extensively used in embedded application development and requires additional functionality to make it more useful. Effectively, these extensions to the barebones asm("string") syntax are required for successful embedded application development.

One common extension is to provide a mechanism to tell the compiler the extent of the effect of the asm() statements in their surrounding code. Consider the effect of optimizations on a program with lots of mixed C/C++ and asm() statements. The compiler wants to limit the amount of parsing required for asm() statements, yet it may not know the impact on register or variable use or how much code has been created by these statements. Moreover, in a well-designed modern compiler, the front end doesn't want to know about the syntax of the assembler at all. The pessimistic approach (any asm() is potentially capable of modifying every register, stack, and heap location) is unacceptable, because it produces extremely inefficient code, forcing the developer to write the entire function, or translation unit, in assembler directly.

Another example is the interpretation of high-level identifiers inside asm() statements. Most implementations allow users to manipulate program variables by name from asm() statements. This eliminates the need to know how the variable is accessed in the program. However, not all implementations are tuned for possible side effects of this feature.

With no other use of the local variable i in Listing Seven, the compiler may optimize it away. This would lead to some unexpected behavior. One way to prevent this would be to remove any optimizations that perform propagation or remove unused variables. This would work, but it is highly undesirable. The correct alternative is for the compiler to understand the asm() syntax enough to realize that i is used in the fragment above.

Another problem impacts the variable j. Without understanding variable usage in the asm() statements, the compiler may wrongly assume that the previously initialized value of j is still valid (residing in a register). Alternatively, a conservative compiler may assume that it must reload every register-based variable after any asm().

The impact of asm() statements on optimal register allocation and branch displacements is exposed in Listing Eight (using 680x0 syntax). This example modifies the register d0 in an asm() statement. In some implementations, this will corrupt one of the previously allocated register variables, unless the compiler is aware of the use in the asm() and allocates registers accordingly.

In this case, the size of the code in the for loop is unknown (assuming multiple instructions are allowed in a single asm() statement). For architectures that implement short displacement branching (such as the Motorola 680x0), it makes it difficult to determine what size displacement can be safely used.

These are just some of the considerations needed to make asm() more effective. Desktop implementations typically make you responsible for controlling and dealing with the side affects of asm() statements. However, most embedded compilers attempt to handle this with minimal loss of optimization. Some even allow users to specify the known side affects of an asm() statement so the compiler can generate optimal code.

Interrupt Handlers

Native application developers rarely need to implement interrupt handlers in their programs. Embedded developers want to implement interrupt handlers in C/C++. The main differences between an interrupt handler and a standard function are how they are called, how they return, and what program context is saved and restored around their activation.

Compilers generally reserve registers for global or subroutine usage. They assume that each nested function is allowed to use the subroutine registers without saving them. Interrupt functions are called out of context and, therefore, must save and restore any registers they use. Interrupt functions are also not directly called. Rather, they are entered through an exception vector table when a specific exception occurs. This process also saves specific information and requires that users return from the interrupt function in a specific fashion. Most of this is transparent and built into the architecture's exception processing model, making it easier than it sounds. However, having a method to generate these differences directly greatly simplifies developing interrupt handlers in a high-level language.

Some compilers distinguish these functions by adding the interrupt keyword to qualify a function name. They even provide options to treat interrupt functions as standard functions, making them callable to simplify debugging them.

Conclusion

The choice of appropriate compiler features is constrained by the type of embedded application. The required implementation techniques are not substantially different from those required for native applications, but the emphasis is different enough to warrant different tools. You can find many of the same features and optimizations present one by one in native development systems; it is the synergy of finding them together under the roof of one toolkit that makes embedded development compilers a breed apart.

DDJ

Listing One

packed struct {    unsigned char field1:4;
    unsigned char field2:3;
    unsigned short field3:1;
    unsigned short field4:5;
    unsigned short field5:2;
    unsigned short field6:4;
    unsigned short field7:4;
} s;
  ...
s.field5 = 3;

Back to Article

Listing Two

<b>(a)
</b>
packed struct {    char c;
    int i;
    char c1;
    short s;
} s;
 ...
s.i = 0x11223344;

<b>(b)</b>
move.l  #$11223344,_s+1 ; move 4 bytes from s.i at an odd address

</p>

<b>(c)</b>
move.b  #$11,_s+1   ; move first byte of s.i at an odd addressmove.w  #$2233,_s+2 ; move next 2 bytes of s.i at an even address
move.b  #$44,_s+4   ; move last byte from s.i at an even address

<b>(d)</b>
move.b  #$11,_s+1   ; move first byte of s.i at an odd addressmove.b  #$22,_s+2   ; move next byte of s.i at an even address
move.b  #$33,_s+3   ; move next byte of s.i at an odd address
move.b  #$44,_s+4   ; move last byte from s.i at an even address

Back to Article

Listing Three

void func (packed struct S *); ...
packed struct S {
    int i1;
} s1;
 ...
packed struct {
    char c;
    packed struct S s2;
} s3;
 ...
func (&s3.s2);
func (&s1);

Back to Article

Listing Four

<b>(a)
</b>
char c;int i;
short s;
float f;

<b>(b)</b>
        .sect   .bss        .align  4
        .globl  c
c:
        .space 1
        .align  4
        .globl  i
i:
        .space 4
        .align  4
        .globl  s
s:
        .space 2
        .align  4
        .globl  f
f:
        .space 4

Back to Article

Listing Five

extern int f( int );static int g( int x ) 
{
    int y = f( x );
    return f( y );
}
int h( int a, int b )
{
    int t;
    if (a < b) {
        int z = b - a;
        t = 2*z;
    } else {
        int w = a*a - b;
        t = w*w;
    }
    return g( t )+1;    
}

Back to Article

Listing Six

int in_century( int year ) { return year % 100; }void print_date( int yy, int mm, int dd )
{
    printf( "%d-%d-%d", in_century( yy ), mm, dd );
}

Back to Article

Listing Seven

int j, k;func() {
    register int i;
    ...
    j = 1;
    asm(" move.l `k`,`i`");
    asm(" move.l `i`,`j`");
    j++;
}

Back to Article

Listing Eight

func() {    register int i, j, k, l, m;
    ...
    asm(" move.l #$100000,d0");
    asm(" movec d0,vbr");
    ...
    for (i=0; i<10; i++)
                asm(" ...");
}

Back to Article

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Embedded Systems

Embedded Development Compilers

Embedded Systems

Sections and Memory Mapping

System Startup Considerations

Data Alignment and Access

Optimizations

Common Optimization Issues

Code versus Stack

Linker/Loader Optimizations

The asm() Pseudofunction

Interrupt Handlers

Conclusion

Listing One

Listing Two

Listing Three

Listing Four

Listing Five

Listing Six

Listing Seven

Listing Eight

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Embedded Systems

Embedded Development Compilers

Embedded Systems

Sections and Memory Mapping

System Startup Considerations

Data Alignment and Access

Optimizations

Common Optimization Issues

Code versus Stack

Linker/Loader Optimizations

The asm() Pseudofunction

Interrupt Handlers

Conclusion

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content