Embedded Systems

Optimizing Embedded Linux

By Todd Fischer, May 01, 2002

Todd shares seven hard-won techniques to aid in the embedded Linux development process.

May02: Optimizing Embedded Linux

Todd is director of engineering and telephony for RidgeRun. He can be contacted at [email protected].

Software development for embedded Linux requires cross-development tools, the Linux kernel, device drivers for the embedded device's I/O ports and peripherals, common libraries, and a means to load the software into the embedded device. Luckily, embedded Linux distributions generally include all the tools, kernel, drivers, libraries, and loaders you need for product development to take place.

But once these pieces are in place, the real work begins and you must ask yourself questions such as:

How can I reduce the RAM and flash memory requirements?
What tools can I find to help tune the system?
What savings can I expect when tuning?

With questions such as these in mind, developers at RidgeRun (where I work) have come up with seven techniques to aid in the embedded Linux development process.

You need to decide on the appropriate kernel version before you touch the first line of code. For desktop PCs, picking the most recent release of the kernel is a safe choice. However, embedded devices may not need the latest features — and the added bulk they bring. Picking a stable kernel that supports the capabilities critical to the embedded target is an easy way to reduce the memory footprint.

A good approach is to select a kernel distribution specifically tuned for embedded devices and tuned for the processor family you plan to use. For example, Linux distributions based on Clinux (http://www.uClinux.org/) are designed for processors that do not have memory management units (MMUs). In addition, developers have tuned Clinux to reduce the memory footprint. Table 1 shows the output of the size utility for two kernel versions (text, data, and bss are the size utility results). All of the numbers I present here are based on results obtained from an ARM9 cross-development environment.

Table 1 shows both the savings from using an older kernel targeted to the embedded space compared to a more recent one. Although the uninitialized data requirements of Clinux 2.0.38 are larger, it is likely that the Linux 2.4.0 file subsystem will dynamically allocate more memory than Clinux 2.0.38, so the total RAM memory requirements of Clinux 2.0.38 may still be smaller.

Configure Wisely

The Linux kernel is modularized in several different ways. All processor dependencies are isolated so the kernel can be built for use on one of several different processor families. Also, the Linux kernel supports feature inclusion/exclusion based upon selections via a configuration tool such as xconfig.

To configure the kernel, run the configuration tool several times, each time removing a feature. Rebuild the kernel after each run to see the changes in kernel size (again using the size utility). Finally, be sure to do a make clean and made dep between runs. Table 2 presents our results when using this technique. Carefully understanding the operating-system features needed in the embedded device and excluding all other features is the simplest method to reduce the Linux kernel memory footprint.

Shrinking the Target Device Filesystem Image

The phrase "kernel file subsystem" refers to the kernel code implementation that supports the filesystem functionality. On the other hand, the term "target device filesystem image" refers to the filesystem contents built for the target device. In typical configurations, the target device filesystem contains all executables and data except for the boot loader and the kernel itself. The boot loader, kernel, and target device filesystem components are stored in the device's flash memory.

When examining flash footprint reduction, the Linux filesystem plays a unique role that doesn't match historical approaches to embedded-device software. Historically, developers designed embedded-device software to control the device in a specific fixed manner. They statically linked the application software that provided the high-level functionality for the device with the operating-system kernel; ROM (or more recently, Flash) held the resulting single executable. After powering the device, the processor's reset vector pointed to the executable, causing the single executable to run. The software in most existing embedded devices has no concept of separate executables residing in a filesystem.

Embedded devices running Linux can use the notion of a filesystem in a powerful way to reduce the size of the flash footprint and reduce the amount of RAM required during execution. This approach works because Linux uses a demand paging scheme via the processor's memory management unit (MMU). In simplistic terms, demand paging loads only the first page of a program into memory and then transfers control to the program. As the program runs, it makes references outside the page or transfers control to code residing on a different page. In either case, a page fault occurs when the MMU detects that the page containing the requested information is not loaded. The kernel handles the exception by demand loading the appropriate page (and configuring the MMU data structures appropriately). The program can continue running even when the program is not entirely loaded into RAM.

With demand paging, each page that makes up the filesystem can be compressed as the target device filesystem image is created. When a page fault occurs, the demand paging exception handler locates the compressed page in the flash memory, decompresses the page into RAM, and then allows execution to continue as before. Using this approach eliminates the long wait for an entire program to decompress before execution begins and anything stored in the filesystem can be compressed. In Linux, CRAMFS supports this technology.

A master of the target device filesystem image traditionally is made on the development workstation. The mkcramfs tool compresses the master filesystem into a single compressed image on the workstation. The boot loader, compressed kernel, and compressed filesystem image are then transferred into the target device's flash memory. The boot loader starts executing on power-up, decompressing the kernel into RAM. Linux boots up and mounts the compressed filesystem. At each page fault, the appropriate page image is decompressed before being loaded into RAM. The amount of flash memory required is thus reduced.

To shrink the target device filesystem image, create the master version of the target device filesystem image on the development workstation in an fs directory. Use the du utility, as in Example 1(a), to measure the size of the target device filesystem image. Run mkcramfs and check the compressed filesystem image size, as in Example 1(b).

In short, using CRAMFS can provide a significant reduction in the amount of flash required to hold the target device filesystem image. Table 3 shows our savings.

Optimizing Executables

Several compiler optimization options are available. Optimization settings -O1 and -O2 optimize for performance and are used regularly. Less well known is -Os, which tells the compiler to optimize for size. -Os enables all other optimizations that typically do not increase code size. It also performs further optimizations designed to reduce code size.

To optimize executables, rebuild the kernel several times, each time changing the compiler optimization setting (in our experience, the Linux kernel did not build correctly if there was no optimization option). Run a "make clean" between rebuilds to ensure all files recompile using the new optimization setting. Table 4 shows our results.

Since the primary focus for Linux and associated technologies is the desktop and server space, the compiler developers have not put as much effort in compiling for minimum size. In addition, compiler defects tend to appear after activating lesser used features like "compile for minimum size." You need to weigh the tradeoff between the memory savings gained when compiling for minimize size compared to problems that may arise due to using less stable compiler options. Table 4 presents results using ARM9 cross-compilation. Compilers for other processor families may support a more effective "compile for minimum size" option.

Library Squeezing

Because of the nature of shared libraries, the functions in the library are included even when they may not be used by any of the applications. Some embedded devices ship with a fixed set of applications; new applications cannot be added. In these fixed capability devices, unused library functions increase the memory footprint without adding any value. Various library compression tools are available to build libraries containing only the functions and data structures required by the application set.

One such open-source tool is the Library Optimizer (http://sourceforge.net/projects/libraryopt/), which rebuilds shared libraries to contain only the object files needed to provide the functions and data structures required by executables and shared libraries in a given directory tree. It can be used to reduce filesystem image size for embedded systems.

For this example, I optimize a large shared library, libc Version 2.1.3 from the GNU glibc suite of libraries. During development, we use a version of the libc library that contains symbols used for debugging. A smaller version of the libc library can be created using the strip utility, which simply removes sections containing debugging symbol information. The stripped version still contains the full set of functions and data structures. The first two entries in Table 5 show the full libc library with and without symbols.

To determine the maximum libc optimization possible using tools such as Library Optimizer, I optimized the libc library to contain only those functions required to make the simplest.c program in Example 2(a) run.

Example 2(a) appears to be completely empty. However, you can interrogate the compiled version, a.out, to determine the list of unresolved externals; see Example 2(b). Even though simplest.c appears to be empty, the abort() and __libc_start_main() functions are required, as is the __iostdio_used data structure. Using Linux shared libraries requires an ELF shared library program interpreter library. The interpreter, ld-linux.so.2, uses functions in the libc library, so it must be included when optimizing libc.

To squeeze the library, run the Library Optimizer on the libc library on a filesystem that contains the complied version of simplest.c and ld-linux.so.2 to create an optimized libc library. For our results, the optimized library requires 173 separate object files. Generally, each file in the glibc source contains one function. Therefore, to completely resolve all externals required by simplest.c, ld-linux.so.2 and the functions/data structures used in the libc library require around 173 libc functions. These functions include most of the common filesystem functions (open, read, write, and so on), string functions (strcpy, strlen, and the like), and memory copy functions (memcpy, memset, and so on). These functions are required because every program has standard in, standard out, and standard error file handles, plus associated functions called by the related filesystem functions. The size of the minimum libc is also in Table 5.

For a real-world example, we put together an ARM9-based MP3 player and web browser using the Microwindows windowing environment. Library Optimizer was again run on this real-world configuration, with the results in Table 5.

If an embedded device supports software download capabilities, what happens if the shared libraries were optimized by removing functions not used by the base set of applications? There are several possibilities:

Necessary functions are available. A new application can be run properly if the application limits the use of library functions to the reduced set supported in the optimized libraries. Using this approach requires careful understanding of which functions were discarded, and building and testing the new application in an environment containing libraries with a similar function set. Managing new application development to a nonstandard set of functions requires effort in areas that are not valued by the customer.
Application builds missing functions into the application. The application can include the missing functions. This statically links the application to the missing functions. If another new application needs one of the missing functions, it will have to also statically link in the function even if another application has the same statically linked function. Flash memory is not utilized optimally when several new applications support the same statically linked functions. However, the inefficiency is typically small if the majority of the embedded devices are not enhanced or upgraded (which is a common scenario). This approach creates difficulties with managing and integrating the missing functions and may have licensing issues as well.
Replace the functionally reduced C library. A simple approach is to include the full libc library as part of the new application download process. Using this approach appears to negate the advantage in not shipping the full libc library in the first place.

While my focus has been on removing unused libc library functions, this approach may be even more valuable removing functions from more specialized libraries. Any library with available source code can be optimized.

At first it appears that removing unused library functions required for just one application could be accomplished by statically linking the application to the library. Technically this is true. However, the code license, like LGPL, may have different requirements based on static versus dynamic linking.

Analyze Static Buffer and Array Sizes

The data and bss segments in software object files contain information about the statically allocated RAM requirements. The data segment contains the initial value for the variables and the bss segment indicates the size needed for the uninitialized variables. Examining the kernel object file makes it possible for you to identify the largest static allocations.

To analyze static buffer and array sizes, use Example 3 to display the list of data and bss segments with a size greater than or equal to 0x1000 (4096) bytes in the Linux 2.4 kernel. The command arm-linux-nm indicates the use of nm cross-development utility for the ARM processor. If the second column contains a "b" or "B," it is a bss segment. Likewise, if the second column contains a "d" or "D," it is a data segment. A lowercase segment indicator means the segment is local, and uppercase means the segment is global.

From Example 3, the largest statically allocated variables are the block driver, kernel error log buffer, and TTY subsystem. Searching the kernel header files uncovers where large static arrays are allocated. Since embedded device capabilities are typically much more limited than desktops or servers, most of these arrays can be made smaller without impacting operation; see Table 6. You can change the header files in Table 6 to lower the values used for static allocations. Again, rebuild the kernel after each source-code modification to see the changes in kernel size. Table 7 presents our results.

RAM requirements are significantly reduced by lowering the maximum number of peripherals supported. Since embedded devices have a known number of peripherals, setting the associated constants to the matching value reduces the RAM usage, thereby improving system performance and/or cost.

Analyzing Dynamic Memory Use

Linux kernel routines dynamically allocate memory from the page allocator, slab allocator, or the kernel memory allocator. You can reduce dynamic memory use by analyzing which routines call the various allocators and identifying routines consuming large amounts of dynamic memory. The largest kernel allocations are from the slab allocator.

The slab allocator manages pools of memory for specific purposes. When allocated memory is freed, the slab allocator keeps the memory associated with its pool. Only when the underlying page allocator runs low on memory does the slab allocator release free memory from the various pools. The time reduction for reallocating slab-managed memory is much faster than performing all the steps necessary for allocating from general memory. This approach is used because Linux developers noticed that the various subsystems freed memory only to later allocate memory for the same purpose.

To analyze dynamic memory usage, interrogate the current state of the slab allocator via the /proc/slabinfo pseudofile in Example 4 (smaller allocations not listed). Multiplying the cache entries in use (column 2) by the cache size (column 4) produces the memory consumed by the cache. The largest four caches are all associated with the filesystem and block drivers. The four caches are dentry_cache (28928), blkdev_requests (49152), inode_cache (62592), and buffer_head (799104).

The filesystem allocates many buffers to hold copies of the information on (or headed to) the mass storage device. Buffers hold copies of the directory structure (dentry). The internal data structure (inode) keeps track of where on the mass storage device files are stored. Using these buffers reduces the number of times information is read from the mass storage device, thus improving performance. For desktop or server environments, this performance improvement can be substantial.

For many embedded devices, reading and writing files is not done as frequently. Embedded devices typically use flash for mass storage, so reading information is typically faster than with a mechanical disk type storage device. Therefore, reducing buffer cache usage appears to be a simple change to reduce the RAM footprint. For embedded devices that do a minimum of writes to the filesystem, the performance with reduced buffer caches should not be impacted.

To analyze dynamic memory usage and test this theory, measure the various cache sizes as the amount of installed RAM changes. Table 8 summarizes our findings.

The kernel automatically adjusts the buffer sizes based on the installed RAM memory. As an embedded device under development is nearing completion, you can find additional gains by manually tuning the cache sizes.

Conclusion

Linux developers typically target the desktop and server environment. However, because of the modular kernel design, you can remove or trim unneeded features to meet simpler embedded device demands. Cross-development tools make Linux a viable option for many of the microprocessors used in embedded devices. Various utility tools, including the kernel configuration tool and the cross-development tools, support kernel configuration and analysis. The information provided by these tools can guide the developer to areas of greatest memory savings. Once tuned, you get all the advantages of Linux in an embedded device package.

References

A complete list of references can be found at http://www.ridgerun.com/more/.

DDJ

1 2 3 4 5 6 7 8 9 10 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Embedded Systems