Dr. Dobb's | Building a Test Harness for RTOS

Building a Test Harness for RTOS

"Sparky," the test system for RTOS that Cort describes here, does a build of all software components and tools and runs a set of measurement, specification, and regression tests.

May 16, 2008
URL:http://www.drdobbs.com/global-developer/building-a-test-harness-for-rtos/207800603

Cort is Director of Engineering at FSMLabs. He can be contacted at [email protected].

Five years ago at FSMLabs, our real-time operating-system software had grown to support so many features and platforms that it was beyond the scope of our small technical staff to fully test and still develop the system at a reasonable pace. The dilemma we faced was that on one hand, users of real-time software expect and require immense amounts of testing and validation before seeing the software. On the other hand, we wanted to maintain our relatively small technical staff of highly skilled and focused developers. Our size let us put customers directly in contact with the people who write the code, and that was something we didn't want to change. Consequently, we couldn't assign a large group of test engineers to the daily execution of a test system, nor could we employ any of the available large and inflexible test systems.

We searched for a system that would fit our requirements, but found nothing of use. The general Linux/BSD test community is primitive, relying for the most part on "many users who act as beta testers." The Linux Test Project (LTP) doesn't include tests for the latest kernel, offering superficial and inconsistent tests. The best you can find at the Open Source Development Labs (OSDL) is a listing of "uptimes" for platforms from the OSDL test site. And while there are testing tools and harnesses for general use, they're not designed for operating system work. With this in mind, we built a test infrastructure that did not require great effort to develop or maintain.

What the System Does

In its current form, our test system does a build of all software components and tools, boots each computer type we support, and runs a set of measurement, specification, and regression tests. It then reports all of this in a web page, packaging the compiled and assembled pieces on a remote server where we can master CDs, provide automated downloads, or access copies ourselves.

First of all, we needed our build system to do what most test systems do—build the system and make sure no one had broken the build in the past 24 hours. Linux-based builds are run on the build "master" server—"Sparky"—which is a Linux host. NetBSD, FreeBSD, and OpenBSD builds are started on remote hosts dedicated specifically for those builds. This lets us run many builds in parallel, but report all the results and data on a single host (Sparky). As the system grew, we could add compile capacity to it with new hosts.

This process starts with a complete build of our toolchain—binutils, GCC, glibc, and all the target filesystem utilities (used for embedded systems and NFS root filesystem). Everything is built from scratch to ensure that the entire build process—which is all implemented in shell scripts—works. Because we target x86, x86 64-bit, PowerPC 6xx (including altivec support), PowerPC 4xx, PowerPC 8xx, PowerPC e500, FRV, MIPS, and ARM/XScale, we end up building an entire toolchain and root filesystem for a number of architectures.

The system then builds the Linux/BSD kernels (we have about 25 different source trees that need to be built), RTCore (the real-time kernel), and all real-time drivers and add-ons to RTCore (real-time networking, memory protection, interfaces, and so on). While the builds run, they show "in progress" on the web page. Once complete, they display success/failure and are color-coded appropriately: blue in progress, red failed, and green success.

When the entire build test for a given architecture is complete, the Linux, NetBSD, or FreeBSD kernel is copied to the tftp boot file location on our server, the root filesystem for the tests is copied to the NFS server, and the build for the next architecture/configuration is started. While the next build continues, the current tests commence.

Every test starts by booting a given board that has been put on our test rack. We centralize this information in the board.sh shell script, which lets each platform be referred to by a specific board name or platform—"Opteron 64-bit SMP" or just "host114." Board.sh is called by the test system to start logging the serial console output and reset power on the board. Once this stage is complete, we know the compiler tools have produced a working and booting Linux/BSD kernel, NFS root works, and compiled healthy user utilities.

The test system then runs some standard tests that all platforms must pass—about 350 in all. They test for the presence of previously fixed bugs, ensure the system conforms to our software specification documents, and test the general health and functioning of the system. The tests run a total of 15 iterations. The next set of standard tests is measurement tests (see devnet.developerpipeline.com/documents/s=9854/q=1/cuj0404dougan/0404dougan.htm) that verify that performance is within expected tolerances for each platform and ensure that we have a consistent source of benchmarks. We often use those to verify that feature creep has not affected system performance—which lets us study long-term trends in our system months or years later.

Because not every test can be run on every platform and sometimes the tests require multiple hosts, we encode that information in the shell script itself; for example, if the platform ppc_gemini is the only one that can run this test, yet two hosts are required. Figure 1 illustrates the results during a build with some failures, some successes, and some in-progress.

[Click image to view at full size]

Figure 1: Status web page (partial).

Hardware and Technology for Testing

Again, our software is based on the operating system RTCore, which runs the bare hardware and lets other nonreal-time operating systems (what we call "general-purpose operating systems" or GPOS) run when no real-time task needs to run. This way, our software is an operating system itself but runs cooperatively with other operating systems—currently Linux, NetBSD, and FreeBSD.

We needed to test RTCore when interacting not only with each of these operating systems but when running on Intel x86, four distinct classes of PowerPC, StrongARM, XScale, ARM7, ARM9, Alpha, and MIPS. These processors range from four-processor VME-bus-based systems used for industrial and military applications down to MMU-less ARM7 processors that run for days on battery when tracking the movements of mountain lions in the Florida Everglades.

We have to push each piece of hardware to the limits of what it can be expected to provide, but not beyond. Spurious failures due to stressing the system too much would lead either to far too much intervention by us to address the problem, or to complacency by neglecting an error when the test system reported it. False errors are simply unacceptable because they would seriously damage the trust we have in the test system.

This required that we compile a detailed list of our expectations for each piece of hardware, then test against that. That's no small task, but is extremely important. When starting on a project like this, care should be taken to ensure that this is done the first time, and continued, with all the diligence necessary.

The test infrastructure hardware itself was made up of standard parts available from the Internet. At the heart of our system are X10 power control modules. These devices connect to a host computer via RS232, turning "receiver" X10 modules on/off. Normally, these are used to turn appliances on/off but we used them to control computers.

To log events in our system, we use serial (RS232) consoles for everything. That means we have about 100+ serial lines that we have to manage and log data on. We started off with inexpensive USB 8-way RS232 devices, but eventually switched to Cyclades 32-way serial devices.

Tracking tests, bug fixes, and changes to the system with something this large and complex isn't easy. If your revision-control software isn't up to the task, then it's even harder. We use BitKeeper to manage our source code, test results, and other data. It's an easy-to-use tool that lets us script things beautifully.

First Implementation and Some Mistakes

The first implementation of the test harness was crude because of time and resource constraints. The original machine that ran these tests was built from spare parts and had no case. (The name "Sparky" comes from our attempts to power-on the board without an actual on-off switch.)

Still, the value of the system was proven within a few days. We found compile errors we didn't know existed because we never compiled with those options enabled. We quickly added the ability to do runtime tests and expanded to what I've described here. We needed a simple and easy way to report the results of the run. We experimented with several methods, including e-mail reports. However, e-mail was inconvenient, got mixed up with spam, and wasn't workable. We ended up using a web-page system so that we could easily access this information and catalog it in a database.

Testing Is Important

We were careful not to force all testing to be automated. Testing can be dull, methodical, and boring, but we did not want to lose the idea that testing is a job that everyone does. A serious concern with a system like this is that people become auditors of the system, rather than driving the tests. We avoided this and improved our testing by being careful—and we actually write more tests now that we know they will get run regularly and attended to, instead of rarely run and never maintained.

In effect, we "outsourced" our boring and monotonous testing to a set of machines, closing the loop between writing, testing, and delivering. We installed a policy early on in this effort that every new feature or bug fix includes a test in the nightly system. We have not had a recurrence of a bug or "false fix" since then.

The important part of this was to not be draconian in our enforcement. We had a good culture of testing and wanted to improve it. Forcing people to do extra work without a good reason would have likely destroyed that. Instead, what people began to see is that they were able to move on to other projects more quickly when they did not have to return to an old bug and fix it over and over again.

Programmers also like to know that once they write something, they can later blame someone else for breaking it. With proof, that's much easier. This led to the natural evolution of a development cycle that is now codified in our development process:

Write documentation.
Write a test.
Write an example.
Write the code.
Make sure the example and test work properly.

We were also able to replicate this test system in other places. Our office in India was able to recreate a smaller test setup within a few days with few instructions. One of the unforeseen advantages of this system is that remote employees could use what we had here—all it takes is a remote login. When we need a given developer (in another office) to test some software on hardware that they do not have (an embedded board of a certain type, for example), it would have required us to purchase another board or send the actual board to them. Either way, it would have taken days to get this done for perhaps a few hours of work. Now they can log into our system remotely, use the console, reset the board, and work on it directly.

Having a history of our tests helped us in tracking trends. We were able to gather and display months of history easily. It was important to see performance trends in our software. Was our performance increasing or decreasing? It was, in fact, useful in determining whether our testing was getting better: It actually tested the test system itself!

This also served as a useful tool in testing newly implemented engineering processes. Did they help, hurt, or do nothing? Software test systems aren't only required to test software—they can test everything related that ends up as software.

We use this on a regular basis for performance predictions, too. How much of a performance improvement can we expect in the first three months of supporting a new platform? How long does it take for a new feature or newly supported platform to stabilize and become reliable?

Why Did It Work So Well For Us?

This system worked well for us. Our small, tightly focused group quickly saw the benefit of the system, embraced it, and made it more useful. Without that, no matter how technically good the system was, it would have failed.

We internalized the idea of the test system. Rather than seeing it as something that happens after software is written, it has become part of our software design and development process. When we first discuss features or components of our system, we discuss how it will be tested and how it will be folded into our test system. Our software requirement documents include comments on how features can be tested properly and completely in an automated fashion. If something is likely to be too difficult to test, it gets redesigned—testing has become that important to our process.

We conservatively estimate that the test system saves us three man-days every workday. Consequently, we do not see maintenance of it as a chore or an expenditure. We immediately see the benefit from what we've done and are immediately rewarded. In our view, time and money are saved. That was valuable in getting the project started. We don't have a large IT or testing staff to develop the system, so we have to effectively steal time and energy away from other (sometimes critical) projects. This was made easier by making the payoff immediate rather than needing to make the argument that an investment was being made and payoff would come eventually.

Paradoxically, if you can show engineers how you can save them work and then actually deliver, they are willing to put great effort into projects like this. This let the project snowball until it quickly became a critical component in how we develop software and plan engineering activities. All this was realized in hindsight, of course. At the time, we didn't know that this would pay off in such a way or that it would be so well received. It was initially an experiment.

Without the automated test system running all the time, our programmers would not have had the time to perform some of these "by-hand" tests. Problems with usability and documentation were discovered that wouldn't have been found otherwise. The end result is that our system improved all aspects of our testing and developers were able to spend time on other tasks.

Editor's Note

In 2007, FSMLabs sold RTLinux and RTCore to WindRiver, but licensed back rights for use in the enterprise field. FSMLabs continues to use and improve Sparky.