Measuring Periodic Task Scheduling

By Cort Dougan, February 01, 2004

High-performance real-time applications absolutely demand predictable response times. Cort presents a C program that measures periodic task-scheduling jitter in Linux.

February 04:

In the world of real-time systems, "jitter" leads to variations in latency, which in turn lead to unpredictable response times. Of course, high-performance real time with worst-case jitter in the neighborhood of 10 ms is perfectly compatible with a convenient and well-specified API — but you still need to keep track of it. In this article, I present a C program (Listing 1) that measures periodic task-scheduling jitter under Linux and RTLinuxPro. I also examine why simply measuring periodic jitter characterizes the determinism of a real-time system for many real-time applications.

Actually, the application I present compiles and runs under Linux, BSD, and any UNIX-like operating system that supports POSIX — including RTCore, the hard real-time kernel developed by FSM Labs (the company I work for). Compiling and executing the code in these different environments illustrates the differences in the determinism of the underlying operating system. Consequently, with this program you can gather very definite information about the operating system's limitations for specific purposes.

I've run the program under a stock Linux 2.4.19 on a dual-processor Pentium 4 running at 2.2 GHz. Hyper-threading was enabled, so the system was effectively running with four CPUs. Since real-time performance is usually very sensitive to small changes in hardware, this platform may not represent all hardware in this class. The program creates one thread for each CPU that is available. These threads are pinned so that in RTCore each runs on a different CPU. Since there is no mechanism for assigning individual threads to different processors under Linux, they are allowed to run on any processor that Linux schedules them on. Each of these threads runs at 1 KHz (period of 1 ms) and computes the difference between when it was scheduled to wakeup and the time it actually woke up. This value is usually referred to as "periodic task-scheduling jitter." The worst (largest) value observed for each thread is stored in the array worst[], where a low-priority thread prints these values.

This worst-case measured value is a characteristic of the underlying operating system running on a specific hardware platform. The total delay is a combination of delays that occur for this thread. The first delay is caused by the hardware itself. When an interrupt (timer interrupt or otherwise) is asserted, there is a finite (sometimes large) delay before the processor begins executing the first instruction of the interrupt handler. This hardware-induced delay can be made worse by an operating system or application that disables interrupts for prolonged periods of time. The lower bound for the worst-case hardware-induced delay is the longest period of time that the operating system or any running application disables interrupts.

The next component of the delay is caused by the execution of low-level interrupt handlers. These generally do things such as save processor state, disable interrupts, acknowledge the pending interrupt, and transfer control to a higher level handler specifically for this interrupt. The time to execute this code is often caused by faulting instructions and data accesses in the Translation Lookaside Buffer (TLB) and cache. If the system is under heavy load with a number of applications running — all doing many memory accesses — the pressure on the TLB and cache is very heavy. This tends to evict the entries used by the low-level fault handlers so they must be reloaded on each interrupt. This slows the execution of the interrupt handlers, of course.

Once the low-level interrupt handler has transferred control to the higher level interrupt handler, a scheduling event is generated. This causes the scheduler to pick the appropriate thread. This selection process is not free. The scheduler may have to sort through a very large list of runnable threads and compare relative priorities as well as other factors when selecting the next thread to run. In addition, it requires that more data and instructions be loaded into the cache, as well as more TLB entries. This is sometimes called "scheduling delay." Once the thread to run has been selected, the scheduler must restore its state and resume execution. This process requires that the state of the last thread to execute and be saved, and the state of the next thread to run and be restored. This is a fairly expensive operation and, since all registers must be saved and restored, this generates a great deal of memory traffic. Once the old thread state has been saved, the processor must wait for all operations to complete so that any exceptions that could have occurred from the previous thread have already happened.

After all this has completed, the thread can finally begin executing. The steps just described all contribute to the final delay, referred to as "periodic task-scheduling jitter."

Significance of the Test

The program reports the "worst-case observed periodic scheduling jitter," a value useful for giving an empirical measure of how deterministic a system is. Experimental validation of worst-case jitter lets you be sure that operations complete correctly even in a worst-case situation, especially for a specific hardware configuration and load.

For example, assume you have a haptic control system that requires that an input device be polled for its position at 100 Hz and this position read requires anywhere from 24.2 μs to 1.124 ms to complete. In fact, this is typically how an analog joystick is read. Assume, also, that this 100-Hz read is necessary to maintain synchronous operation with physical or visual cues fed back to a system operator. If the read of the position is too slow or inaccurate, then the cues being presented to users may be inaccurate and the whole system can fail. This is often the case in targeting or tracking systems with visual cues and with physical (force feedback) cues in robotic control systems.

Imagine that this system controls a robot arm that provides feedback to the operator through force exerted back on the operator through the joystick whenever the arm encounters an object or lifts a certain weight. If the feedback to the operator is not timely and tightly coupled to the position read of the joystick, then it is possible for operator-induced oscillation to result. If operators do not get timely feedback on their own actions, then the whole system may not work properly.

In this system, I have a 100-Hz rate giving me a period of 10 ms. In the worst case, the position measurement can take 1.124 ms so the system can tolerate a 10 ms-1.124 ms=8.876 ms worst-case periodic scheduling jitter before one position read overlaps the following one.

If the worst-case jitter is less than 8.876 ms, then I'm assured that I will get my 100-Hz measurement rate — but it won't give me uniform measurements. One read may end up completing 8.876 ms later than it should, but the following one may occur right on time. The read shows that the input device changed by a certain amount in 10 ms but, in reality, the change occurred in far less time since the measurements were taken nearly right after one another.

If I decide that I can tolerate only a 25 percent error, then I would need the worst-case scheduling jitter to be less than 2.219 ms. If I am able to measure less than this value on my system during heavy load, then I can be reasonably sure that my system will be able to measure the position with that amount of accuracy. Looking at this another way, I can also use the measured value to tell what the maximum update frequency can be given the performance of my operating system.

The Numbers

I ran the tests on each machine for 120 hours each. During that time, each machine was put under heavy load. Twelve copies of find / were run, sending their output to /dev/null to generate disk interrupts. Five copies at a time of dd if=/dev/hda of=/dev/null staged at 30-second intervals were run to generate disk activity, while making sure that they were staggered so they did not simply read from the buffer cache. Four outgoing ping -f instances were run on a 3Com 905B 100 mbps Ethernet card targeting a remote machine, and four incoming streams of ping -f from the same remote machine.

The number of interrupts per second was collected with vmstat each second, and the arithmetic mean was 19,965.4 interrupts per second during the whole run. The total interrupt count during the run was 8.62505×10⁹. This is an important figure since it shows that the operating system was performing another activity at a reasonable rate while taking these measurements. Getting extremely good performance out of an idle system is not generally difficult, and not very useful because the system needs to perform work!

After the tests ran, the stock Linux 2.4.19 run showed a worst-case scheduling jitter of 862 ms (see Figure 1), while the RTCore run showed a worst-case scheduling jitter of 47.3 μs (see Figure 2).

What is important is not what level of precision you need, but that you do need a guarantee of a certain level of precision. Whether your real-time application requires 20 μs worst-case jitter or only needs 200 ms, you still have a requirement — and a requirement means you need a guarantee. Even if that guarantee exceeds your requirement, you still know that you will meet your deadline every single time. If your application depends on meeting its deadlines, it's best to make sure that the operating system you choose will let it.

Measurements of Other Systems

If you have a candidate operating system on some candidate hardware, it is useful to run this program. Real-time performance varies by hardware wildly, even with the same operating system.

I include the test and makefile (Listing 2) to build it so that you can run it. I welcome the opportunity to hear about your experiences when you ran this program on other systems.

Even modifications to this program to take advantage of advanced features are interesting. I left out many RTCore-specific optimizations so it would be as portable as possible, but when comparing best-of-breed applications on different systems, it would be useful to see what kind of results would come out of it. For example, a three-line change to optimize RTCore performance results in a worst-case periodic scheduling jitter of 13.2 μs on the same hardware in a 120-hour run under the same load.

Cort Dougan is director of engineering and cofounder of FSMLabs. He can be contacted at [email protected].

1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.