Determining Processor Utilization

Performance requirements are often considered after the fact when it comes to selecting processors for embedded products.

July 01, 2002
URL:http://www.drdobbs.com/embedded-systems/determining-processor-utilization/embedded-systems/determining-processor-utilization/184405100

Jul02: Determining Processor Utilization

Determining Processor Utilization

Byron is an independent firmware developer specializing in microprocessor and DSP design and development for data acquisition, control, and Internet appliances. He can be reached at [email protected].

In an ideal world, selecting a processor for new embedded products would be based on the hardware architecture and performance requirements. In reality, we select the architecture and processor based on everything from cost, available tools, and developer familiarity, to support and even politics. Consequently, performance requirements are often considered after the fact and performance considerations are neglected until a problem arises.

Existing products can be affected by performance requirements too. Performance that's "good enough" early in a product's lifecycle may not be later on. On the other hand, products may be developed without considering performance requirements, resulting in a partially functional application. In short, ignoring performance requirements can have great consequences.

In this article, I present two approaches — duty cycle and counter — for handling performance issues at the latter stages of product development. For the purposes of illustration, I also present an example protocol analyzer test instrument that applies these approaches. Although a real-time operating system (RTOS) is assumed, these approaches are general purpose enough that you can use them for any application running under an RTOS or superloop.

Duty cycle and counter are empirical approaches. The duty-cycle approach uses duty cycles to determine aggregate processor utilization. A duty cycle is a ratio of the time on versus the total time. On the other hand, the counter approach uses counters to log time spent in tasks. Counts are calculated and compared to give processor utilization as a function of differential time. This approach is used to determine both aggregate and individual task utilization. Additionally, it also uses ratios to determine utilization. Although both approaches determine processor utilization, they do not provide utilization in terms of cycles, rather as a ratio of use. If you need to address performance issues during the analysis and design stages of product development, use an analytical approach such as Rate Monotonic Analysis (RMA), which is also useful for assessing whether system-wide deadlines are met.

Duty Cycle

To use the duty-cycle approach, you need a multitasking, preemptive, and priority-based RTOS that always has at least one task — the null task — running. The null task, also referred to as the idle task, is the lowest priority task. Consequently, it runs when no other task is running. Listing One is pseudocode for a typical null task. The null task is simply a superloop designed to use idle processor cycles.

The duty-cycle approach operates by manipulating application code to observe nonidle tasks. This approach first inserts code to toggle an output port on the processor. The port is a single line or pin that is turned on/off depending on which instruction is executing. Moreover, the pin's on/off value determines whether the application is idle. The toggled pin is then available for monitoring by a test instrument, which in turn gathers task information. Once gathered, information is analyzed to determine processor utilization under certain operating conditions. Code changes are required for this approach.

For monitoring, the duty-cycle approach requires code changes to the null task and task scheduler. Changes within the null task focus on accessing an available output pin port. Code is added to set the output pin upon entering the null task. The task scheduler is part of the RTOS kernel and is responsible for scheduling and swapping tasks. You add code here to reset the output pin whenever a task is swapped out.

Changes to the task scheduler require access to the kernel source, while changes to the null task do not. The net effect of these changes is to toggle the pin to high when idle, and low when nonidle (doing work). Alternatively, code changes could be isolated to the task scheduler. However, the logic involved to determine idle/nonidle tasks is more complex than simply setting the pin in the idle task and resetting the pin in the task scheduler. Listing Two is pseudocode for null task and task schedule changes.

There are two (although likely more) duty-cycle techniques you can use to determine processor utilization — the "multimeter technique" and the "oscilloscope/logic analyzer technique." Code changes required to use these techniques are the same for each. The only difference is how the information is gathered. These techniques are named according to the test instrument used to gather duty-cycle information from the output pin.

The multimeter technique, which uses a multimeter as its measuring instrument, lets you determine average processor utilization. Moreover, it determines aggregate processor utilization for the whole application, rather than individual tasks. You can use the multimeter technique during the implementation, integration, and testing stages of development.

The multimeter technique's advantages are that it provides a quick approximation of performance; requires minimal analysis knowledge and diagnostic skills; requires a standard, inexpensive piece of test equipment; and is minimally code invasive, requiring only a couple of changes in the code.

Accuracy, however, can be a problem with the multimeter technique because it is an averaging technique and accurate for measuring applications that are nonfluctuating — that is, applications that are utilizing the processor at a fairly constant rate. If the application fluctuates, it is considered bursty (that is, processor utilization varies greatly from one time interval to the next). In short, the multimeter technique averages bursty applications that can lead to gross inaccuracies.

For example, if the multimeter is averaging a bursty application with a processor utilization value over the five evenly dispersed time intervals of 5, 76, 34, 17, and 65 percent, it generates a result that averages about 39 percent. The error is significant if the actual utilization is 27 percent. Bursty applications may give results that are too far off to be useful for determining processor utilization. On the other hand, a nonbursty application over the same time interval with readings of 25, 29, 28.5, 26.5, and 27 percent yields an average close to the actual. Therefore, if greater accuracy is needed, you're better off using another technique (such as the oscilloscope/logic analyzer technique).

Another problem with the multimeter technique is that it is only good for measuring overall processor utilization. If you need individual task utilization, this technique is unusable. Finally, this technique requires additional hardware — namely, the multimeter. Although this is a disadvantage, it is a minor one since multimeters are inexpensive.

On the other hand, the oscilloscope/logic analyzer technique operates by graphically keeping track of the duty cycle to determine aggregate processor utilization using a logic analyzer or oscilloscope. Since modern oscilloscopes have much of the same functionality as logic analyzers, logic analyzers and scopes are interchangeable for using this technique for all practical purposes.

Like the multimeter technique, the scope technique is general purpose, determines aggregate processor utilization, may be used during the same phases of development, relies on the null task-based RTOS, and is minimally code invasive. However, the scope technique does require more detailed knowledge of analysis theory and diagnostic skills — you need to know how to operate a scope and interpret its results. Plus, this technique is expensive due to the cost of the equipment. Compared to multimeters, the scope technique is more accurate, flexible, and scalable — and provides greater accuracy. In fact, its accuracy is around the 98-99 percent range and produces accurate results for both bursty and nonbursty applications. Also, the scope technique is easily scalable. Scaling for increased processor utilization granularity is done by modifying more code and using multichannel scopes.

One enhancement you can make to the scope technique is to support individual task utilization. It can be used with either a null task or a nonnull task-based RTOS. Supporting this enhancement requires multichannel scopes, enough output pins, and more extensive code changes. Say, for example, you have a system with six tasks. You could use this enhancement and scale it up by adding code for toggling six different output pins; a separate pin per task. Although it scales well for six tasks, it becomes cumbersome using the enhancement for more than 16 tasks because of the number of code changes and the scope channels required. Therefore, I don't recommend this approach for applications with a large number of tasks.

Finally, the scope technique lets you capture processor utilization in greater accuracy than the multimeter technique. This technique may be enhanced to calculate the individual task utilization by simply changing code for different output pins and connecting a multichannel scope to monitor the results. Keep in mind that the maximum number of tasks monitored is limited by the input channels of the scope, and how many changes you can manage at once.

Counter Approach

The counter approach is more flexible and general purpose than the duty-cycle approach, and provides a means of displaying processor utilization for each task in an application. Like other empirical approaches, it can be used during the implementation, integration, and testing stages of development. Furthermore, it is as accurate as the scope technique and does not require expensive test instruments to determine utilization. All you need is a preemptive RTOS, real-time counter, and software to capture and display the counts.

This approach operates by keeping separate data structures or bins for each task. The bin holds counter information and any other required statistic information. When the scheduler performs a context switch, the bin for that specific task is updated. Updating consists of enabling the real-time clock and starting the counter for that bin. The counter is incremented while the task is running. When the scheduler is ready to run another task, it disables the clock. Once the context switch occurs, the time differential or delta is recorded in the bin associated with that task before a new task is scheduled. Deltas for each bin are summed to provide the complete time for the application. User-written profiling utilities display the collected bin information.

The counter approach requires more code changes than either duty-cycle technique because potentially each task in the system is modified. Additionally, this approach may not require a null task-based RTOS to operate if it is monitoring each task's utilization. Alternatively, if aggregate utilization is the goal, then a null task-based RTOS is required.

Similar to the duty-cycle approach, the counter approach has two techniques that can be used to determine processor utilization.

The "single null task technique" is similar to duty-cycle techniques because it requires a change to the RTOS's task schedule to operate. Code in the task scheduler is modified so that a counter is started when the null task is scheduled. When a nonidle task is scheduled, the null task is preempted. The task scheduler stops the counter, calculates the delta, and sums the delta with the existing delta in the idle bin. It then resets and restarts the counter. The counter runs until the scheduler detects a context switch back to the null task. At that time the scheduler stops the timer, records the summed delta, and stores it in the nonidle bin. Once stored, the process is ready to start all over again. Listing Three presents the necessary changes to support this technique.

The scheduler is responsible for recording the sums of each bin's delta. The utilization number using this technique is derived by:

U(n) = S ai / S (ai + bi)

where U(n) is the utilization for n-tasks, ai the deltas for the idle task, S ai the summation of the deltas for the idle task, bi the deltas for all the nonidle tasks, and S (ai + bi) the summation of all the deltas for both the nonidle and idle tasks. This ratio gives the aggregate processor utilization for a single task technique.

This technique is superior to both duty-cycle techniques. First, there are no required changes to the task code because all changes occur within the task scheduler. Second, it does not require a great deal of analysis expertise to use; the code profiler utility translates the information for you. Third, no additional equipment is needed to gather the utilization information. Finally, this technique is easy to scale by adding more bins.

On the other hand, the "multiple tasks technique" is a variation of the single null task technique — it is a scaled technique to provide better granularity at the individual task level. This technique uses multiple bins to store processor utilization information for each task. The multiple task technique has the ability to determine both aggregate and individual task utilization information.

It operates almost identically to the single null task technique, the main difference being there are more bins to manage. Consequently, the logic to manage them is also more complex. For example, instead of juggling between an idle and nonidle bin, the task scheduler has to keep track of one bin for every task. The upside of this added complexity is that no additional code changes are needed within any of the application tasks. A side effect of isolating the changes to the task scheduler is that they are transparent to the application. Listing Four is pseudocode for managing multiple tasks.

The multiple task technique is more flexible than the single task technique. Furthermore, it does not require a null task because the multiple task technique keeps track of all tasks rather than an idle and a nonidle one. It may be configured to profile aggregate utilization, individual utilization, or both concurrently. The multiple task technique is usable for either a small or large number of tasks. However, it may be hard to manage that many code changes for a large number of tasks.

The counter approach works well for finding processor utilization of an application composed of tasks. In fact, all of the empirical techniques discussed here provide processor utilization for tasks. While they do not include ISRs, utilization is easily added to the scope technique by inserting code into the ISR itself. The counter approaches also can be modified to deal with ISRs. They require more extensive modifications than the other approach. Specifically, it requires additional logic within the ISR to handle the ISR bin and its control.

An Example

To illustrate, I'll show how one of the empirical techniques can be used to determine processor utilization. Specifically, the product here is a real Integrated Services Digital Network (ISDN) test instrument that supports two 64K-baud and one 16K-baud channels (often referred to as Basic Rate Instrument, or BRI). The application supports 41 synchronized tasks, which have different priorities based on the external events they service. Moreover, the tasks are objects that are provided by the Real-Time eXecutive in C (RTXC), a preemptive multitasking RTOS that satisfies the requirements of the selected empirical technique. The application software runs on a proprietary hardware platform based on the Motorola 68360 processor.

The goal here is to determine if a current product has enough processing power to support an enhancement without a processor upgrade. Specifically, how much of the processor is utilized during a single channel test. The results can then be analyzed and scaled to see if there is enough bandwidth for the current product's processor to support 24 channels simultaneously.

To determine the worst-case processor utilization for BRI, the application must be stressed. This is done using one of the most processor-intensive ISDN tests — the Basic Error Rate Test (BERT), which measures the quality of a channel. BERT operates by testing a single 64K-baud ISDN channel and sending a fixed sequence of 63 bits every 20 seconds. Bits are simultaneously received and compared against those sent. The results of the comparison yield the channel quality.

I selected the scope technique for this evaluation because it is accurate, general purpose, works for both bursty and nonbursty applications, and provides aggregate processor utilization. Furthermore, this technique is minimally invasive, requiring a minimum of code changes.

To use the scope technique, the application requires a couple of code additions to the null task and the scheduler. Listing Five is the modified RTXC null task. SET_P 1 is a macro that sets the output pin to a 5-volt logic level through the Xilinx part. Whenever the application is idle, the output pin is set to a 5-volt high-logic level. Listing Six is the relevant code within RTXC's scheduler. RTXC's scheduler routine, called postem(), handles ISR and task scheduling. The scheduler code resets the output pin to a 0-volt low-logic level when a context swap occurs. These changes allow monitoring of the output pin while it toggles on/off depending on the task. A high condition signifies the application/system is idle, while a low condition signifies it is doing some work. Once these code changes are complete and the scope is connected to the appropriate pin, the application is ready to test.

Figures 1 and 2 show two significant events of the BERT test. Figure 1, a waveform capture of the section that the application is busy exercising the channel, shows the area of interest over a one-second range. Figure 2, a greater magnification of the right side of Figure 1, shows the spikes of nonidle task time. Calculating total nonidle task time is accomplished by summing up all low-level waveforms of Figure 1. Starting from the first low edge of the large nonidle waveform, it produced a total of 670 milliseconds for all nonidle tasks. This translates to a processor utilization of 67 percent.

When comparing the scope technique to the multimeter technique, the multimeter reading on the same pin was 2.42 volts, which translated to 48.5 percent. As you can see, the multimeter approach does not give good accuracy because the test is run over 20 seconds, and most of that time is spent idle. To produce an accurate result for the multimeter technique, the 20-second period has to be normalized to one second. In other words, to get 3.35 volts that correspond to the 67 percent, you would have to put the multimeter lead on the pin at the start of the nonidle task, and remove it one second later. This is clearly impractical for this setup. Thus, the scope technique lets you focus on the area of interest, in this case the nonidle time, and analyze it to determine its normalized processor utilization for its worst-case usage.

Conclusion

The results from the BERT test verified that there was a problem — the existing hardware would not be able to scale up and handle the simultaneous testing of 24 64K-baud channels. This verified the original analysis of the application based on deadlines and overall processor cycles available, which also concluded there was not enough processor bandwidth to accomplish our goal. Thus, the empirical test verified the analytical RMA analysis.

Keep in mind that the earlier you deal with the performance requirements, the less expensive the product costs to develop. Problems cost more to repair later in the development cycle. It costs more to fix bugs later in the development cycle than it would to use RMA initially. However, when it is not possible to use the first approach, the latter ones are at your disposal.

References

Baldwin, J.T., "Predicting and Estimating Real-Time Performance," Embedded Systems Programming, February 1995.

Klien, Ralya, Pollack, Obenza, and Harbour, A Practitioner's Handbook for Real-Time Analysis. Kluwer Academic Publishers, 1993.

Miller, West, Embedded System Design Primer. Impatiens Publications, 2001.

Thomae, D.A., "Processor Selection Using Rate Monotonic Analysis," Embedded Systems Programming, May 1996.

DDJ

Listing One

Null_task
{
    loop
    {
        Typically, nop code goes here, althought 
        user-specific code may be inserted here
    }
    forever     // loops until preempted by a higher priority task
}

Back to Article

Listing Two

(a)

null_task { loop { set output pin } forever } (b)

task_scheduler
{
    preempt currently running task
    select highest priority task that is ready to run
    ...
    check ready task to see if it is the idle task
    if ready task is non-idle task
    {
        reset output pin
    }
    ...
}

Back to Article

Listing Three

task_scheduler
{
    ...
if power up sequence true
{
clear bins
reset timer counter
}
    Determine if scheduled task is idle or non-idle
    Case task idle:
    {
        stop timer counter
        calculate delta from timer counter
        sum counter delta with contents of non-idle task bin
        store results in non-idle task bin
        reset timer counter
        start timer counter
    }
    Case task non-idle:
    {
        stop timer counter
        calculate delta from timer counter
        sum counter delta with contents of idle task bin
        store results in idle task bin
        reset timer counter
        start timer counter
    ...
}

Back to Article

Listing Four

task_scheduler
{
    ...
if power up sequence true
{
clear bins
reset timer counter
}
    Determine if identity of currently executing task
    stop timer counter
    calculate delta from timer counter
    sum counter delta with contents of currently executing task bin
    store results in currently executing task bin
    reset timer counter
    start timer counter
    ...
    }
}

Back to Article

Listing Five

/* C main() module */
int main(void)
{
   int  i;
 ...
for (;;)        /* loop forever (null task) */
   {
      if (brkflag)
         ; /* this could be "break;" which aborts RTXC */
      SET_P1;   /* macro that sets port pin 1 via xilinx */
   }
 ...
   return(1);
}

Back to Article

Listing Six

/* RTXC task scheduler */
static FRAME *postem(void)  /* returns with interrupts disabled */
{
 ...
    if (hipritsk->priority != NULL_TASK_PRIORITY)   
                           /* reset pin if non-idle task */
        RESET_P1;   /* macro, resets port pin 1 */
 ...
    return(hipritsk->sp);   /* exit to hipritsk via tcb.sp */
}

Back to Article

Jul02: Determining Processor Utilization

Figure 1: Waveform capture of the section that the application is busy exercising the channel.

Jul02: Determining Processor Utilization

Determining Processor Utilization

Duty Cycle

Counter Approach

An Example

Conclusion

References

Listing One

Listing Two

Listing Three

Listing Four

Listing Five

Listing Six

Figure 1: Waveform capture of the section that the application is busy exercising the channel.

Figure 2: Spikes of nonidle task time in Figure 1.