Dr. Dobb's | Measuring Execution Time and Real-time Performance: Part 1

Measuring Execution Time and Real-time Performance: Part 1

In the first of a two part tutorial, David Stewart focusses on techniques for measuring execution time, by first providing a definition of key attributes and overview of methods, then provides details for using each method.

November 06, 2006
URL:http://www.drdobbs.com/embedded-systems/measuring-execution-time-and-real-time-p/193502123

Many embedded systems require hard or soft real-time execution that must meet rigid timing constraints. Further complicating the issue is that for a variety of reasons, most of these same embedded systems have very limited processing power; it is not uncommon for them to be using an 8-bit or 16- bit processor operating at 10 MHz or less.

Real-time systems theory advocates the use of an appropriate scheduling algorithm and performing a schedulability analysis prior to building the system. Adherence to this theory alone does not lead to working embedded systems, and thus use of this theory is often dismissed by practitioners.

Practitioners, on the other hand, spend days - if not weeks - of testing and debugging hard-to-find and difficult-to- replicate problems because their system is not performing to specifications. Often, these problems are related to the system's timing, because functional testing was done using good tools, and the system usually produces a correct response.

There exists a balance between theory and practice, where proper design of real-time code enables the real-time analysis of it. Systematic techniques for measuring execution time can then be used alongside the guidelines provided by real-time systems theory to help an engineer design, analyze, and if necessary quickly fix timing problems in real-time embedded systems.

This series of two articles discusses techniques for measuring and optimizing real-time code, and analyzing performance by correlating the measurements with the real-time specifications through use of real-time systems theory. Since this paper is directed towards practitioners, simple rules of thumb that encapsulate the knowledge of complex theories and proofs are presented.

Several other activities of the development process can benefit from estimating and measuring execution time using the methods described here. This includes debugging hard-to-find timing errors that result in hiccups in the system, estimating processing needs of software, and determining the hardware needs when enhancing functionality of an existing system or reusing code in subsequent generations of embedded systems.

Overview of Measurement Techniques
Many different methods exist to measure execution time, but there is no single best technique. Rather, each technique is a compromise between multiple attributes, such as resolution, accuracy, granularity, and difficulty. A summary of the key attributes follows:

Resolution is a representation of the limitations of the timing hardware. For example, a stop watch measures with a 0.01 sec resolution, while a logic analyzer might be able to measure with a resolution of 50 nsec.

Accuracy is the closeness of the measured value using a given method of measuring, as compared to the actual time if a perfect measurement was obtained. If a particular measurement is repeated several times, there is usually some amount of error in the measurements. Thus, measurements could yield answers of the form x +/- y. In this case, y is the accuracy of the measurement x.

Granularity is the part of the code that can be measured, and usually specified in a subjective manner. For example, coarse granularity (also called coarse-grain) methods would generally measure execution time on a per-process, per-procedure, or per-function basis.

In contrast, a method that has fine granularity (also called fine-grain) can be used to measure execution time of a loop, small code segment, or even a single instruction. Important to note is that some fine-grain techniques can also be used to perform coarse-grain measurements, although the effort in doing so could be much greater than using a coarse-grain method.

Difficulty subjectively defines the effort to obtain measurements. A method that requires the user to simply run the code and it produces an instant answer or a table of results is considered easy. A method that requires usage of instrumentation such as a logic analyzer and filtering of data to obtain the answers is considered hard.

Typically, software-only methods are easier, but yield only coarse-grain results. Hardware-assisted methods are hard, but they can provide fine-grain results with high accuracy.

Table 1: Summary of methods to measure execution time

Table 1, above summarizes the methods and attributes of each method as presented in this paper Note that in many cases the attributes are approximated or subjective, not exact values; however comparing the attributes of different methods should provide sufficient information to help choose the best measurement for a particular need.

The method of choice can also depend on the hardware features and instrumentation tools available. For example, some methods require special hardware features like a digital output port, while other techniques require a specific software application or measurement instrumentation to be available. In some cases, the hardware or tools needed can be quite expensive and the cost and lack of availability can prevent using a particular method.

On the other hand, having access to the right tools can significantly decrease the amount of effort needed to obtain needed measurements, and thus obtaining the tools most suited to a project's needs could be a worthwhile investment.

The design of the software can also have a major impact on the ability to obtain measurements of execution time, but it is not classified as an attribute, as there is no way to quantify or qualify every possible variation.

In particular, the execution time of software designed in an ad-hoc manner (also known as "spaghetti code") is very difficult to measure because the starting and stopping points of the code are not easy to identify, and if there are multiple and inconsistent entry or exit points to the same piece of code, then obtaining accurate measurements is near impossible.

On the other hand, software designed so that it is "analyzable" clearly has a single entry and exit point for any part of it that needs to be measured, and those entry and exit points are defined consistently for all code segments that have similar functionality.

Selecting a Method
To select which measurement method to use, first consider the reason for measuring execution time. The most common reasons for measuring execution time are to refine estimates, optimize code, analyze real-time performance, and to debug timing errors.

Refining estimates is usually done during the design phase or early in the implementation phase. The estimates might be used to select which processor to use, or to obtain ballpark figures on how many iterations of a particular function can be executed per second.

Coarse-grain measurements can provide some of these answers fairly quickly. Sometimes, the measurements can even be made on the host processor, with an approximated scale factor for the target processor (such as "embedded processor X is about 18 times slower than host processor Y".)

Optimizing code could use coarse-grain methods or fine-grain methods, depending on what is being optimized. If optimization is at a global scale, such as deciding whether it would be faster to use arrays or linked lists in a particular application, then a coarse-grain technique to measure execution time of complete functions is usually sufficient.

On the other hand, for localized optimizations, such as those that are specific to a target processor and occur during the late stages of development or when trying to fine-tune an application, a fine-grain technique that can measure execution time of a single line of code is usually needed.

Analyzing real-time performance can use a coarse-grain technique, but often only fine-grain techniques can provide the necessary accuracy. The accuracy needs to be at least five to ten times faster than the period of the fastest task.

Thus, if the fastest task in the system has a period of 10 msec, then a measurement technique that provides an accuracy of at least 1 to 2 msec for functions is needed to provide fairly good answers. More accuracy is better, especially if the Central Processing Unit (CPU) is either overloaded or operating at almost 100% utilization. In these cases, a technique with microsecond accuracy is needed.

Debugging timing errors usually needs a fine-grain method with maximum resolution. It is often necessary to measure not only user code, but also real-time operating system (RTOS) code, and to detect any anomalies that might be occurring, such as missed deadlines or tasks not executing at the desired rate.

Measurement Methods
Of the measurement techniques that were summarized in Table 1, a few methods are quite straightforward, but most are only applicable to UNIX-based systems, such as most embedded versions of Linux.

The software analyzer method is an all-encompassing description of some features provided by commercial RTOS and tools. The techniques described towards the end of this tutorial are the ones based on hardware, and can be used independent of the RTOS. These can provide the most accurate results, but also involve the most complexity.

Stop-watch. A stop watch is only suitable for non-interactive programs, preferably running on single-tasking systems. It can be used to measure time of things like numerical code which may take minutes or hours to execute, and when measurements only need to be approximations (e.g. to nearest second).

The method simply involves using the chronograph feature of a digital wrist-watch (or other equivalent timing device). When the program starts, start the watch. When the program ends, stop the watch, and read the time.

Date command. The date command is useful when using a UNIX-based system or any other RTOS that has a command that displays the current date and time.

The date command is used like a stopwatch, except it uses the built-in clock of the computer instead of an external stopwatch. This method is more accurate than a stop-watch, but has the same granularity of only being able to accurately measure non-interactive processes.

A typical way to use the command is to wrap the program that is being measured in a shell script or alias with the following commands:

date > output
program >> output
date >> output

As with the stop-watch method, this will only provide an estimate of how long the full program takes to execute. It does not take into consideration preemption, interrupts, or I/O. Most accurate answers are obtained on non-preemptive systems. This method is useful if the output serves as a log, so that the start and end time of each execution is logged into the file. A sample use is for long simulations that run in the background overnight, and it can provide information to know precisely when it ended.

Time command (UNIX). The time command is useful when using a UNIX-based system. Other RTOSes might provide a similar command. Execution time measurement is activated by prefixing time to a command line. This command not only measures the time between beginning and end of the program, but it also computes the execution time used by the specific program, taking into consideration preemption, I/O, and other activities that cause the process to give-up the CPU.

The output depends on which version of the time command is being used. In some cases, the time command is part of the shell. In other cases, it can be found in /usr/bin/time. In each case the output is the same information, just the format is different. For example:

% time program
8.400u 0.040s 0:18.40 56.1%

Interpreting the output, the first item (with a u appended, u=CPU), is the execution time of program, shown here as 8.4 sec. This is the amount of time the CPU was actually executing the program. Any time spent preempted, blocked for I/ O, or performing RTOS functions is excluded.

The second item (with s appended, s=system), is the execution time used by the RTOS while running the program. This includes execution time for items such as device drivers, interrupt handlers, or other system calls directly associated with the program. The example shows that 0.04 sec of execution time was for system functions.

The third item is the total time that the program was executing in the system, whether it be running or blocked or waiting on the ready queue. In this case, it was 18.4 sec. This time is the about the same time that would be reported using the date method above.

The fourth item is the average percentage of CPU time used when the task was ready or running. The value primarily depends on the load of the system, and has little meaning as far as measuring execution time.

Prof and Gprof (UNIX). The previous methods can only be used to measure a complete program. Many times, it is necessary to measure execution time at a finer granularity.

One method to measure execution on a per function basis is to use the prof or gprof profiling mechanisms available in UNIX. Profiling means to obtain a set of timing measurements for all (or a large part) of the code. The granularity of a profile depends on the method. In this case, both prof and gprof measure execution time with the granularity of a function. The resolution is usually that of the system clock, meaning on the order of 10 msec.

Both prof and gprof do similar things, except that gprof gives much more detailed results than prof. The measured time properly takes into account preemption, such that if a process is preempted, the clock stops until the process starts to execute again. This profiling mechanism, however, does slow down execution of the program by a non-negligible amount.

So the execution time measured when using prof or gprof will be greater than the real execution time of the program when it is not being profiled. Despite this inaccuracy, the method can be useful to identify which functions in the program are using the most execution time, to identify where optimizations might need to be made the most.

To use prof, compile with the "p option then run program as follows (other compiler options can be used too, this is just an example).

% gcc "p -o program program.c
% program

When the program terminates, the file mon.out is automatically created. It is a binary file that contains the timing data by function for the program. To view the timing data, type the following:

% prof program

A more detailed profile report can be obtained using gprof, by compiling with the "pg option as follows:

% gcc -pg -o program program.c
% program

Running the program creates the file gmon.out, which can be viewed as follows:

% gprof program

For information that describes the format of the statistics and what each entry means, look at the online UNIX manuals for prof and gprof for the specific operating system version being used.

Clock(). Although the prof/gprof method provides more detailed information then the first few methods presented, it is often necessary to measure execution time with finer granularity than a function.

Suppose prof was used and it shows that 90% of the time is spent in one subroutine. That subroutine becomes the primary target for optimization. But if the routine includes several loops, the next step is then to identify the most time-consuming parts within that subroutine.

A possible approach is to use the clock() function, as provided by many operating systems, including UNIX. In this case, however, the program must be instrumented such that the clock is read at the beginning and end of the code segment( s) being measured.

Instrumenting the code means adding lines of code explicitly to perform the timing measurements. Such lines of code are temporary, and are removed once the desired data has been collected.

This method is useful for fine-grain measurements, such as a code segment or loop, but it is not as convenient as prof/ gprof to obtain measurements of multiple functions or processes at once. Here is an example of a program that uses clock().

    #include
    clock_t start,finish;
    double total;
    start = clock();
    do stuff;
    finish = clock(); total = (double) (finish - start) / (double) CLK_TCK
    printf("Total = %f\n",total);

There are several issues that must be taken into account when using clock(). The issues stem from the fact that there is no standard implementation of this function, thus it can produce different results for different operating systems.

For example, it can provide a value in microseconds, seconds, or clock ticks. The reference manual for the particular operating system should be referenced prior to using the clock() function.

Depending on the system, clock() might behave differently if the system is preemptive. In some cases, if the task is preempted, the value returned by clock() will include the time spent by the other task too. In other cases, it will only include time used by its own process.

The clock() function is certainly more useful when the implementation properly deals with preemption. But even if it does not, see descriptions in the following sections on how to deal with preemption.

It is also important to note the resolution. Even though clock() might report time in microseconds, the resolution is usually the same as the system clock, which can be computed as 1/sysconf (3). On many UNIX systems, this is 10 msec or longer. Calling the function sysconf() with the argument '3' returns the value of the system clock.

If more resolution than 10 msec is needed, then one of two approaches can be used:

1) Create a loop around what needs to be measured, that executes 10, 100, or 1000 times or more. Measure execution time to the nearest 10 msec. Then divide that time by the number of times the loop executed. If the loop executed 1000 times using a 10 msec clock, you obtain a resolution of 10 µsec for the loop.

2) Use a hardware-based method.

The advantage of the loop method is that it does not require any special hardware. The disadvantage is that it forces a change in the code; the change might affect the functionality, and could even cause the program to crash.

At the very least, the code slows down by the number of iterations performed just to get a reading, and thus real-time performance is lost. If this is not acceptable, then one of the other methods must be used.

Software Analyzer. The term software analyzer is used as an all-encompassing phrase for software tools provided by a variety of RTOS and tool vendors designed specifically for measuring execution time. Examples include TimeTrace [6], and WindView [7].

It is beyond the scope of this tutorial to describe how to use any such tools, or to even recommend one tool over the other. Rather, this section provides a general discussion to aid in understanding capabilities of these tools.

The first step in using a software analyzer is to determine the resolution and granularity. The resolution should be one of the specifications of the product. It can also be determined experimentally by slowly increasing execution time of a code segment, then monitoring the measured value by the smallest time increment. That typically is the resolution.

If the software analyzer is based on the system clock, then the resolution will likely be on the order of a millisecond. If the analyzer is based on some other hardware-based method, such as using an onboard timer/counter chip, then the resolution might be in the microseconds range.

The granularity is another important item to identify. Some software analyzers will be like prof/gprof, and only be able to provide information on a per-function or per-process basis. As with prof/gprof, such analyzers are good if coarse-grain measurements are satisfactory, but not very useful when optimizing localized code segments or tracking down timing or synchronization errors.

A good software analyzer will not only provide information on per-function or per-process basis, but it will also contain a means for measuring execution time for smaller segments, such as a loop, block of code, or even a single statement. The ability to measure execution time of interrupt handlers and the RTOS overhead are also a bonus.

Some software analyzers provide a timing trace to show precisely what process is executing at what time. Such a timing trace could be helpful to an expert when debugging timing and synchronization errors, but they do not offer data in a convenient format to analyze real-time performance.

Also, if the timing trace is not correlated to the source code, then it is not possible to identify what part of code is responsible for extended periods of execution when such an event is detected in the timing trace. Instead, a state mode that provides tabular data that can be analyzed or download is needed.

Another issue to consider when using software analyzers are the resources used. Some analyzers add overhead, and thus slow down code. Most analyzers require lots of memory to log data, making the tool ineffective when an embedded system's memory is already fully allocated. In such cases, the hardware-based methods described below can instead be used.

Timer/Counter Chip. Most embedded computers have timer/counter chips that are user programmable. If such a chip is available, then it can be used to obtain fine-grain measurements of code segments. The method presented here, however, is not very useful for coarse-grain measurements, such as total execution time used by a function or process.

This method is similar to using the clock() method described earlier, in that the starting and stopping points of the code being measured are instrumented directly into the code. At the beginning of the code segment, the current countdown (or count-up) value of the timer/counter is read. At the end of the code, the value is read again. The difference between these two values represents how many timer ticks have elapsed.

It is then necessary to determine the value of a timer tick. The value of the timer tick is typically a multiple of the microprocessor clock speed. It could be fixed, or user-programmable.

For example, an 8 MHz microcontroller has a cycle time of 125 nsec. A timer-chip on this microcontroller has the timer tick user programmable as 1x, 4x, 16x, or 256x, depending on the bit-pattern written to one of the timer's control registers. Suppose 16x is chosen. This means the timer-tick is 16 times 125 nsec, or 2 µsec.

This yields a mechanism with a resolution of 2 µsec, and usually an accuracy of twice the resolution, meaning 4 µsec. With this accuracy, it is possible to measure execution time of rather small code segments.

If an RTOS is being used, there is a possibility that the RTOS has already configured the timer/counter chip. In such a case, either use a second timer/counter chip if one is available, or use the same chip as the RTOS, but only read it. Do not change the timer configuration in any way, as that can cause the RTOS to crash.

A question arises as to where do the answers go? In the clock() example earlier, a print statement displayed results. But on a system in which this timer/counter method is used, there is a good possibility that a video display is not available.

If a small display is available (even a simple 4-digit 7-segment LCD display), then values can be shown on the display. An alternative is to send the data out on an output port, and collect it using a chart recorder or logic analyzer. A third possibility is to store the data in memory at a known location, then to peek into that memory using a debugging tool or a processor's built-in monitor.

One issue that needs to be considered is overflow. If the timer is 16 bits, and its resolution is programmed to be 2 µsec, then it will reset and start over every 130 msec. As a rule of thumb, the method should be restricted to measuring code segments that are at most 10% of this maximum range, meaning up to about 13 msec for a 16-bit timer with 2 µsec resolution.

In such a case, if the measurement is continuing on a periodic basis, approximately 1 in 10 readings will be wrong, as it coincides with the timer overflowing. That reading needs to be spotted and discarded. This is quite easy to do as long as the code segment takes about the same amount of time every time, in which case the data reading that is discarded is the one that does not make sense.

Another issue occurs in a preemptive environment or when interrupts are present. If the code segment being measured can be preempted, then false data readings will be provided every time such a preemption occurs within the code segment.

Several possibilities exist. One is to disable interrupts whenever a measurement begins, then re-enable interrupts when the measurement ends. This could affect real-time performance by causing priority inversion and cause the application to not meet the specifications.

But often it is acceptable to do during the testing phase in order to get the measurements of various code segments. A second alternative is to discard readings that are much longer than the average reading, as they represent measurements that include preemption. Anytime readings are discarded, care must be taken to not accidently keep an incorrect reading and discard a valid reading.

As a general rule, any discarding of data must always be done with great care. Only discard a value if there is a reasonable explanation. If there is concern that a good value might accidentally be discarded, and such a mistake cannot be tolerated, then use a different method that is not subject to the overflow of the timer chip, or more suited to account for preemption.

Logic Analyzers
A logic analyzer is one of the best tools for accurately measuring execution time with microsecond resolution, especially when accurate timing is essential. The drawback is that the it requires specialized hardware and more effort than some of the previous techniques described above.

There are two approaches to using a logic analyzer. One approach is to hook up the probes to the CPU pins. Connecting the logic analyzer to a CPU emulator or using a bus analyzer has the same effect. While this method is least obtrusive on the real-time code, it is also the most difficult, as it requires reverse engineering the code to correlate logic analyzer measurements with the source code.

Some logic analyzers provide processor disassembly support, but that only provides correlation to the assembly code, and not necessarily to the source code. This approach is not advocated, as it is very difficult and does not yield answers that are any better than the other approach described next.

However, a variation of this approach is to monitor only a single memory location, in which case this becomes the same as the other approach.

The other approach is to send strategic signals to an output port, which are read by the logic analyzer as events. The code is instrumented to send signals at the start and end of each code segment.

The instrumentation is encapsulated within a macro, so that redefining the macro to an empty statement disables the instrumentation without the need to change any part of the application code. This approach is compatible both with large applications that use commercial RTOS and smaller systems based on custom executives or even ad-hoc code.

Necessary Embedded Hardware Features
To use a logic analyzer to measure code, signals must be sent from the software to the analyzer. The easier way is to use a digital output port. It is highly recommended that any embedded application is designed with at least one such port dedicated to testing and debugging. A single 8-bit or 16-bit port can be used as a gateway to seeing inside the program to save tremendous development time.

Some embedded hardware features more sophisticated windows to the inside, such as JTAG and BDM. However, each of these require additional engines to drive the mechanism, and while very useful for debugging functional code, their use can greatly affect real-time performance, and thus not recommended when measuring execution time.

An 8-bit digital output port is usually sufficient for most applications. A 16-bit port might be desirable for larger applications as it enables the encoding more information to send to the logic analyzer. If at least an 8-bit port is not available, there are other alternatives.

If there is access to the CPU's address and data lines (for example, if an emulator is attached to the system), then only a single memory location on the CPU needs to be reserved. The address of that memory location is used to trigger the logic analyzer, while the data lines contain the information that would otherwise be sent to the digital output port.

A similar method can be used if a bus-analyzer (such as a VMEbus or PCIbus analyzer) is present in the system. A bus analyzer only monitors accesses to external memory that go over the bus. Thus the single memory location that is selected must be an external memory location that can be captured by the bus analyzer.

A bus analyzer is in fact a logic analyzer, with all the probes permanently affixed to each wire on the bus. Therefore the techniques for using a bus analyzer are the same as described in this section when using a logic analyzer.

Even if there is only a single bit of output available or even a single serial or digital-to-analog output port, it is still possible to measure execution time, although it is much more difficult. If the output is analog, then an oscilloscope is needed instead of a logic analyzer.

Logic Analyzer Features
A logic analyzer must be setup to capture the data being sent to the digital output port or over the address lines. However, not every logic analyzer is the same. Some key features can greatly simplify collecting data for the purpose of measuring execution time and real-time performance.

First, the logic analyzer should support state mode. That is, it displays collected data as a list of hexadecimal numbers, one line per entry in the analyzer's buffer. All but the lowest cost analyzers usually have this mode. It is still possible to measure execution time using timing graphs, but this is much more difficult, and forces each measurement to be performed manually.

The logic analyzer should support automatic detection of transitions (often called transitional mode). That is, it monitors the data lines, and collects one entry every time it detects the output on the data lines has changed. Even some high-end analyzers do not have this capability; while other low-end analyzers do have the capability. If the logic analyzer does not support this mode, then a more sophisticated external triggering combined with setting up the analyzer in sequence mode is needed. Here, it is assumed that transitional mode is available.

A deep buffer on the analyzer is highly desirable. The more data that can be collected during a single execution, the more different items that can be measured, and the more measurements of periodic or repeated code. This leads to higher confidence in measurements of average and worst-case execution times. A deep buffer also increases that ability to measure rare events, like an occasional interrupt. Some logic analyzers have buffers that are one or two million events. The general rule is the more the better.

To measure execution time, only 16 channels are needed, and if only an 8-bit output port is used, then only 8 channels are needed. Most logic analyzers—even the lowest-end ones—have this many channels. For purposes of measuring execution time, additional channels are not needed. Measuring execution time can become tedious, thus automating parts of it is highly desirable.

To automate some of the data filtering there needs to be a computer. Thus some form of output from the analyzer, either Ethernet, GPIB, or high-speed serial is very helpful. Alternately, one of the newer generations of logic analyzers with built-in host computer can also be used.

A search option that enables typing in a data pattern, and displaying only the data that matches the pattern, is also very helpful for quickly viewing some results. Lack of the search option, however, does not invalidate use of the analyzer, as the same effect can be achieved after uploading data from the analyzer to a host computer.

Once an appropriate logic analyzer is selected, connecting it is straightforward. Simply connect the 16 bits of the digital output port to the corresponding first 16 channels of the logic analyzer. If an 8-bit output port is used, then only connect the first 8 channels. For simplicity, be sure that bit 0 of the output port is connected to channel 0 of the logic analyzer, bit 1 to channel 1, etc.

Next, measuring execution time for a single code segment is described. The method is then expanded in Section 4.1 for instrumenting complete tasks to measure code for an entire application at once.

Measuring Time for Code Segments
The following discussion assumes C or C++. If using any other language, including assembly language, it should be fairly obvious on how to adapt the method to the new language.

The first step is to setup macros for writing the output port. This step is recommended because different architectures and different output devices may require different methods of writing output. However, it is desirable to become accustom to a single set of commands.

Suppose the macros are called MEZ_START and MEZ_STOP, and a definition is created for an 8-bit output port. Following is a sample definition:

#define MEZ_START(id) output(dioport,0x50|id&0xF)
#define MEZ_STOP(id) output(dioport,0x60|id&0xF)

These definitions assume a multitasking system. id is an identification number that enables measuring execution time for multiple code segments at once. Each code segment is simply given a separate id number. This macro assumes a maximum of 16 id's (numbered 0 through 15).

The 0x50 and 0x60 codes are arbitrarily defined; they can be any number that use only the first four bits of the 8-bit value, as the bottom four bits are used for the id. A full profiling of an application might encompass a dozen or so codes. The encoding is quite flexible; although reserving the top four bits as the event code and bottom four bits as the id makes it easy to view the items in hexadecimal on the logic analyzer.

The code whose execution time is to be measured is then instrumented to include MEZ_START and MEZ_STOP macros. For example:

:
MEZ_START(1);
funcA();
MEZ_STOP(1);
MEZ_START(2);
y = a + b * c;
MEZ_STOP(2);
:

In this example, two code segments are being measured simultaneously. The first is to obtain the execution time of the function funcA(). The second is to obtain the execution time of the operation y=a+b*c.

The code is compiled. Prior to executing it, the logic analyzer is turned on, and setup in transitional mode to collect data from the output port. The code is then executed. Data collection on the logic analyzer is halted, and the output displayed.

Depending on the analyzer, there could be many different columns. Two columns are most important for this task: the data column and the time column.

The data column will show the data codes that are output as a result of the MEZ_START and MEZ_STOP macros. For the above example, the data should be 0x51, 0x61, 0x52, and 0x62, in that order.

The logic analyzer automatically time-stamps every event. The timestamp can generally be displayed as relative or absolute. Relative means that the time column shows the amount of time that elapsed since the reading on the previous line. Absolute is a cumulative time.

For example, assuming that funcA() took 358 µsec and the calculation to determine y took 14 µsec, the output would appear as follows (both relative and absolute time mode shown):

The u represents microseconds; this is a typical convention used by most logic analyzers. Other common abbreviations are n for nanoseconds, m for milliseconds, and s for seconds. From this output, the measured execution time is readily obtained. Relative mode is usually easiest to use if the start and stop operations are consecutive. Absolute mode is useful when nesting measurements.

Using this method, any code segment(s) in the application can be measured. Measuring individual code segments is especially helpful when optimizing code. The code can be measured prior to optimization then again after the optimization, and the amount of savings (if any) is readily known.

When optimizing code, execution time should always be measured, to prevent making changes to the code that appear as optimizations, but in reality either do not affect execution time or worse, slow down execution time.

There are caveats to measuring execution time in this manner. In particular, it does not account for preemption or interrupts, and thus measured values could be misleading.

Furthermore, it is incomplete, in that only a few code segments are measured. That is insufficient if trying to measure real-time performance for the entire application. Variations to using this technique discussed later, do take into account preemption and other related issues.

Collecting Data through a single bit
Some embedded systems are so restrictive that a spare 8-bit digital output port is an unavailable luxury. Measuring execution time can still occur with as little as a single output bit available.

The primary limitation with using only a single bit is that only one item can easily be measured at once. The MEZ_START macro is modified to set the bit to 1, while the MEZ_STOP macro resets the bit to 0.

If there is more than 1 bit, but less than 8-bits available, various encoding strategies can be used to measure more than a single item at a time. For example, with 3 bits, two of the bits can be used to identify the code segment, thus allowing four code segments to be measured at a time. The third bit is toggled as in the single-bit case, to provide measurements.

When using only one or two bits, an oscilloscope can replace the logic analyzer. Another alternative if an oscilloscope is available is to use a digital-to-analog output port.

Several different clearly-distinguishable analog levels are pre-defined, with the occurrence of each one representing an event. The resolution of an analog output is generally a function of the conversion time. Values in the range of 10 to 50 µsec are not uncommon, while very accurate ones (like a 20-bit converter) could be in the millisecond range.

This represents much lower resolution as compared to using digital outputs. Nevertheless, if this is the only means of sending signals from the embedded processor to a measurement instrument, then it is still better than not having such an ability.

Next, in Part 2 in this tutorial, the author focusses on real-time analysis and various techniques for analyzing real-time performance.

David B. Stewart is Director of Software Engineering at InHand Electronics.

References
[1] D. Katcher, H. Arakawa and J. Strosnider, "Engineering and Analysis of Fixed Priority Schedulers", IEEE Transactions on Software Engineering, Vol. 19, No. 9, Sep. 1993.

[2] A. Secka, "Automatic Debugging of a Real-Time System Using Analysis and Prediction of Various Scheduling Algorithm Implementations," M.S. Thesis, Dept. of Electrical and Computer Engineering, University of Maryland, Supervisor D. Stewart, Nov. 2000.

[3] M. Steenstrup, M.A. Arbib, and E.G. Manes, "Port Automata and the Algebra of Concurrent Processes," J. Computer and System Sciences, Vol. 27, No. 1, pp. 29-50, August 1983.

[4] D.B. Stewart and P.K. Khosla, "Mechanisms for Detecting and Handling Timing Errors," Comm. the ACM, Vol. 40, No. 1, pp. 87"94, January 1997.

[5] D.B. Stewart, "Designing Software Components for Real-Time Applications," in Proc. of Embedded Systems Conference, San Francisco, CA, Class 507/527, Apr. 2001.

[6] TimeTrace, TimeSys Corp.

[7] WindView, Wind River Systems.