Channels ▼


Using Hardware Trace for Performance Analysis

Source Code Accompanies This Article. Download It Now.

October, 2005: Using Hardware Trace for Performance Analysis

Michael is a senior software engineer at Green Hills Software. He can be reached at [email protected]

Whether you want to create a faster printer, squeeze extra features into your cell-phone design, or minimize the cost of your embedded device by lowering your computation and processor requirements, performance analysis is a vital component of developing high-quality embedded systems. Thus, the consequences of inefficient code are very real when making embedded devices. There are several traditional tools and techniques that you can use to diagnose and help fix performance problems in embedded systems. However, many of these tools have limitations because they require modification to the code running on the system. This means that the tool interferes with the execution of the software, which can lead to inaccurate results or other problems.

One class of tools that does not have any impact on your running system, yet makes performance analysis easy, is based on hardware trace technology. Hardware trace essentially gives you a complete history of the instructions executed by your microprocessor. This information can be collected without modifying the running code or altering its runtime characteristics—it is collected with zero intrusion on your system. Trace-analysis tools can then convert this information into meaningful and powerful performance-analysis data that lets you analyze your system (without altering its runtime behavior) and optimize your system for maximum performance.

In this article, I examine some traditional performance-analysis techniques as they apply to embedded systems and discuss some of their inherent limitations. Next, I explore hardware trace by investigating what it is and how it can be used to effectively and efficiently debug your embedded software. Finally, I walk through a real-world example to illustrate some of the unique benefits of using hardware trace for system optimization.

Performance Analysis and Optimization

Performance analysis is an important part of developing a complete embedded product. Whatever your reasons for needing higher performance from your software, you almost certainly could use code that runs faster. However, performance analysis is generally a difficult task because it requires careful measurement and analysis of your system. In addition, after gaining visibility into what your software is doing, you must find areas where the code is performing unnecessary computations or other actions. Oftentimes, system performance can be greatly improved with relatively simple changes. However, finding the places that can be easily optimized is often a challenging task.

There are many techniques that have been traditionally used to analyze software performance. The most common techniques involve various forms of profiling, which give you information about how often each part of your program is running. The most popular forms of profiling provide statistical information about what code is running. This data can be collected by either periodically sampling the program counter or by instrumenting the object code running on your system. Regardless of the method, profiling methods generally maintain an array of the program counter locations in your program. Profiling methods then increment a counter for each program counter location as it is encountered, letting you see how much time is spent at each instruction in your program. A debugger or other tool then correlates this raw information to the time spent in each task, function, and source code line in your system.

Statistical profiling lets you see the percentage of time spent at every location in your program. For example, Figure 1 is output from a profiling tool for the program in Listing One. However, this information comes at a steep cost, especially for real-time systems. By either halting your system periodically to sample the program counter or modifying your executable, standard profiling methods change the behavior of your system, often in dramatic ways. It is possible that these methods could change the timing behavior of your system, potentially causing your software to miss deadlines or encounter other serious problems. Also, storing the profiling information on embedded systems is often a challenge because either additional memory or a way to transmit this information as the system runs is required. Finally, instrumentation solutions may increase the size of your code, potentially making it too large to fit into the limited memory available on many embedded systems.

Hardware Trace

Hardware trace is a technology available on many embedded microprocessors that allows you to effectively analyze your system without any of the drawbacks previously mentioned. In addition, hardware trace offers some unique performance-analysis benefits that are unavailable with standard profiling methods.

Hardware trace provides a complete history of what your microprocessor is doing as your software runs. This lets you examine the various interactions in your software in great detail. By collecting information about what instructions are executed, when interrupts and context switches occur, and sometimes even what data addresses and values are loaded and stored, hardware trace provides nearly complete visibility into the workings of your software. In addition, because hardware trace uses dedicated logic on your processor and specialized collection devices, hardware trace does not impact the execution of your software and you are able to collect detailed information about your production software while it runs at full speed. These capabilities let you visualize your software in unprecedented ways, making it easier to debug and optimize your software.

Hardware trace data is output over a dedicated port from a trace-enabled microprocessor. There are several embedded microprocessors available today that can output trace data, including the ARM7, ARM9, and ARM11; the IBM PowerPC 405 and 440; and the PowerPC 5500 and MAC7100 families from Freescale. In addition, more and more microprocessors are adding support for trace as the benefits of debugging and analyzing embedded systems with hardware trace are becoming more widely recognized.

Special hardware called a "trace probe" or a "trace port analyzer" connects directly to a microprocessor's trace port and captures the trace data. This data is then uploaded to a host system, such as a PC running Windows or Linux, to be decompressed and analyzed. Figure 2 is a block diagram of a trace-collection system using an embedded microprocessor.

Trace probes also offer features that are particularly important when optimizing the performance of an embedded system. The most important of these features is the ability to collect "time stamps" along with each piece of trace information that is collected. This lets you accurately measure the amount of time elapsed between two events and to precisely analyze where your program is spending its time. In addition, some processors let you collect information about how many cycles each instruction takes, which allows you to investigate the effects that your caches are having on your system and perform other detailed types of analysis for improving the performance of software.

Once the trace data is uploaded, host tools let you analyze various parts of the execution of your system, including debugging incorrect behavior and searching for performance problems. There are a number of trace probes and trace-analysis software packages available, including the SuperTrace Probe and MULTI TimeMachine Suite from Green Hills Software (where I work;

Hardware Trace Applied to Performance Analysis

By providing a wealth of information about the execution of your system without changing its behavior, hardware trace data enables unique advanced performance-analysis techniques. In addition, because it lets you collect timing and cycle information, you can very accurately measure important system metrics such as interrupt latency and context switch times. And once you can measure these numbers accurately, you can find the parts of your system that are running too slowly and that can be easily sped up.

The first and most important step of performance optimization is to find the "hot spots" where your system is spending a large amount of time and the code can be improved to yield the greatest gains in performance. These are often inner loops and other parts of your code that are executed many times, although they could also be inappropriate algorithms or other slow code. Locating these hot spots is the primary purpose of nearly all performance-analysis tools, including those tools that let you visualize your program using trace data.

One major advantage that trace data provides in searching for hot spots is that it enables you to locate anomalous execution patterns in your software. This may be a function that takes an unusually long time with certain parameters; a loop that takes too long on one or more specific iterations; or even an interrupt service routine that is fast most of the time, but sometimes takes too long.

For instance, if a function executes 100 times, taking 100 cycles 99 times, and then taking 10,100 cycles the final time, the function takes an average of 20,000/100= 200 cycles per call. A statistical-analysis profiling tool will be unable to distinguish this case from a function that takes 200 cycles every time it is called. The types of performance improvements that you would look for may differ between these two cases because you may need to debug just one slow path through your function in the first case, and you may want to do a more general overhaul of your function or algorithm if it takes 200 cycles on every call and its performance needs to be improved.

However, trace-analysis tools can distinguish between these two cases by giving you a list of each call to the function and its duration. Once you have identified the functions that are taking most of your execution time, you can then examine the paths through that function that cause it to run slowly and eliminate these inefficiencies.

Many trace tools also let you visualize your system over time so that you can graphically see what functions take most of the time in your system. By allowing you to visualize this data rather than having to pore over lists of numbers, you can gain a solid understanding of how your software spends its time to easily let you make your program more efficient. In addition, some trace tools even let you easily determine the parameters to your function for a slow case so that you can immediately debug it and improve your system performance.

An Example

Listing Two is a simple program that sorts an array a large number of times. I use an implementation of the quicksort algorithm, which should be getting reasonable performance from the sorting algorithm. Each time through the main loop, I read an array, sort it, and verify that it is sorted correctly. If the sort function ever fails, then you bail out of the main loop and print an error message.

When running this program, you discover that the program runs slower than expected. If you were using an ordinary performance-analysis tool, such as a statistical profiler, you would be able to determine that the sort function takes too long. In addition, you could determine what source lines were the cause of the delay. However, it would be difficult for you to distinguish between a sorting algorithm that is too slow on average and one that is too slow during a single iteration of the loop. You could then insert additional code into the loop to measure the execution time of each iteration to see if you could discover anything interesting. You would then have to dig deeper once you discovered the iteration that was taking too long. In short, you would have to do a lot of work and probably recompile and rerun the program several times before you could solve the problem. In addition, many embedded systems have strict timing deadlines, so inserting instrumentation code may cause incorrect behavior or lead to other negative outcomes. While this additional code may not affect this simple example, it is easy to imagine a scenario where tight deadlines might be missed when inserting additional instrumentation code.

If you collect trace data for this example instead of using traditional performance-analysis techniques, you can then easily get information about each function that executes and how long it takes to run. You can even precisely measure the amount of time that each iteration through the loop takes and how many cycles elapsed. This information is invaluable in locating hot spots quickly so that you can maximize your efficiency when tuning your system. Moreover, this data lets you focus on the code that runs slowly by precisely accounting for all elements of your system, including cache hits and misses, interrupt latencies, and others.

Returning to the example, I run the program on an ARM processor and collect trace data as the program runs. Figure 3 shows a photo of this setup and Figure 4 shows the resulting output from a TimeMachine tool called the "PathAnalyzer" that shows a graphical call stack over time. You can immediately see that the quick_sort() function takes up the majority of the running time and that a single iteration takes longer than the rest. If you want to investigate the quick_sort() function in more detail, you can easily look at all of the calls to quick_sort() and their durations.

Browsing all of the calls to the quick_sort() function results in Figure 5, which contains a list of the 10 calls to quick_sort() and the duration of each call in processor clock cycles. You can quickly see that the seventh call to quick_sort() takes almost four times longer than the rest. And you are able to easily collect and display this information while the system is running at full speed without modifying its behavior in any way.

At this point, you know that you should investigate this specific call to quick_sort(). By examining and debugging the code in this example, you discover that you are using an array that is already sorted on the seventh iteration. As you may recall, quicksort degenerates into an O(N2) algorithm when it is run on an already sorted array, which is the cause of our performance problem in this example. There are several options for fixing this performance problem by modifying the implementation of quicksort; this is left as an exercise for the reader. For more information on the quicksort algorithm, see a book on algorithms such as Algorithms In C by Robert Sedgewick (Addison-Wesley, 1998).

In addition, if you simply wanted to characterize this sorting algorithm, you could collect trace data from our implementation running on various arrays and collate very accurate results into some interesting information about our algorithm. While this type of analysis can be done in other ways with simple sorting algorithms, if you are collecting the data from sensors on your embedded system and want to analyze the cost of reading the data and sorting it over long periods of time, hardware trace would be an ideal solution because it does not modify the runtime behavior of your system, yet provides detailed and accurate information about the execution of your system.

While this was a fairly simple example, you can see how trace tools make it painless to find hot spots and perform advanced analysis in an embedded system so that performance optimization can be performed more efficiently and effectively.


Hardware trace provides a mechanism through which embedded developers can easily gather very accurate information about the execution of their systems without having to modify their software or impact the runtime behavior of their system. This enables hardware trace to be used in nearly any embedded system that runs on a trace-enabled processor, regardless of whether the application has strict timing requirements. Also, because hardware trace does not modify the behavior of your system, you can analyze your software in real-world conditions while it is running on actual data collected from the rest of the system. These characteristics make trace a promising solution for many embedded developers.

Whether you want to increase the performance of your application to free up CPU cycles for extra features, to lower the clock frequency of your processor to save power, or to achieve higher performance for some key metric of your system, optimizing embedded systems is an important task that hardware trace can make easier. By providing zero intrusion and enabling very accurate measurements, hardware trace can help you achieve your performance-optimization goals quickly and efficiently.


Listing One

/* This code implements a simple bubble sort algorithm.  Figure 1 shows the 
 * output from a traditional profiling tool for this code.  You can see that
 * roughly 28% of the time running this program consists of copying memory 
 * using the memset function.  From looking at the code, this is a non-obvious
 * result.
#include <stdio.h>
#define N 10
int array_to_sort[N] = {15, 6, 25, 36, 92, 0, 2, 67, 82, 91};

void bubble_sort(int *array, int size)
    int i, j;
    for(i=0; i<size-1; i++) {
    for(j=0; j<size-i-1; j++) {
        if(array[j] > array[j+1]) {
        int tmp = array[j];
        array[j] = array[j+1];
        array[j+1] = tmp;
void print_array(int *array, int size)
    int i;
    for(i=0; i<size; i++) {
    printf("%d\n", array[i]);
int main(int argc, char *argv[])
    int tmp_array[N];
    memcpy(tmp_array, array_to_sort, sizeof(array_to_sort));
    bubble_sort(tmp_array, N);
    print_array(tmp_array, N);
    return 0;
Back to article

Listing Two
/* This code contains an implementation of the basic quicksort algorithm. 
 * The program's main loop reads an array, sorts it and then verifies that 
 * the array is properly sorted.  By tracing this program, we can easily 
 * identify slow call to quick_sort() and determine causes of this slowdown.

#include <stdio.h>
#define N 100

int array1[N];
int array2[N];

#define swap(a, b) { int tmp = a; a = b; b = tmp; }

void init()
    int i;
    for(i=0; i<N; i++) {
    array1[i] = rand() % 100;
void read_array(int *dest, int *size, int which_array)
    *size = N;
    if(which_array != 6) {
    memcpy(dest, array1 + which_array,
        sizeof(int) * (N - which_array));
    memcpy(dest+N-which_array, array1,
        sizeof(int) * which_array);
    } else {
    int i;
    for(i=0; i<N; i++) {
        array2[i] = i;
    memcpy(dest, array2, sizeof(int) * *size);
void quick_sort_r(int *array, int l, int r)
    int partition_key, i, j;
    if(r <= l)

    i = l-1;
    j = r;
    partition_key = array[r];
    while(1) {
    while(array[++i] < partition_key)
    while(partition_key < array[--j]) {
        if( j == l )
    if( i >= j )
    swap(array[i], array[j]);
    swap(array[i], array[r]);
    quick_sort_r(array, l, i-1);
    quick_sort_r(array, i+1, r);
void quick_sort(int *array, int size)
    quick_sort_r(array, 0, size-1);
int verify_array(int *array, int size)
    int i;
    int last_val = 0;
    for(i=0; i<size; i++) {
    if(array[i] < last_val)
        return 0;
    last_val = array[i];
    return 1;
int main(int argc, char *argv[])
    int i;
    int tmp_array[N];
    int size;
    for(i=0; i<10; i++) {
    // Let's read in an array and sort it
    read_array(tmp_array, &size, i);
    quick_sort(tmp_array, size);

    // Now let's verify that our array is
    // sorted properly.
    if(verify_array(tmp_array, size) == 0) {
        printf("Array not sorted correctly\n");
    return 0;
Back to article

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.