Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Parallel

A Benchmark Apologia


FEB89: A BENCHMARK APOLOGIA

G. Michael Vose is the editor of "OS Report: News and Views on OS/2," and he can be reached at Box 3160, Peterborough, NH 03458. Dave Weil is the Microsoft C project manager, a co-author of the Microsoft C Compiler, and the principal representative from Microsoft on the ANSI C Standard Committee. He can be contacted at 16011 NE 36th way, Box 97010 Redmond, WA 98073.


Battered and berated, benchmarks might remind you of Wiley E. Coyote: They keep getting blown up but they never seem to go away. Consistently outsmarted by those cunning optimizing compiler-writing road-runners, benchmarks put on a new face almost yearly and dive back into the performance-testing fray.

Why do we bother with benchmarks? What have they done for us lately? If they are so easy to trick, what is to be gained by testing software with benchmarks? Will compiler technology always be one step ahead of benchmarks, which supposedly help us evaluate new software? Which comes first: the test suite or the product to test?

On the plus side, benchmarks can help us sort out real compiler performance from compiler vendor hype. They can help uncover the flaws in a compiler or show where a compiler is strong. Benchmarks even occasionally expose bugs and language implementation errors.

But benchmarks can also mislead us into thinking a compiler doesn't perform well, when the real problem is the benchmark itself. Conversely, a compiler may show outstanding performance on a benchmark only to prove mediocre in real-world situations. Knowing where the problem lies--in the compiler or in the benchmark--can be tricky.

Ignoring the vast problems encountered in using benchmarks to measure computer hardware performance, let's look at the practice of benchmarking language-compiler software to see if benchmarks can tell us anything useful. This discussion can also highlight the pitfalls we might encounter in designing and using compiler benchmarks and evaluating their results.

What To Test?

Benchmarks often try to do too much. They often attempt to test several different qualities of a compiler and then represent the results of those tests as one monolithic total. This is sort of like using a college graduate's overall grade point average to figure out how good he or she will be at math. This flaw in benchmarks raises a fundamental question about what they are trying to measure.

The primary performance issues for compiler testers encompass several fundamental compiler operations: code quality, including its compactness and its speed of execution; compiling speed; compiler features, such as support for different memory models and compiler settings, or switches; and standards conformance.

The secondary compiler issues subject to testing by benchmarks include the environment of the compiler and the operation of the compiler on different hardware platforms. So the first step in designing benchmarks is to figure out what to measure. After you decide what you want to measure, then you can set out to design a test algorithm to generate some data.

Testing individual statements or expressions is extremely difficult. Designing a test to determine the speed of pointer dereferencing or integer math operations exposes the benchmark writer to a variety of pitfalls. These pitfalls include the use of loops to extend the execution of the test code to some meaningful, measurable time period. Using loops to iterate some piece of test code fifty thousand to one hundred thousand times to obtain a benchmark time of one to two seconds raises the question of whether the benchmark is measuring the test code or the compiler's loop overhead. In benchmarks that run for only one to two seconds, a time differential of only one-third of a second may create times too small to measure accurately.

Loops also frequently fall victim to compiler optimizations. A good compiler can pull loop invariant code out of a loop and thereby distort the result of a benchmark. In some extreme cases, as we'll explain in one of the examples that follow, an optimizing compiler might do away with a loop altogether. Compilers can also unroll a loop and break down the loop to a serial sequential operation, while leaving others inside; this also will distort a benchmark's results unless you are trying to measure a compiler's ability to optimize loop code. But such a test may not tell you anything about the quality of the code generated for the body of the loop.

One way to lessen the problems of using loops in benchmarks is to place a large body of code within the loop. Large code blocks do not necessarily prevent loop optimizations from taking place, but they do ensure that the loop overhead is not significant compared with the time spent doing the operations inside the loop--supposedly what you are trying to measure.

Trying to test individual statement execution probably isn't useful anyway. Even if you could determine that a compiler dereferenced pointers at blazing speeds, an application probably wouldn't spend much of its time dereferencing pointers. Testing individual expressions might not paint an accurate portrait of a compiler's performance for a total application. If you wanted to know how well a compiler handles floating-point operations, the best test you could devise would be a floating-point intensive application, such as an FFT algorithm. Using real working code can often yield more meaningful performance data about compilers than an artificially contrived test can.

Knowing exactly what you are testing becomes crucial in a variety of situations. One common area of confusion is that gray area where a compiler and an operating system overlap. For example, does a disk I/O benchmark test a compiler's ability to handle files or its calls to the operating system? Does that matter as long as you always benchmark compilers under the same operating system? Do disk I/0 routines ever go directly to the hardware and bypass the operating system?

A recently published operating system benchmark suffered from a misunderstanding of the gray area where compilers and operating systems intersect. The benchmark sought to test the operating system's virtual memory manager. It created a large array and accessed elements in the array. But the compiler used to compile the benchmark mapped the array in such a manner as to cause a segment swap with each array-element access. The resulting test of the virtual memory manager turned out to be a test of the compiler's memory manager instead. The operating system got blamed for poor performance, but the compiler was the culprit.

A common pitfall in designing compiler tests is a failure to isolate the routines that do the work that you actually want to measure. To measure the speed of matrix multiplication, for example, a test might set up two arrays and then multiply them. Measuring the time needed to set up the arrays, as well as the time needed to make the matrix multiplications, does not tell you how well the compiler generates code to do matrix math. The timing routine needs to be placed around the multiplication operation, not around all the declaration and set-up code.

Today you must possess a knowledge of optimization techniques in order to test compilers. If you want to measure code compactness, particularly for a modern optimizing compiler, you must create test algorithms that cannot be reduced in size by a given optimization technique unless you specifically want to measure the implementation of that optimization. Not knowing how a given piece of test code might be optimized leaves you wide open to generating misleading data. Another complication is that you also can't always assume that a compiler optimizes code just because the vendor claims it can; one well-known Unix C compiler purports to do common subexpression elimination but actually performs this optimization in only the simplest of cases.

Evaluating Results

Possibly the hardest part of using benchmarks is interpreting their results. The proper design and coding of benchmarks minimizes this problem. But even well-done tests can be overinterpreted or occasionally misinterpreted.

Many benchmarks leave themselves open to improper interpretation because they fail to produce verifiable results. If a compiler's code executes an algorithm 25 percent faster than its rivals but calculates a wrong answer, how do you interpret its performance? If you don't know it calculated an incorrect answer, you may arrive at an unwarranted conclusion. Benchmarks should always calculate a verifiable solution to a problem.

Interpreting benchmarks also suffers from an intrinsic fault of benchmarks: their size. Benchmark code, because it needs to be portable and easily read, frequently is brief. But can tests using a small piece of code be extrapolated to similar performance characteristics when code size grows to hundreds or thousands of lines? Small-test code usually gets compiled with a small- or medium-memory model; will the results of these restricted-memory operations apply to large- and huge-memory model programs?

Compilers may store data differently for specific memory models, particularly for large data models. One compiler may store data in the near data segment until this segment overflows and then store subsequent data in far data segments; another may choose to store all data larger than a certain threshold size--or all uninitialized data--in far data segments. For example, Microsoft C always stores uninitialized array data in a far data segment. The compiler does not know how big the array will be, because arrays can have different sizes in different modules (as long as they are not initialized), and the linker resolves the space allocation problem by allocating space for the largest size. This is called common model; the name is taken from the Fortran concept of Common data. Initializing the array will ensure that it goes into the near data segment if it is less than 32K in size. This memory model specific treatment of data can affect the performance of a compiler on a benchmark, such as the Sieve. These factors can be controlled in most compilers via initialization, by keywords such as near or far, or by control of the data threshold size.

Now that we've entered the 32-bit era, another frequent benchmark problem is whether ints are 16 or 32 bits. One compiler vendor recently related that several of his customers called to complain that his new compiler was 25 percent slower than a popular competing compiler on a certain integer-math benchmark. The benchmark really only used 16-bit integers, but people testing a 32-bit 80386 compiler declared all the integers as 32 bits. The compiler had to add extra code to cope with storing 16-bit numbers in 32-bit data areas--code that slowed it down by 25 percent. Recoding in 16 bits showed this compiler to be superior to the competing compiler. This incident shows how the assumptions made when running and interpreting benchmarks can create false impressions.

Another common problem with interpreting benchmark results is inequalities in testing procedures. For example, a tester once walked into a room full of 80386 machines and started plugging in a disk containing an executable file of what he thought was a good benchmark. As he went from machine to machine recording test results, he uncovered some wide disparities. Later snooping revealed that the machines had different-sized caches and different expanded memory managers; both the caches and memory managers had been installed at boot time. Therefore the benchmark results were skewed by factors not immediately apparent. The tester's mistake, of course, was in not carefully controlling all the factors that could affect the results.

Another example of inequitable test procedures shows up in some tests of file I/O. One recent benchmark attempted to measure the efficiency of low-level file I/O by writing a large file to disk and then performing random seeks in the file, reading and writing pieces of the file. When this operation used media such as a floppy disk, running the same program consecutively with different compilers while deleting the test file between runs resulted in each compiler performing the test more slowly than the previous one. Was this because the compilers were really less efficient? The problem here stemmed from the fact that the files were created on different areas of the disk (farther out each time) because of the operating system's scheme for allocating disk sectors. Therefore, the disk head-seek time became significant (because of the way the file was written) and affected each subsequent run. When the test was rerun with a reformatted disk to ensure that the file was always created in the same place on the disk, each compiler ran the benchmark in almost the identical amount of time. Were they all equally good at low-level file I/O? Profiling this benchmark revealed that it spent more than 95 percent of its time in the operating system and less than five percent of its time in the actual filehandling code, so the benchmark was really measuring the operating system. The only compiler that had a significantly different time (faster) on this benchmark did some internal buffering to cut down on the number of system calls made (this, however, subverted the use of low-level I/O) to force information directly to the disk.

A compiler's options also complicate the benchmark problem. When testing compilers you must know what compiler switches to set to generate the equivalent code from each compiler to make the tests as fair as possible. Optimization levels are a prime example. If you decide to run all benchmarks using the default switch settings for all the compilers you test, some compilers will default to no optimization, others will default to full optimization, and still others will fall somewhere in between. This lack of care in testing can mislead for the following reasons.

  • You provide no information on what these defaults were for each compiler.
  • The testing compares apples to oranges.
  • Most users do not exhibit the same lack of care in running their compiler, except perhaps when first getting accustomed to it.

Stack checking is another example of an area where compiler defaults can differ. Some compilers turn stack checking on by default, and others turn it off. On a benchmark such as the Fibonacci, stack checks could have a huge effect on execution time even though virtually every compiler provides some user control over stack checks. Benchmarkers must be sensitive to the need for compilers to be compared on equal terms.

The Good and the Bad

There are many examples of both good and bad benchmarks. The Dhrystone, Whetstone, and Sieve benchmarks have been around for quite some time, and, although not perfect, are at least well understood.

The Dhrystone benchmark, Version 1.1, is susceptible to some string-handling optimizations. It also produces no output that you can check for accuracy. But the Dhrystone still is one of the better benchmarks, because it does test a good mix of statements and operations. One major exception is floating-point operations: Dhrystone does no floating-point tests even though some testers make the mistake of judging floating-point performance by running the Dhrystone. A new version of the Dhrystone currently being written will address these problems.

The Whetstone benchmark does test floating-point operations and has also been around for a long time. Whetstone performance can get a boost from compilers that insert in-line code for certain routines, and this benchmark also contains some loop invariant code that can be taken out of a loop. The Whetstone benchmark also exhibits a frustrating lack of verifiability, because it produces no output that you can check but performs many complex operations that open the door to a lot of potentially wrong answers.

The Sieve benchmark tests a compiler's ability to perform integer operations, memory references, and simple control structures, such as loops. The Sieve calculates a testable result, prime numbers, and deserves high marks as a test whose output you can verify. But, like other benchmarks, the Sieve tests just a few operations in a small memory environment and therefore cannot tell you much about the many operations that it doesn't test. Plus, as mentioned earlier, the Sieve's results can be misleading if you don't account for an individual compiler's method of handling uninitialized arrays.

Other well-known benchmarks leave a lot to be desired. The so-called Loop and Float benchmarks have major flaws, as do some disk I/O benchmarks. One version of the Fibonacci benchmark frequently gets misinterpreted.

The Float benchmark provides a classic example of how an optimizing compiler can destroy a poorly written test algorithm. The Float algorithm (see - Example 1 ) performs a series of multiplication and division operations inside a loop. An optimizing compiler can reduce this code to a simple assignment operation, c = b. The compiler first propagates constants and discovers that they have a sequence of assignments of the form

  for(...) {
        c = CONST_VAL_1;
        c = CONST_VAL_2;
        c = CONST_VAL_1;
        c = CONST_VAL_2;x
        ....
        }

The compiler will then throw away the extra stores to c, leaving only C = CONST_VAL_2. Next the loop optimizer hoists this code out of the loop, because it is invariant, leaving simply c = b. This program could be modified in one of several simple ways to prevent this optimization. Initializing the variables a and b with a function call, as in init(&a,&b);, will prevent constant propagation. Alternatively, modifying the body of the loop so that a or b or both also get modified would prevent this optimization, unless the compiler was also smart enough to do the following induction analysis:

  for(....) {
            c = a * b;
            a *= a;
            b-=a;
            ...
            }

The Whetstone benchmark provides good examples of this kind of unoptimizable loop.

Example 1

  /*  The floating point arithmetic benchmark
   *  does repeated multiplications &
                divisions in a loop that is
   *  large enough to make the looping time
                        insignificant.  This
   *  program is from the August 1983 issue
                         of Byte magazine.
  */
  #include <stdio.h>

  #define CONST1 3.141597E0
  #define CONST2 1.789032E4
  #define COUNT 50000

  main ()
          {
          double a, b, c;
          unsigned int i;

          time_0();

          a = CONST1;
          b = CONST2;

          for ( i = 0; i < COUNT; ++i)
                   {
                   c = a * b;
                   c = c / a;
                   c = a * b;
                   c = c / a;
                   c = a * b;
                   c = c / a;
                   c = a * b;
                   c = c / a;
                   c = a * b;
                   c = c / a;
                   c = a * b;
                   c = c / a;
                   c = a * b;
                   c = c / a;
                   }
          fprintf (stderr, "%d\n", time_n());
          }

The Fibonacci test provides another example of how benchmarks can mislead if the results are not interpreted correctly. This benchmark (see Example 2) computes a Fibonacci number sequence using recursion. But using iteration might be more efficient. This benchmark really tests function-call overhead. The issue here boils down to the question: What do the results mean?

Example 2

  /*      The Fibonacci benchmark tests
              recursive procedure calls
  */

  #include <stdio.h>

  #define NTIMES 34
  #define NUMBER 24

  main()
          {
          int i;
          unsigned value, fib();

          time_0();

          printf ("%d iterations: ", NTIMES);

          for( i = 1; i <= NTIMES; i++)
                  value = fib(NUMBER);

          printf ("\n");

          fprintf (stderr, "%d\n", time_n());
          }
  unsigned fib(x)
  int x;
          {
          if( x > 2)
                 return (fib(x-1) + fib(x - 2)):
          else
                 return (1);
          }

This benchmark is small enough and straight forward enough that most compilers generate similar code for it. But test results show that compilers that often do well on most other benchmarks sometimes lose badly on this one. The reason has to do with register variables. If a compiler does not implement register variables, the code for the path that computes fib(x-l) + fib (x-2); will look like this (in pseudo-assembler):

   push(x-l)
   call fib
   push ax       ;save result in memory
   push(x-2)   call fib

  pop bx   ;retrieve saved result
  add ax,bx   ;add current result and saved result

A compiler that implements register variables produces this code:

   push(x-1)
   call fib
   mov si,ax     ;save result in register
   push(x-2)
   call fib
   add ax,si   ;add current and save results

This latter sequence will always execute faster. Because these compilers must save and restore the value of the register si every time the function is called--and because this path is taken only one-half the times that fib is called--the overhead of the save/restore on each call is greater than the overhead to save/restore the intermediate result in memory. The compiler that uses register variables, contrary to what your expectations may be, runs slower than one that does not.

In most cases, however, register variables help produce more compact code and enhance an algorithm's execution speed. This Fibonacci algorithm really is an exception rather than the rule (other ways to alleviate this problem include saving/restoring si only in the path where it is used). But benchmarkers seldom engage in this kind of analysis.

The Loop benchmark was originally created to factor out loop overhead in benchmarks that used loops to artificially increase a benchmark's running time rather than increase the number of operations at the source level. The problem with this technique is that many compilers that perform loop optimizations will remove these loops, not as a benchmark ploy, but because other loop optimizations may result in empty loop bodies. For example, loop unrolling on loops that iterate a small number of times, as in

  float a[4];,
  ...
  for (i = 0; i < 4; i++) {
         a[i] = (float)i;

        }

might well be expanded to four individual assignments and the loop removed entirely.

A better solution to the problem of dealing with loop overhead is to simply make sure that the body of the loop is large and complex enough that the loop overhead is insignificant.

So how can you protect yourself from misleading benchmark claims? Always insist on benchmarks that provide verifiable output so that you can at least be certain the code a compiler generates will correctly execute instructions. Read benchmark code whenever you can to judge the vulnerability of the test algorithms to optimization. And be sure to find out if all the conditions were the same when a given test was performed on competing systems.

The Acid Test

There will probably never be a perfect compiler benchmark. Compiler technology marches forward at a steady rate, and performance tests almost always follow the creation of new compiler techniques.

To protect yourself as you evaluate compilers, you need to remember a few simple but important ideas:

  • Understand what a test really measures.
  • Understand optimizations and how they affect code.
  • Understand that compilers make assumptions about data storage, optimizations, and so on.
  • Understand the importance of an equal environment for running tests.
  • Take care in interpreting benchmark results.

The true measure of any compiler is how well it suits your programming style and application needs. Benchmarks might be more useful if they measured compiler performance on a variety of common tasks, ranging from heavily compute-bound tasks (for example, the FFT program for floating-point operations, [scalar] matrix inversion for array manipulation, table lookups or tree manipulation for pointers) to I/O bound operations (for example, an object decoder, such as the one used at Microsoft). Similar kinds of applications could test other library related items, such as string manipulation, memory allocation, and the like. Benchmarks that are too simple tell you little, and most compilers end up looking much the same.

Developments of the future will no doubt create a need to test multitasking compiler operations, real-time applications, and multiprocessor systems. These complex systems will only add to the problems of creating realistic and credible benchmarks. But knowing the pitfalls can help us differentiate the signal from the noise.


Copyright © 1989, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.