C/C++ Compiler Optimization

By Matthew Wilson, May 01, 2004

Squeezing the maximum execution speed from C/C++ compilers requires an understanding of optimization switches.

May04: C/C++ Compiler Optimization

Focusing on speed

Matthew is a software-development consultant for Synesis Software, creator of the STLSoft libraries, and author of Imperfect C++ (Addison-Wesley, 2004). He can be contacted via http://stlsoft.org/.

In "Comparing C++ Compilers" (DDJ, October 2003), I compared leading Win32 C/C++ compilers against several criteria, including build size, build speed, and execution speed. In this article, I focus exclusively on execution speed. Because the performance of template code is something that is of particular interest to me—and modern C++ practice involves templates to a significant degree—most of the tests performed involve templates. However, there are also two C tests, for reasons that will be explained shortly.

I built all executables with each compiler's maximum speed optimization settings (see Table 1), as far as each allows for the target architecture—Pentium 4—of my test machine. The source and makefiles are available electronically (see "Resource Center," page 5).

There are a couple of issues that need to be addressed from the previous article. First, I made two errors (see http://synesis .com.au/resources/articles/errata/ddj200310.html), the first being that, when I did the Dhrystone test, most of the compilers were optimized for space rather than speed. I don't have any explanation for this; it was not my intention, rather a regrettable oversight. The second misstep was that I failed to apply the -ox flag, in addition to -ot, for the Watcom compiler. This one was plain ignorance, and I thank the chaps from the Open Watcom organization (http://www.openwatcom.org/) for helping me see clearly through the perplexing array of optimization options.

The second issue is that some of the scenarios had been built for size, and their speeds were tested. This was fair in the context of the previous article, since I was examining a raft of compiler characteristics, and optimization for space is a legitimate option that is often advised as the best policy for large systems. However, this message was not well expressed in the article, as I received several e-mails taking me to task on the issue. Furthermore, in hindsight of the tests run in this article, the abilities of compilers to provide good speed as a byproduct of size optimization are known to vary considerably. Some of the scenarios here are the same as those from the previous article, but when optimized for speed, the results for some compilers differ markedly. For others, they are pretty much the same.

Another difference is that the set of compilers to be examined has changed, reflecting changes in the industry over the last six months. Borland 5.6.4 (C++ BuilderX) is used instead of Version 5.6 (C++ Builder 6). Digital Mars is now at 8.38, rather than 8.34. I use the new Intel 8.0 rather than 7.0. Open Watcom is 1.2, rather than 1.0.

I've dropped Visual C++ 7.0 because it's an unnecessary enhancement to Version 6.0 when you consider that the excellent Version 7.1 is available for free as part of the .NET SDK.

Comeau 4.3.3 is now featured, though Comeau still does not yet officially support Win32. Despite this, I felt it was important to include it because it is the only 100-percent standard conforming compiler currently available. Also note that I have used it with the Visual C++ 6.0 back end. This means that some aspects of the performance may reflect that of Visual C++ 6.0 rather than Comeau's innate abilities. This is an artifact of the Comeau architecture and its reliance on a back-end compiler, and not something we can (expect to) do anything about other than to employ a different back-end compiler. However, if you're a Comeau user on Win32, one thing you might want to do is to e-mail the vendor about developing Intel back-end compatibility, as Comeau is a demand-driven (and very responsive) vendor. Note that Comeau uses the Visual C++ 6.0 runtime libraries and Intel uses the Visual C++ 7.1 runtime libraries.

Tests

There are two tests that are exclusively or primarily C only—Dhrystone and zlib, which I also featured in the previous article.

There are seven C++ tests: auto_buffer, fixed_array, int2string, multi_array, pod_vector, string tokenization (Boost), and string tokenization (STLSoft). For the C++ tests, I've endeavored to isolate any compiler library-specific performance by using the processheap_allocator from WinSTL (the Win32 subproject of STLSoft; http://winstl.org/) for all classes that take allocators. This directly accesses the Win32 heap API for the current process's heap, so all compiler's C++ executables should have the same memory allocation scheme and experience the same conditions.

auto_buffer uses the STLSoft template of the same name that efficiently provides local buffers whose sizes are determined at runtime (see "Efficient Variable Automatic Buffer," C/C++ Users Journal, December 2003). This test creates and uses 100-byte buffers, 10 million times.
fixed_array and multi_array are rectangular array template tests, conditionally compiled from the same source file. The former uses the STLSoft fixed_array_3d template class, the latter the Boost multi_array template. Both scenarios create dynamically sized arrays of 10×50×100 doubles, then walk through them accessing and setting each element to exercise the indexing functionality of the array classes.
int2string uses the STLSoft integer_to_string template function suite (http://www.cuj.com/documents/s=8943/cujexp0312wilson/) to efficiently perform conversions of 10 million integers to character string form.
pod_vector is an STLSoft template that provides superior performance over std::vector for POD (plain-old-data) types. It achieves this by omitting the destruction of elements, using direct memory-manipulation functions (memcpy(), for example), and auto_buffer. The first two are always beneficial; the latter represents an optimization in the average case. This test exercises a range of operations, such as front insertion, back insertion, front erasure, back erasure, and the like.
The two string tokenization scenarios are the same as described in my October 2003 article.

Apart from the Dhrystone, all the tests are carried out using a custom test harness that executes each compiler/scenario permutation a given number of times, extracts the performance figures via regular expressions, and calculates their averages, discarding the lowest and highest to try and avoid any blips or operating system caching. The Dhrystone figures were similarly obtained by calculating the averages of a large number of executions for each compiler.

For the Dhrystone test, the higher the number of Dhrystones per second, the better. For all other tests, lower time indicates better performance; all of these were obtained by measuring the active section of code using the WinSTL performance_counter (see my previous article; http://www.windevnet.com/documents/win0305a/).

There are two final points to note. Intel 8 is not explicitly supported by the version of Boost (1.30) that I used in this test, and warnings to that effect were printed to the console during compilation of multi_array and string tokenizer (Boost). Despite this, I have little doubt that there's anything in the Boost libraries that would be significantly different for compilation with Intel Versions 7 (which is recognized) and 8. The superior performance of the Intel compiler for these two scenarios bears this out.

The second issue is that Digital Mars 8.38 crashes the compiler in the compilation of the auto_buffer and pod_vector scenarios if exceptions are turned on, via the -Ae option. Once again, I don't think this affects the results much, but it's only fair to mention it, since all other compilers have exceptions enabled for all C++ scenarios.

I spent time to ensure as much compatibility as possible, but still not all compilers support all of the C++ scenarios. Digital Mars is not configured with the Boost 1.30 I used, despite now conforming almost completely to standard. Borland experienced internal compiler errors compiling some things from Boost. Others had similar issues.

As it turns out, these missing data points are of little consequence. As Table 2 illustrates, even if you adjust the average scores of these compilers to take into account only those scenarios in which they participate, the top five or six ranking places remain unchanged.

As expected, the correct Dhrystone results (Figure 1) paint a different picture to that presented in my October 2003 article. Visual C++ 7.1 is the best and, along with Intel, is head and shoulders above the rest. CodeWarrior also stands out with a good performance. Then come Open Watcom, Visual C++ 6.0, and Comeau close together. Borland, GCC, and Digital Mars fill the last three places, at around 60 percent of the performance of Visual C++ 7.1. The previous test had Digital Mars, Intel, GCC, VC++ 6.0, VC++ 7.1, Borland, CodeWarrior, Open Watcom, in that order; so its results were, indeed, misleading for most of our compilers.

In the zlib scenario, the manipulation of the large file to be compressed is done outside the timed region, so the performance figures obtained represent that of the compression function—zlib's compress()—only. As Figure 2 shows, Borland is the best, closely followed by CodeWarrior, then Open Watcom and Digital Mars. Intel, Visual C++ 7.1, and GCC trail by about 10-20 percent, and Visual C++ 6 and Comeau by about 35 percent.

The first of my C++ tests is a mixture of expected and surprising results; see Figure 3. Intel and Open Watcom are noticeably superior, followed by CodeWarrior and Visual C++ (6 and 7.1, respectively). GCC and Digital Mars are roughly twice as slow; Comeau around three times, and Borland is about five times slower. For such a simple template as auto_buffer, this is not good.

With the fixed_array performance scenario (Figure 4), once again Intel is the best, but with Visual C++ 7.1 snapping at its heels. Next are GCC, CodeWarrior, and Visual C++ 6, about 50 percent slower. Digital Mars is more than twice as slow as Intel, Borland three times, and Comeau around five times. The fixed_array_3d template is more complex than auto_buffer, but it's still surprising to see such a large range in performance.

If you were using Boost's powerful multi_array template class on Win32, the results of the multi_array test (Figure 5) would indicate that you should be using Intel or GCC; nothing else comes close. Visual C++ 6.0, CodeWarrior, and Visual C++ 7.1 all come in about the same, being about three times as slow as Intel. Alas, Comeau seems to be having a hard time, being more than 20 times slower than Intel.

An interesting feature of this test is that it shows that, with Intel, the Boost rectangular array is about on a par with the STLSoft one, which I wrote with performance in mind. For all other compilers, the STLSoft class performs significantly better (up to four times faster), which reflects its simpler, less flexible design.

I think these two rectangular array tests ably show just how challenging an area template optimization can be. The considerably increased sophistication of Boost's multi_array template over STLSoft's fixed_array exposes the difficulties that all compilers—except Intel in this specific case—have in translating the simple logical requirements of a programmer's intent into efficient code. It's no straightforward matter, and it is reckless to write arbitrarily complex code and just assume that the compiler takes care of optimizing it all away for you. This is a serious issue that all fans of template complexity, metaprogramming, and the like, should be aware of.

For integer-to-string conversions (Figure 6) Intel wins again, followed closely by GCC, then Digital Mars and Comeau. CodeWarrior and Visual C++ are next at about twice the cost of Intel. Open Watcom C and Borland bring up the rear.

When I badgered Walter Bright to look at the Digital Mars template optimization of the integer_to_string template, he explained that the compiler was not fully inlining all of the supporting functions. He reworked the compiler so that it does so with Version 8.38, as its previous performance was more in line with that of Borland. I would assume that this is what's happening, to varying degrees, with the slower compilers in this scenario. Indeed, my guess is that inlining depth, or lack thereof, is a major factor in the performance differences throughout the C++ scenarios.

The pod_vector (Figure 7) scenario doesn't throw too many surprises, other than that Comeau gives Intel a serious run for its money. Given the fact that this scenario exercises a number of different aspects of the pod_vector template, coupled with the complexity of pod_vector relative to most of those in the other scenarios presented here, it's impressive that we have such a relatively close grouping over the eight compilers featured in this summary. For a change, Intel is less than twice as fast as its competitors.

Figures 8 and 9 show Boost and STLSoft tokenizer performances, respectively. For Boost's string tokenizer, it is Intel first, and this time with GCC second. Just as with multi_array, these two compilers give the best performance with Boost. CodeWarrior and Visual C++ (6 and 7.1) are also in the game. Borland is more than twice as slow, and Comeau three times.

With the STLSoft tokenizer, Visual C++ 7.1 and CodeWarrior pip Intel for line honors by about 15 percent, which is at least a break from the monotony. Next come GCC and Visual C++ 6 at 25-30 percent slower, and then Borland, Comeau, and Digital Mars at about twice as slow as Visual C++ 7.1.

Conclusion

Any ranking scheme is, of course, arbitrary, so I'll stick to a straightforward one. The compilers scores points from 10 for best performance down to 2 for worst; those that do not compile or execute for a given scenario are given zero. Table 2 shows the averages of these rankings for the C scenarios, C++ scenarios, and overall. For those compilers that did not feature in all scenarios, the number of missing scenarios is noted along with an average score for the scenarios in which they did feature.

These rankings seem to indicate that Intel is the fastest compiler, with an impressive average score of 9.22. Since the Intel compiler specifically targets Intel processors, it's not surprising that it has a superior performance. All's not fair in business, and if you're exclusively targeting the Intel architecture, you probably want to seriously consider the Intel compiler—it provides a variety of optimization options for squeezing out the last drop in performance.

I confess, I was pleasantly surprised by the performance of Visual C++ 7.1 in averaging 7.56; it represents a significant improvement over Version 6, at least as far as these scenarios exercise their capabilities. (It might finally convince me to give up the trusty old Visual Studio 98.) CodeWarrior comes in a close third with 7.44, which is consistent with my expectations and previous experience with it (note that CodeWarrior 9 has just been released and may well perform even better than 8.3). GCC is the best performing of the free compilers, scoring 6.67.

I was surprised to see Visual C++ 6 come in next best, as this challenged several of my preconceptions/prejudices. Comeau was next, which is quite impressive when you consider that it's not yet officially supported on Win32, and that it used Visual C++ 6 as the back end; it's probable that if I'd used it with CodeWarrior or Visual C++ 7.1, it may have scored higher.

The remaining three compilers were all stymied by virtue of not being compatible with all the scenarios, and their average scores are correspondingly low. I've also included a weighted average of just the scenarios in which they did score, as a teaser for what kind of performance we might expect if/when they do support the templates. The next version of Borland is just around the corner. Digital Mars is now almost entirely standards compliant, and we can hope that the Boost configuration will be available soon. Alas, Open Watcom still seems some way from having sophisticated template support, but we can look forward to that time with some eagerness, if its performance in the auto_buffer scenario indicates likely performance in broader template contexts.

We should remember that this test was all about performance of compiled code, and focused on template code at that. We've not discussed conformance (Comeau takes the cake here), or cost (Digital Mars, GCC, Open Watcom, and Visual C++ 7.1 are free; Intel is free on Linux), or usability (CodeWarrior and Visual C++ have the best IDEs, in my opinion), or cross-platform abilities (CodeWarrior, Comeau, GCC, Intel, and Watcom all have versions that work with multiple platforms), or quality of warnings (they all score well on this). Even though some compilers do very well, it's not appropriate to assume that these relative performances will be reflected in, say, manipulation of polymorphic types; not without conducting tests to prove it, anyway. In any case, every compiler in this test acquits itself well in at least one scenario. I'm maintaining an errata/update page for this article at http://synesis.com.au/ resources/articles/errata/ddj200405.html. Performance results for newer compiler versions and, heaven forfend, any errata will be available there.

Whatever your work, I advise you to use as many compilers as possible to ensure the best quality of your software. In this context, the speed of some compilers may be secondary to their quality. You might elect to use a conforming compiler to validate your code's correctness, but actually build using a faster but less modern compiler.

Acknowledgments

Thanks to DDJ readers and engineers from Microsoft, Metrowerks, and Scitech (Open Watcom) who gave me feedback from the October 2003 article. Particular thanks are due to Greg Comeau for super-human patience and helpfulness in the face of an hourly e-mail barrage from a poor fool who can't read and apply simple web site instructions, and to Walter Bright for continuing responsiveness with respect to making enhancements to the Digital Mars compiler.

DDJ

1 2 3 4 5 6 7 8 9 10 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

C/C++ Compiler Optimization