Thomas uses fundamental data processing and arithmetic operations to measure the performance of C++, C#, and Java.
July 01, 2005
URL:http://www.drdobbs.com/cpp/microbenchmarking-c-c-and-java/184401976
Generally speaking, benchmarking is the process of comparing two or more systems to determine which is more efficient or provides better performance, with the ultimate goal of making improvements to one or more of the systems being measured. It's no surprise that few processes in software development are more controversial than benchmarking, as tests are often tweaked to make one product appear to be better than it actually is, or to make it look better than its competitors. In addition, some benchmarks are simply meaningless or irrelevant in what they measure and how they apply to the system under scrutiny. Clearly, for a benchmark to have any value, you need to understand what is actually being measuredand under what conditions.
With all of these caveats in place (along with the familiar cop out "your mileage may vary"), it is worth noting that there are valid reasons for benchmarking systems, the least of which is that benchmarking is in itself interesting. When it comes to computer systems, most benchmarking seems to occur when measuring CPU performance, network throughput, 3D graphic cards, and the like. In this article, however, I compare programming languages with their associated compilers and runtime environments, in particular C++, C#, and Java. To do so for this benchmark, I've modified, extended, and fixed code portions from tests implemented by Christopher W. Cowell-Shah [1] and Doug Bagley [2], whose original shootouts are now out of date or had some flaws [3].
I refer to the tests presented here as "micro benchmarks," which means they perform elementary operations that every programming language should offer. I've split the end results into two categoriessimple data processing and arithmetic functionsbecause these are the two major distinctions of what applications do most of their time (80/20 rule [4]). These tests give a good picture of commonly used algorithms and data structures that you find in most applications today. That's the reason I chose them. I haven't addressed other issues, such as multithreading on multiple CPUs, because they deserve an article in their own rite.
The goal of this micro benchmark is to evaluate the performance characteristics of Java, C#, and C++ on Windows and Linux. The source code files I use can be downloaded from ToMMTi-Systems [5].
Table 1 lists the settings I use for the different language platforms. For some unknown reason(s), compiling with the Intel C/C++ Compiler -fast
option failed under Linux; that's why I used -O3
instead. I would have liked to include the Intel C/C++ Compiler for Windows results, but the STLPort 5.0 RC2 failed to compile under Windows. I used both the GCC STL and STLPort 5.0 RC2 STL to get comparable results for Linux and Windows, because Microsoft Visual C 7.1 has a slow HashMap STL implementation.
Language | Compiler Settings | Runtime Settings |
Java | -g:none -target: 1.4 (JRE1.4) | -server or -client |
-target: 5 (JRE5) | ||
(IBM JRE supports no client/server option) | ||
C# | /optimize+ /debug- /checked- /unsafe+ /target:exe | .NET: none |
Mono: --optimize=all | ||
C++ | GCC: -O3 -march=pentium4 -msse -msse2 -mfpmath=sse | none |
Intel C/C++: -O3 -march=pentium4 -mcpu=pentium4 | ||
Microsoft Visual C 7.1: | ||
/Ox /Og /Ob2 /Oi /Ot /Oy /GT /GL /G7 /GA /GF /FD /EHsc /ML /arch:SSE2 |
Because the .NET Framework 1.1 lacks a LinkedList implementation, I used Rodney S. Foley's [6] source code to fill this gap. The .NET Framework 2.0 (currently at beta 2 status) includes a LinkedList implementation under System.Collections.Generic
, but it is not used in this benchmark to make the C# comparison more accurate, but included in the second file of the C# benchmark. My own tests showed that it is a bit slower (by approximately 6 percent) than Foley's implementation, maybe due to the use of generics.
I didn't use Java 5's StringBuilder
implementation for the same reason. I wanted the 1.4.2 baseline comparison to be more accurate. On the client VM, StringBuilder
is 150 percent (dynamic memory allocation) and 310 percent (fixed-memory allocation) faster than StringBuffer
. On the server VM, it's even faster170 percent and 470 percent, respectively.
Most of these micro benchmarks are straightforward and have just enough code to hopefully not get optimized away. They should show a direction, but they should not be taken as absolute values. I've spent time making them as solid as I could in terms of optimization safety, repeatability, and equality. Every test, except the arithmetic and the trigonometric, is run 10 times to let the JIT do its work and to compensate for small errors due to the operating system's background tasks. I won't give any rating because I simply can't rate a platform. It all depends on what you want to make with a platform and what benefits and compromises you can accept and what not. Furthermore, I can't explain the results. I could speculate, but that would not help. There are more knowledgeable people out there who can perform this task.
Configuration wise, the hardware for my testbed is a 3.20-GHz Intel Pentium 4 (Stepping 09) with Hyperthreading Technology, 512 KB L2-Cache, and 1256-MB PC2700 (333 MHz) of memory. There are two software configurations. For Linux, I ran Mandrake Linux 10.1, Vanilla Kernel 2.4.11 (SMP enabled), Glibc 2.3.3, GNU C++ Compiler (GCC) 3.4.1, the Intel C++ Compiler 8.1 (build 024), STLPort 5.0 RC2, Mono 1.0.6, Sun Java 1.4.2_07 (build 05), Sun Java 1.5.0_02 (build 09), and IBM Java 1.4.2 (build cxia32142sr1a-20050209). On the Windows side, I used Windows XP Professional + SP2, Microsoft Visual Studio 7.1 C++ Compiler (build 7.10.3077), .NET Framework 1.1 + SP1 (build 7.10.6001.4), .NET Framework 2.0 Beta 2 (build 8.00.50110.28), the Intel C++ Compiler 8.1 (build 20050201Z), Sun Java 1.4.2_07 (build 05), Sun Java 1.5.0_02 (build 09), IBM Java 1.4.2 (build cxia32142sr1a-20050209), and JIT Excelsior 3.6 (mp3) + Sun Java 1.4.2_06 (build 03).
Before proceeding with the benchmarks, it is important to note that, in contrast to ordinary benchmarks, a lower value in these tests is always better because it is the measured time to perform an operation.
Figure 1 shows the results of the 32-bit integer arithmetic test. It simply does the basic arithmetic operations with 32-bit integer variables in a loop. Here is the pseudocode:
while (i < Max) { Result -= i++ Result *= i++ Result += i++ Result /= i++ }
The 64-bit double arithmetic test in Figure 2 is same as the 32-bit integer arithmetic test, but uses 64-bit double variables instead of 32-bit integer. Here is the pseudocode:
while (i < Max) { Result -= i++ Result *= i++ Result += i++ Result /= i++ }
Figure 3 shows the results of the Long
arithmetic test, which replaces the 64-bit double variables with 64-bit long
variables. Here is the pseudocode:
while (i < Max) { Result -= i++ Result *= i++ Result += i++ Result /= i++ }
The trigonometric functions test runs the sin
, cos
, tan
, log10
, and sqrt
functions in a loop, performed with 64-bit double arithmetic. Figure 4 presents the results. Here is the pseudocode:
while (i < Max) { sine += sin(i) cosine += cos(i) tangent += tan(i) logarithm +=log10(i) squareRoot += sqrt(i) i++ }
File I/O performs a write/read of 8,100,000 bytes in/from a text file. Starting with this test, each subsequent test is done 10 times. Figure 5 shows the results. Here is the pseudocode:
while (i < Max) { writeLine ("abcdefghijklmno...") i++ } while (not eof) { result += readLine() }
The array test measures the basic array set
and get
functions, including allocation of the required memory. Figure 6 shows the results. Here is the pseudocode:
while (i < Max) { x[i] = i + 1 y[i] = 0 i++ } while (k < 1000) { while (j < Max) { y[j] += x[j] j++ } k++ }
Exception handling creates 2x50000 exceptions, which are interleaved. Figure 7 shows the results. Here is the pseudocode:
while (i < Max) { try { try { if ( i is odd ) { throw new lo_Exception } else { throw new hi_Exception } } catch(lo_exception) { Lo++ } } catch(hi_exception) { Hi++ } i++ }
A single hash map is filled with 32-bit integers via a string key in this test. Then a find
is performed with a 62.5 percent success ratio of all values; see Figure 8. Here is the pseudocode:
while (i < Max) { insert value of i with key of i as hex string into hash map i++ } i=0 while (i < Max) { if( hash map contains key of i as decimal string ) { result++ } i++ }
This time, two hash maps are used. One is filled and iterated to perform find
, get
, and set
operations on the second. Figure 9 shows the results. Here is the pseudocode:
while (i < 10000) { insert value of i with key of i as decimal string into hash map1 i++ } i=0 while (i < Max) { for each entry in hash map1 { if( hash map2 contains key of hash map1 entry) { hash map2 entry value += hash map1 entry value } else { insert hash map1 entry value into hash map2 via hash map1 entry key } } i++ }
Heap sort is a simple and well-known sorting algorithm. Figure 10 shows the results of the test. Here is the pseudocode:
while (i < Max) { fixed size array[i] = pseudo random value (i) i++ } sort array[Max] via heap sort
Three double-linked lists are used to perform add
, get
, and remove
operations. Adding and removing is performed by a 4:3 ratio. Figure 11 shows the results. Here is the pseudocode:
while (i < Max) { clear all lists while (l < list1 size) add l to list1 list2 = deep copy of list1 while ( list2 is not empty ) { add first list2 element to the end of list3 remove first list element from list2 } while ( list3 is not empty ) { add last list3 element to the end of list2 remove last list element from list3 } while ( list1 is not empty ) { add first list1 element to the begin of list4 remove first list element from list1 } list1 = deep copy of list4 iterate through list1 and list2 and search for differences i++ }
Three 32-bit integer 30×30 matrices are used to perform a simple matrix multiply operation. Figure 12 shows the results. Here is the pseudocode:
while (i < Max) { ROW_WIDTH = COL_WIDTH = 30 fill matrix1[ROW_WIDTH][COL_WIDTH] with 'col+ROW_WIDTH*row' of each matrix element fill matrix2[ROW_WIDTH][COL_WIDTH] with 'col+ROW_WIDTH*row' of each matrix element matrix3 = matrix1 * matrix2 i++ }
Six loops are nested into each other to perform some 32-bit integer add
operations.Figure 13 shows the results. Here is the pseudocode:
x=0 for (a = 0; a < Max; a++) for (b = 0; b < Max; b++) for (c = 0; c < Max; c++) for (d = 0; d < Max; d++) for (e = 0; e < Max; e++) for (f = 0; f <; Max f++) x += a + b + c + d + e + f
String concatenation is performed here, but the memory allocation is done outside the benchmarking loop. That's why I called it "fixed" or "preallocated." Figure 14 presents the results. Here is the pseudocode:
preallocate memory for string1 init string2 array with { "hello", "bla", "hi", "jap" } while (i < Max) { string1 += string2[i modulo 4] i++ }
This test is entirely the same as the "preallocated" version, but this time no buffer is preallocated and the allocation time is included in the benchmark. Figure 15 presents the results. Here is the pseudocode:
init string2 array with { "hello", "bla", "hi", "jap" } while (i < Max) { string1 += string2[i modulo 4] i++ }
This last test performs dynamic object creation/destruction and calls a method with six parameters, which are passed by value. Figure 16 shows the results. Here is the pseudocode:
while (i < Max) { create testObject(i, i, i, i, i, i) testObject.doSomething(i, i, i, i, i, i) testObject.doSomething(i, i, i, i, i, i) testObject.doSomething(i, i, i, i, i, i) testObject.doSomething(i, i, i, i, i, i) destroy testObject i++ }
Figure 17 shows the average results for the nonarithmetic/nontrigonometric tests, while Figure 18 shows the average results for the arithmetic and trigonometric tests.
[1] http://www.cowell-shah.com/research/benchmark/code.
[2] http://dada.perl.it/shootout/.
[3] http://www.w3sys.com/pages.meta/benchmarks.html.
[4] http://www.javaworld.com/jw-03-1998/jw-03-hotspot.html.
[5] http://www.tommti-systems.com/mainDateien/reviews/languages/bench.zip.
[6] http://foleyutilities.sourceforge.net/.
Performance comparison C++, C#, and Java, http://www.tommti-systems.de/main-Dateien/reviews/languages/benchmarks.html.
Nine Language Performance Round-Up, http://www.osnews.com/story.php?news_id=5602.
Performance of Java versus C++, http://www.idiom.com/~zilla/Computer/javaCbenchmark/.
Java theory and practice: A brief history of garbage collection, ftp://www6.software.ibm.com/software/developer/library/j-jtp10283.pdf.
Java theory and practice: Dynamic compilation and performance, ftp://www6.software.ibm.com/software/developer/library/j-jtp12214.pdf.
Java theory and practice: Anatomy of a flawed micro benchmark, http://www6.software.ibm.com/software/developer/library/ j-jtp02225.html.
Performance Considerations for Runtime Technologies in the .NET Framework, http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/html/dotnetperftechs.asp.
Thomas Bruckschlegel is the founder of ToMMTi-Systems, which deals with 3D hardware, software-performance evaluation, and benchmark/tool development. Thomas can be reached at [email protected].
Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.