Dr. Dobb's | Microbenchmarking C++, C#, and Java

Microbenchmarking C++, C#, and Java

Thomas uses fundamental data processing and arithmetic operations to measure the performance of C++, C#, and Java.

July 01, 2005
URL:http://www.drdobbs.com/cpp/microbenchmarking-c-c-and-java/184401976

Generally speaking, benchmarking is the process of comparing two or more systems to determine which is more efficient or provides better performance, with the ultimate goal of making improvements to one or more of the systems being measured. It's no surprise that few processes in software development are more controversial than benchmarking, as tests are often tweaked to make one product appear to be better than it actually is, or to make it look better than its competitors. In addition, some benchmarks are simply meaningless or irrelevant in what they measure and how they apply to the system under scrutiny. Clearly, for a benchmark to have any value, you need to understand what is actually being measured—and under what conditions.

With all of these caveats in place (along with the familiar cop out "your mileage may vary"), it is worth noting that there are valid reasons for benchmarking systems, the least of which is that benchmarking is in itself interesting. When it comes to computer systems, most benchmarking seems to occur when measuring CPU performance, network throughput, 3D graphic cards, and the like. In this article, however, I compare programming languages with their associated compilers and runtime environments, in particular C++, C#, and Java. To do so for this benchmark, I've modified, extended, and fixed code portions from tests implemented by Christopher W. Cowell-Shah [1] and Doug Bagley [2], whose original shootouts are now out of date or had some flaws [3].

I refer to the tests presented here as "micro benchmarks," which means they perform elementary operations that every programming language should offer. I've split the end results into two categories—simple data processing and arithmetic functions—because these are the two major distinctions of what applications do most of their time (80/20 rule [4]). These tests give a good picture of commonly used algorithms and data structures that you find in most applications today. That's the reason I chose them. I haven't addressed other issues, such as multithreading on multiple CPUs, because they deserve an article in their own rite.

The goal of this micro benchmark is to evaluate the performance characteristics of Java, C#, and C++ on Windows and Linux. The source code files I use can be downloaded from ToMMTi-Systems [5].

Table 1 lists the settings I use for the different language platforms. For some unknown reason(s), compiling with the Intel C/C++ Compiler -fast option failed under Linux; that's why I used -O3 instead. I would have liked to include the Intel C/C++ Compiler for Windows results, but the STLPort 5.0 RC2 failed to compile under Windows. I used both the GCC STL and STLPort 5.0 RC2 STL to get comparable results for Linux and Windows, because Microsoft Visual C 7.1 has a slow HashMap STL implementation.

Language	Compiler Settings	Runtime Settings
Java	-g:none -target: 1.4 (JRE1.4)	-server or -client
	-target: 5 (JRE5)
	(IBM JRE supports no client/server option)
C#	/optimize+ /debug- /checked- /unsafe+ /target:exe	.NET: none
		Mono: --optimize=all
C++	GCC: -O3 -march=pentium4 -msse -msse2 -mfpmath=sse	none
	Intel C/C++: -O3 -march=pentium4 -mcpu=pentium4
	Microsoft Visual C 7.1:
	/Ox /Og /Ob2 /Oi /Ot /Oy /GT /GL /G7 /GA /GF /FD /EHsc /ML /arch:SSE2

Table 1: Compiler settings used for different language platforms.

Because the .NET Framework 1.1 lacks a LinkedList implementation, I used Rodney S. Foley's [6] source code to fill this gap. The .NET Framework 2.0 (currently at beta 2 status) includes a LinkedList implementation under System.Collections.Generic, but it is not used in this benchmark to make the C# comparison more accurate, but included in the second file of the C# benchmark. My own tests showed that it is a bit slower (by approximately 6 percent) than Foley's implementation, maybe due to the use of generics.

I didn't use Java 5's StringBuilder implementation for the same reason. I wanted the 1.4.2 baseline comparison to be more accurate. On the client VM, StringBuilder is 150 percent (dynamic memory allocation) and 310 percent (fixed-memory allocation) faster than StringBuffer. On the server VM, it's even faster—170 percent and 470 percent, respectively.

Most of these micro benchmarks are straightforward and have just enough code to hopefully not get optimized away. They should show a direction, but they should not be taken as absolute values. I've spent time making them as solid as I could in terms of optimization safety, repeatability, and equality. Every test, except the arithmetic and the trigonometric, is run 10 times to let the JIT do its work and to compensate for small errors due to the operating system's background tasks. I won't give any rating because I simply can't rate a platform. It all depends on what you want to make with a platform and what benefits and compromises you can accept and what not. Furthermore, I can't explain the results. I could speculate, but that would not help. There are more knowledgeable people out there who can perform this task.

Configuration wise, the hardware for my testbed is a 3.20-GHz Intel Pentium 4 (Stepping 09) with Hyperthreading Technology, 512 KB L2-Cache, and 1256-MB PC2700 (333 MHz) of memory. There are two software configurations. For Linux, I ran Mandrake Linux 10.1, Vanilla Kernel 2.4.11 (SMP enabled), Glibc 2.3.3, GNU C++ Compiler (GCC) 3.4.1, the Intel C++ Compiler 8.1 (build 024), STLPort 5.0 RC2, Mono 1.0.6, Sun Java 1.4.2_07 (build 05), Sun Java 1.5.0_02 (build 09), and IBM Java 1.4.2 (build cxia32142sr1a-20050209). On the Windows side, I used Windows XP Professional + SP2, Microsoft Visual Studio 7.1 C++ Compiler (build 7.10.3077), .NET Framework 1.1 + SP1 (build 7.10.6001.4), .NET Framework 2.0 Beta 2 (build 8.00.50110.28), the Intel C++ Compiler 8.1 (build 20050201Z), Sun Java 1.4.2_07 (build 05), Sun Java 1.5.0_02 (build 09), IBM Java 1.4.2 (build cxia32142sr1a-20050209), and JIT Excelsior 3.6 (mp3) + Sun Java 1.4.2_06 (build 03).

Before proceeding with the benchmarks, it is important to note that, in contrast to ordinary benchmarks, a lower value in these tests is always better because it is the measured time to perform an operation.

Figure 1 shows the results of the 32-bit integer arithmetic test. It simply does the basic arithmetic operations with 32-bit integer variables in a loop. Here is the pseudocode:

while (i < Max)
{
   Result -= i++
   Result *= i++
   Result += i++
   Result /= i++
}

Figure 1: 32-bit integer arithmetic.

The 64-bit double arithmetic test in Figure 2 is same as the 32-bit integer arithmetic test, but uses 64-bit double variables instead of 32-bit integer. Here is the pseudocode:

while (i < Max)
{
   Result -= i++
   Result *= i++
   Result += i++
   Result /= i++
}

Figure 2: 64-bit double arithmetic.

Figure 3 shows the results of the Long arithmetic test, which replaces the 64-bit double variables with 64-bit long variables. Here is the pseudocode:

while (i < Max)
{
   Result -= i++
   Result *= i++
   Result += i++
   Result /= i++
}

Figure 3: Results.

The trigonometric functions test runs the sin, cos, tan, log10, and sqrt functions in a loop, performed with 64-bit double arithmetic. Figure 4 presents the results. Here is the pseudocode:

while (i < Max)
{
   sine += sin(i)
   cosine += cos(i)
   tangent += tan(i)
   logarithm +=log10(i)
   squareRoot += sqrt(i)
   i++
}

Figure 4: Trigonometric functions.

File I/O performs a write/read of 8,100,000 bytes in/from a text file. Starting with this test, each subsequent test is done 10 times. Figure 5 shows the results. Here is the pseudocode:

while (i < Max)
{
   writeLine ("abcdefghijklmno...")
   i++
}
while (not eof)
{
   result += readLine()
}

Figure 5: File I/O.

The array test measures the basic array set and get functions, including allocation of the required memory. Figure 6 shows the results. Here is the pseudocode:

while (i < Max)
{
   x[i] = i + 1
   y[i] = 0
   i++
}
while (k < 1000)
{
   while (j < Max)
   { 
 y[j] += x[j]
 j++
   }
   k++
}

Figure 6: Array.

Exception handling creates 2x50000 exceptions, which are interleaved. Figure 7 shows the results. Here is the pseudocode:

while (i < Max)
{
   try
   {
 try
 {
    if ( i is odd )
    {
  throw new lo_Exception
    }
    else
    {
  throw new hi_Exception
    }
 }
 catch(lo_exception)
 {
    Lo++
 }
   }
   catch(hi_exception)
   {
 Hi++
   }
   i++
}

Figure 7: Exception.

A single hash map is filled with 32-bit integers via a string key in this test. Then a find is performed with a 62.5 percent success ratio of all values; see Figure 8. Here is the pseudocode:

while (i < Max)
{
   insert value of i with key 
      of i as hex string into hash map
   i++
}
i=0
while (i < Max)
{
   if( hash map contains key 
       of i as decimal string )
   {
 result++
   }
   i++
}

Figure 8: Single hash map.

This time, two hash maps are used. One is filled and iterated to perform find, get, and set operations on the second. Figure 9 shows the results. Here is the pseudocode:

while (i < 10000)
{
   insert value of i with key of i 
       as decimal string into hash map1
   i++
}
i=0
while (i < Max)
{
   for each entry in hash map1
   {
 if( hash map2 contains key of hash map1        entry)
 {
    hash map2 entry value += 
        hash map1 entry value
 }
 else
 {
    insert hash map1 entry value 
       into hash map2 via hash map1 
       entry key
 }
   }
   i++
}

Figure 9: Multiple hash map.

Heap sort is a simple and well-known sorting algorithm. Figure 10 shows the results of the test. Here is the pseudocode:

while (i < Max)
{
   fixed size array[i] = 
      pseudo random value (i)
   i++
}
sort array[Max] via heap sort

Figure 10: Heap sort.

Three double-linked lists are used to perform add, get, and remove operations. Adding and removing is performed by a 4:3 ratio. Figure 11 shows the results. Here is the pseudocode:

while (i < Max)
{
   clear all lists
   while (l < list1 size)
 add l to list1
   list2 = deep copy of list1
   while ( list2 is not empty )
   {
 add first list2 element to the 
     end of list3
 remove first list element from list2
   }
   while ( list3 is not empty )
   {
 add last list3 element to the 
     end of list2
 remove last list element from list3
   }
   while ( list1 is not empty )
   {
 add first list1 element to the 
     begin of list4
 remove first list element from list1
   }
   list1 = deep copy of list4
   iterate through list1 and 
      list2 and search for differences
   i++
}

Figure 11: List.

Three 32-bit integer 30×30 matrices are used to perform a simple matrix multiply operation. Figure 12 shows the results. Here is the pseudocode:

while (i < Max)


{
   ROW_WIDTH = COL_WIDTH = 30
   fill matrix1[ROW_WIDTH][COL_WIDTH] 
 with 'col+ROW_WIDTH*row' 
 of each matrix element
   fill matrix2[ROW_WIDTH][COL_WIDTH] 
 with 'col+ROW_WIDTH*row' 
 of each matrix element
   matrix3 = matrix1 * matrix2
   i++ 
}

Figure 12: Matrix multiply.

Six loops are nested into each other to perform some 32-bit integer add operations.Figure 13 shows the results. Here is the pseudocode:

x=0
for (a = 0; a < Max; a++)
  for (b = 0; b < Max; b++)
  for (c = 0; c < Max; c++)
  for (d = 0; d < Max; d++)
  for (e = 0; e < Max; e++)
  for (f = 0; f <; Max f++)
    x += a + b + c + d + e + f

Figure 13: Nested loop.

String concatenation is performed here, but the memory allocation is done outside the benchmarking loop. That's why I called it "fixed" or "preallocated." Figure 14 presents the results. Here is the pseudocode:

preallocate memory for string1
init string2 array with { "hello", "bla", "hi", "jap" }
while (i < Max)
{
   string1 += string2[i modulo 4]
   i++
}

Figure 14: String concatenation (dynamic memory allocation).

This test is entirely the same as the "preallocated" version, but this time no buffer is preallocated and the allocation time is included in the benchmark. Figure 15 presents the results. Here is the pseudocode:

init string2 array with { "hello", "bla", "hi", "jap" }
while (i < Max)
{
   string1 += string2[i modulo 4]
   i++
}

Figure 15: String concatenation (no buffer).

This last test performs dynamic object creation/destruction and calls a method with six parameters, which are passed by value. Figure 16 shows the results. Here is the pseudocode:

while (i < Max)
{
   create testObject(i, i, i, i, i, i)
   testObject.doSomething(i, i, i, i, i, i)
   testObject.doSomething(i, i, i, i, i, i)
   testObject.doSomething(i, i, i, i, i, i)
   testObject.doSomething(i, i, i, i, i, i)
   destroy testObject
   i++
}

Figure 16: Object creation/ destruction and method call.

Figure 17 shows the average results for the nonarithmetic/nontrigonometric tests, while Figure 18 shows the average results for the arithmetic and trigonometric tests.

Figure 17: Average results without arithmetic.

Figure 18: Average arithmetic and trigonometric.

References

[1] http://www.cowell-shah.com/research/benchmark/code.

[2] http://dada.perl.it/shootout/.

[3] http://www.w3sys.com/pages.meta/benchmarks.html.

[4] http://www.javaworld.com/jw-03-1998/jw-03-hotspot.html.

[5] http://www.tommti-systems.com/mainDateien/reviews/languages/bench.zip.

[6] http://foleyutilities.sourceforge.net/.

For More Information

Performance comparison C++, C#, and Java, http://www.tommti-systems.de/main-Dateien/reviews/languages/benchmarks.html.

Nine Language Performance Round-Up, http://www.osnews.com/story.php?news_id=5602.

Performance of Java versus C++, http://www.idiom.com/~zilla/Computer/javaCbenchmark/.

Java theory and practice: A brief history of garbage collection, ftp://www6.software.ibm.com/software/developer/library/j-jtp10283.pdf.

Java theory and practice: Dynamic compilation and performance, ftp://www6.software.ibm.com/software/developer/library/j-jtp12214.pdf.

Java theory and practice: Anatomy of a flawed micro benchmark, http://www6.software.ibm.com/software/developer/library/ j-jtp02225.html.

Performance Considerations for Runtime Technologies in the .NET Framework, http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/html/dotnetperftechs.asp.

Thomas Bruckschlegel is the founder of ToMMTi-Systems, which deals with 3D hardware, software-performance evaluation, and benchmark/tool development. Thomas can be reached at [email protected].