Channels ▼
RSS

Parallel

Performance Portable C++

Source Code Accompanies This Article. Download It Now.


The Benchmarks

To illustrate how array organization can affect performance, I use three examples that calculate area and volume (source-code listings are available at www.ddj.com/code/):

  • Triangle area, 8 FLOPs/Triangle; see Listing Two online.
  • 2D Quadrilateral area, 8 FLOPs/Quadrilateral; see Listing Three online.
  • 3D Brick volume, 60 FLOPs/Brick (Hexahedron); see Listing Four online.

I apply these calculations to hundreds of thousands of mesh elements. I include enough elements in my benchmark to have a memory footprint of over 20 megabytes, but I organize elements in a cache-optimal way, so cache reuse occurs.

My benchmarks have two interacting array classes. I use a Point class to store coordinates, and a shape-specific class to store the point indices that define the shape.

The examples I've chosen have subtly different memory layouts and memory access patterns. The subtlety helps to emphasize how the interplay of algorithms and data layouts can influence the effectiveness of compiler optimizations in addition to memory latency.

The Results

Figures 2, 3, and 4 show the performance of Listings Two, Three, and Four, respectively. For each bar in the graphs, 20 runs were made, and the minimum time was used. There was little variance among the 20 runs since each run was made on a "dedicated" processor having no other users. Table 1 provides the processor/compiler details of each benchmark.

All results within a given bar color are normalized against the minimum time for that color. This lets you measure the relative performance of a given data layout for a given environment.

Comparing different bar colorings to determine the fastest compiler won't work. When interpreting results, you should pick two bar colorings, then compare how those colors interact across all four memory layouts. For example, when looking at Figure 3, you can compare the Core2/pathscale3.0 results to the Power5/xlC7.0 results to see that the fastest memory layout for one configuration is the slowest for the other. The performance difference here is as much as 13 percent. By using performance portable array classes, you can get the best performance in both environments.

Mnemonic Processor Compiler Compiler Options
Core2/icpc10.0 Intel Xeon E5345 Intel v10.0 -fast
Core2/g++4.1.1 GNU v4.1.1 -O3 -mfpmath=sse -static
Core2/PGI6.2-3 PGI v6.2-3 -fast -Mipa=fast,inline -Bstatic
Core2/path3.0 Pathscale v3.0 -Ofast -static
Opteron/g++4.1.1 AMD 8216 GNU v4.1.1 -O3 -mfpmath=sse -static
Opteron/PGI6.2-3 PGI v6.2-3 -fast -Mipa=fast,inline -Bstatic
Opteron/path3.0 Pathscale v3.0 -Ofast -static
Itanium2/icpc10.0 Intel Intel v10.0 -fast
Itanium2/g++3.4.4 GNU v3.4.4 -O3 -static
Power5/xlC7.0 IBM xlC v7.0 -O5 -qnoeh
Power5/g++3.3.3 GNU v3.3.3 -O3 -static

Table 1: Processor/compiler details of each benchmark. Each test was compiled on the same processor it was run on, except the Core2 runs that had to be compiled on a 2.4-GHz Intel Xeon using an E75xx chipset.

Figure 2: Point class/Triangle class.

Figure 3: Point class/Quadrilateral class.

Figure 4 is also interesting. Looking at the graph, the most efficient arrangement of data is to implement the Point class as individual arrays, and the Brick connectivity class as an array of structs. Compare this to the layout where Point and Brick are both implemented as arrays of structs. The performance degradation for using a Struct-like Point class with the icpc v10.0 compiler on the Itanium2 is about 36 percent. This is surprising because many scientific applications use Struct-like Point classes almost exclusively.

Figure 4: Point class/Brick class.

Also, the graphs show that compiling a given benchmark with the same compiler but on two different architectures gives different results. Looking at Figure 2, a comparison of the Core2/PGI6.2-3 with the Opteron/PGI6.2-3 shows a performance divergence of up to 26 percent when using exactly the same compiler, data layouts, and compiler options. This fact only becomes evident when compiling on two different machines. This example reveals a key advantage of flexible data structures. The ability to switch data structures lets you catch performance problems related to data layout that may not be obvious through profiling tools. The spikes in some of the graphs show clear performance problems for some configurations, but a profiler may not catch that effect because the usage of a given data structure may be spread uniformly throughout the code.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video