Performance Portable C++

Performance portability means that code can achieve good performance across a range of computer architectures while maintaining a single body of source code.


May 07, 2008
URL:http://www.drdobbs.com/parallel/performance-portable-c/207600622

Jeff is a computer scientist at Lawrence Livermore National Laboratory where he contributes to several software projects managed through the ASC program.


Programmers have two basic ways of organizing arrays of data; see Figure 1. The performance of each choice can vary greatly as code is ported from machine to machine and compiler to compiler.

It's easy to switch between the Array-like and Struct-like implementations in Figure 1 by hiding the array details behind a class API. Listing One shows how a coordinate array is implemented as a performance portable Point class. There are two important features of the Point class implementation:

Together, these two features let almost all compilers efficiently optimize most (if not all) class overhead, especially when interprocedural analysis has been enabled in the compiler optimization flags.

If you use classes having the above form, you can quickly switch between array layouts as you port code. The easiest way to do this is to create a configuration header file with system-specific layout choices, and #include that configuration file at the top of each array class header file.

If you don't hide the array implementation as I describe here, you can end up completely rewriting your software when switching from one form of array layout to the other.


(a) 

double x[10000] ;         
double y[10000] ; 
double z[10000] ;         

(b)

struct {           
  double x,y,z ;
} point[10000] ;   

Figure 1: (a) Array-like, (b) Struct-like.

  
#define ML_STRUCT 0
#define ML_ARRAY  1

#if POINT_MEM == ML_ARRAY
class Point {
public:
   Point(const int size) : m_x(size), m_y(size) {}
   inline double &x(const int idx) { return m_x[idx] ; }
   inline double &y(const int idx) { return m_y[idx] ; }
private:
  Point() ;
  std::vector<double> m_x ;
  std::vector<double> m_y ;
} ;
#else /* ML_STRUCT */
class Point {
public:
   Point(const int size) : m_p(size) {}
   inline double &x(const int idx) { return m_p[idx].x ; }
   inline double &y(const int idx) { return m_p[idx].y ; }
private:
  struct Coord { double x, y ; } ;
  Point() ;
  std::vector<Coord> m_p ;
} ;
#endif 
Listing One

The Benchmarks

To illustrate how array organization can affect performance, I use three examples that calculate area and volume (source-code listings are available at www.ddj.com/code/):

I apply these calculations to hundreds of thousands of mesh elements. I include enough elements in my benchmark to have a memory footprint of over 20 megabytes, but I organize elements in a cache-optimal way, so cache reuse occurs.

My benchmarks have two interacting array classes. I use a Point class to store coordinates, and a shape-specific class to store the point indices that define the shape.

The examples I've chosen have subtly different memory layouts and memory access patterns. The subtlety helps to emphasize how the interplay of algorithms and data layouts can influence the effectiveness of compiler optimizations in addition to memory latency.

The Results

Figures 2, 3, and 4 show the performance of Listings Two, Three, and Four, respectively. For each bar in the graphs, 20 runs were made, and the minimum time was used. There was little variance among the 20 runs since each run was made on a "dedicated" processor having no other users. Table 1 provides the processor/compiler details of each benchmark.

All results within a given bar color are normalized against the minimum time for that color. This lets you measure the relative performance of a given data layout for a given environment.

Comparing different bar colorings to determine the fastest compiler won't work. When interpreting results, you should pick two bar colorings, then compare how those colors interact across all four memory layouts. For example, when looking at Figure 3, you can compare the Core2/pathscale3.0 results to the Power5/xlC7.0 results to see that the fastest memory layout for one configuration is the slowest for the other. The performance difference here is as much as 13 percent. By using performance portable array classes, you can get the best performance in both environments.

Mnemonic Processor Compiler Compiler Options
Core2/icpc10.0 Intel Xeon E5345 Intel v10.0 -fast
Core2/g++4.1.1 GNU v4.1.1 -O3 -mfpmath=sse -static
Core2/PGI6.2-3 PGI v6.2-3 -fast -Mipa=fast,inline -Bstatic
Core2/path3.0 Pathscale v3.0 -Ofast -static
Opteron/g++4.1.1 AMD 8216 GNU v4.1.1 -O3 -mfpmath=sse -static
Opteron/PGI6.2-3 PGI v6.2-3 -fast -Mipa=fast,inline -Bstatic
Opteron/path3.0 Pathscale v3.0 -Ofast -static
Itanium2/icpc10.0 Intel Intel v10.0 -fast
Itanium2/g++3.4.4 GNU v3.4.4 -O3 -static
Power5/xlC7.0 IBM xlC v7.0 -O5 -qnoeh
Power5/g++3.3.3 GNU v3.3.3 -O3 -static

Table 1: Processor/compiler details of each benchmark. Each test was compiled on the same processor it was run on, except the Core2 runs that had to be compiled on a 2.4-GHz Intel Xeon using an E75xx chipset.

Figure 2: Point class/Triangle class.

Figure 3: Point class/Quadrilateral class.

Figure 4 is also interesting. Looking at the graph, the most efficient arrangement of data is to implement the Point class as individual arrays, and the Brick connectivity class as an array of structs. Compare this to the layout where Point and Brick are both implemented as arrays of structs. The performance degradation for using a Struct-like Point class with the icpc v10.0 compiler on the Itanium2 is about 36 percent. This is surprising because many scientific applications use Struct-like Point classes almost exclusively.

Figure 4: Point class/Brick class.

Also, the graphs show that compiling a given benchmark with the same compiler but on two different architectures gives different results. Looking at Figure 2, a comparison of the Core2/PGI6.2-3 with the Opteron/PGI6.2-3 shows a performance divergence of up to 26 percent when using exactly the same compiler, data layouts, and compiler options. This fact only becomes evident when compiling on two different machines. This example reveals a key advantage of flexible data structures. The ability to switch data structures lets you catch performance problems related to data layout that may not be obvious through profiling tools. The spikes in some of the graphs show clear performance problems for some configurations, but a profiler may not catch that effect because the usage of a given data structure may be spread uniformly throughout the code.

STL versus Naked Pointers

I initially ran the benchmarks for this article using STL vectors to store data, but was curious how the STL performance compared to using naked pointers. Listing Five (available online) uses the restrict keyword with pointers in place of STL vectors.

In Figures 5, 6, and 7, I plot the ratio of the Pointer time to the STL time for the Triangle, Quadrilateral, and Brick benchmarks, respectively (a value less than 1 means the naked pointers are faster). The Quadrilateral benchmark results show that the pathscale 3.0 compiler on the Opteron ran 32 percent faster on average when using restricted pointers instead of the STL. On the other hand, the g++ compiler ran slightly faster on every benchmark configuration when using the STL. I had to omit the XlC compiler from this comparison because of overly aggressive optimizations that occurred when using pointers.

Figure 5: Point class/Triangle class.

Figure 6: Point class/Quadrilateral class.

Figure 7: Point class/Brick class.

Class Partitioning

Before hiding the implementation of arrays behind a class API, you may want to consider how that extra layer of API affects code. Now that I've given hard facts above describing the technique and its performance, I turn to some potential guidelines for organizing array classes.

First of all, it's almost certain that any chosen partitioning of data at the beginning of a project changes by the time the project ends. Keeping this in mind, you want to partition your data such that there's as little pain as possible if you have to refactor your code. There are two extremes of data organization:

The first choice has the advantage of better readability and refactorability. If all of your data members are prefixed with the same short mnemonic, then it is almost like having a namespace for your array data. It is more refactorable because you can move data around within the class, and the class API won't change, so your code won't have to change.

The first choice has the disadvantage that it is hard to instantiate multiple copies of the object without incurring a huge fixed overhead of memory and class construction time. For instance, if you have 800 arrays in a class implemented using STL vectors, then you are going to pay the penalty of instantiating all the vectors, even if you are cloning the class just for a few of the vectors it contains. Construction/Initialization can also be a problem, especially for subsets of arrays.

The second choice has the advantage of extreme flexibility. You can group just a few arrays that are often used together in a class, examples being coordinate vectors or velocity vectors. This is optimal when you want to construct/destruct a lot of temporary objects throughout your run. Small classes are also easier to tune for performance portability.

The disadvantage can be reduced readability because almost every variable will effectively have a different namespace prefix. It's not always fun to write that kind of code, or read it in equations. The readability issue can be mitigated by pulling data out into local temporaries before use, but that can cause code bloat, not to mention the introduction of errors due to cut-and-paste code and typos. Another drawback of many small classes is that if you want to refactor your code, say by moving an array from one class to another, that change will often require changes throughout your code.

Now that I've covered the advantages/disadvantages of the two extremes, I look at a third choice, which is to group data having similar "topological" characteristics.

An indication that arrays may be topologically similar is that they probably have the same length. For example, when working with physics on meshes, velocity components are often defined at the coordinates, so perhaps coordinates and velocities should be encapsulated in the same class.

In contrast, particle data may contain coordinates, but you wouldn't want to necessarily group particle coordinates with mesh coordinates because particles have a completely different function. Particles can move independently from mesh coordinates, and they will probably be created and destroyed more often.

The advantage of grouping arrays topologically is that they can often be nested in hierarchies. For example, one array class could contain array data common to all the nodes of a mesh, while another class could contain extra array data pertaining to a subset. An index set could be used to map array indices in the subset to corresponding indices in the larger mesh as a way to implement inheritance.

Topological grouping is a good guideline, but it's only a guideline. There may be performance-sensitive groups of arrays that should be split off into separate classes for easier cache tuning.

Once you have decided how to group your classes, there is one other important point. For classes that have an underlying implementation that is Struct-like, make sure that variables that are likely to be used together are in adjacent locations in the struct. This greatly increases your cache hits, and thus your performance. Resist the urge to order struct members strictly alphabetically because that can cause a huge performance penalty.

Conclusion

Performance portability is likely to become a necessary part of programming in the near future. Already, some compilers will only create good SSE code for the x86 processor architecture if stride one vectors are used (also known as Array-like layouts). At the same time, older superscalar machines usually perform best when Struct-like layouts are used as often as possible. Without the flexibility to switch memory layouts, performance almost certainly suffers as code is ported, and this is more likely to be so as we head into the future.

The technique I describe here is used in a large multiphysics project that encompasses hundreds of thousands of lines of code. The technique was originally used as a refactoring tool more than a performance portability tool. At one point, the project was able to quickly refactor the hydrodynamics portion of their physics code for a 42-100 percent speedup depending on the problem being solved and the machine being used. The hydrodynamics can be the dominant portion of the runtime for many physics applications, so doubling the performance with just a change of data structures is impressive. Part of the performance gain probably came from the compiler recognizing extra optimizations that could be applied (same compiler flags), and the rest came from different cache latency characteristics.

Thanks to Brian McCandless for questioning me during a presentation on the sidebar material, and inspiring me to write this article. His group uses a variant of this technique in their software.

References

Triangle and Quadrilateral areas (softsurfer.com/algorithms.htm).

Grandy, Jeff. "Efficient Computation of Volume of Hexahedral Cells," Lawrence Livermore National Laboratory UCRL-ID-128886, Oct 30, 1997.

Next-Generation Performance Portability

To take full advantage of multicore, NUMA, and heterogeneous technologies, lots of code will likely need to be rewritten, especially in the High Performance Computing (HPC) arena.

To mitigate the impact of this shift to new architectures, I'm investigating the use of source-to-source translation as a possible salvation. I've used a sophisticated tool called ROSE (www.roseCompiler.org) to write a Tiny C language extension that implements the techniques in this article in a way that is transparent to users. All the drawbacks I mention in the article go away, and new features emerge, especially with respect to multidimensional arrays.

Users create a single schema file describing the struct/array grouping of array names used within a code. The compiler can then use the schema to generate structs or arrays as appropriate at compile time. The groupings can be hierarchical, thus subtle relationships among groups of arrays can easily be captured.

I plan to investigate generating RapidMind, Intel's Threaded Building Block, or other backend code to complement the performance portable language extension. My hope is that source translation combined with a vector-like language extension might let the power of next-generation architectures be exploited with a minimal amount of effort devoted to code rewrites.

—J.K.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.