Linear Algebra with C++ Template Metaprograms

A C++ technique called "template metaprograms" makes it possible for you to exploit the interpretive nature of the C++ template mechanism to write metaprograms that are interpreted at compile time and generate specialized algorithms as their output.

August 01, 1996
URL:http://www.drdobbs.com/cpp/linear-algebra-with-c-template-metaprogr/184409935

August 1996: Linear Algebra with Template Programs

Linear Algebra with C++ Template Metaprograms

Rapid linear algebra is just one use

Todd Veldhuizen and Kumaraswamy Ponnambalam

Todd is a graduate student in the Vision and Image Processing Group, and Kumaraswamy is a professor in the department of Systems Design Engineering, University of Waterloo. They can be contacted at [email protected].

C++ provides a number of advantages over C and Fortran. Since C++ is an object-oriented language, it offers a high level of abstraction and code reusability. C++ is also well-suited to writing numerical codes, since mathematical entities such as matrices and vectors can be treated as objects. The operator-overloading mechanism of C++ makes it possible to write equations in terms of matrices and vectors; for example, the matrix sum D=A+B+C can be written in C++ as D=A+B+C;. This represents an enormous advantage in clarity over C or Fortran, where numerical analysts must work at a much lower level of abstraction. So why hasn't the numerical community been migrating to C++?

Part of the reason is that a steep performance price must be paid to realize the benefits of C++. Performance losses of a factor of up to ten times have been reported when numerical codes were rewritten using C++ class libraries. This is not due to incompetence on the part of library developers, but rather limitations of C++ that make the language inherently inefficient for numerical computing.

Suppose, for example, you wanted to carry out a vector operation that involved adding three vectors; see Example 1(a). When you write code such as this, you want to generate a single loop, like Example 1(b), that evaluates the expression using all the vectors at once. But, due to the pairwise manner in which C++ evaluates expressions, this is not at all what happens. With a naive implementation of the class Vector, what really happens is something like Example 1(c). Two temporary vectors and three loops were required to evaluate an expression needing only one loop and no temporary vectors if hand coded! This is one reason why C++ has such severe performance problems for linear algebra.

A Solution

Algorithm specialization provides one way to overcome these performance problems. Algorithm specialization replaces general-purpose algorithms with much faster specialized versions. By generating specialized algorithms to carry out linear algebra operations, the performance problems of C++ can be overcome. Moreover, the object-oriented advantages of C++ are preserved.

Algorithm specialization usually requires a partial evaluator (see Partial Evaluation and Automatic Program Generation, by N.D. Jones et al., Prentice Hall, 1993), which specializes a program for a set of known parameters, such as matrix and vector sizes. However, partial evaluators introduce many problems. They are highly complex to implement and require language extensions to specify how far optimizations should be carried (for more details, refer to "Compiling Scientific Code Using Partial Evaluation," by A. Berlin and D. Weise, Computer, December 1990).

In this article, we'll describe a different approach using techniques that were partly developed while Todd was an intern at Rogue Wave Software.

A C++ technique called "template metaprograms" makes it possible to exploit the interpretive nature of the C++ template mechanism to write "meta" programs that are interpreted at compile time and generate specialized algorithms as their output. These metaprograms may be thought of as recipes that the compiler can follow to generate a specialized algorithm. This approach overcomes some of the problems with partial evaluation.

A metaprogram recipe consists of a set of code generation rules expressed as C++ template classes. We'll describe a library of recipes designed for common linear algebra operations on small, dense systems. When you write high-level, object-oriented code such as Example 2(a), the compiler follows the library for vector summation and translates the code into Example 2(b).

A second technique, called "expression templates," lets you unwind any complicated matrix or vector expression into a set of equations. Similar recipes have been designed for dot products, PLU factorizations, matrix products, and other operations.

Expression Templates

The basic notion of expression templates is that parse trees of expressions can be represented as C++ types by using recursive templates. As the accompanying text box entitled "A Simple Example of Recursive Templates" illustrates, a recursive template is simply a template that takes other templates as parameters.

To build parse trees of expressions using recursive templates, operators such as +, -, *, and / can be overloaded so that, rather than performing any useful computation, they simply return a type representing a parse tree of the operation. No calculation is performed until the expression template (representing a parse tree of the expression) is assigned to a vector or matrix. At this point, the expression can be evaluated in a single pass using no temporaries. (For a detailed explanation of expression templates and their applications, see Todd's article "Expression Templates," C++ Report, June 1995.)

Template Metaprograms

The C++ template mechanism requires the compiler to act as an interpreter in many ways. An early example of this notion was a small program, written by Erwin Unruh and circulated among members of the ANSI/ISO C++ standards committee, that did not actually compile successfully. Instead, it generated warning messages that contained a list of prime numbers. It turns out that it is possible to do arbitrarily complicated computations at compile time using these metaprograms-C++ templates constitute an interpretive programming language on their own. There are analogs for most C flow-control structures, including if/else/else if, for and do..while loops, and switch statements. (See the text box entitled "A Simple Example of Template Metaprograms.")

The interpretive nature of the compiler can be exploited to generate specialized algorithms, transforming Example 3(a), for instance, into Example 3(b).

When the routine dot() is encountered at compile time, the compiler runs a template metaprogram that unwinds the loop of the dot product and inserts the single equation representing the result. In practice, algorithms are decomposed into a set of code-generation rules encoded as C++ templates; these templates are the template metaprogram used for generating the algorithm. When an algorithm is needed, the C++ compiler uses the metaprogram to generate the necessary code.

This technique can be used to perform quite complex optimizations. For example, it is possible to implement a Cooley-Tukey Fast Fourier Transform (FFT) server class using template metaprograms. At compile time, the template metaprogram calculates the complex roots of unity and the array reordering to produce a massively inlined, optimized FFT routine. (This example program is available on our web site; see "Conclusion.")

A Prototype Linear-Algebra Library

We implemented a C++ class library for linear algebra that provides vector and matrix classes, and supports expressions involving vectors and matrices, matrix-vector products, matrix-matrix products, PLU factorization and solution of linear systems, and miscellaneous math functions. All of the operations in the library were implemented using expression templates and template metaprograms. For example, multiplying two matrices does not result in a call being generated to a matrix-multiply routine in the library; rather, the matrix multiplication is unwound into a set of equations that are inserted inline into the program.

In its current version, this library has a few important limitations. The library is restricted to small, dense systems with sizes that are known at compile time. The systems must be small because compilers place an upper limit on the expansion of recursive templates, which limits the length of loops that may be unwound. However, even if it were possible to unwind longer loops, it would not be desirable: Unwinding the loops for, say, a 100100 matrix multiplication would result in several megabytes of code being generated. This would have disastrous consequences for instruction caches.

Even when using this library on small systems, it is possible to see program sizes expand considerably. To control this code bloat, the C++ class library allows algorithm specialization to be disabled for particular vectors and matrices.

Benchmark Results

We tested this C++ class library's performance on a 486/100-MHz machine. The results are reported relative to the performance of Math.h++, which is a commercial class library from Rogue Wave Software. The Math.h++ library is carefully optimized: Many linear algebra operations result in calls to hand-coded assembly-language Basic Linear Algebra Subroutines (BLAS).

Another limitation of the techniques described in this article is that the necessary C++ language features are not yet universally supported. The most important of these are partial specialization and member templates. This situation will change as more compilers adhere to the ANSI/ISO C++ standard.

Benchmarks were obtained for individual library operations, and two real-world applications; see Figure 1. For vector dot products and element sums, the speed increase was a modest 4-6 times. However, for evaluating vector expressions with two or three operands, the prototype library was 6-20 times as fast as Math.h++. This is due to the use of the expression templates technique, which avoids the use of multiple loops and temporary vectors.

For matrix initialization, speed increases of 6-24 times the Math.h++ library were observed; see Figure 2. The prototype library generates only two processor instructions to initialize an element, which results in the large speed increases for small matrices. Matrix-vector products were 3.5-8 times faster using the prototype library. Matrix expressions are similar to vector expressions, and ran 2-12 times faster. Solving linear systems was 3-6 times faster.

The results for individual library operations were very promising. The true test however, is the performance of the library on real-world applications. Two sample problems were implemented. The first of these was drawn from a Computational Fluid Dynamics code written by Kershaw et al. of Lawrence Livermore National Laboratory. The code implements a 3-D, unstructured grid shock-hydrodynamics model of the Inertial Confinement Fusion process. Their scheme is based on modular computations on small vectors (usually 5-10 elements). They initially used Rogue Wave's Math.h++ library for the entire code, but found that it was unacceptably slow. By rewriting several key modules in C, they were able to achieve a speed increase of ten times over the Math.h++ version.

One of these key modules was selected as a test case for the prototype library. The module contains manipulations of small, dense vectors and matrices, and implementations in both C and Math.h++ were available. A third version was implemented using the prototype library. The running time of the three versions were compared; see Table 1. The prototype library version ran somewhat slower than the C version, and it ran at 10.68 times the speed of the Math.h++ version. These results were very promising: They illustrated that it is possible to achieve the efficiency of C using the prototype library.

The second application was the Newmark-b method, a technique for computing the motion of dynamic systems over time. Each time step involves matrix and vector expressions, matrix-vector products, and solving linear systems. The method was implemented using the prototype library and Math.h++. The relative speed increase, shown in Figure 3, ranged from a factor of 3.6 for 9 degrees-of-freedom systems to 6.9 for single degrees-of-freedom systems.

Conclusion

The template techniques, template meta-programs, and expression templates solve some of the performance problems associated with using C++ for linear algebra. Performance increases of 3-20 times that of a commercial class library have been achieved.

Some of the new language features adopted by the ANSI/ISO C++ standards committee will allow for extension to larger systems and systems with sizes that are unknown at compile time.

With the next generation of compilers supporting these features, it seems reasonable to expect that C++ will be able to rival or exceed the performance of C for numerical computing, while preserving the benefits of an object-oriented approach. More information on these techniques is available at http://monet .uwaterloo.ca/blitz/

References

Berlin, A. and D. Weise. "Compiling Scientific Code Using Partial Evaluation." Computer, December 1990.

Jones, N.D., C.K. Gomard, and P. Sestoft. Partial Evaluation and Automatic Program Generation. Englewood Cliffs, N.J.: Prentice Hall, 1993.

Kershaw, D.S., M.K. Prasad, and M.J. Shaw. "Object-Oriented Development of a Three-Dimensional Unstructured Grid Shock Hydrodynamics Model," OONSKI '94 Conference Proceedings.

Veldhuizen, T. "Using C++ Template Metaprograms." C++ Report, May 1995.

--- . "Expression Templates." C++ Report, June 1995.

DDJ

A Simple Example of Template Metaprograms

This template metaprogram calculates factorials at compile time. When Factorial<3> is instantiated, the value enum is set to 3*Factorial<2>::value. This triggers the instantiation of Factorial<2>, which triggers the instantiation of Factorial<1>, and so on.

The templates are instantiated recursively until the root case Factorial<0> is hit. Then Factorial<3>::value is set to 3*2*1*1=6. For a detailed explanation of template metaprograms and their applications, see Todd's article "Using C++ Template Metaprograms" (C++ Report, May 1995).

- T.V. & K.P.

A Simple Example of Recursive Templates

If you wanted to represent the binary tree using template classes, you could define a template class TreeNode, which represents a single node:

Binary Tree

Template Class TreeNode

template<class L, class T, class R>
class TreeNode {
   ...
};

The two template parameters L and R represent the left and right children of the node, and T represents the type of the node itself. Assuming that A, B, C, D, and E are C++ types, the binary tree could be represented by this template instance:

TreeNode<B, A, TreeNode<D, C, E> >

-T.V. & K.P.

Table 1: Comparing library performance.

                  Running time
Version           (microseconds)

C                 80.3
Prototype Library 81.5
Math.h++          869.6

Example 1: (a) Adding three vectors; (b) code that evaluates the expression efficiently; (c) code actually generated to evaluate the expression.

template<int N>
class Factorial {
  public:
    enum { value = N * 
Factorial<N-1>::value };
};
class Factorial<0> {
  public:
    enum { value = 1 };
};
const int x = Factorial<3>::value;  
// Calculate 3!



(a)     
// Four vectors with 1000 elements of double
Vector<double> y(1000), a(1000), b(1000), c(1000);
y = a + b + c;


(b)     
for (int i=0; i < 1000; ++i)
    y[i] = a[i] + b[i] + c[i];


(c)     
// Temporary vector __t1 allocated, which receives (a+b)
Vector<double> __t1(1000);
for (int i=0; i < 1000; ++i)
    __t1[i] = a[i] + b[i];
// Temporary vector __t2 allocated, which receives (__t1 + c)
Vector<double> __t2(1000);
for (int i=0; i < 1000; ++i)
    __t2[i] = __t1[i] + c[i];
// Result is copied into y
for (int i=0; i < 1000; ++i)
    y[i] = __t2[i];

Example 2: The compiler transforms (a) into (b).

(a)

Vector<double,3> y, a, b, c;
y = a + b + c;


(b)

double y[3], a[3], b[3], c[3];
y[0] = a[0] + b[0] + c[0];
y[1] = a[1] + b[1] + c[1];
y[2] = a[2] + b[2] + c[2];

Example 3: The compiler transforms (a) into (b).

(a)

Vector<double,3> a, b;
double result = dot(a,b); // Dot product of two vectors


(b)

double a[3], b[3];
double result = a[0]*b[0] + a[1]*b[1] + a[2]*b[2];

Figure 1: Performance increase for selected vector operations.

Figure 2: Performance increase for some matrix operations.

Figure 3: Speed increase for the Newmark-b method.