STL & Generic Programming
How to optimize computationally-intensive operations with templates.
This is the second in a short series of articles on the subject of using C++ template and template metaprogramming techniques to create highly efficient code that makes C++ acceptable for scientific and numerical programming. In my last column, I talked about loop unrolling via inline function templates. I also mentioned a much more difficult problem, namely, the elimination of gratuitous temporary objects and gratuitous loop iterations when overloading operators such as operator+ for algebraic objects such as matrices. Expression templates solve that problem. Todd Veldhuizen and David Vandevoorde independently invented expression templates. Todd Veldhuizen's original article from the now defunct C++ Report is reprinted in C++ Gems . David Vandevoorde's take on expression templates can be found in the book on C++ templates he wrote with Nicolai Josuttis . This book is highly recommended reading even if you're not all that interested in advanced topics such as expression templates. It is clearly the definitive source and reference for all your template-related questions.
Gratuitous Temporaries and Loops
Suppose that we are working on a piece of software, perhaps a library for scientific programming, that supports matrices and algebraic matrix operations such as addition and multiplication. If we did this in C++, we would, of course, have a class template Matrix. Listing 1 shows what is probably the simplest possible implementation of such a class template. A real-life matrix class would not look quite like this, but the details of the design and implementation are irrelevant for the discussion at hand. Our matrix class would of course be incomplete without overloaded operators for the algebraic operations. Listing 1 has a naive version of operator+ for matrices. Suppose now that we have two matrices of doubles A and B of equal size n x m, and we write:
Matrix<double, n, m> S = A + B;
Without any optimization occurring, executing the overloaded operator+ will construct a temporary matrix object from which the matrix S is then copy-constructed. In this case, it is quite likely that your compiler will optimize away the temporary by applying what is known as RVO (return value optimization): Instead of creating a temporary and copying it, the sums of the elements of A and B will be placed directly into the matrix S. However, RVO is an elusive beast (for example, see ). All we must do is to consider a situation where the matrix S has been constructed and used before:
Matrix<double, n, m> S; // work with S S = A + B;
This is certainly a plausible scenario. In this case, RVO is not going to happen: the assignment operator takes a const reference to a matrix as its argument. That reference needs to be bound to something, and that something can only be a temporary object created by operator+. In the world of serious scientific programming and number crunching, the machine code that is thus generated from the assignment S = A + B is simply laughable. These people want to see one iteration over the n x m matrix positions, during which the elements of A and B are added and the sums are placed in the respective positions of S. Instead, we make one iteration to add the elements of A and B and park them all in a temporary memory area. Then we make a second iteration to move the sums from the temporary location to the target matrix S, which is clearly a colossally dumb thing to do in a context where time and space efficiency are an overriding concern. Therefore, if we want to save scientific and numerical software engineers from eternal damnation in the dank and musty world of Fortran, we have to do something about this. As a matter of fact, avoiding the creation and/or copying of gratuitous temporaries has been a hot topic in the C++ community for quite some time. Andrei Alexandrescu's article  shows the current state of the art in this area. However, amazing and important as his techniques are, they do not completely solve the problems we're having here with matrix addition. Suppose we have three matrices A, B, and C, and we write:
Matrix<double, n, m> S; // work with S S = A + B + C;
Under Listing 1's naive implementation, this will place the sums of A and B into a temporary matrix (one iteration over n x m positions). It will then place the sum of that temporary matrix and C into a second temporary matrix (a second iteration over n x m elements). Finally it will copy the elements of the second temporary to S (a third iteration over n x m elements). We could now embark on a long discussion of all the different ways to get rid of the creation and/or copying of the temporaries. However, none of this would satisfy the serious scientific software engineer. The only way to placate her is to not only get rid of all the temporary objects, but also to ensure that only one iteration over the n x m matrix positions takes place. That is what expression templates do. I'll show you what they are and what they do first, and then I'll explain why they're called "expression templates."
The conventional overloaded operator+ as shown at the end of Listing 1 places the sum of its operands into a matrix object and returns that object. The only way to stomp out gratuitous temporary objects and gratuitous iterations arising from this operator once and for all is to change operator+ radically so that it never even performs the summation. Instead, operator+ will create and return an object of a completely different kind. It is a lightweight object that holds nothing but references to the two operands of the summation. This object passes itself off as the actual sum of the two operands by implementing the method ElementAt, which returns the sum of the elements at position (i,j) of the two operands. The method computes that sum on the fly from the references to the two operands that the object holds.
Listing 2 shows a first (but ultimately useless) attempt at an implementation. Class EtMatrixAdd provides the lightweight objects as described earlier. A new operator+ simply constructs an EtMatrixAdd object from its operands and returns it. Finally, a new assignment operator takes an EtMatrixAdd object as its right-hand side. The implementation of that new assignment operator is simple, because, as I said before, the EtMatrixAdd object passes itself off as a matrix containing the sum of the two operands by implementing ElementAt. Listing 2 does not show a matrix copy constructor that works with the new EtMatrixAdd class, but it would not be difficult to make that work as well. Now if we have three matrices S, A, and B, all of equal size, and we write:
S = A + B;
then there will be no temporary matrix objects. operator+ returns a temporary lightweight object of type EtMatrixAdd, which holds references to the matrices A and B. The loop inside the assignment operator calculates the desired sum at each position by calling the lightweight object's method ElementAt, which simply fetches and adds the respective elements of the operands A and B. We have thus pretty much achieved the performance of a handwritten loop.
The code in Listing 2 is far from providing a general solution to our problems. To see why, let us look at the sum of three matrices again:
S = A + B + C;
Here, the compiler would first apply our new operator+ to the two matrices A and B. The result is a temporary, lightweight object of type EtMatrixAdd that holds references to the matrices A and B. In other words, the compiler acts as if we had written:
EtMatrixAdd<double, n, m> Temp1 = A + B;
Next, the compiler must generate code for the addition Temp1 + C, and the result of that addition should be a second temporary lightweight object Temp2 that holds references to the two operands Temp1 and C. However, that's not possible with what we have so far. Our operator+ can only add two matrices, but not an EtMatrixAdd object and a matrix. Furthermore, our lightweight EtMatrixAdd objects can only hold references to matrices, but not to other EtMatrixAdd objects. Listing 3 fixes these two shortcomings. Listing 3 looks fairly complicated, but if you understand the code in Listing 2, you really don't have to look at Listing 3 in much detail. Here's the gist of the changes: First, EtMatrixAdd has been modified to take two additional template parameters, LeftOp and RightOp. Instead of holding references to two matrices, the new EtMatrixAdd holds references to LeftOp and RightOp. In Listing 2's assignment operator, which allows us to assign from an EtMatrixAdd object to a matrix, the old EtMatrixAdd type has been replaced with the new one. Otherwise, the assignment operator has remained the same. Similarly, operator+ for the addition of two matrices remains essentially the same, except for the two new template arguments to the EtMatrixAdd object that it returns. Finally, there is now a second operator+ for adding an EtMatrixAdd object to a matrix object. The result is, of course, simply an EtMatrixAdd object that holds references to the two operands.
We have thus made it possible for the compiler to handle the line:
S = A + B + C;
The code that gets generated is the same code that would be generated from these lines:
typedef Matrix<double, n, m> Mat; EtMatrixAdd<double, n, m, Mat, Mat> Temp1 = A + B; EtMatrixAdd< double, n, m, EtMatrixAdd<double, n, m, Mat, Mat>, Mat> Temp2 = Temp1 + C; S = Temp2;
The addition A + B is performed using the first operator+ in Listing 3, and the result is an EtMatrixAdd object that holds references to A and B. The addition Temp1 + C is performed using the second operator+ in Listing 3, and the result is an EtMatrixAdd object that holds references to Temp1 and C. The assignment S = Temp2 loops over all matrix positions (i, j) and performs the assignment:
S.ElementAt(i, j) = Temp2.ElementAt(i, j);
The function call on the right-hand side is in fact the inlined one-liner:
Temp1.ElementAt(i, j) + C.ElementAt(i, j);and this in turn resolves, via inlining, to:
A.ElementAt(i, j) + B.ElementAt(i, j) + C.ElementAt(i, j);
Therefore, the assignment:
S = A + B + C
is performed as efficiently as with a handwritten loop.
Taking a Step Back
The code in Listing 3 is still a far cry from maturity. The main problem here is that an addition's operands can be two entirely different things, namely good old matrix objects and new lightweight objects. The mature implementations of  and  resolve this by defining one class template, which, via an additional template parameter, can morph into a good old matrix or into a new lightweight object. There are a number of subtleties and gotchas along the way, so if you're thinking about using expression templates or even implementing them yourself, make sure to study  or  very carefully. However, there is no need to go into any of that here, because as far as the idea behind expression templates is concerned, we have it all in Listing 3.
So why the term "expression template?" Take another look at the type of the temporary object Temp2 that was created when performing the addition A + B + C. If we suppress the template arguments double, n, and m for readability, that type becomes:
EtMatrixAdd<EtMatrixAdd<Mat, Mat>, Mat>
It is plain to see that the nested structure of this type represents the parse tree of the expression (x + y) + z. A concrete object of this type augments that parse tree by references to concrete operands. And finally, an assignment such as:
S = A + B + C
amounts to an evaluation of the expression, based on the parse tree that is in the type and the operands that are in the object. Looking at things in this way explains the term "expression templates" for class templates such as our EtMatrixAdd. It also reveals, in yet another way, the enormous power of C++ templates. We previously saw how to use them to write full-fledged programs that get executed at compile time. Here, they are used to generate, at compile time, parse trees of algebraic expressions that allow efficient evaluation of these expressions at run time.
 Todd Veldhuizen. "Expression Templates," in Stanley B. Lippman (Editor), C++ Gems (SIGS Books and Multimedia, 1996), pp. 475-488.
 David Vandevoorde and Nicolai M. Josuttis. C++ Templates: The Complete Guide (Addison-Wesley, 2002).
 Andrei Alexandrescu. "Generic<Programming>: Move Constructors," C/C++ Users Journal C++ Experts Forum, February 2003
About the Author
Thomas Becker works as a senior software engineer for Zephyr Associates, Inc. in Zephyr Cove, Lake Tahoe. He can be reached at [email protected].