### STL & Generic Programming

*How to optimize computationally-intensive operations with templates.*

### Introduction

This is the second in a short series of articles on the subject of using C++ template
and template metaprogramming techniques to create highly efficient code that makes
C++ acceptable for scientific and numerical programming. In my last column, I
talked about loop unrolling via inline function templates. I also mentioned a
much more difficult problem, namely, the elimination of gratuitous temporary objects
and gratuitous loop iterations when overloading operators such as **operator+**
for algebraic objects such as matrices. Expression templates solve that problem.
Todd Veldhuizen and David Vandevoorde independently invented expression templates.
Todd Veldhuizen's original article from the now defunct *C++ Report* is reprinted
in *C++ Gems* [1]. David Vandevoorde's take on expression templates can be
found in the book on C++ templates he wrote with Nicolai Josuttis [2]. This book
is highly recommended reading even if you're not all that interested in advanced
topics such as expression templates. It is clearly the definitive source and reference
for all your template-related questions.

### Gratuitous Temporaries and Loops

Suppose that we are working on a piece of software, perhaps a library for
scientific programming, that supports matrices and algebraic matrix operations
such as addition and multiplication. If we did this in C++, we would, of course,
have a class template **Matrix**. Listing
1 shows what is probably the simplest possible implementation of such a
class template. A real-life matrix class would not look quite like this, but
the details of the design and implementation are irrelevant for the discussion
at hand. Our matrix class would of course be incomplete without overloaded operators
for the algebraic operations. Listing 1 has
a naive version of **operator+** for matrices. Suppose now that we have two
matrices of **double**s **A** and **B** of equal size *n x m*,
and we write:

Matrix<double, n, m> S = A + B;

Without any optimization occurring, executing the overloaded **operator+**
will construct a temporary matrix object from which the matrix **S** is then
copy-constructed. In this case, it is quite likely that your compiler will optimize
away the temporary by applying what is known as RVO (return value optimization):
Instead of creating a temporary and copying it, the sums of the elements of **A**
and **B** will be placed directly into the matrix **S**. However, RVO is
an elusive beast (for example, see [3]). All we must do is to consider a situation
where the matrix **S** has been constructed and used before:

Matrix<double, n, m> S; // work with S S = A + B;

This is certainly a plausible scenario. In this case, RVO is not going to happen:
the assignment operator takes a **const** reference to a matrix as its argument.
That reference needs to be bound to something, and that something can only be
a temporary object created by **operator+**. In the world of serious scientific
programming and number crunching, the machine code that is thus generated from
the assignment **S = A + B** is simply laughable. These people want to see
one iteration over the *n x m* matrix positions, during which the elements
of **A** and **B** are added and the sums are placed in the respective positions
of **S**. Instead, we make one iteration to add the elements of **A** and
**B** and park them all in a temporary memory area. Then we make a second iteration
to move the sums from the temporary location to the target matrix **S**, which
is clearly a colossally dumb thing to do in a context where time and space efficiency
are an overriding concern. Therefore, if we want to save scientific and numerical
software engineers from eternal damnation in the dank and musty world of Fortran,
we have to do something about this. As a matter of fact, avoiding the creation
and/or copying of gratuitous temporaries has been a hot topic in the C++ community
for quite some time. Andrei Alexandrescu's article [3] shows the current state
of the art in this area. However, amazing and important as his techniques are,
they do not completely solve the problems we're having here with matrix addition.
Suppose we have three matrices **A**, **B**, and **C**, and we write:

Matrix<double, n, m> S; // work with S S = A + B + C;

Under Listing 1's naive implementation, this
will place the sums of **A** and **B** into a temporary matrix (one iteration
over *n x m* positions). It will then place the sum of that temporary matrix
and **C** into a second temporary matrix (a second iteration over *n x m*
elements). Finally it will copy the elements of the second temporary to **S**
(a third iteration over *n x m* elements). We could now embark on a long
discussion of all the different ways to get rid of the creation and/or copying
of the temporaries. However, none of this would satisfy the serious scientific
software engineer. The only way to placate her is to not only get rid of all the
temporary objects, but also to ensure that only one iteration over the *n x
m* matrix positions takes place. That is what expression templates do. I'll
show you what they are and what they do first, and then I'll explain why they're
called "expression templates."

### Expression Templates

The conventional overloaded **operator+** as shown at the end of Listing
1 places the sum of its operands into a matrix object and returns that object.
The only way to stomp out gratuitous temporary objects and gratuitous iterations
arising from this operator once and for all is to change **operator+** radically
so that it never even performs the summation. Instead, **operator+** will
create and return an object of a completely different kind. It is a lightweight
object that holds nothing but references to the two operands of the summation.
This object passes itself off as the actual sum of the two operands by implementing
the method **ElementAt**, which returns the sum of the elements at position
**(i,j)** of the two operands. The method computes that sum on the fly from
the references to the two operands that the object holds.

Listing 2 shows a first (but ultimately
useless) attempt at an implementation. Class **EtMatrixAdd** provides the
lightweight objects as described earlier. A new **operator+** simply constructs
an **EtMatrixAdd** object from its operands and returns it. Finally, a new
assignment operator takes an **EtMatrixAdd** object as its right-hand side.
The implementation of that new assignment operator is simple, because, as I
said before, the **EtMatrixAdd** object passes itself off as a matrix containing
the sum of the two operands by implementing **ElementAt**. Listing
2 does not show a matrix copy constructor that works with the new **EtMatrixAdd**
class, but it would not be difficult to make that work as well. Now if we have
three matrices **S**, **A**, and **B**, all of equal size, and we write:

S = A + B;

then there will be no temporary matrix objects. **operator+** returns a temporary
lightweight object of type **EtMatrixAdd**, which holds references to the matrices
**A** and **B**. The loop inside the assignment operator calculates the
desired sum at each position by calling the lightweight object's method **ElementAt**,
which simply fetches and adds the respective elements of the operands **A**
and **B**. We have thus pretty much achieved the performance of a handwritten
loop.

The code in Listing 2 is far from providing a general solution to our problems. To see why, let us look at the sum of three matrices again:

S = A + B + C;

Here, the compiler would first apply our new **operator+** to the two matrices
**A** and **B**. The result is a temporary, lightweight object of type **EtMatrixAdd**
that holds references to the matrices **A** and **B**. In other words, the
compiler acts as if we had written:

EtMatrixAdd<double, n, m> Temp1 = A + B;

Next, the compiler must generate code for the addition **Temp1 + C**, and the
result of that addition should be a second temporary lightweight object **Temp2**
that holds references to the two operands **Temp1** and **C**. However,
that's not possible with what we have so far. Our **operator+** can only add
two matrices, but not an **EtMatrixAdd** object and a matrix. Furthermore,
our lightweight **EtMatrixAdd** objects can only hold references to matrices,
but not to other **EtMatrixAdd** objects. Listing
3 fixes these two shortcomings. Listing 3
looks fairly complicated, but if you understand the code in Listing
2, you really don't have to look at Listing
3 in much detail. Here's the gist of the changes: First, **EtMatrixAdd**
has been modified to take two additional template parameters, **LeftOp** and
**RightOp**. Instead of holding references to two matrices, the new **EtMatrixAdd**
holds references to **LeftOp** and **RightOp**. In Listing
2's assignment operator, which allows us to assign from an **EtMatrixAdd**
object to a matrix, the old **EtMatrixAdd** type has been replaced with the
new one. Otherwise, the assignment operator has remained the same. Similarly,
**operator+** for the addition of two matrices remains essentially the same,
except for the two new template arguments to the **EtMatrixAdd** object that
it returns. Finally, there is now a second **operator+** for adding an **EtMatrixAdd**
object to a matrix object. The result is, of course, simply an **EtMatrixAdd**
object that holds references to the two operands.

We have thus made it possible for the compiler to handle the line:

S = A + B + C;

The code that gets generated is the same code that would be generated from these lines:

typedef Matrix<double, n, m> Mat; EtMatrixAdd<double, n, m, Mat, Mat> Temp1 = A + B; EtMatrixAdd< double, n, m, EtMatrixAdd<double, n, m, Mat, Mat>, Mat> Temp2 = Temp1 + C; S = Temp2;

The addition **A + B** is performed using the first **operator+** in Listing
3, and the result is an **EtMatrixAdd** object that holds references to
**A** and **B**. The addition **Temp1 + C** is performed using the second
**operator+** in Listing 3, and the result
is an **EtMatrixAdd** object that holds references to **Temp1** and **C**.
The assignment **S = Temp2** loops over all matrix positions **(i, j)**
and performs the assignment:

S.ElementAt(i, j) = Temp2.ElementAt(i, j);

The function call on the right-hand side is in fact the inlined one-liner:

Temp1.ElementAt(i, j) + C.ElementAt(i, j);and this in turn resolves, via inlining, to:

A.ElementAt(i, j) + B.ElementAt(i, j) + C.ElementAt(i, j);

Therefore, the assignment:

S = A + B + C

is performed as efficiently as with a handwritten loop.

### Taking a Step Back

The code in Listing 3 is still a far cry from maturity. The main problem here is that an addition's operands can be two entirely different things, namely good old matrix objects and new lightweight objects. The mature implementations of [1] and [2] resolve this by defining one class template, which, via an additional template parameter, can morph into a good old matrix or into a new lightweight object. There are a number of subtleties and gotchas along the way, so if you're thinking about using expression templates or even implementing them yourself, make sure to study [2] or [3] very carefully. However, there is no need to go into any of that here, because as far as the idea behind expression templates is concerned, we have it all in Listing 3.

So why the term "expression template?" Take another look at the type of the
temporary object **Temp2** that was created when performing the addition
**A + B + C**. If we suppress the template arguments **double**, **n**,
and **m** for readability, that type becomes:

EtMatrixAdd<EtMatrixAdd<Mat, Mat>, Mat>

It is plain to see that the nested structure of this type represents the parse
tree of the expression *(x + y) + z*. A concrete object of this type augments
that parse tree by references to concrete operands. And finally, an assignment
such as:

S = A + B + C

amounts to an evaluation of the expression, based on the parse tree that is in
the type and the operands that are in the object. Looking at things in this way
explains the term "expression templates" for class templates such as our **EtMatrixAdd**.
It also reveals, in yet another way, the enormous power of C++ templates. We previously
saw how to use them to write full-fledged programs that get executed at compile
time. Here, they are used to generate, at compile time, parse trees of algebraic
expressions that allow efficient evaluation of these expressions at run time.

### References

[1] Todd Veldhuizen. "Expression Templates," in Stanley B. Lippman
(Editor), *C++ Gems* (SIGS Books and Multimedia, 1996), pp. 475-488.

[2] David Vandevoorde and Nicolai M. Josuttis. *C++ Templates: The Complete Guide* (Addison-Wesley, 2002).

[3] Andrei Alexandrescu. "Generic<Programming>: Move Constructors,"
*C/C++ Users Journal C++ Experts Forum*, February 2003

### About the Author

Thomas Becker works as a senior software engineer for Zephyr Associates, Inc. in Zephyr Cove, Lake Tahoe. He can be reached at **thomas@styleadvisor.com**.