# How to: Write a parallel_for Loop (Matrix Multiply)

**Concurrency::parallel_for**to compute the product of two matrices. The example illustrates how to use

**parallel_for**in a nested loop.

A poster on the ISN forums provided a link to an MSDN article titled How to: Write a parallel_for Loop which shows how to use the Visual Studio 2010's **Concurrency::parallel_for** to compute the product of two matrices. The example illustrates how to use **parallel_for** in a nested loop.

The typical C++ serial method to compute the matrix multiplication of a square matrix might look as follows:

// Computes the product of two square matrices. void matrix_multiply( double** m1, double** m2, double** result, size_t size) { for (size_t i = 0; i < size; i++) { for (size_t j = 0; j < size; j++) { double temp = 0; for (int k = 0; k < size; k++) { temp += m1[i][k] * m2[k][j]; } result[i][j] = temp; } } }

Using the VS concurrency collection **parallel_for** the code looks like:

// Computes the product of two square matrices in parallel. void parallel_matrix_multiply( double** m1, double** m2, double** result, size_t size) { parallel_for (size_t(0), size, [&](size_t i) { for (size_t j = 0; j < size; j++) { double temp = 0; for (int k = 0; k < size; k++) { temp += m1[i][k] * m2[k][j]; } result[i][j] = temp; } }); }

This makes use of the C++0x Lambda functions and the Concurrency template for the **parallel_for**. This conversion is elegant (simplified) by essentially rewriting two lines of code: The first **for** statement and closing brace of that **for** statement.

The reported performance boost on a four-processor system for 750x750 matrix multiplication was:

Serial: 3853 (ticks)

Parallel: 1311 (ticks)

Approximately a 2.94x speed-up.

This is not an ideal scaling that produces a 4x speed-up using four-processor. But considering that there must be some setup overhead, 2.94x is not too bad on this small of matrix.

Although this article illustrates a **parallel_for**, a new parallel programmer might be misled into assuming that this example illustrates how to write a parallel matrix multiplication function. After all, this article was written in an MSDN article -- an authoritative source.

Let's see how we can do better at parallelization of the Matrix Multiplication.

First off, I do not have Visual Studio 2010 so I cannot use the sample program as-is. I do have QuickThread (I am the author of this parallel programming toolkit www.quickthreadprogramming.com). So I adapted the code to use QuickThread. Adaptation is relatively easy. Change the headers and minor differences in syntax.

// Computes the product of two square matrices in parallel. void parallel_matrix_multiply( double** m1, double** m2, double** result, size_t size) { parallel_for( 0, size, [&](intptr_t iBegin, intptr_t iEnd) { for(intptr_t i = iBegin; i < iEnd; ++i) { for (intptr_t j = 0; j < size; j++) { double temp = 0; for (intptr_t k = 0; k < size; k++) { temp += m1[i][k] * m2[k][j]; } result[i][j] = temp; } } } ); }

In the QuickThread dialect, the **parallel_for** passes the half open range. As opposed to the parallel_for of the VS 2010 concurrency collection passing the single index. QuickThread chose to pass the half open range, as opposed to a single index, because knowing the range, the programmer and/or compiler can optimize the code within the loop.

Using the QuickThread modified code on a four-core (2 by AMD Opteron 270 dual core) system the tick counts line up as:

Serial: 11011

Parallel: 3417

For a scaling ratio of 3.22x

The processor models were not listed for the Visual Studio test so it is hard to tell the reason why QuickThread yields 3.22x improvement on four cores while VS yields 2.94x improvement on 4 processors (cores?).

Setting aside the issue of which threading toolkit is better, let's look at the parallelization of the matrix multiplication from the stand point of improving the performance.

The first thing any programmer should know is: Improve the serial algorithm first, and then address the problem through parallelization (save intrinsic SSW functions for last).

If you look at the inner most loop of the serial function we find:

for (int k = 0; k < size; k++) { temp += m1[i][k] * m2[k][j]; }

While the **k** index in **m1[i][k]** sequentially accesses memory, the**k** index in **m2[k][j]** does not. What this means is the compiler is unable to vectorize the loop. And potentially worse, stepping down the rows at certain points will cause cache evictions.

To vectorize the matrix multiplication (and minimize cache evictions) you would have to transpose the matrix **m2 (m2t)** and then you could transpose the indexes in the inner loop. To transpose the matrix m2 you will have to allocate memory and then write the transposition of the matrix. This may sound absurd. Add overhead to:

Perform an allocation (actually size+1 allocations) run transposition loops write the equivalent of the results output an additional time

You might assume that this must be slower than the simple matrix multiplication. Well you would be wrong to assume this. The rewritten serial loop looks like:

// compute DOT product of two vectors, returns result as double // (makes code easier to read) inline double DOT(double* v1, double* v2, size_t size) { double temp = 0.0; for(size_t i = 0; i < size; i++) { temp += v1[i] * v2[i]; } return temp; } void matrix_multiplyTranspose( double** m1, double** m2, double** result, size_t size) { // create a matrix to hold the transposition of m2 double** m2t = create_matrix(size); for (size_t i = 0; i < size; i++) { for (size_t j = 0; j < size; j++) { m2t[j][i] = m2[i][j]; } } // Now matrix multiply with the transposed matrix m2t for (size_t i = 0; i < size; i++) { for (size_t j = 0; j < size; j++) { result[i][j] = DOT(m1[i], m2t[j], size); } } destroy_matrix(m2t, size); }

In a real implementation you might want to consider preserving the buffers allocate on the first call. Then on subsequent calls you check to see if the prior allocation was sufficient for the current call. If so, the bypass the allocation. If not then delete the buffers, and allocate new buffers. This is a further optimization detail left to readers. Also note that if the m2 matrix is going to be used many times (e.g. in a filter system), then you can perform the transposition once. These are implementation strategies to keep in mind when taking these suggestions into your own code.

Let's see how the serial with transposition technique stacks up against the serial without transposition as run on the 2x dual-core AMD Opteron system:

Serial: 10531

Serial Transpose: 5998

An improvement of approximately 1.76x over serial alone.

The parallel transpose function looks like:

void parallel_matrix_multiplyTranspose( double** m1, double** m2, double** result, size_t size) { // create a matrix to hold the transposition of m2 double** m2t = create_matrix(size); // perform the transposition of m2 to m2t parallel_for( 0, size, [&](int iBegin, int iEnd) { for(int i = iBegin; i < iEnd; ++i) { for (size_t j = 0; j < size; j++) { m2t[j][i] = m2[i][j]; } } } ); // perform the matrix multiplication using m2t parallel_for( 0, size, [&](int iBegin, int iEnd) { for(intptr_t i = iBegin; i < iEnd; ++i) { for (intptr_t j = 0; j < size; j++) { result[i][j] = DOT(m1[i], m2t[j], size); } } } ); destroy_matrix(m2t, size); }

Let's look at the performance comparisons on a system with more cores.

A dual Xeon 5570 system running Ubuntu 10.0 and QuickThread 2.0.0.RC001 has

- Two Processors
- 16 hardware threads as 8 cores with HyperThreading

Test program modified to use **_rdtsc()** as opposed to millisecond timer to attain better resolution.

matrix size = 750 x 750 serial: 9072682708 (_rdtsc() ticks) serial Transpose: 1181206962 ratio 7.68086x (to original serial) parallel: 1261734015 ratio 7.19065x (to original serial) parallel Transpose: 165788596 ratio 54.7244x (to original serial) ratio 7.1248x (to Transpose serial)

On a larger matrix we have:

matrix size = 2048 x 2048 serial: 212494313170 serial Transpose: 28047918588 ratio 7.57612x (to original serial) parallel: 28098476692 ratio 7.56249x (to original serial) parallel Transpose: 3785739057 ratio 56.1302x (to original serial) ratio 7.4088x (to Transpose serial)

This system, while having 16x the number of hardware threads, it has only 8x the number of floating point execution units. With this respect scaling appears almost linear.

I have purposely stayed away from using the SSE intrinsic functions. I suggest you not use the SSE intrinsic functions until the last phase of optimization. Doing so early complicates your parallization efforts.

As good as the performance is (approximately 55x on dual Xeon 5570), is there room for additional improvement before adding SSE intrinsic functions?

You will have to wait for my reply, as my next article will address this subject.

I wish to thank one of my users for letting me have access to his Dell R610 for testing purposes. I have yet to configure the test program to run on Intel's Parallel Universe system. I may need to do this for my next article (unless a reader is willing to run a test program for me).