# How to: Write a parallel_for Loop (Matrix Multiply)

June 21, 2010

A poster on the ISN forums provided a link to an MSDN article titled How to: Write a parallel_for Loop which shows how to use the Visual Studio 2010's Concurrency::parallel_for to compute the product of two matrices. The example illustrates how to use parallel_for in a nested loop.

The typical C++ serial method to compute the matrix multiplication of a square matrix might look as follows:

```
// Computes the product of two square matrices.
void matrix_multiply(
double** m1, double** m2, double** result, size_t size)
{
for (size_t i = 0; i < size; i++)
{
for (size_t j = 0; j < size; j++)
{
double temp = 0;
for (int k = 0; k < size; k++)
{
temp += m1[i][k] * m2[k][j];
}
result[i][j] = temp;
}
}
}

```

Using the VS concurrency collection parallel_for the code looks like:

```
// Computes the product of two square matrices in parallel.
void parallel_matrix_multiply(
double** m1, double** m2, double** result, size_t size)
{
parallel_for (size_t(0), size, [&](size_t i)
{
for (size_t j = 0; j < size; j++)
{
double temp = 0;
for (int k = 0; k < size; k++)
{
temp += m1[i][k] * m2[k][j];
}
result[i][j] = temp;
}
});
}

```

This makes use of the C++0x Lambda functions and the Concurrency template for the parallel_for. This conversion is elegant (simplified) by essentially rewriting two lines of code: The first for statement and closing brace of that for statement.

The reported performance boost on a four-processor system for 750x750 matrix multiplication was:

Serial: 3853 (ticks)
Parallel: 1311 (ticks)

Approximately a 2.94x speed-up.

This is not an ideal scaling that produces a 4x speed-up using four-processor. But considering that there must be some setup overhead, 2.94x is not too bad on this small of matrix.

Although this article illustrates a parallel_for, a new parallel programmer might be misled into assuming that this example illustrates how to write a parallel matrix multiplication function. After all, this article was written in an MSDN article -- an authoritative source.

Let's see how we can do better at parallelization of the Matrix Multiplication.

First off, I do not have Visual Studio 2010 so I cannot use the sample program as-is. I do have QuickThread (I am the author of this parallel programming toolkit www.quickthreadprogramming.com). So I adapted the code to use QuickThread. Adaptation is relatively easy. Change the headers and minor differences in syntax.

```
// Computes the product of two square matrices in parallel.
void parallel_matrix_multiply(
double** m1, double** m2, double** result, size_t size)
{
parallel_for(
0, size,
[&](intptr_t iBegin, intptr_t iEnd)
{
for(intptr_t i = iBegin; i < iEnd; ++i)
{
for (intptr_t j = 0; j < size; j++)
{
double temp = 0;
for (intptr_t k = 0; k < size; k++)
{
temp += m1[i][k] * m2[k][j];
}
result[i][j] = temp;
}
}
}
);
}

```

In the QuickThread dialect, the parallel_for passes the half open range. As opposed to the parallel_for of the VS 2010 concurrency collection passing the single index. QuickThread chose to pass the half open range, as opposed to a single index, because knowing the range, the programmer and/or compiler can optimize the code within the loop.

Using the QuickThread modified code on a four-core (2 by AMD Opteron 270 dual core) system the tick counts line up as:

Serial: 11011
Parallel: 3417

For a scaling ratio of 3.22x

The processor models were not listed for the Visual Studio test so it is hard to tell the reason why QuickThread yields 3.22x improvement on four cores while VS yields 2.94x improvement on 4 processors (cores?).

Setting aside the issue of which threading toolkit is better, let's look at the parallelization of the matrix multiplication from the stand point of improving the performance.

The first thing any programmer should know is: Improve the serial algorithm first, and then address the problem through parallelization (save intrinsic SSW functions for last).

If you look at the inner most loop of the serial function we find:

```
for (int k = 0; k < size; k++)
{
temp += m1[i][k] * m2[k][j];
}

```

While the k index in m1[i][k] sequentially accesses memory, thek index in m2[k][j] does not. What this means is the compiler is unable to vectorize the loop. And potentially worse, stepping down the rows at certain points will cause cache evictions.

To vectorize the matrix multiplication (and minimize cache evictions) you would have to transpose the matrix m2 (m2t) and then you could transpose the indexes in the inner loop. To transpose the matrix m2 you will have to allocate memory and then write the transposition of the matrix. This may sound absurd. Add overhead to:

```
Perform an allocation (actually size+1 allocations)
run transposition loops
write the equivalent of the results output an additional time

```

You might assume that this must be slower than the simple matrix multiplication. Well you would be wrong to assume this. The rewritten serial loop looks like:

```
// compute DOT product of two vectors, returns result as double
// (makes code easier to read)

inline double DOT(double* v1, double* v2, size_t size)
{
double temp = 0.0;
for(size_t i = 0; i < size; i++)
{
temp += v1[i] * v2[i];
}
return temp;
}

void matrix_multiplyTranspose(
double** m1, double** m2, double** result, size_t size)
{
// create a matrix to hold the transposition of m2
double** m2t = create_matrix(size);
for (size_t i = 0; i < size; i++)
{
for (size_t j = 0; j < size; j++)
{
m2t[j][i] = m2[i][j];
}
}
// Now matrix multiply with the transposed matrix m2t
for (size_t i = 0; i < size; i++)
{
for (size_t j = 0; j < size; j++)
{
result[i][j] = DOT(m1[i], m2t[j], size);
}
}
destroy_matrix(m2t, size);
}

```

In a real implementation you might want to consider preserving the buffers allocate on the first call. Then on subsequent calls you check to see if the prior allocation was sufficient for the current call. If so, the bypass the allocation. If not then delete the buffers, and allocate new buffers. This is a further optimization detail left to readers. Also note that if the m2 matrix is going to be used many times (e.g. in a filter system), then you can perform the transposition once. These are implementation strategies to keep in mind when taking these suggestions into your own code.

Let's see how the serial with transposition technique stacks up against the serial without transposition as run on the 2x dual-core AMD Opteron system:

Serial: 10531
Serial Transpose: 5998

An improvement of approximately 1.76x over serial alone.

The parallel transpose function looks like:

```
void parallel_matrix_multiplyTranspose(
double** m1, double** m2, double** result, size_t size)
{
// create a matrix to hold the transposition of m2
double** m2t = create_matrix(size);
// perform the transposition of m2 to m2t
parallel_for(
0, size,
[&](int iBegin, int iEnd)
{
for(int i = iBegin; i < iEnd; ++i)
{
for (size_t j = 0; j < size; j++)
{
m2t[j][i] = m2[i][j];
}
}
}
);
// perform the matrix multiplication using m2t
parallel_for(
0, size,
[&](int iBegin, int iEnd)
{
for(intptr_t i = iBegin; i < iEnd; ++i)
{
for (intptr_t j = 0; j < size; j++)
{
result[i][j] = DOT(m1[i], m2t[j], size);
}
}
}
);
destroy_matrix(m2t, size);
}

```

Let's look at the performance comparisons on a system with more cores.

A dual Xeon 5570 system running Ubuntu 10.0 and QuickThread 2.0.0.RC001 has

• Two Processors

Test program modified to use _rdtsc() as opposed to millisecond timer to attain better resolution.

```
matrix size = 750 x 750

serial: 9072682708 (_rdtsc() ticks)

serial Transpose: 1181206962
ratio 7.68086x (to original serial)

parallel: 1261734015
ratio 7.19065x (to original serial)

parallel Transpose: 165788596
ratio 54.7244x (to original serial)
ratio 7.1248x (to Transpose serial)

```

On a larger matrix we have:

```
matrix size = 2048 x 2048

serial: 212494313170

serial Transpose: 28047918588
ratio 7.57612x (to original serial)

parallel: 28098476692
ratio 7.56249x (to original serial)

parallel Transpose: 3785739057
ratio 56.1302x (to original serial)
ratio 7.4088x (to Transpose serial)

```

This system, while having 16x the number of hardware threads, it has only 8x the number of floating point execution units. With this respect scaling appears almost linear.

I have purposely stayed away from using the SSE intrinsic functions. I suggest you not use the SSE intrinsic functions until the last phase of optimization. Doing so early complicates your parallization efforts.

As good as the performance is (approximately 55x on dual Xeon 5570), is there room for additional improvement before adding SSE intrinsic functions?

You will have to wait for my reply, as my next article will address this subject.

I wish to thank one of my users for letting me have access to his Dell R610 for testing purposes. I have yet to configure the test program to run on Intel's Parallel Universe system. I may need to do this for my next article (unless a reader is willing to run a test program for me).

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

# First C Compiler Now on Github

The earliest known C compiler by the legendary Dennis Ritchie has been published on the repository.

# HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

HTML5 Mobile Development: Seven Good Ideas (and Three Bad Ones)

# Building Bare Metal ARM Systems with GNU

All you need to know to get up and running... and programming on ARM

# Amazon's Vogels Challenges IT: Rethink App Dev

Amazon Web Services CTO says promised land of cloud computing requires a new generation of applications that follow different principles.

# How to Select a PaaS Partner

Eventually, the vast majority of Web applications will run on a platform-as-a-service, or PaaS, vendor's infrastructure. To help sort out the options, we sent out a matrix with more than 70 decision points to a variety of PaaS providers.

More "Best of the Web" >>