# Combining TBB and IPP with Intel Parallel Studio

### Matrix Multiply Example

Combining the TBB and IPP libraries worked well for the Game of Life example but does this technique work well for a wide range of applications? Only a broader study could provide that answer but a second example combining TBB and IPP adds further support to this hypothesis. Any technique that combines threading, caching, and vectorizing is valuable.

A matrix multiply is easily multi-threaded with TBB. Figure 1 outlines how a matrix multiply works: each output cell depends only on the row and column defining that cell [6]. Similarly, a cluster of output cells would depend only on the rows and columns defining the range of that cluster.

Figure 1: In a matrix multiply, the entry for a single cell x(3,4) uses row 3 and column 4 of the input matrices to compute the value. Similarly, a "tile" describing several output cells requires several rows and columns from the input matrices. TBB's 2-dimensional range defines a tile for each TBB thread.

TBB can define a series of 2-dimensional ranges for the entire output matrix and assign each entry in the series to a thread. Listing 3 shows a matrix multiply example using TBB and 2D ranges. Typically, a TBB parallel_for works on an entire row or set of rows but with a 2-dimensional range, the TBB parallel_for will work on a "tile" in the output matrix.

It is a well-known trick to transpose the second matrix in the multiply (the 4x5 matrix in Listing 1) to reduce cache misses. The columns becomes rows and multiple entries will reside in a cache line. Without transposing, each column entry is likely to reside in a cache line by itself. Only code using transposed data is shown here but examples without transposing are included in the results.

Listing 3 shows considerable improvement over a single-threaded matrix multiply (see results) because it is multi-threaded but also because the cache is used much more effectively. Each TBB thread is assigned a tile of the output matrix and repeatedly uses the same rows and columns to compute the individual cells in that 2D range.

Listing 3: This TBB class multi-threads a matrix multiply using a 2D range - threading and caching the data. The innermost loop is not vectorized.

However, Listing 4 is much faster because it has added vectorizing to the threading and caching in Listing 3.

```
class MatrixMultiplyBody2D_Transposed_IPP {
public:
void operator()( const blocked_range2d
```

Listing 4: This TBB class multi-threads a matrix multiply using a 2D range -- threading and caching the data -- just like in Listing 3 but in this class, the innermost loop has been replaced with a call to an IPP library function that will vectorize the multiply as well.

All the real work in the matrix multiply is done in the innermost loop -- the outer loops are just bookkeeping and are identical between Listings 3 and 4. By converting the core loop to an IPP library call, the multiply operations will use SIMD instructions and are vectorized.

The ippmMul_mv_32f IPP API is defined in the IPP documentation as a "multiplication of a matrix by a vector". In fact, the matrix is actually just a row from the first input matrix and the vector is actually just a column from the second input matrix. The ippmMul_mm_32f (note "mm" vs. "mv") IPP API's specifically targets multiplying a matrix by a matrix but its parameters don't accommodate the use of TBB tiles and could not be combined easily with TBB.

### Matrix Multiply Results

The timing results in Table 2 show that the combination of IPP and TBB yields the best performance. Native performance is only slightly better than the performance for managed code for the same algorithm. Since all the time in the inner loop is spent in the optimized IPP library, using managed code to call IPP makes little difference.

```
Running 10 iterations in 32-bit...

Wall time: 0.538 seconds    MatrixMultiply Transposed IPP TBB
Wall time: 0.557 seconds    MatrixMultiply Transposed IPP Managed TBB
Wall time: 1.232 seconds    MatrixMultiply Transposed NoTBB IPP
Wall time: 7.238 seconds    MatrixMultiply Transposed NoIPP Managed TBB
Wall time: 7.405 seconds    MatrixMultiply Transposed NoIPP TBB
Wall time: 8.715 seconds    MatrixMultiply IPP TBB
Wall time: 9.142 seconds    MatrixMultiply IPP Managed TBB
Wall time: 11.364 seconds  MatrixMultiply NoIPP TBB
Wall time: 11.615 seconds  MatrixMultiply NoIPP Managed TBB
Wall time: 53.675 seconds  MatrixMultiply Transposed ST

```

Table 2: Using a 1.7 Mb array on a Core i7 975, there is a 100X speedup from the single-threaded case ("Transposed ST") to the best solution ("Transposed IPP TBB") that incorporates threading, caching, and vectorizing. As seen earlier in Table 1 almost all the benefit of threading, caching, and vectorizing is available to .NET developers ("Transposed IPP Managed TBB"). The benefit of transposing before the matrix multiply is the difference between the best time and the "MatrixMultiply IPP TBB" case that didn't transpose. All methods using transpose are substantially better than those without transposing.

When threading, caching, and vectorizing are all working, the performance speed up for matrix multiply was 100X from the single-threaded, unvectorized code. Transposing the matrix was clearly valuable with all the "Transposed" entries in Table 2 at the top of the list. Transposing the matrix before multiplying appears to be important regardless of the methodology.

From the results in Table 2, it is also possible to determine which is more important: TBB or IPP. The difference between IPP and "NoIPP" is about 7:1 ("IPP TBB" vs. "NoIPP TBB" in Table 2) while the difference between TBB and "NoTBB" ("IPP TBB" vs. "NoTBB IPP" in Figure 7) is 2.5:1. Given the choice, using SIMD instructions appears to be more important than multi-threading for this algorithm. A look at back at Table 1 for the Game of Life also supports the idea that using SIMD is more important than TBB.

### Conclusion

There are natural drawbacks to using IPP and TBB (independently or in combination) -- both take time to learn and combining them requires careful validation. Finding the right IPP API can present challenges even for a knowledgeable developer choosing ippmMul_mv_32f over ippmMul_mm_32f in the matrix multiply discussion is a good example.

Other drawbacks to using the combination of IPP and TBB may include larger code complexity and memory footprint. The two examples provided in this paper showed code that looked simpler and ran faster. IPP is expected to run faster but looking simpler is unlikely to be the case in general. Memory footprint may also expand with IPP because temporary variables in conventional algorithms require only 32-bits or 64-bits but when the algorithm is converted to a series of IPP calls, the temporary variables expand to become buffers.

This paper addresses the shortage of examples that combine IPP and TBB but more importantly it provides a strong motivation to explore the combination of IPP and TBB further. A broader study would also help determine which library is more important - IPP or TBB. This would prioritize code changes as a developer migrates to using all possible performance dimensions. Finally, the examples above suggest that .NET managed code can benefit from the combination of IPP and TBB almost as much as native C++ code. Any tool or technique that can facilitate threading, caching, and vectorizing will get developers' attention. Other tools -- including Intel's Ct compiler -- are coming but in the meantime, combining TBB and IPP can provide benefits now in a manner which will be supported by Parallel Studio on future Intel platforms.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.