Matrix Multiply Example
Combining the TBB and IPP libraries worked well for the Game of Life example but does this technique work well for a wide range of applications? Only a broader study could provide that answer but a second example combining TBB and IPP adds further support to this hypothesis. Any technique that combines threading, caching, and vectorizing is valuable.
A matrix multiply is easily multi-threaded with TBB. Figure 1 outlines how a matrix multiply works: each output cell depends only on the row and column defining that cell [6]. Similarly, a cluster of output cells would depend only on the rows and columns defining the range of that cluster.
TBB can define a series of 2-dimensional ranges for the entire output matrix and assign each entry in the series to a thread. Listing 3 shows a matrix multiply example using TBB and 2D ranges. Typically, a TBB parallel_for works on an entire row or set of rows but with a 2-dimensional range, the TBB parallel_for will work on a "tile" in the output matrix.
It is a well-known trick to transpose the second matrix in the multiply (the 4x5 matrix in Listing 1) to reduce cache misses. The columns becomes rows and multiple entries will reside in a cache line. Without transposing, each column entry is likely to reside in a cache line by itself. Only code using transposed data is shown here but examples without transposing are included in the results.
Listing 3 shows considerable improvement over a single-threaded matrix multiply (see results) because it is multi-threaded but also because the cache is used much more effectively. Each TBB thread is assigned a tile of the output matrix and repeatedly uses the same rows and columns to compute the individual cells in that 2D range.
Listing 3: This TBB class multi-threads a matrix multiply using a 2D range - threading and caching the data. The innermost loop is not vectorized.
However, Listing 4 is much faster because it has added vectorizing to the threading and caching in Listing 3.
class MatrixMultiplyBody2D_Transposed_IPP {
public:
void operator()( const blocked_range2d
All the real work in the matrix multiply is done in the innermost loop -- the outer loops are just bookkeeping and are identical between Listings 3 and 4. By converting the core loop to an IPP library call, the multiply operations will use SIMD instructions and are vectorized.
The ippmMul_mv_32f IPP API is defined in the IPP documentation as a "multiplication of a matrix by a vector". In fact, the matrix is actually just a row from the first input matrix and the vector is actually just a column from the second input matrix. The ippmMul_mm_32f (note "mm" vs. "mv") IPP API's specifically targets multiplying a matrix by a matrix but its parameters don't accommodate the use of TBB tiles and could not be combined easily with TBB.
Matrix Multiply Results
The timing results in Table 2 show that the combination of IPP and TBB yields the best performance. Native performance is only slightly better than the performance for managed code for the same algorithm. Since all the time in the inner loop is spent in the optimized IPP library, using managed code to call IPP makes little difference.
Running 10 iterations in 32-bit... Wall time: 0.538 seconds MatrixMultiply Transposed IPP TBB Wall time: 0.557 seconds MatrixMultiply Transposed IPP Managed TBB Wall time: 1.232 seconds MatrixMultiply Transposed NoTBB IPP Wall time: 7.238 seconds MatrixMultiply Transposed NoIPP Managed TBB Wall time: 7.405 seconds MatrixMultiply Transposed NoIPP TBB Wall time: 8.715 seconds MatrixMultiply IPP TBB Wall time: 9.142 seconds MatrixMultiply IPP Managed TBB Wall time: 11.364 seconds MatrixMultiply NoIPP TBB Wall time: 11.615 seconds MatrixMultiply NoIPP Managed TBB Wall time: 53.675 seconds MatrixMultiply Transposed ST
When threading, caching, and vectorizing are all working, the performance speed up for matrix multiply was 100X from the single-threaded, unvectorized code. Transposing the matrix was clearly valuable with all the "Transposed" entries in Table 2 at the top of the list. Transposing the matrix before multiplying appears to be important regardless of the methodology.
From the results in Table 2, it is also possible to determine which is more important: TBB or IPP. The difference between IPP and "NoIPP" is about 7:1 ("IPP TBB" vs. "NoIPP TBB" in Table 2) while the difference between TBB and "NoTBB" ("IPP TBB" vs. "NoTBB IPP" in Figure 7) is 2.5:1. Given the choice, using SIMD instructions appears to be more important than multi-threading for this algorithm. A look at back at Table 1 for the Game of Life also supports the idea that using SIMD is more important than TBB.
Conclusion
There are natural drawbacks to using IPP and TBB (independently or in combination) -- both take time to learn and combining them requires careful validation. Finding the right IPP API can present challenges even for a knowledgeable developer choosing ippmMul_mv_32f over ippmMul_mm_32f in the matrix multiply discussion is a good example.
Other drawbacks to using the combination of IPP and TBB may include larger code complexity and memory footprint. The two examples provided in this paper showed code that looked simpler and ran faster. IPP is expected to run faster but looking simpler is unlikely to be the case in general. Memory footprint may also expand with IPP because temporary variables in conventional algorithms require only 32-bits or 64-bits but when the algorithm is converted to a series of IPP calls, the temporary variables expand to become buffers.
This paper addresses the shortage of examples that combine IPP and TBB but more importantly it provides a strong motivation to explore the combination of IPP and TBB further. A broader study would also help determine which library is more important - IPP or TBB. This would prioritize code changes as a developer migrates to using all possible performance dimensions. Finally, the examples above suggest that .NET managed code can benefit from the combination of IPP and TBB almost as much as native C++ code. Any tool or technique that can facilitate threading, caching, and vectorizing will get developers' attention. Other tools -- including Intel's Ct compiler -- are coming but in the meantime, combining TBB and IPP can provide benefits now in a manner which will be supported by Parallel Studio on future Intel platforms.


