Bob Davies is a software engineer at Intel.
Although Parallel Studio ships with Threading Building Blocks (TBB) and the Intel Performance Primitives library (IPP), there is nonethess a shortage of examples that show how to combine these two components. This article provides a motivation to combine these two libraries -- dramatically improved performance. The samples I present here will accrue to .NET developers using managed code as well.
There are three main dimensions to improving performance on Intel architecture. The oldest of these has been to exploit on-chip and off-chip caches -- consolidate references to the same memory as much as possible. A second dimension of performance arrived with SIMD (Single Instruction Multiple Data) instructions on Intel architecture -- performing operations on a vector of 4 or 8 data elements at the same time (vectorizing.) The most recent dimension for improving performance was the advent of multiple processors and advanced threading techniques.
Caches can be exploited without any change to the code but exploiting SIMD instructions is actual work -- the new instructions need to be inserted and loop counters adjusted. Multi-threading has pushed the need to reorganize developer code even further.
Intel provided a simple method to multi-thread code with OpenMP -- requiring minimal changes to the code in most cases. The success of OpenMP encouraged attempts to find more and better techniques to multi-thread code -- resulting in the creation of Intel's Threading Building Blocks (TBB) and Intel's upcoming vectorizing compiler known as "Ct".
TBB is capable of providing two of the three dimensions of performance -- efficient cache utilization and multi-threading. There is no overt effort to include SIMD instructions with TBB but these SIMD instructions may be included by using IPP. The Ct compiler is an attempt to attack all three dimensions of performance with explicit features that will help developers exploit caches, SIMD, and multi-threading.
Ct is not available yet but a close approximation to Ct can be created using Parallel Studio's TBB and IPP libraries. The rest of this article demonstrates how to combine TBB and IPP and the benefit it brings.
Game of Life
The Game of Life is useless sport but it serves well here to show how rethinking code can yield significant performance gains. The algorithm is simple to understand from Conway's original description in Scientific American -- cells die from overcrowding or isolation and cells are born when just the right number of neighbors is present. There are numerous implementations in a variety of languages.
The typical implementation of the core algorithm of Life is shown in Listing 1. The 8 surrounding cells are tested for special conditions such as top row or bottom row. The world of Life is a globe even though it is a rectangular matrix in memory. Cells in the bottom row affect the life and death of cells in the top row. Similarly cells in the left column are neighbors of cells in the extreme right column.
switch (cellPosition) {
case upperLeft:
if (TopRow && LeftCol) return *(source+((x*y)-1));
if (TopRow && !LeftCol) return *(source+(((x*y)-x)+(cellNum-1)));
if (LeftCol && !TopRow) return *(source+(cellNum-1));
return *((source+cellNum)-(x+1));
case upper:
if (TopRow) return *(source+(((x*y)-x)+cellNum));
return *((source+cellNum)-x);
case upperRight:
if (TopRow && RightColumn) return *(source+((x*y)-x));
if (TopRow && !RightColumn) return *(source+(((x*y)-x)+(cellNum+1)));
if (RightColumn && !TopRow) return *(source+((cellNum-(x*2))+1));
return *(source+(cellNum-(x-1)));
case right:
if (RightColumn) return *(source+(cellNum-(x-1)));
return *(source+(cellNum+1));
case lowerRight:
if (onBottomRow && RightColumn) return *source;
if (onBottomRow && !RightColumn) return *(source+((cellNum-((x*y)-x))+1));
if (RightColumn && !onBottomRow) return *(source+(cellNum+1));
return *(source+(((cellNum+x))+1));
case lower:
if (onBottomRow) return *(source+(cellNum-((x*y)-x)));
return *(source+(cellNum+x));
case lowerLeft:
if (onBottomRow && LeftCol) return *(source+(x-1));
if (onBottomRow && !LeftCol) return *(source+(cellNum-((x*y)-x)-1));
if (LeftCol && !onBottomRow) return *(source+(cellNum+((x*2)-1)));
return *(source+(cellNum+(x-1)));
case left:
if (LeftCol) return *(source+(cellNum+(x-1)));
return *(source+(cellNum-1));
Listing 1 handles one "Life" cell at a time -- no SIMD is active. The code in Listing 2 adds SIMD instructions through library calls to IPP to perform the identical operations as those in Listing 1. The code is much smaller in Listing 2 but both examples are multi-threaded with TBB.
Both the samples in Listings 1 and 2 are based off a single pointer -- source in Listing 1 and Current in Listing 2 -- and each cell depends only on its 8 neighbors. Both Listings 1 and 2 can be placed in a TBB parallel_for operator method. TBB assigns a range of cells to each thread with the parameters supplied to TBB's operator method.
// Vectors show index for cell to the right, lower right, below, lower left,
// left, above and left, above, and above and right - 8 neighboring cells.
int x[] = {1, 1, 0, -1, -1, -1, 0, 1};
int y[] = {0, 1, 1, 1, 0, -1, -1, -1};
ippiSet_8u_C1R(0, pTmp2, Stride, roi); // zip the accumulator.
for (int i = 0; i < 8; i++) {
pTmp = &Current[Offset + x[i] + y[i] * Stride];
ippiAdd_8u_C1RSfs(pTmp2, Stride, pTmp, Stride, pTmp3, Stride, roi, 0);
ippiCopy_8u_C1R(pTmp3, Stride, pTmp2, Stride, roi);
}
TBB is providing two of the three dimensions of performance -- threading and caching -- because in addition to threading, TBB spreads the cells among the different threads. Each cell is processed 8 times as there are 8 neighbors to each cell and when the amount of data for each thread fits in cache, there are no cache misses after the first pass over the data.
With TBB managing the task and limiting the buffer so it fits in cache, only the first IPP call in Listing 2 in the first iteration will have cache misses. All subsequent IPP calls on all iterations hit the cache because only a small portion of the matrix is provided to each thread. By combining the reduced cache misses with the multiple threads, TBB reduces execution time.
But TBB is providing only two of the three dimensions of performance -- threading and caching. The code in Listing 2 runs faster because the IPP library calls use SIMD instructions and exploits all three vectors of performance.
Comparing Game of Life Algorithms: The Results
The timing results in Table 1 show the benefit of combining IPP and TBB and how just using TBB alone falls short of the combination. The combination of TBB and IPP is about 4X faster that a version just using TBB without IPP ("GameOfLife NoIPP TBB"). Using IPP and not TBB ("GameOfLife NoTBB IPP") also falls short of the combination of TBB and IPP. Using TBB or IPP is good but using them in combination is better.
The benefits of IPP and TBB are also available to developers using managed code in .NET (both TBB and IPP are easily accessed from managed C++ code.) The results in Table 1 show that .NET developers can get almost all the benefit seen in native code because the TBB and IPP libraries are native code. Parallel Studio can increase performance and speed development for managed code as well.
Running 10 iterations in 32-bit... Wall time: 1.038 seconds GameOfLife IPP TBB Wall time: 1.256 seconds GameOfLife IPP Managed TBB Wall time: 1.688 seconds GameOfLife NoTBB IPP Wall time: 4.111 seconds GameOfLife NoIPP TBB Wall time: 15.888 seconds GameOfLife ST


