Channels ▼
RSS

Combining TBB and IPP with Intel Parallel Studio


Bob Davies is a software engineer at Intel.


Although Parallel Studio ships with Threading Building Blocks (TBB) and the Intel Performance Primitives library (IPP), there is nonethess a shortage of examples that show how to combine these two components. This article provides a motivation to combine these two libraries -- dramatically improved performance. The samples I present here will accrue to .NET developers using managed code as well.

There are three main dimensions to improving performance on Intel architecture. The oldest of these has been to exploit on-chip and off-chip caches -- consolidate references to the same memory as much as possible. A second dimension of performance arrived with SIMD (Single Instruction Multiple Data) instructions on Intel architecture -- performing operations on a vector of 4 or 8 data elements at the same time (vectorizing.) The most recent dimension for improving performance was the advent of multiple processors and advanced threading techniques.

Caches can be exploited without any change to the code but exploiting SIMD instructions is actual work -- the new instructions need to be inserted and loop counters adjusted. Multi-threading has pushed the need to reorganize developer code even further.

Intel provided a simple method to multi-thread code with OpenMP -- requiring minimal changes to the code in most cases. The success of OpenMP encouraged attempts to find more and better techniques to multi-thread code -- resulting in the creation of Intel's Threading Building Blocks (TBB) and Intel's upcoming vectorizing compiler known as "Ct".

TBB is capable of providing two of the three dimensions of performance -- efficient cache utilization and multi-threading. There is no overt effort to include SIMD instructions with TBB but these SIMD instructions may be included by using IPP. The Ct compiler is an attempt to attack all three dimensions of performance with explicit features that will help developers exploit caches, SIMD, and multi-threading.

Ct is not available yet but a close approximation to Ct can be created using Parallel Studio's TBB and IPP libraries. The rest of this article demonstrates how to combine TBB and IPP and the benefit it brings.

Game of Life

The Game of Life is useless sport but it serves well here to show how rethinking code can yield significant performance gains. The algorithm is simple to understand from Conway's original description in Scientific American -- cells die from overcrowding or isolation and cells are born when just the right number of neighbors is present. There are numerous implementations in a variety of languages.

The typical implementation of the core algorithm of Life is shown in Listing 1. The 8 surrounding cells are tested for special conditions such as top row or bottom row. The world of Life is a globe even though it is a rectangular matrix in memory. Cells in the bottom row affect the life and death of cells in the top row. Similarly cells in the left column are neighbors of cells in the extreme right column.


switch (cellPosition)  { 
case upperLeft:
    if (TopRow && LeftCol) return *(source+((x*y)-1));
    if (TopRow && !LeftCol) return *(source+(((x*y)-x)+(cellNum-1)));
    if (LeftCol && !TopRow) return *(source+(cellNum-1));
	return *((source+cellNum)-(x+1)); 
case upper:
    if (TopRow) return *(source+(((x*y)-x)+cellNum));
	return *((source+cellNum)-x); 
case upperRight:
    if (TopRow && RightColumn) return *(source+((x*y)-x));
    if (TopRow && !RightColumn) return *(source+(((x*y)-x)+(cellNum+1)));
    if (RightColumn && !TopRow) return *(source+((cellNum-(x*2))+1));
	return *(source+(cellNum-(x-1))); 
case right:
    if (RightColumn) return *(source+(cellNum-(x-1)));
	return *(source+(cellNum+1)); 
case lowerRight:
    if (onBottomRow && RightColumn) return *source;
    if (onBottomRow && !RightColumn) return *(source+((cellNum-((x*y)-x))+1));
    if (RightColumn && !onBottomRow) return *(source+(cellNum+1));
	return *(source+(((cellNum+x))+1)); 
case lower:
    if (onBottomRow) return *(source+(cellNum-((x*y)-x)));
	return *(source+(cellNum+x)); 
case lowerLeft:
    if (onBottomRow && LeftCol) return *(source+(x-1));
    if (onBottomRow && !LeftCol) return *(source+(cellNum-((x*y)-x)-1));
    if (LeftCol && !onBottomRow) return *(source+(cellNum+((x*2)-1)));
	return *(source+(cellNum+(x-1))); 
case left:
    if (LeftCol) return *(source+(cellNum+(x-1)));
	return *(source+(cellNum-1)); 

Listing 1: The traditional approach to handling all the special cases that occur in the Game of Life. For example, when looking at a cell neighbor in "UpperLeft", special handling must occur when the cell is in the top row or the left column.

Listing 1 handles one "Life" cell at a time -- no SIMD is active. The code in Listing 2 adds SIMD instructions through library calls to IPP to perform the identical operations as those in Listing 1. The code is much smaller in Listing 2 but both examples are multi-threaded with TBB.

Both the samples in Listings 1 and 2 are based off a single pointer -- source in Listing 1 and Current in Listing 2 -- and each cell depends only on its 8 neighbors. Both Listings 1 and 2 can be placed in a TBB parallel_for operator method. TBB assigns a range of cells to each thread with the parameters supplied to TBB's operator method.


// Vectors show index for cell to the right, lower right, below, lower left, 
// left, above and left, above, and above and right - 8 neighboring cells.
int x[] = {1, 1, 0, -1, -1, -1,  0,  1};
int y[] = {0, 1, 1,  1,  0, -1, -1, -1};
ippiSet_8u_C1R(0, pTmp2, Stride, roi);  // zip the accumulator.
for (int i = 0; i < 8; i++) {
    pTmp = &Current[Offset + x[i] + y[i] * Stride];
    ippiAdd_8u_C1RSfs(pTmp2, Stride, pTmp, Stride, pTmp3, Stride, roi, 0);
    ippiCopy_8u_C1R(pTmp3, Stride, pTmp2, Stride, roi);
} 

Listing 2: The non-traditional approach uses IPP library calls to do the same computation as Listing 1. The left and right columns automatically wrap to the opposite side column. The bottom two rows of the matrix are copied before the top row and the top two rows are copied to the bottom row to facilitate this approach. The extra copying eliminates all conditional statements in Listing 1.

TBB is providing two of the three dimensions of performance -- threading and caching -- because in addition to threading, TBB spreads the cells among the different threads. Each cell is processed 8 times as there are 8 neighbors to each cell and when the amount of data for each thread fits in cache, there are no cache misses after the first pass over the data.

With TBB managing the task and limiting the buffer so it fits in cache, only the first IPP call in Listing 2 in the first iteration will have cache misses. All subsequent IPP calls on all iterations hit the cache because only a small portion of the matrix is provided to each thread. By combining the reduced cache misses with the multiple threads, TBB reduces execution time.

But TBB is providing only two of the three dimensions of performance -- threading and caching. The code in Listing 2 runs faster because the IPP library calls use SIMD instructions and exploits all three vectors of performance.

Comparing Game of Life Algorithms: The Results

The timing results in Table 1 show the benefit of combining IPP and TBB and how just using TBB alone falls short of the combination. The combination of TBB and IPP is about 4X faster that a version just using TBB without IPP ("GameOfLife NoIPP TBB"). Using IPP and not TBB ("GameOfLife NoTBB IPP") also falls short of the combination of TBB and IPP. Using TBB or IPP is good but using them in combination is better.

The benefits of IPP and TBB are also available to developers using managed code in .NET (both TBB and IPP are easily accessed from managed C++ code.) The results in Table 1 show that .NET developers can get almost all the benefit seen in native code because the TBB and IPP libraries are native code. Parallel Studio can increase performance and speed development for managed code as well.


Running 10 iterations in 32-bit...

Wall time: 1.038 seconds     GameOfLife IPP TBB
Wall time: 1.256 seconds     GameOfLife IPP Managed TBB
Wall time: 1.688 seconds     GameOfLife NoTBB IPP
Wall time: 4.111 seconds     GameOfLife NoIPP TBB
Wall time: 15.888 seconds   GameOfLife ST

Table 1: Using a 15 Mb array and 10 Game of Life generations on a Core i7 975, the benefit of IPP and TBB ("GameOfLife IPP TBB") is dramatically faster than the single-threaded version ("GameOfLife ST"). The impact of just TBB is isolated from the combination of IPP and TBB in the entry "GameOfLife NoTBB IPP". Similarly, the impact of just IPP is seen in "Game of Life NoIPP TBB". The combination of IPP and TBB may also be implemented with managed code with only a modest impact on performance ("GameOfLife IPP Managed TBB").


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video