Parallel

Getting Started with TBB on Windows

By Dave Vanden Bout, December 23, 2008

TBB is open source, so you can see how it works

My First TBB Program

For my initial attempt, I wanted to concentrate on the mechanics of creating a TBB application with the Express Edition and not get too bogged down in the details of the algorithm. So I wrote a simple program that just multiplies two vectors item by item and places the products into another, equal-sized vector. I started by cranking up Express Edition and clicked on the New Project button (you can also use File -> New -> "Project..." from the menu). In the New Project window, I selected Win32 as the project type and Win32 Console Application as the template. I gave the project the creative name of "example1" and set its location to the C:\llpanorama\TBB\examples directory. After clicking OK in the New Project window, and then clicking Finish in the Win32 Application Wizard window, a window opened with a simple code skeleton. I made additions to the skeleton and arrived at the following program, which I will explain below.


01 // example1.cpp : Defines the entry point for the console application.
02 //
03 
04 #include "stdafx.h"
05 #include <iostream>
06 #include "tbb/task_scheduler_init.h"
07 #include "tbb/parallel_for.h"
08 #include "tbb/blocked_range.h"
09 #include "tbb/tick_count.h"
10 using namespace tbb;
11 using namespace std;
12
13 class vector_mult{
14     double *const v1, *const v2;	// multiply these vectors
15     double *v3;	                // put the result here
16 public:
17     // constructor copies the arguments into local storage
18     vector_mult(double *vec1, double *vec2, double *vec3)
19         : v1(vec1), v2(vec2), v3(vec3) { }
20     // overload () so it does a vector multiply
21     void operator() (const blocked_range<size_t> &r) const {
22         for(size_t i=r.begin(); i!=r.end(); i++)
23             v3[i] = v1[i] * v2[i];
24     }
25 };
26
27 const size_t vec_size = 1000000;
28 
29 int _tmain(int argc, _TCHAR* argv[])
30 {
31     // allocate storage for vectors
32     double *v1, *v2, *v3;
33     v1 = (double*)malloc(vec_size*sizeof(double));
34     v2 = (double*)malloc(vec_size*sizeof(double));
35     v3 = (double*)malloc(vec_size*sizeof(double));
36 
37     // init and multiply vectors serially and time the operation
38     for(size_t i=0; i<vec_size; i++) { v1[i] = v2[i] = v3[i] = i; }
39     tick_count serial_start = tick_count::now();
40     for(size_t i=0; i<vec_size; i++) { v3[i] = v1[i] * v2[i]; }
41     tick_count serial_end = tick_count::now();
42 
43     // reinit and multiply the vectors in parallel and time the operation
44     for(size_t i=0; i<vec_size; i++) { v1[i] = v2[i] = v3[i] = i; }
45     task_scheduler_init init;
46     tick_count parallel_start = tick_count::now();
47     parallel_for( blocked_range<size_t>(0,vec_size,1000),
                     vector_mult(v1,v2,v3) );
48     tick_count parallel_end = tick_count::now();
49 
50     // print some of the result vector to make sure it is correct
51     for(size_t i=0; i<10; i++) { cout << v3[i] << endl; }
52 
53     // print out the speedup delivered by parallelism
54     cout << "Speedup from parallelism is "
55         << (serial_end - serial_start).seconds()
56        / (parallel_end - parallel_start).seconds() << endl;
57     return 0;
58 }

The program starts by allocating some memory for vectors v1, v2, and v3 (lines 32-35). The size of these vectors is set by a constant defined on line 27.

I wanted to multiply the vectors serially at first to get a baseline for how long that took. I initialized the vectors on line 38 so that each element stored its index. The current time is stored on line 39, after which a loop is started on line 40 where each element in v1 and v2 is multiplied and stored in the corresponding location in v3. Finally, the completion time for this loop is stored on line 41. The starting and ending times are used to calculate the time it takes to multiply the vectors serially.

Next the same calculation is performed in parallel. The vectors are reinitialized on line 44. (I did this to erase the product values in v3 so I could be sure any correct values came from the parallel threads and not the preceding serial calculation.) A task scheduler object is then created on line 45. This object manages the scheduling of tasks onto physical threads (cores of the CPU, essentially). You can specify the number of threads when the scheduler object is instantiated, but it's usually best to let it determine the number of threads automatically, in case your application is run on a different type of CPU.

The starting and ending times are recorded on lines 46 and 48 for the parallel loop on line 47. The parallel_for subroutine does the actual creation of tasks that are scheduled for execution on the threads. It takes two arguments. The first is an object that stores the range of the vectors and a grainsize that specifies what size the vectors should be cut into. The second argument is an object whose constructor stores the pointers to the v1, v2, and v3 vectors as private variables (see lines 18-19). What parallel_for does is partition the range of the vectors into non-overlapping pieces no larger than the specified grainsize, and then create multiple copies of the vector_mult object and pass one of the subrange pieces to each one. Then the vector_mult objects are scheduled and run in parallel on the physical threads.

When one of the vector_mult objects runs, its () operator is called and passed the subrange that specifies the sections of the vectors it should multiply (see lines 21-24). The loop on line 22 gets the beginning and ending indices of the subrange and just multiplies the vector elements that lie between them. For example, the range beginning and ending might be 51000 and 52000, so the vector_mult object would do the following calculations:


v3[51000] = v1[51000] * v2[51000];
v3[51001] = v1[51001] * v2[51001];
...
v3[51999] = v1[51999] * v2[51999];

So each vector_mult object computes a small section of the final product vector. The cores just get copies of the vector_mult object and subranges in no particular order and compute the partial results. Since the subranges are non-overlapping, there is no danger that the objects are overwriting each other's results. And parallel_for makes sure the entire range is covered so a complete product vector is delivered.

After the parallel_for completes, a few of the product values are printed on line 51 just to make sure the program is doing something correctly. Then the amount of speed-up provided by using parallelism is printed on lines 54-56.

Once the program was written, I had to get it compiled and linked. This required the setting of a few project properties. I selected the Project -> "example1 Properties..." menu item to open the Property Pages window. Then I selected All Configurations from the Configurations drop-down menu and I set the following properties so the compiler and linker could find the TBB header files and libraries:


C/C++?General:
   Additional Include Directories = $(TBB20_INSTALL_DIR)\include
Linker?General:
   Additional Library Directories = $(TBB20_INSTALL_DIR)\ia32\vc8\lib

Then I selected the Debug configuration and set the following properties:


Linker?Input:
   Additional Dependencies = kernel32.lib $(NoInherit) tbb_debug.lib
Build Events?Post-Build Event:
   CommandLine = copy "$(TBB20_INSTALL_DIR)\ia32\vc8\bin\tbb_debug.dll" "$(OutDir)"

These property settings link the debug version of my simple program with the debug version of the TBB library and copy the debug version of the TBB dynamic link library to the directory where the executable file for my application can find it. I then selected the Release configuration and set the corresponding properties using the non-debug versions of the library:


Linker?Input:
   Additional Dependencies = kernel32.lib $(NoInherit) tbb.lib
Build Events?Post-Build Event:
   CommandLine = copy "$(TBB20_INSTALL_DIR)\ia32\vc8\bin\tbb.dll" "$(OutDir)"

Once these properties were all set up, I built the Debug and Release versions by selecting the configuration and then selecting Build -> "Build example1" from the menu bar. Once a version was built, I could run it using the Debug -> Start Without Debugging menu item. I ran both versions ten times and got the following speed-ups:

Speedup
Debug	Release
0.48	1.05
0.48	1.08
0.50	0.90
0.47	0.95
0.49	1.11
0.50	1.07
0.50	1.11
0.50	1.13
0.49	0.93
0.50	1.10

In the Debug version, the parallel code runs about half the speed of the serial code because all the diagnostic features of the TBB library are enabled. These features watch out for many of the stupid things I might do while developing code (like setting the number of physical threads to a negative number). I would expect this to have a major impact on the speed-up.

However, the speed-up in the Release version is really disappointing! The average speedup is just 1.04. In fact, the parallel code is slower than the serial code in 30 percent of the tests. Why is that!? My feeling is that a simple multiplication of vectors doesn't provide enough computational mass to outweigh the overhead of starting the threads, but I haven't run any experiments to investigate this.

As Alexey Kukanov illustrates in his Why a simple test can get parallel slowdown blog post, sometimes multithreading a simple problem leads to almost no speedup, or even to a performance slowdown. See my Let's try this again... blog post for a subsequent TBB experiment where I applied a more difficult problem to TBB and saw a significant speedup.

Conclusion

At this point, by following my example, you should be able to:

Set up a development environment for building TBB-based applications.
Successfully recompile the TBB libraries.
Use a few TBB constructs like the task scheduler, blocked ranges, and parallel_for.
Build a program from scratch that uses TBB.

And the best part is, it won't cost you a penny to do any of this.

Obviously, I've only scratched the surface of what TBB can do. As I continue to work with TBB, I'll be guided by the following questions:

What other constructs does TBB provide, and what parallel programming styles do they support?
What characteristics of a program make it suitable for parallelization?
If a program doesn't run well after it is coded for parallel execution, how do I find the bottlenecks?

Previous 1 2

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Getting Started with TBB on Windows

My First TBB Program

Conclusion

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Getting Started with TBB on Windows

My First TBB Program

Conclusion

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content