Introducing tbb::parallel_invoke
I would like to introduce you to a new template function recently added to Threaded Building Blocks -- tbb::parallel_invoke. It provides TBB users a simple way to run several functions in parallel. So, for example, if you have three functions that do some work and you would like to run them simultaneously, you may write the following TBB code (I skipped some things like scheduler initialization):
void Function1(); void Function2(); void Function3();
void RunFunctions() { tbb::parallel_invoke(Function1, Function2, Function3); }
Looks simple, doesn't it? You do not have to define any specific classes or write extra code to use parallel_invoke. It is possible to pass function pointers or functor objects to the template function using the same syntax:
void (*FuncPtr1)(void), (*FuncPtr2)(void); void RunFuncPtrs tbb::parallel_invoke(FuncPtr1, FuncPtr2); } class FunctorClass { public: void operator() () const {} } Functor1, Functor2; void RunFunctors tbb::parallel_invoke(Functor1, Functor2); }
It also supports lambda functions available in C++0x:
tbb::parallel_invoke( []() { std::cout < < "Hello!"; }, []() { std::cout < < "Greetings!"; } );
Up to 10 functions can be run by parallel_invoke:
tbb::parallel_invoke(Func1, Func2, Func3, Func4, Func5, Func6, Func7, Func8, Func9, Func10);
Obviously, you could write your own code to run the functions in parallel, but when you use parallel_invoke you get all usual benefits from TBB. Since parallel_invoke uses a task-based approach, the code will run on any platform and on different numbers of cores.
However in order to be run by parallel_invoke, the functions should have no arguments and no return value. The second restriction is not strict -- actually you can pass a non-void function, but the return value will be ignored, so doing this is not a good design.
tbb::parallel_invoke also includes exception handling and cancellation support. It behaves like other TBB template algorithms:
try{ tbb::parallel_invoke (Function1, Function2, Function3) }catch (tbb::captured_exception &exc) { // Processing exc }
And now a little bit about implementation details. As I mentioned above, TBB tasks are used, so each user-defined function is run by a separate task. The tasks form a tree, each leaf runs up to three user functions. For example, a 5 functions version looks like this (each box represents a task):
Note each sub-root task runs a user-defined function in its body to optimize the number of tasks. The most complicated case with 10 user functions looks like:
The tasks aren't blocked at the inner level. Sub-root tasks use continuation-passing style to prevent it; wait_for_all is called only at the top level.
Parallel Pattern 5: Stencil
All memory addresses used for reads are expressed as offsets
Distributing Work Across Cores Using .NET
A roll-your-own ThreadPool implementationLooking For The Lost Packets: Part 2
Looking For The Lost Packets: Part 1
- Intel Parallel Studio; Download the free eval today!
- Parallelism Breakthrough Video Series; Watch and learn more about Intel® Parallel Studio
- 2009 Intel Software Webinar Series; View On-Demand webinars
- Coding for Multi-core Processes; Intel® Compiler Pro eBook
- Performance Through Parallelism; Intel® Tuning for Vista eBook
- Intel® Software Network; Connect with developers and Intel engineers
-
February 18, 2010
Lock Contention, Using Intel Parallel Studio to Improve Performance
Speaker: Vasanth Tovinkere, Software Engineer, Intel Corporation (Bio)Vasanth Tovinkere is a software engineer in the Developer Products Division (DPD) at Intel. His current role involves defining novel approaches to understanding and visualizing parallel performance and consulting with strategic customers to help them prepare and deliver code for the multicore world. Vasanth has been involved in the development of automatic semantic event detectors for digital sports technologies in Intel Labs. He also has been awarded three patents and has two patents pending.
Abstract:
Discover how easy it is to use the power of Microsoft Visual Studio and Intel Parallel Studio to find performance issues due to lock contention in threaded applications. This ensures that shipped applications can take better advantage of multicore processors. In this webcast, we provide live demonstrations that show how to identify lock contentions issues with Visual Studio and Intel Parallel Studio, an add-in to Visual Studio that helps developers create fast, reliable code on multicore processors.t.



