Intel has announced Threading Building Blocks 2.2 (TBB), a high-level C++ library that abstracts threads to tasks to create portable and scalable parallel applications. Version 2.2 is available in both the commercial and open source releases. These are built from identical sources -- the only real difference is the license and support offerings.
TBB 2.2 maintains the functionality and platform support of previous versions and adds feature and performance improvements, including full support for the lambda capabilities of the new C++ draft standard (C++0x) and more flexibility for developers to redistribute with their applications. Among the new 2.2 features are:
- Automatic memory allocator replacement available. The memory allocator is one of the most popular features of Intel TBB. However, it can be time consuming to replace your own memory allocator calls. Version 2.2 uses a dynamic instrumentation method on Windows and the LD_PRELOAD function on Linux to offer automatic memory allocator replacement throughout applications. Version 2.2 extends TBB's memory allocator's performance by delivering large-block (over 8K in size) allocation performance.
- Scaling of scheduler enhanced. Version 2.2 features a reworked task scheduler to behave more like an ideal Cilk-style scheduler, yielding scalable behavior. Version 2.2 also has improvements to the affinity partitioner, and changes the default for loop templates from the simple_partitioner to the easier to use and adaptive auto_partitioner.
- Automatic initialization available. Version 2.2 no longer requires an explicit initialization. Users of prior versions have told us that in a large application it is not easy to initialize in the right place. Version 2.2 takes care of automatically initializing the scheduler when it is first needed.
Additionally, 2.2 includes enhancements to parallel algorithms:
- Version 2.2 has a new parallel_invoke for running a group of functors simultaneously in parallel.
- Version 2.2 has a new parallel_for_each and a simplified parallel_for interface to make writing some common for loops easier.
- parallel_for_each(first, last, f) is like parallel_do(first, last, body) but without the feeder functionality that allows adding more work items. In other words, tbb::parallel_for_each is the parallel equivalent of std::for_each.
- The new overload parallel_for(first, last, step, f) lets you pass an integer first (auto i=first), last (i
, and step (i+=step) for a given function f(i), handles simple cases easily, especially with the use of lambdas. The original interface parallel_for(range, body, partitioner) has been retained. It's more general but also more complicated to write, even with the use of lambdas.
- Intel TBB's pipeline can now perform DirectX, OpenGL, and I/O parallelization by using the new thread_bound_filter feature. There are certain types of operations that require that they are used from the same thread every time and by using a filter bound to a thread, you can guarantee that the final stage of the pipeline will always use the same thread.
- Exception safety support has been expanded significantly. Prior versions had support for exception propagation only in parallel_for, parallel_reduce, and parallel_sort. Support is expanded to include parallel_do, the new parallel_invoke, and parallel_for_each as well as the new forms of parallel_for and parallel_reduce.
- Lambda support has been extended to cover not only parallel_for, but also parallel_reduce, parallel_sort, and the new parallel_for_each and parallel_invoke algorithms. In addition, the new combinable and enumerable_thread_specific classes for thread local storage can accept lambdas. The documentation and code examples are expanded to show lambdas in action. The Intel Compiler 11.0 and Intel Parallel Studio offer lambda support today, and Microsoft will support it in Visual Studio 2010 (it is in the beta currently).
Concurrent container enhancements include:
- Thread local storage, which is portable across platforms, is now possible with the new enumerable_thread_specific and combinable classes. This can be useful for algorithms that reduce shared memory contention by creating local copies and then combining results later through something like a reduce operation.
- Unbounded non-blocking interface for concurrent_queue and new blocking concurrent_bounded_queue. Some operations require synchronization and may or may not block depending on whether or not the queue is bounded. To get the best behavior, use the unbounded form if you need only basic non-blocking push/try_pop operations to modify the queue. Otherwise use the bounded form which supports both blocking and non-blocking push/pop operations.
- Simplified interfaces for concurrent_hash_map that make it easier to utilize for common data types using the new tbb_hasher.
- Improved interfaces for concurrent_vector that removes a common extra step needed to use the vector output.
There are some changes programmers may need to do to move from prior versions of Intel TBB to the Version 2.2. Developers can simply add #define TBB_DEPRECATED 1 to their code, and the old interfaces remain available (at least for now) adjust to the following changes:
- auto_partitioner() is now the default instead of simple_partitioner().
- Concurrent queue API changes: renaming four interfaces. Or you can change pop_if_present to try_pop, push_if_not_full to try_push, begin to unsafe_begin and end to unsafe_end, and thereby be consistent with the latest API.
- Concurrent vector API changes: renamed compact to shirnk_to_fit, and changed three interfaces to all consistently have return types of iterator. Previously grow_by returned size_type, grow_to_at_least returned nothing, push_back returned size_type.
- The notion of task depth has been eliminated, so the following four members of class task have no effect: depth_type, depth, set_depth and add_to_depth. These have no effect in 2.2 even if you use TBB_DEPRECATED, but are nonetheless defined to permit their use without error messages.