Intel Threading Building Blocks: parallel_for()
Concurrency needn't be so complicated that you avoid it completely. One of the easiest ways to gain performance increases on multi-core platforms is with the parallel_for algorithm. To get a sense of how real-world developers are using Intel Threading Building Blocks, we spoke with Vincent Tan, a programmer with Pongrass Australia.
As described on the Intel Software Network, Tan created a multithreaded version of par2cmdline 0.4, a utility commonly used to repair corrupted Usenet postings via Reed Solomon coding. By leveraging the Intel Threading Building Blocks 2.0 library (using TBB's mutex, concurrent_hash_map, atomic, and parallel_for constructs), the program can process files concurrently instead of serially. As a result, dual-core machines can nearly double performance time when creating or repairing data files.
Q: How did you learn about parallel_for?
A: I read the Intel TBB tutorial and reference manuals. From there, I looked at the sample code.
Q: Was it easy to add the algorithm to your application?
A: After studying the sample code, it was straightforward to convert the code. The harder part was finding all of the shared resources (such as member variables) and then ensuring that access to them was thread-safe.
Q: Did you make any mistakes?
A: I originally specified a grain size, but I found that it did not really help (because the TBB's default behavior was good enough for the code to which I tried to apply the grain size).
A: What would be the most interesting use for this algorithm?
A: To be honest, I view it as a tool to solve a particular problem. The obvious for loops in the project's code pretty much dictated the use of parallel_for. I'll put it another way: If you can process elements of a random-accessible array in parallel (i.e., the elements have no interdependencies) then parallel_for is the tool you probably want.
A: What performance or productivity benefits did you gain?
A: CPU utilization on a dual-core machine went from approximately 40-45 percent to approximately 80-85 percent. Because I/O is still performed serially (non-overlapped), the code never achieves 100 percent utilization -- but a doubling of performance is good enough for most users.
A: How should a developer get started with parallel_for?
A: Read the Intel TBB tutorial on the Documentation page of threadingbuildingblocks.org and study the sample code. The reference manual helps out with the nitty-gritty details but you'll probably only need it if you need to specify the grain size.
Here's a snippet of parallel_for at work in the par2cmdline source code.
// par2creator.cpp::973 // New function to hold the original loop body void ProcessData(u32 outputblk, u32 endindex, size_t blklength, u32 inputblk) { for( ; outputblk != endindex; ++outputblk ) { // Select the appropriate part of the output buffer void *outbuf = &((u8*)outputbuf)[chunksize * outputblk]; // Process the data through the RS matrix rs.Process(blklength, inputblk, inputbuf, outputblk, outbuf); } } // Encapsulates the loop body class ApplyRSProcess { public: ApplyRSProcess(Par2Creator* obj, size_t blklength, u32 inputblk) : _obj(obj), _blklength(blklength), _inputblk(inputblk) {} void operator()(const tbb::blked_range<u32>& r) const { _obj->ProcessData(r.begin(), r.end(), _blklength, _inputblk); } private: Par2Creator* _obj; size_t _blklength; u32 _inputblk; };
Parallel Pattern 5: Stencil
All memory addresses used for reads are expressed as offsets
Distributing Work Across Cores Using .NET
A roll-your-own ThreadPool implementationLooking For The Lost Packets: Part 2
Looking For The Lost Packets: Part 1
- Intel Parallel Studio; Download the free eval today!
- Parallelism Breakthrough Video Series; Watch and learn more about Intel® Parallel Studio
- 2009 Intel Software Webinar Series; View On-Demand webinars
- Coding for Multi-core Processes; Intel® Compiler Pro eBook
- Performance Through Parallelism; Intel® Tuning for Vista eBook
- Intel® Software Network; Connect with developers and Intel engineers
-
February 18, 2010
Lock Contention, Using Intel Parallel Studio to Improve Performance
Speaker: Vasanth Tovinkere, Software Engineer, Intel Corporation (Bio)Vasanth Tovinkere is a software engineer in the Developer Products Division (DPD) at Intel. His current role involves defining novel approaches to understanding and visualizing parallel performance and consulting with strategic customers to help them prepare and deliver code for the multicore world. Vasanth has been involved in the development of automatic semantic event detectors for digital sports technologies in Intel Labs. He also has been awarded three patents and has two patents pending.
Abstract:
Discover how easy it is to use the power of Microsoft Visual Studio and Intel Parallel Studio to find performance issues due to lock contention in threaded applications. This ensures that shipped applications can take better advantage of multicore processors. In this webcast, we provide live demonstrations that show how to identify lock contentions issues with Visual Studio and Intel Parallel Studio, an add-in to Visual Studio that helps developers create fast, reliable code on multicore processors.t.



