QuickThread: A New C++ Multicore Library
NUMA (Non-Uniform Memory Access) architectures are becoming popular in HPC (High-Performance Computing) scenarios. Therefore, it is very important to work with efficient and optimized memory allocators. QuickThread is a new commercial C++ multicore programming library loaded with many optimizations for NUMA architectures, bringing a new option to create high-performance parallelized code.
A few months ago, Jim Dempsey, CEO and Chief Architect of QuickThread Programming, LLC, invited me to offer him some feedback about QuickThread's beta versions. I've accepted his offer as I'm always attracted to new development tools, languages, libraries and paradigms related to multicore programming. Thus, I've been able to have early access to many of the features found in its first official release.
One of the most interesting features found in QuickThread is its design to take full advantage of the underlying hardware, considering all the cache levels (L1; L2 and L3) and the NUMA nodes. The developer can experience with different configurations in order to maximize the performance offered by certain algorithms running on very complex and heterogeneous hardware.
QuickThread extends the concept of thread affinity to a new level. It offers developers more control than other libraries because it considers more low-level details of the underlying hardware. For example, it allows a developer to allocate data objects from a particular NUMA node.
Besides, QuickThread focuses on offering developers a simple way to refactor their existing serial code without needing to create additional classes. Developers can work on the same code base and add parallelism replacing existing code snippets. However, as always, it is necessary to consider all the new complexities introduced by concurrent code. QuickThread makes it simpler to replace an existing loop to create a parallelized loop. Nonetheless, the developer has to create code capable of running concurrently without generated undesired side-effects.
Most of the optimizations found in QuickThread are optimal using Intel C++ Compiler Professional Edition for Windows. As it is a C++ library, it can also take advantage of many of the tools offered by Intel Parallel Studio to optimize parallelized code. It competes with Intel Threading Building Blocks, also known as TBB, because it offers a different alternative to parallelize existing C++ code.
QuickThread allows developers to create both 32-bits (x86-32) and 64-bits (x86-64 or EM64T) applications and it supports the following affinity schemes:
• Thread affinity.
• Data binding affinity.
• NUMA-aware affinity.
It provides a tasking system using thread pools with the goal of producing a minimal overhead mechanism for distributing work in multicore and manycore systems. The reduced overhead and the efficient allocator are two of the most impressive features found in this library and can make a big difference compared to other less efficient libraries. In fact, QuickThread can optimize the scheduler at run-time when many different cache organizations and levels appear.
The reduced overhead and the efficient scheduler are even more important when the parallelized code runs in multi-socket configurations with multiple cores in each socket. I do believe QuickThread has many interested features for these environments.
QuickThread offers the following parallel constructs:
• parallel_distribute: Schedules a task team to work on different portions of the same task.
• parallel_for: Schedules a task team to run across an iteration space divided up evenly to team members or chunked up to team members). It offers a classic for loop parallelization without the need to write an additional class.
• parallel_for_each: Schedules a task team across an iteration space divided upon demand by each team member number.
• parallel_invoke: Invokes multiple different tasks provided by C++0x Lambda functions.
• parallel_list: Schedules a task team to process a single linked list of objects.
• parallel_pipeline: Schedules a task team to process a sequence of steps contained within a vector.
• parallel_reduce: Schedules a task team across an iteration space divided upon demand by each team member number whilst performing a reduction operation.
• parallel_task: Schedules a single task for its execution.
In its first official version, QuickThread offers support for Lambda functions (C++0x). It also offers support to FORTRAN. However, it is not fully implemented in its first official release. One of its main drawbacks is that the comparative analysis with Intel Threading Building Blocks documentation is prepared for developers with previous TBB experience, without an introduction for beginners. Therefore, if you haven't worked with TBB before, you will likely find that it's a bit difficult to understand. If this is your case, you can read QuickThread programmer's reference instead.
The library offers and outstanding performance. Therefore, it is a very interesting option when you're looking for high-performance in C++ with simple code.
If you're interested in the features offered by QuickThread, you can read its "Programmers Reference Guide" and "A Comparative Analysis with Intel Threading Building Blocks"
Distributing Work Across Cores Using .NET
A roll-your-own ThreadPool implementation
Looking For The Lost Packets: Part 2
Techniques for debugging multicore packet-processing systemsLooking For The Lost Packets: Part 1
DSP Meets Wireless Communications
- Intel Parallel Studio; Download the free eval today!
- Parallelism Breakthrough Video Series; Watch and learn more about Intel® Parallel Studio
- 2009 Intel Software Webinar Series; View On-Demand webinars
- Coding for Multi-core Processes; Intel® Compiler Pro eBook
- Performance Through Parallelism; Intel® Tuning for Vista eBook
- Intel® Software Network; Connect with developers and Intel engineers
-
February 18, 2010
Lock Contention, Using Intel Parallel Studio to Improve Performance
Speaker: Vasanth Tovinkere, Software Engineer, Intel Corporation (Bio)Vasanth Tovinkere is a software engineer in the Developer Products Division (DPD) at Intel. His current role involves defining novel approaches to understanding and visualizing parallel performance and consulting with strategic customers to help them prepare and deliver code for the multicore world. Vasanth has been involved in the development of automatic semantic event detectors for digital sports technologies in Intel Labs. He also has been awarded three patents and has two patents pending.
Abstract:
Discover how easy it is to use the power of Microsoft Visual Studio and Intel Parallel Studio to find performance issues due to lock contention in threaded applications. This ensures that shipped applications can take better advantage of multicore processors. In this webcast, we provide live demonstrations that show how to identify lock contentions issues with Visual Studio and Intel Parallel Studio, an add-in to Visual Studio that helps developers create fast, reliable code on multicore processors.t.



