NUMA (short for "Non-Uniform Memory Access") architectures are becoming popular in high-performance computing (HPC) scenarios. Therefore, it is very important to work with efficient and optimized memory allocators. QuickThread is a new commercial C++ multicore programming library loaded with many optimizations for NUMA architectures, bringing a new option to create high-performance parallelized code.
A few months ago, Jim Dempsey, CEO and Chief Architect of QuickThread Programming, invited me to offer him some feedback about QuickThread's beta versions. I've accepted his offer as I'm always attracted to new development tools, languages, libraries and paradigms related to multicore programming. Thus, I've been able to have early access to many of the features found in its first official release.
One of the most interesting features found in QuickThread is its design to take full advantage of the underlying hardware, considering all the cache levels (L1; L2 and L3) and the NUMA nodes. The developer can experience with different configurations in order to maximize the performance offered by certain algorithms running on very complex and heterogeneous hardware.
QuickThread extends the concept of thread affinity to a new level. It offers developers more control than other libraries because it considers more low-level details of the underlying hardware. For example, it allows a developer to allocate data objects from a particular NUMA node.
Besides, QuickThread focuses on offering developers a simple way to refactor their existing serial code without needing to create additional classes. Developers can work on the same code base and add parallelism replacing existing code snippets. However, as always, it is necessary to consider all the new complexities introduced by concurrent code. QuickThread makes it simpler to replace an existing loop to create a parallelized loop. Nonetheless, the developer has to create code capable of running concurrently without generated undesired side-effects.
Most of the optimizations found in QuickThread are optimal using Intel C++ Compiler Professional Edition for Windows. As it is a C++ library, it can also take advantage of many of the tools offered by Intel Parallel Studio to optimize parallelized code. It competes with Intel Threading Building Blocks, also known as TBB, because it offers a different alternative to parallelize existing C++ code.
QuickThread allows developers to create both 32-bits (x86-32) and 64-bits (x86-64 or EM64T) applications and it supports the following affinity schemes:
- Thread affinity.
- Data binding affinity.
- NUMA-aware affinity.
It provides a tasking system using thread pools with the goal of producing a minimal overhead mechanism for distributing work in multicore and manycore systems. The reduced overhead and the efficient allocator are two of the most impressive features found in this library and can make a big difference compared to other less efficient libraries. In fact, QuickThread can optimize the scheduler at run-time when many different cache organizations and levels appear.
The reduced overhead and the efficient scheduler are even more important when the parallelized code runs in multi-socket configurations with multiple cores in each socket. I do believe QuickThread has many interested features for these environments.
QuickThread offers the following parallel constructs:
- parallel_distribute: Schedules a task team to work on different portions of the same task.
- parallel_for: Schedules a task team to run across an iteration space divided up evenly to team members or chunked up to team members). It offers a classic for loop parallelization without the need to write an additional class.
- parallel_for_each: Schedules a task team across an iteration space divided upon demand by each team member number.
- parallel_invoke: Invokes multiple different tasks provided by C++0x Lambda functions.
- parallel_list: Schedules a task team to process a single linked list of objects.
- parallel_pipeline: Schedules a task team to process a sequence of steps contained within a vector.
- parallel_reduce: Schedules a task team across an iteration space divided upon demand by each team member number whilst performing a reduction operation.
- parallel_task: Schedules a single task for its execution.
In its first official version, QuickThread offers support for Lambda functions (C++0x). It also offers support to FORTRAN. However, it is not fully implemented in its first official release. One of its main drawbacks is that the comparative analysis with Intel Threading Building Blocks documentation is prepared for developers with previous TBB experience, without an introduction for beginners. Therefore, if you haven't worked with TBB before, you will likely find that it's a bit difficult to understand. If this is your case, you can read QuickThread programmer's reference instead.
The library offers and outstanding performance. Therefore, it is a very interesting option when you're looking for high-performance in C++ with simple code.
If you're interested in the features offered by QuickThread, you can read its "Programmers Reference Guide" and "A Comparative Analysis with Intel Threading Building Blocks"