Max Domeika is a senior staff software engineer in the Developer Products Division at Intel and author of Software Development for Embedded Multi-Core Systems.
Dual-core and quad-core processors from Intel provide developers with an opportunity to scale performance while optimizing power consumption. To fully exploit this opportunity, developers must understand the inherent parallelism in their applications. This article presents an overview of parallelism constructs and programming techniques, focusing on common threading issues and performance tuning. Intel Parallel Studio is a software development tools suite available on Microsoft Windows that helps unlock the power of parallelism for your software development projects and is comprised of three tools -- Intel Parallel Amplifier, Intel Parallel Composer, and Intel Parallel Inspector -- that help in the design, implementation, debug, and tuning phase of multicore application development.
To assist developers in identifying opportunities for parallelism within the application, Intel offers Intel Parallel Amplifier, a performance analysis tool that helps determine code regions in the application that consume a lot of CPU time. There are two views of the data generated by this tool that are useful when analyzing code for threading opportunities -- file-level and call graph.
Focusing on file-level hotspots in your applications lets you understand how much time is being spent in each function of your program. Once the most time-consuming functions are identified, drill-down to the source code to determine whether threading can be effectively implemented. Some resource-intensive functions may not lend themselves to parallel execution. If you find yourself faced with a hotspot that cannot be threaded, the call graph view is the next step. Call graph graphically depicts the call tree through an application. Even when your hot spot is not amenable to threading, this technology may be able to identify a function further up the call tree that can be threaded. Threading a function further up the call tree will improve performance by allowing multiple threads to call to the hot function simultaneously.
The implementation of parallelism in a system can take many forms; one commonly used type is shared memory parallelism which implies:
- Multiple threads execute concurrently.
- The threads share the same address space. This is compared to mu ltiple processes which can execute in parallel but each with a different address space.
- Threads coordinate their work.
- Threads are scheduled by the underlying operating system and require OS support.
To illustrate the keys to effective parallelism, I present a real-world example -- multiple workers mowing a lawn. The first consideration is how to divide the work evenly. This even division of labor has the effect of keeping each worker as active as possible. Second, the workers should each have their own lawn mower; not doing so would significantly reduce the effectiveness of the multiple workers. Finally, access to items such as the fuel can and clipping container needs to be coordinated. The keys to parallelism illustrated through this example are generalized as follows:
- Identify the concurrent work.
- Divide the work evenly.
- Create private copies of commonly used resources.
- Synchronize access to unique shared resources.
Three classifications of parallel technologies available in Intel Parallel Composer are task programming libraries, domain-specific threaded libraries, and compiler threading support. Intel Threading Building Blocks is an example of a programming library that abstracts control of low-level threads behind a C++ template library. Domain-specific threaded libraries consist of optimized multi-threaded functions specific to a domain such as image processing. Intel Parallel Composer includes Intel Integrated Performance Primitives which contain optimized routines targeting image processing, signal processing, cryptology, and several other domains. The third technology is threading support enabled in the Intel Parallel Composer in the form of OpenMP and automatic parallelization.