OpenMP paves a simple and portable way for you to parallelize your applications or to develop threaded applications. The threaded application performance with OpenMP is largely dependent upon the following factors:
- The underlying performance of the single-threaded code.
- The percentage of the program that is run in parallel and its scalability.
- CPU utilization, effective data sharing, data locality and load balancing.
- The amount of synchronization and communication among the threads.
- The overhead introduced to create, resume, manage, suspend, destroy, and synchronize the threads, and made worse by the number of serial-to-parallel or parallel-to-serial transitions.
- Memory conflicts caused by shared memory or falsely shared memory.
- Performance limitations of shared resources such as memory, write combining buffers, bus bandwidth, and CPU execution units.
Essentially, threaded code performance boils down to two issues: how well does the single-threaded version run, and how well can the work be divided up among multiple processors with the least amount of overhead?
Performance always begins with a well-designed parallel algorithm or well-tuned application. The wrong algorithm, even one written in hand optimized assembly language, is just not a good place to start. Creating a program that runs well on two cores or processors is not as desirable as creating one that runs well on any number of cores or processors. Remember, by default, with OpenMP the number of threads is chosen by the compiler and runtime library -- not you -- so programs that work well regardless of the number of threads are far more desirable.
Once the algorithm is in place, it is time to make sure that the code runs efficiently on the Intel Architecture and a single-threaded version can be a big help. By turning off the OpenMP compiler option you can generate a single-threaded version and run it through the usual set of optimizations. A good reference for optimizations is The Software Optimization Cookbook (Gerber 2006). Once you have gotten the single-threaded performance that you desire, then it is time to generate the multi-threaded version and start doing some analysis.
First look at the amount of time spent in the operating system's idle loop. The Intel VTune Performance Analyzer is great tool to help with the investigation. Idle time can indicate unbalanced loads, lots of blocked synchronization, and serial regions. Fix those issues, then go back to the VTune Performance Analyzer to look for excessive cache misses and memory issues like false-sharing. Solve these basic problems, and you will have a well-optimized parallel program that will run well on multi-core systems as well as multiprocessor SMP systems.
Optimizations are really a combination of patience, trial and error, and practice. Make little test programs that mimic the way your application uses the computer's resources to get a feel for what things are faster than others. Be sure to try the different scheduling clauses for the parallel sections.
Keep the following key points in mind while programming with OpenMP:
- The OpenMP programming model provides an easy and portable way to parallelize serial code with an OpenMP-compliant compiler.
- OpenMP consists of a rich set of pragmas, environment variables, and a runtime API for threading.
- The environment variables and APIs should be used sparingly because they can affect performance detrimentally. The pragmas represent the real added value of OpenMP.
- With the rich set of OpenMP pragmas, you can incrementally parallelize loops and straight-line code blocks such as sections without re-architecting the applications. The Intel Task queuing extension makes OpenMP even more powerful in covering more application domain for threading.
- If your application's performance is saturating a core or processor, threading it with OpenMP will almost certainly increase the application's performance on a multi-core or multiprocessor system.
- You can easily use pragmas and clauses to create critical sections, identify private and public variables, copy variable values, and control the number of threads operating in one section.
- OpenMP automatically uses an appropriate number of threads for the target system so, where possible, developers should consider using OpenMP to ease their transition to parallel code and to make their programs more portable and simpler to maintain. Native and quasi-native options, such as the Windows threading API and Pthreads, should be considered only when this is not possible.
Shameem Akhter is a platform architect and Jason Roberts a senior software engineer at Intel. They are the authors of Multi-Core Programming.