As the computer architecture paradigm shifts from increasing clock speeds to producing many-core chips, software designers can no longer rely on the "free lunch" performance gains these increases in clock speed provided. To achieve performance gains on the many-core systems that will continue to roll out from chip manufactures, software must be explicitly parallelized.
What Can Make Implementing Parallelism Expensive?
The design phase of new applications has to take this into account to create competitive products that will take advantage of new chip architectures and increasing core counts. However, there are a large number of existing applications that execute serial code which must be parallelized to avoid becoming outdated. Implementing this parallelism can be very expensive for a number of reasons. Existing serial code has usually been optimized for performance based on increasing algorithm efficiency, decreasing memory accesses, etc without any thought to future parallelization. These optimizations can often leave behind artifacts such as memory access patterns which make it difficult to create independent tasks. If ignored, these artifacts may introduce threading issues such as data-races into the parallel program. The intrinsic non-determinism of parallel applications, which leads to these types of errors, makes debugging and testing newly parallelized code difficult and expensive. Additionally, effort spent in parallelizing code that has a minimal contribution to the overall runtime of the application can have a high cost with very little benefit. This is due to the fact that the overall speedup which can be achieved through parallelization is limited by the runtime of the remaining serial code.
How Do You Get the Best Bang for Your Buck?
Contrary to popular opinion, adding parallelism does not need to be expensive. The key to getting top-drawer results with minimal effort is to spend that effort where it will pay off. Too many developers spend a lot of time trying to get parallel code to work correctly so they can measure the benefit. They spend all the effort on the hard part of the problem before knowing whether it's worth it. A better choice is to "measure twice, cut once". Measure the serial program, and only consider adding parallelism where it might pay off. Then, in those places that could pay off, do a little more measuring. Use software tools to model the parallelism in the serial code before committing to it. First, model the performance details to make sure you will get the benefit you expect. Then, model the correctness problems you might run into. If the correctness issues are small, you can move forward with confidence. If they are daunting, you can choose to move elsewhere before you spend a lot of effort. If you proceed, at least you know in advance what the value of the effort will be.
Focus Effort Where It Matters: Measure the Serial Program
To ensure that effort spent parallelizing code will be cost effective you must consider Amdahl's Law, which asserts that overall speedup from parallelization is limited by the percent of the runtime that still executes serially. For example, if 20% of an application can be parallelized and run on an infinite number of processors (effectively making its runtime 0), the overall speedup of the application will still only be 1.25. The speedup is limited by the 80% still running serially. To get the largest benefit from parallelization, focus should be placed on portions of the code that contribute the most to the overall runtime of the application. Parallelizing these "hotspots" will provide the best speedup.
When deciding what regions of an application may see the largest benefit from parallelism there are a few common characteristics to look for. Regions of code that are very CPU-intensive can benefit greatly from dividing the workload across processors as long there are not limiting dependencies between the data. Also, regions of code with long latency operations such as UI requests and I/O can benefit from parallelism by hiding the latency with useful operations from other threads. Experience, common sense, and quantitative measurements can all factor in to determining where a program spends its time.
Clean Up Serial Artifacts Before Dealing with Parallelism
Before thinking hard about making the program parallel, think first about the serial program. If the algorithm is inefficient, there is likely to be a much higher payoff in switching to a better algorithm than in parallelizing a poor one. The reason is simple: a multicore processor may have 8 or even 16 cores, but the number of cores is fixed. A more efficient algorithm can save time proportional to the size of the input -- so for large inputs this can be much larger than the number of cores.
If the location of the hotspot surprised you, make sure you understand the program's behavior. Resolve any unintended behavior first. Then re-measure and see if the hotspot is still in the same place. When you are convinced that the serial code is an efficient implementation, then it is time to consider parallelizing the hotspot.
The next thing to consider is whether the algorithm being used is appropriate for parallelization. For some problems, there several different algorithms. For instance, there is a family of different sorting algorithms with the same asymptotic run-time. Merge sort, heap sort, and quicksort are three different algorithms for sorting, for example. Each has the same asymptotic run time behavior. Some of these sorts are easier to parallelize. Which ones? Before you parallelize the existing hotspot code, ask yourself if there is a different algorithm to accomplish the same task which might be more straightforward to parallelize.
Lastly, running code in parallel efficiently requires that the threads share as little data as possible. This is much easier when good programming practices are followed. Declare data as locally as possible, for example using thread local storage as opposed to global variables, and make sure that memory is freed when it is no longer needed. These steps serve to limit the lifetime of storage to the scope of the computation that uses it. This will eliminate a lot of the potential sharing problems and make the job of parallelizing the code easier.
Make Sure You Get What You Pay For Before You Pay: Model the Parallelism
In addition to identifying hotspots, programmers should perform some initial analysis on the application to ensure that parallelization efforts will not be in vain. Modeling parallel performance gains and possible issues using deterministic serial code before implementing real parallelism can help programmers avoid common performance issues and provide insight into the steps required to thread the application.
Parallel programming is based on the belief that while an application may have a lot of work to do, much of the work can be done at the same time (in parallel) without causing correctness errors. These portions of the application that can run in parallel are called tasks. Tasks can be scheduled to run at the same time on different threads or processor cores to speed up the application. To fully understand the principles of parallel performance, you must understand some basic terminology:
- Site: A location in an application where parallelism is implemented
- Task: A portion of a site that can run in parallel with other portions of the same site
Adding parallel constructs to an application (e.g. threads, locks) always comes with some associated runtime overhead. It takes some number of clock cycles to spawn a thread, acquire a lock, etc To determine if adding parallelism will speed up an application, this overhead has to be understood and taken into account. It isn't worth creating threads that will do less work than the work it takes to create them. Similarly, if two threads are competing for resources that are protected by a lock and they find themselves waiting for the other thread to release the lock, these idle cycles are wasted.
Model the Performance
It can be very costly to implement parallelism only to find out that the overhead is eating away all of the performance gains. For this reason, it is very useful to model sites, tasks, and locks in the serial application to determine the value of adding parallel code. Intel Parallel Advisor uses simple source code annotations to allow developers to model the performance of an application if it were parallelized without having to implement real parallelism. The code is still run serially and will support any existing test infrastructure already in place.
After users add annotations to represent potential parallel sites and tasks in their application, Advisor uses runtime information to determine the potential performance gains on multicore architectures if the code was truly parallelized in this way. This performance modeling takes into account many factors including the total runtime of the application vs. the runtime of the tasks, the target number of cores where the application will be running, and the overhead of adding a threading paradigm.
There are many parallel paradigms available today, such as Intel Threading Building Blocks and native threads, and no single implementation works best in all situations. These annotations are lightweight and easy to use, which gives developers the ability to model the performance of many different parallel implementations. This can increase the number of designs that can be tested while still speeding up the process of determining which implementation provides the best performance.
Model the Correctness Problems
With the help of lightweight annotations and tools to analyze the annotated serial code, many common correctness issues of parallel applications such as race conditions can be detected and avoided before any parallelism is added. One of the most time consuming steps to developing parallel code is debugging because parallel code is intrinsically non-deterministic. Bugs may appear and disappear randomly with identical test cases and developers are at the mercy of the task scheduler to reproduce these bugs. By modeling parallelism in the serial code with annotations, test results are deterministic and repeatable. Advisor has the ability to interpret the model and detect potential correctness issues that could occur when the code is actually parallelized. This technique mitigates the time consuming task of debugging parallel code.
Using serial modeling tools allows a programmer to investigate how to add parallelism to an existing application without a lot of effort. With these tools, you can:
- Focus on adding parallelism where it matters
- Check that parallel overheads will not overwhelm the performance improvement from threading
- Check that any data shared between tasks is understood, and accesses are handled appropriately
Following this approach, you avoid the pitfalls that make going parallel expensive -- you can find out early if there's going to be a problem, and resolve the issues before moving on.
When you are ready to parallelize the code, you already have a design you can believe in. You may still introduce some problems during the transition to parallel code -- or they may slip in later, during program maintenance. Then it's time to use one of the tools that focuses on parallel applications. By designing the parallelism in and getting the "big picture" correct first, the need for these late-stage tools is reduced, and the total effort of parallelizing is much smaller, because the effort is focused where it matters, and where there is a good chance of succes