Parallel programming has been around for decades, though before the advent of multicore processors, it was more of an esoteric discipline. Now, numerous programmers have tripped over the common stumbling blocks and by recognizing these problems and knowing how to manage them you can avoid the same problems. Many of these problems arise from the initial design of the threading model of the program and cannot be easily patched later. This article lists some of these common threading problems, their effects, and ways to avoid them.
Too Many Threads
It may seem that if a little threading is good, then a lot must be better. In fact, having too many threads can seriously degrade program performance. The impact comes in two ways. First, partitioning a fixed amount of work among too many threads gives each thread too little work, so that the overhead of starting and terminating threads swamps the useful work. Second, having too many concurrent software threads incurs overhead from having to share fixed hardware resources.
When there are more software threads than hardware threads, the operating system typically resorts to round robin scheduling. The scheduler gives each software thread a short turn, called a time slice, to run on one of the hardware threads. When a software thread's time slice runs out, the scheduler preemptively suspends the thread in order to run another software thread on the same hardware thread. The software thread freezes in time until it gets another time slice.
Time slicing ensures that all software threads make some progress. Otherwise, some software threads might hog all the hardware threads and starve other software threads. However, this equitable distribution of hardware threads incurs overhead. When there are too many software threads, the overhead can severely degrade performance. There are several kinds of overhead, and it helps to know the culprits so you can spot them when they appear.
The most obvious overhead is the process of saving and restoring a thread's register state. Suspending a software thread requires saving the register values of the hardware thread, so the values can be restored later, when the software thread resumes on its next time slice. Typically, thread schedulers allocate big enough time slices so that the save/restore overheads for registers are insignificant, so this obvious overhead is in fact not much of a concern.
A more subtle overhead of time slicing is saving and restoring a thread's cache state. Modern processors rely heavily on cache memory, which can be about 10 to 100 times faster than main memory. Accesses that hit in cache are not only much faster; they also consume no bandwidth of the memory bus. Caches are fast, but finite. When the cache is full, a processor must evict data from the cache to make room for new data. Typically, the choice for eviction is the least recently used data, which more often than not is data from an earlier time slice. Thus threads tend to evict each other's data. The net effect is that too many threads hurt performance by fighting each other for cache.
A similar overhead, at a different level, is thrashing virtual memory. Most systems use virtual memory, where the processors have an address space bigger than the actual available memory. Virtual memory resides on disk, and the frequently used portions are kept in real memory. Similar to caches, the least recently used data is evicted from memory when necessary to make room. Each software thread requires virtual memory for its stack and private data structures. As with caches, time slicing causes threads to fight each other for real memory and thus hurts performance. In extreme cases, there can be so many threads that the program runs out of even virtual memory.
The cache and virtual memory issues described arise from sharing limited resources among too many software threads. A very different, and often more severe, problem arises called convoying, in which software threads pile up waiting to acquire a lock. Consider what happens when a thread's time slice expires while the thread is holding a lock. All threads waiting for the lock must now wait for the holding thread to wake up and release the lock. The problem is even worse if the lock implementation is fair, in which the lock is acquired in first-come first-served order. If a waiting thread is suspended, then all threads waiting behind it are blocked from acquiring the lock.
The solution that usually works best is to limit the number of "runnable" threads to the number of hardware threads, and possibly limit it to the number of outer-level caches. For example, a dual-core processor has two physical cores, each with Hyper-Threading Technology, and each with its own cache. This configuration supports four hardware threads and two outer-level caches. Using all four runnable threads will work best unless the threads need so much cache that it causes fighting over cache, in which case maybe only two threads is best. The only way to be sure is to experiment. Never hard code the number of threads; leave it as a tuning parameter.
Runnable threads, not blocked threads, cause time-slicing overhead. When a thread is blocked waiting for an external event, such as a mouse click or disk I/O request, the operating system takes it off the round-robin schedule. Hence, a blocked thread does not cause time-slicing overhead. A program may have many more software threads than hardware threads, and still run efficiently if most of the OS threads are blocked.
A helpful organizing principle is to separate compute threads from I/O threads. Compute threads should be the threads that are runnable most of the time. Ideally, the compute threads never block on external events, and instead feed from task queues that provide work. The number of compute threads should match the processor resources. The I/O threads are threads that wait on external events most of the time, and thus do not contribute to having too many threads.
Because building efficient task queues takes some expertise, it is usually best to use existing software to do this. Common useful practices are as follows:
- Let OpenMP do the work. OpenMP lets the programmer specify loop iterations instead of threads. OpenMP deals with managing the threads. As long as the programmer does not request a particular number of threads, the OpenMP implementation will strive to use the optimal number of software threads.
- Use a thread pool, which is a construct used to maintain a set of long lived software threads and eliminates the overhead of initialization process of threads for short lived tasks. A thread pool is a collection of tasks which are serviced by the software threads in the pool. Each software thread finishes a task before taking on another. For example, Windows has a routine
QueueUserWorkItem. Clients add tasks by calling
QueueUserWorkItemwith a callback and pointer that define the task. Hardware threads feed from this queue. For managed code, Windows .NET has a class
ThreadPool. Java has a class
Executorfor similar purposes. Unfortunately, there is no standard thread pool support in POSIX threads.
- Experts may wish to write their own task scheduler. The method of choice is called work stealing, where each thread has its own private collection of tasks. When a thread runs out of tasks, it steals from another thread's collection. Work stealing yields good cache usage and load balancing. While a thread is working on its own tasks, it tends to be reusing data that is hot in its cache. When it runs out of tasks and has to steal work, it balances the load. The trick to effective task stealing is to bias the stealing towards large tasks, so that the thief can stay busy for a while. The early Cilk scheduler is a good example of how to write an effective task-stealing scheduler.
Data Races, Deadlocks, and Live Locks
Unsynchronized access to shared memory can introduce race conditions, where the program results depend nondeterministically on the relative timings of two or more threads. Figure 1 shows two threads trying to add to a shared variable
x, which has an initial value of 0. Depending upon the relative speeds of the threads, the final value of
x can be 1, 2, or 3.
Parallel programming would be a lot easier if races were as obvious as in Figure 1. But the same race can be hidden by language syntax in a variety of ways, as shown by the examples in Figure 2. Update operations such as
+= are normally just shorthand for
"temp = x; x = temp+1", and hence, can result in interleaving. Sometimes the shared location is accessed by different expressions. Sometimes the shared location is hidden by function calls. Even if each thread uses a single instruction to fetch and update the location, there could be interleaving because the hardware might break the instruction into interleaved reads and writes.
Intel Thread Checker is a powerful tool for detecting potential race conditions. It can see past all the varieties of camouflage shown in Figure 2 because it deals in terms of actual memory locations, not their names or addressing expressions.
Sometimes deliberate race conditions are intended and useful. For example, threads may be reading a location that is updated asynchronously with a "latest current value." In such a situation, care must be taken that the writes and reads are atomic. Otherwise, garbled data may be written or read. For example, reads and writes of structure types are often done a word at a time or a field at a time. Types longer than the natural word size, such as 80-bit floating-point, might not be read or written atomically, depending on the architecture. Likewise, misaligned loads and stores, when supported, are usually not atomic. If such an access straddles a cache line, the processor performs the access as two separate accesses to the two constituent cache lines.
Data races can arise not only from unsynchronized access to shared memory, but also from synchronized access that was synchronized at too low a level. Figure 3 shows such an example. The intent is to use a list to represent a set of keys. Each key should be in the list at most once. Even if the individual list operations have safeguards against races, the combination suffers a higher level race. If two threads both attempt to insert the same key at the same time, they may simultaneously determine that the key is not in the list, and then both would insert the key. What is needed is a lock that protects not just the list, but that also protects the invariant "no key occurs twice in list."
Adding the necessary lock to correct Figure 3 exposes the frustrating performance problem of locks. Building locks into low-level components is often a waste of time, because the high-level components that use the components will need higher-level locks anyway. The lower level locks then become pointless overhead. Fortunately, in such a scenario the high-level locking causes the low-level locks to be uncontended, and most lock implementations optimize the uncontended case. Hence, the performance impact is somewhat mitigated, but for best performance the superfluous locks should be removed. Of course, there are times when components should provide their own internal locking. This topic is discussed later in the discussion of thread-safe libraries.