Memory use by threads can have unforeseen effects on performance. To avoid the problems that careless memory allocation and management can cause, you need to be mindful of how exactly memory is consumed.
Allocating from the Heap
Just like regular programs, threads use a stack to allocate local variables. In addition to the data elements they can place on the stack, threads can allocate memory dynamically from the heap (using
malloc() in C and
new in C++). Unlike the stack which is private to each thread, the heap is shared by all threads in the program. Consequently, when a thread allocates memory from the heap, the operation needs to be thread-safe, which requires a lock. In fact, for that moment, the entire heap is frozen no other allocations or releases are permitted. The lock placed on the heap is very much like a mutex-kind of mechanism: for that moment only one thread has access. The result of this is that threads that frequently allocate and release memory from the heap are hurting performance because they are constantly being blocked or blocking other threads. (This will appear as a cold spot in your performance analysis.)
This problem is generally solved by grabbing a large chunk of memory off the heap at thread start-up and then managing the space via the thread. As with so many activities in parallel programming, this choice provides performance benefits at the cost of added complexity. The additional cost is an efficient use of memory, as each thread will have allocated portions of the heap that it might not be using. The latter problem is not grave, because systems today are so memory rich. However, the complexity-performance trade-off needs further consideration. First let's examine performance.
The principal benefit has already been explained. In addition to avoiding memory-allocation lock, your own functions can reduce the amount of overhead in memory allocation because they do not have to be general-purpose, unlike the operation system functions. In the best cases, allocating memory from a pre-allocated chunk simply requires aiming a pointer at it. Hence, the allocation can be very efficient.
The management of allocated memory, however, can be very complex, if memory efficiency is important. The greatest problems are keeping track of what memory is available, what memory is marked for deletion, and allocating new chunks efficiently. As parts of the memory pool are allocated and released, the free memory becomes broken up over hundreds of chunks of different sizes. A good memory manager has to keep track of these chunks and reuse them on future allocations. In the event that it cannot allocate needed memory, it must go back to the heap and grab a new memory pool that it will manage in addition to the existing one. The code to manage all this memory activity must be completely bug-free and highly efficient, if performance is to be maintained.
Some established techniques can help in this regard. The first is to know a lot about the memory allocation of your program. For example, if the main data structure of a program is an open hash table (that is one that can add buckets dynamically), then you know that you will allocate numerous data items of the same size. Learn to approximate the number of these items you'll need and pre-allocate them as a group, so that they can be allocated and freed from the same pool.
Another technique used by memory management software is to group requests of similar sizes in similar areas of a memory pool. For example, if a thread requires many allocations of different sizes, try dividing its pre-allocated memory pool into different areas for the small, medium, and large chunks. This approach minimizes the negative effects of fragmentation and accelerates the allocation process.
Doing memory management yourself is a practice widely adopted in parallel programming, especially in server software and in desktop software that requires large amounts of memory (such as imaging software). Fortunately, open-source memory managers exist and can remove some of the difficulty of writing and testing this code. One package, called Hoard, has received kudos. Its implementation, which is specifically aimed at multithreaded programs, redefines
malloc() to point to its own implementation. This arrangement means that virtually no code changes to existing programs are necessary to solve the problem of thread contention for the heap.
You can also implement your own solution. If you choose to write your own memory allocation scheme, you should read the paper Reconsidering Custom Memory Allocation [PDF]. It provides good advice on the do's and don'ts of such a project. Several commercial products exist as well, such as SmartHeap from MicroQuill.
Thread-local storage (TLS) is a related idea that enables a thread to allocate storage for its exclusive use, without fear of access by other threads. Windows has a set of functions that enable a thread to create its own personal heap, which does not have to be locked every time memory is allocated or released. (See the
HeapCreate() family of functions.) In Pthreads, similar functions are found in Pthreads_Specific.
While the problem of threads contending for heap access is universal in parallel programming, false sharing is a system-dependent issue. It arises from the way memory is loaded into cache. When the processor loads data from RAM into cache, it does so in units called cache lines. On the Northwood family of processors (the Pentium 4 and the Xeon processors through 2008), the size of the cache line the minimum amount of data read from memory is 128 bytes. On later processors, cache lines are 64 bytes, but read two at a time. This minimum unit means that if any byte in the cache line is accessed, the whole cache line is loaded into cache.
When data is loaded into cache, a special relationship is set up between all processors that have this block in cache. This relationship aims at a condition known as cache coherency, which refers to the need that all copies of an item in cache must contain the exact same data. So, whenever a cache line is updated, all caches that contain the item are signaled to refresh the copy they have in cache.
The problem of false sharing arises when two threads are accessing non-shared data fields within the same cache line. That is two data items used by separate threads are within the same cache line. When this occurs, every time one thread modifies its data in the cache line, the other thread must update its copy even if it has no interest in the data item that was modified. This problem is known a false sharing, because the caches are updating on the belief that they share the data, when in fact they don't. Since cache coherency is fundamental to system integrity, when the signal arrives to update the cache, the processor must respond immediately. If the two threads access their respective fields frequently, the system bus can become swamped with update commands, and true processing will grind to a crawl.
The obvious solution is to make sure no two fields used by threads are within the same cache line. However, this solution is insufficient, because two threads might inadvertently access data items within the same cache line. For example, suppose a developer decides to perform data decomposition on an array by using two threads to update it. One thread updates every odd element, the other every even element. These two threads will be constantly updating elements within the same cache line and subject to the perils of false sharing. Because it can arise subtly, false sharing should be carefully guarded against.
False sharing can be a devastating problem on multiprocessor systems. It exists in a slightly different form on Hyper-threading processor, even though they share the L2 cache. When one thread modifies a cache line that the other thread is also accessing, it notifies it ahead of time of the pending update to make sure no conflict in the data exists. This dialog while somewhat different in nature than the chatter described earlier constitutes unneeded communication for unshared data that is, false sharing. And its effect on performance is equally devastating.
The 64KB Aliasing Problem
When Intel processors based on the NetBurst architecture load caches, they keep track of the cache data as offsets from a 64-KB boundary. That is, if two items start 512 bytes in from a 64-KB address, they will collide in the cache and the oldest one will be expelled by the new memory reference. This design works particularly well in a single-threaded model, but predictably can lead to problems where multiple threads are in use. The problem is that when threads are started up, their stacks always start on a 64-KB boundary. Hence, the items allocated on a thread's stack will frequently collide in cache with those of another active thread.
When this collision occurs, the cache management circuitry must reload from memory the data it's looking for. Of course, it loads it into cache overlaying the data from the other thread. When that thread writes to a local variable, the process will repeat itself. The result is needless fetches from memory. And on the NetBurst architecture, memory fetches are particularly expensive.
The way to avoid this is simple. Allocate one large local variable on the stack and don't access it. All other local variables should be located after this unused space. Across threads used different sizes for this initial unused data area. Now local variables will not conflict with each other, or at least conflict significantly less. (In Windows, another option is to use the
_alloca() API, which performs a similar function.)
The NetBurst architecture was phased out beginning in 2006, and the migration to the Core architecture was complete by 2008. So the aliasing problem exists only on processors shipped before 2008. However, due to its cost, it is mentioned here for readers working on older Intel systems.
In sum, threading performance is sensitive to a few memory issues. Fortunately, these issues can be fairly easily eliminated if you bear them in mind as you code.
This post is an update of material in Chapter 7 of Programming with Hyper-Threading Technology by Rich Gerber and Andrew Binstock (Intel Press, 2004). Portions copyright Intel Corp. False sharing is discussed in depth by Herb Sutter in this article.