Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


Memory Constraints on Thread Performance

Memory use by threads can have unforeseen effects on performance. To avoid the problems that careless memory allocation and management can cause, you need to be mindful of how exactly memory is consumed.

Allocating from the Heap

Just like regular programs, threads use a stack to allocate local variables. In addition to the data elements they can place on the stack, threads can allocate memory dynamically from the heap (using malloc() in C and new in C++). Unlike the stack which is private to each thread, the heap is shared by all threads in the program. Consequently, when a thread allocates memory from the heap, the operation needs to be thread-safe, which requires a lock. In fact, for that moment, the entire heap is frozen — no other allocations or releases are permitted. The lock placed on the heap is very much like a mutex-kind of mechanism: for that moment only one thread has access. The result of this is that threads that frequently allocate and release memory from the heap are hurting performance because they are constantly being blocked or blocking other threads. (This will appear as a cold spot in your performance analysis.)

This problem is generally solved by grabbing a large chunk of memory off the heap at thread start-up and then managing the space via the thread. As with so many activities in parallel programming, this choice provides performance benefits at the cost of added complexity. The additional cost is an efficient use of memory, as each thread will have allocated portions of the heap that it might not be using. The latter problem is not grave, because systems today are so memory rich. However, the complexity-performance trade-off needs further consideration. First let's examine performance.

The principal benefit has already been explained. In addition to avoiding memory-allocation lock, your own functions can reduce the amount of overhead in memory allocation because they do not have to be general-purpose, unlike the operation system functions. In the best cases, allocating memory from a pre-allocated chunk simply requires aiming a pointer at it. Hence, the allocation can be very efficient.

The management of allocated memory, however, can be very complex, if memory efficiency is important. The greatest problems are keeping track of what memory is available, what memory is marked for deletion, and allocating new chunks efficiently. As parts of the memory pool are allocated and released, the free memory becomes broken up over hundreds of chunks of different sizes. A good memory manager has to keep track of these chunks and reuse them on future allocations. In the event that it cannot allocate needed memory, it must go back to the heap and grab a new memory pool that it will manage in addition to the existing one. The code to manage all this memory activity must be completely bug-free and highly efficient, if performance is to be maintained.

Some established techniques can help in this regard. The first is to know a lot about the memory allocation of your program. For example, if the main data structure of a program is an open hash table (that is one that can add buckets dynamically), then you know that you will allocate numerous data items of the same size. Learn to approximate the number of these items you'll need and pre-allocate them as a group, so that they can be allocated and freed from the same pool.

Another technique used by memory management software is to group requests of similar sizes in similar areas of a memory pool. For example, if a thread requires many allocations of different sizes, try dividing its pre-allocated memory pool into different areas for the small, medium, and large chunks. This approach minimizes the negative effects of fragmentation and accelerates the allocation process.

Doing memory management yourself is a practice widely adopted in parallel programming, especially in server software and in desktop software that requires large amounts of memory (such as imaging software). Fortunately, open-source memory managers exist and can remove some of the difficulty of writing and testing this code. One package, called Hoard, has received kudos. Its implementation, which is specifically aimed at multithreaded programs, redefines malloc() to point to its own implementation. This arrangement means that virtually no code changes to existing programs are necessary to solve the problem of thread contention for the heap.

You can also implement your own solution. If you choose to write your own memory allocation scheme, you should read the paper Reconsidering Custom Memory Allocation [PDF]. It provides good advice on the do's and don'ts of such a project. Several commercial products exist as well, such as SmartHeap from MicroQuill.

Thread-local storage (TLS) is a related idea that enables a thread to allocate storage for its exclusive use, without fear of access by other threads. Windows has a set of functions that enable a thread to create its own personal heap, which does not have to be locked every time memory is allocated or released. (See the HeapCreate() family of functions.) In Pthreads, similar functions are found in Pthreads_Specific.

False Sharing

While the problem of threads contending for heap access is universal in parallel programming, false sharing is a system-dependent issue. It arises from the way memory is loaded into cache. When the processor loads data from RAM into cache, it does so in units called cache lines. On the Northwood family of processors (the Pentium 4 and the Xeon processors through 2008), the size of the cache line — the minimum amount of data read from memory — is 128 bytes. On later processors, cache lines are 64 bytes, but read two at a time. This minimum unit means that if any byte in the cache line is accessed, the whole cache line is loaded into cache.

When data is loaded into cache, a special relationship is set up between all processors that have this block in cache. This relationship aims at a condition known as cache coherency, which refers to the need that all copies of an item in cache must contain the exact same data. So, whenever a cache line is updated, all caches that contain the item are signaled to refresh the copy they have in cache.

The problem of false sharing arises when two threads are accessing non-shared data fields within the same cache line. That is two data items used by separate threads are within the same cache line. When this occurs, every time one thread modifies its data in the cache line, the other thread must update its copy — even if it has no interest in the data item that was modified. This problem is known a false sharing, because the caches are updating on the belief that they share the data, when in fact they don't. Since cache coherency is fundamental to system integrity, when the signal arrives to update the cache, the processor must respond immediately. If the two threads access their respective fields frequently, the system bus can become swamped with update commands, and true processing will grind to a crawl.

The obvious solution is to make sure no two fields used by threads are within the same cache line. However, this solution is insufficient, because two threads might inadvertently access data items within the same cache line. For example, suppose a developer decides to perform data decomposition on an array by using two threads to update it. One thread updates every odd element, the other every even element. These two threads will be constantly updating elements within the same cache line and subject to the perils of false sharing. Because it can arise subtly, false sharing should be carefully guarded against.

False sharing can be a devastating problem on multiprocessor systems. It exists in a slightly different form on Hyper-threading processor, even though they share the L2 cache. When one thread modifies a cache line that the other thread is also accessing, it notifies it ahead of time of the pending update to make sure no conflict in the data exists. This dialog while somewhat different in nature than the chatter described earlier constitutes unneeded communication for unshared data — that is, false sharing. And its effect on performance is equally devastating.

The 64KB Aliasing Problem

When Intel processors based on the NetBurst architecture load caches, they keep track of the cache data as offsets from a 64-KB boundary. That is, if two items start 512 bytes in from a 64-KB address, they will collide in the cache and the oldest one will be expelled by the new memory reference. This design works particularly well in a single-threaded model, but predictably can lead to problems where multiple threads are in use. The problem is that when threads are started up, their stacks always start on a 64-KB boundary. Hence, the items allocated on a thread's stack will frequently collide in cache with those of another active thread.

When this collision occurs, the cache management circuitry must reload from memory the data it's looking for. Of course, it loads it into cache overlaying the data from the other thread. When that thread writes to a local variable, the process will repeat itself. The result is needless fetches from memory. And on the NetBurst architecture, memory fetches are particularly expensive.

The way to avoid this is simple. Allocate one large local variable on the stack and don't access it. All other local variables should be located after this unused space. Across threads used different sizes for this initial unused data area. Now local variables will not conflict with each other, or at least conflict significantly less. (In Windows, another option is to use the _alloca() API, which performs a similar function.)

The NetBurst architecture was phased out beginning in 2006, and the migration to the Core architecture was complete by 2008. So the aliasing problem exists only on processors shipped before 2008. However, due to its cost, it is mentioned here for readers working on older Intel systems.

In sum, threading performance is sensitive to a few memory issues. Fortunately, these issues can be fairly easily eliminated if you bear them in mind as you code.

This post is an update of material in Chapter 7 of Programming with Hyper-Threading Technology by Rich Gerber and Andrew Binstock (Intel Press, 2004). Portions copyright Intel Corp. False sharing is discussed in depth by Herb Sutter in this article.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.