Improving Scalability of Multithreaded Dynamic Memory Allocation

Multiprocessor/multithreaded environments add a new dimension to the familiar malloc facility. The "MT-hot" implementation Greg presents here lets multiple threads execute in parallel without major delays.


July 01, 2001
URL:http://www.drdobbs.com/cpp/improving-scalability-of-multithreaded-d/184404685

Jul01: Improving Scalability of Multithreaded Dynamic Memory Allocation

Greg is a member of technical staff at Sun Microsystems. He can be contacted at [email protected].


Dynamic memory allocation (also known as "malloc") continues to be an important issue for the programming community. Malloc is a layer between the application requesting memory from the system at run time, and the operating system (OS) supplying this memory. Malloc contains its own memory-management database and uses it to dole out the memory blocks that the application requests. Malloc is not really a part of the OS even though every modern OS supplies at least one malloc package as its part. Application developers are free to use any of the OS-supplied malloc packages or any other malloc implementation. Applications can also obtain dynamic memory from the system directly and manage it on their own. However, memory-management issues can get complicated, so most applications use the system-supplied malloc instead.

The malloc API is a part of the Standard C Library. Essentially, it consists of malloc(), realloc(), and free(). In C++ and Java, operators new and delete usually end up calling malloc() and free(), respectively. Therefore, the issues I'll discuss here apply equally to C, C++, and Java applications.

The multiprocessor/multithreaded (MP/MT) environment adds a new dimension: On top of everything else, we have to consider how multiple threads use dynamic memory and interact with each other. MP/MT issues may dominate malloc performance; this is when "MT-hot" implementations will help. (MT-hot refers to algorithms not only working in the MT environment correctly, called "MT-safe," but also allowing multiple threads to execute in parallel without major delays associated with threads waiting for each other.) When I say scalability, I mean a degree of improvement in application performance when the number of CPUs in the system increases while everything else remains the same. The technology described in this article may be covered by one or more patents owned by Sun Microsystems.

The MT-hot malloc implementation I'll describe here has been available for SPARC/Solaris since 1996. Although the described implementation is specific to SPARC/Solaris, the ideas and methods are applicable to any MP/MT environment.

Single-Threaded Malloc Implementation Issues

A simple malloc/free implementation is presented in The C Programming Language, Second Edition, by Brian W. Kernighan and Dennis M. Ritchie (Prentice Hall, 1988). It uses a linked list to store the free blocks. It is a good demonstration of one way to implement malloc. However, using it for large applications is likely to be very slow.

There is no best malloc for all applications. Any general-purpose malloc implementation is a compromise to satisfy the conflicting goals of generality, simplicity, speed, locality of reference, memory fragmentation, and memory consumption. Which malloc is best to use always depends on how a particular application uses dynamic memory. Nevertheless, certain reasonable tradeoffs have been achieved in practice. One of them is malloc(3C), which is the default malloc implementation in Solaris. Solaris malloc(3C) has been successfully used for over a decade without serious problems.

Application developers need to decide which malloc to use for their particular applications. If the application calls malloc with a specific pattern (for example, asking for a constant size block very frequently), any general-purpose malloc implementation is unlikely to be the best solution.

Malloc implementations usually fall into these general categories:

You can determine the malloc-calling pattern of an existing application using a library interposer; for more information, see "malloc_hist.so," described in my article "Building Library Interposers for Fun and Profit" (http://www.ddj.com/articles/2001/0165//documents/ddj0165m/). This tool produces a histogram of malloc usage for specific application sessions.

For a general-purpose malloc implementation, the fundamental performance bottleneck is a search through available free blocks to find a good candidate for the current malloc request. Most such implementations do not return freed memory blocks to the system. Instead, they store the free blocks internally and reuse those blocks when the application makes additional memory requests. This is not the only way to do it as malloc can be implemented to return the free memory to the OS. However, the latter would be hard to implement and is not necessary in most cases.

Kernighan and Ritchie's implementation uses a linked list to store the free blocks. Each time malloc is called, it has to traverse that linked list to find a block that can be used to satisfy the current malloc request. This traversal can take a very long time in many cases.

The default Solaris malloc(3C) implementation uses the much more sophisticated algorithm described in "Self-Adjusting Binary Search Trees," by D. Sleator and R. Tarjan (Journal of the ACM, 32:652-686, 1985). It maintains a self-balancing binary search tree to store the free blocks. Despite the overhead to maintain and balance such a tree, it is fast because the search becomes a binary search operation based on the requested block size. For small blocks (8, 16, 24, and 32 bytes), it uses a separate linked list for each size. When adjacent memory blocks become free, malloc(3C) merges them into one. This reduces memory fragmentation.

It obtains memory from the system in large chunks of 8 KB, then subdivides each of them as necessary to satisfy the application's requests. SPARC/Solaris malloc(3C) aligns each block on an 8-byte boundary. It also uses a hidden 8-byte header for each allocation. The header contains the block size. The presence of the header means that if the application often calls malloc requesting small blocks, a large memory overhead will result.

To illustrate the header overhead, consider the simple test program in Listing One that dynamically allocates a million 8-byte blocks. It uses SPARC/Solaris default malloc(3C) and the pmap(1) Solaris utility combined with egrep(1) to measure the amount of heap memory consumed. As you can see, this program consumes 16 million bytes of memory for its heap rather than the 8 million bytes you may think it should consume. Obviously, a special-purpose allocator will do a better job with such a special malloc-calling pattern.

Making Malloc Suitable for MT Applications

When the application calls malloc routines simultaneously from multiple threads, something must be done to preserve the integrity of the data. Otherwise, while one thread is modifying the malloc data structures, another thread can use or modify some data there and cause data corruption.

One way to prevent such data corruption is to have each of the malloc API routines locked entirely using a global mutex lock. This way, while one thread is in the middle of a malloc API routine, any other thread cannot enter the code to any malloc API routine until the first thread is finished with its memory-allocating task.

This approach is used in the default Solaris malloc(3C). It works well in terms of correctness. However, performance and scalability of this arrangement can suffer when many threads call the malloc routines frequently. The high contention on the single malloc lock can slow the application down very seriously.

MT-Hot Implementation

Over the last few years, many MT-hot malloc packages have emerged in industry and academia; see, for example, ptmalloc, part of the GNU C library, by Wolfram Gloger (http://www.malloc.de/en/index.html) and "Hoard: A Scalable Memory Allocator for Multithreaded Applications," by Emery D. Berger, et al., (http://www.cs.utexas.edu/users/emery/hoard/asplos.pdf/). Also, Solaris 7 introduced mtmalloc(3t). Comparing the MT-hot malloc implementation described here with the other MT-hot malloc packages is beyond the scope of this article.

As far as I know, I wrote the first MT-hot malloc at Sun Microsystems in late 1995 for PTC Pro/ENGINEER (a major Mechanical CAD/CAM system). That application experienced a severe malloc lock-contention problem in its multithreaded code. In June 1996, Sun applied for a patent for this technology (see U.S. patent 6,058,460 "Memory Allocation In a Multithreaded Environment," by Greg Nakhimovsky, at http://www.uspto.gov/). This implementation made assumptions that fit the application at hand. The assumptions have turned out to be quite general for many other (although certainly not all) applications. Those assumptions include:

In 1999, I improved a few details of the original 1996 implementation. The following description corresponds to the latest version. For convenience, I call this implementation "mt_hot_malloc." Current Sun users of mt_hot_malloc also know it as "mtmalloc2."

This implementation is quite simple in principle. I did this on purpose. The idea was to fix the single mutex-lock contention problem, getting the most benefit with the least complexity. More complicated solutions to this problem are certainly possible, but they are likely to be less efficient in many cases, and their debugging and performance analyzing is likely to be more difficult.

The features and details of the mt_hot_malloc implementation are:

This method is easy to implement. It also avoids problems associated with using thread-specific data (which can call malloc internally, and is also slow and complicated).

It is up to the application to create a reasonable number of simultaneous threads, each of which uses malloc heavily. Preferably, the number of such threads should be equal to the number of available CPUs.

This method also works well when the application creates threads, then destroys them, then creates new threads, and so on. The thread IDs will continue increasing, but the mapping algorithm will point the new threads to the same malloc pools previously used by the old threads.

To achieve the best scalability, make sure that all malloc-hungry threads have sequential thread IDs. If your application creates threads that are not using malloc much, you may want to make sure their IDs are away from the malloc-using ones. For example, you can create all of the malloc-hungry threads first, then create all threads of the other kind.

This malloc implementation provides better performance possibly at the price of some extra memory used, depending on how the application uses memory. This is a typical space versus time tradeoff.

The mt_hot_malloc binaries for SPARC/Solaris are available from DDJ (see "Resource Center," page 5). These mt_hot_malloc binaries were built under under Solaris 2.5.1 (32-bit version) and under Solaris 7 (64-bit version), so they will work under those or later Solaris versions.

One way to use this package is to link mt_hot_malloc.o statically with the application. Another is to use the LD_PRELOAD environment variable to make the application use libmt_hot_malloc.so instead of Solaris malloc(3C) located in /usr/lib/libc.so.1. For example (using the C-shell syntax), you can do the following:

% setenv LD_PRELOAD /full_path/libmt_hot_malloc.so

% [run your application here]

% unsetenv LD_PRELOAD

This way, you can use the mt_hot_malloc package without rebuilding the application in any way (assuming your application is not linked with its own malloc package statically). Any user can do it. (This package is not officially supported by Sun Microsystems, but you can contact me if you have any problem with it.)

merge_malloc_pools()

There is, however, a potential problem with the implementation just described. When and if the application uses multiple threads only for a part of the otherwise singlethreaded application, the memory allocated to the additional pools (other than the main thread pool) will become inaccessible after all the additional threads finish their execution. This may be considered a type of memory leak.

The solution I used in the mt_hot_malloc implementation was to add one more routine to the malloc API: merge_malloc_pools(). When and if the application is done with multiple threads, it can explicitly call this routine to transfer all memory from the additional pools to the main pool. This arrangement assumes that only the main thread may call merge_malloc_pools() and only after the application has freed all the memory in the additional pools.

The advantages of this solution are that it is simple, and application developers have direct control over memory in the additional pools. The disadvantage is that it changes the standard malloc API. An application that calls merge_malloc_pools() becomes nonportable, that is it cannot use any other malloc implementation directly. (On the other hand, it is simple enough to compile the call to merge_malloc_pools() conditionally using the #ifdef construct, and make it a no-op for other malloc implementations.) In addition, merge_malloc_pools() can be a relatively expensive call.

The PTC MT developers (for whom this implementation was originally intended) consciously decided never to call merge_malloc_pools(). Even though their MT code was small relative to the total size of the application, they decided it was executed frequently enough not to consider the memory occupied by the additional pools a waste. After all, that memory is reused every time the application enters an MT portion of the application.

In fact, merge_malloc_pools() have not even been documented for most current mt_hot_malloc users, so its implementation is not fully tested at the moment.

Internally, mt_hot_malloc gets large chunks of memory for the additional pools from the main thread pool. This way, only the main thread needs to obtain memory from the OS.

Therefore, all merge_malloc_pools() needs to do is free the large chunks of the additional pool memory, thus returning them to the original main thread pool. Also, this way the additional threads cannot fail due to lack of swap space while there is still memory in the main pool.

Another solution to this potential memory leak problem might be to call merge_malloc_pools() automatically when all the additional threads terminate or free all their memory blocks.

The advantages would be that there is no change in the malloc API, nor any additional burden on the application developers. The disadvantage is that it would make this implementation much less portable and more complicated. I would have to maintain a list of live threads and modify it as the application creates and destroys threads. Even more importantly, this method would not be flexible enough, which can lead to decreased performance and scalability. What if the application creates threads, destroys them when their tasks are done, and then repeats this process many times? The overhead of calling merge_malloc_pools() each time would be incurred many times, even when malloc could reuse the additional thread pools immediately. This is why I rejected the automatic solution.

Testing Scalability of an MT-Hot Malloc

Performance testing generally requires adherence to certain rules. One such rule is that only one parameter should be varied at a time. Otherwise, the results you get may not provide much useful information. Another rule is that the test environment should be as stable as possible.

Performing a scalability test of an MT-hot malloc package requires some additional considerations. Before you start such testing, you should realize that:

Most modern applications use memory in a highly dynamic way. They call malloc() many times, then realloc() for many of the memory blocks, then free(), then malloc() again. They repeat this many times forming a very complex malloc-calling pattern.

As a result of such usage, the number of free blocks in the malloc database becomes very large after a while. The sizes of those free blocks can also vary substantially. The time a malloc call takes becomes highly dependent on how fast malloc can search through its large database to find an appropriate free block to reuse. This is a situation where an MT-hot malloc implementation can become especially useful since threads can perform those searches in parallel.

The malloc_hist.so tool described in "Building Library Interposers For Fun and Profit" can be used to produce malloc calling histograms for your application. Such histograms produced for many real applications have revealed that most of them call malloc with a bell-curve pattern. In other words, a typical large application requests memory blocks of a certain size (say between 64 and 128 bytes) most frequently. The blocks smaller and larger than that are requested somewhat less often. Finally, the extremely small or large blocks are requested the least number of times.

What malloc_hist.so does not report, however, are the MT aspects: How many threads frequently call malloc and friends, whether their IDs are sequential or not, whether those threads are bound or not, how many threads free the memory blocks that other threads allocated.

For all these reasons, the best method to test performance and scalability of an MT-hot malloc package (or any malloc package) is to run it with your application and measure the results.

Short of that, for benchmarking purposes you can also use a test program that mimics the malloc-calling patterns of your application. Listing Two is one such malloc scalability test program. Its source code is available electronically; see "Resource Center."

In a primitive way, this test emulates the pattern that PTC Pro/ENGINEER uses with multiple threads: The MT code is reentrant, the threads are bound, it calls malloc API routines heavily, and each thread uses only its own memory.

Tables 1, 2, 3, 4, and 5 are results of this test for five SPARC/Solaris machines. Each test was run three times, with the results averaged. You can see the variation by comparing the average time with its components. (In the tables: ST = singlethreaded version; MT_sys = multithreaded version using system malloc(3C); MT_hot = multithreaded version using mt_hot_malloc; Speedup_factor = ST-time/MT-time; Perfect_speedup = speedup factor corresponding to linear speedup; and Scalability = percentage of the linear scalability achieved, computed as Speedup_factor/Perfect_speedup*100%; all times are in seconds.)

Figure 2 is a graph of speedup factors versus number of CPUs. As you can see, MT-hot malloc makes a big difference for scalability compared to the system malloc. Speedup factors less than 1.0 for the system malloc indicate slowdowns. This means that, for example, with 8 bound threads heavily using malloc routines on an 8-CPU machine, the system malloc is slowing the application down to such an extent that it is slower with 8 CPUs than with one.

Conclusion

Now that you know how to create and use an MT-hot malloc package, you can achieve higher scalability of your multithreaded applications and take better advantage of the MP hardware. You can also better evaluate various approaches to solving MT-hot malloc problems.

Acknowledgment

Thanks to my Sun Microsystems colleagues Tom Gould, Morgan Herrington, and Pramod Rustagi for their advice regarding this article. Pramod Rustagi also helped design the mt_hot_malloc implementation.

DDJ

Listing One

% more test1.c 
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

main()
{
  char *str;
  int i;
  char buf[128];

  for (i=0; i<1000000; i++)
    str = (char *)malloc(8);
  sprintf(buf, "/usr/proc/bin/pmap -x %ld "
    "| /bin/egrep 'heap|Kbytes'", getpid());
  system(buf);
}
% cc test1.c
% a.out

Address   Kbytes  Resident  Shared  Private  Permissions       Mapped File
00022000  15742    15752       -     15752    read/write/exec    [ heap ]

Back to Article

Listing Two

/* Originally written by David Palmer, PTC.
Modified by Greg Nakhimovsky, Sun Microsystems.
This program helps measure scalability of an MT-hot malloc.
The single-threaded (ST) version calls function pound() ncpus times in
sequence.
The multi-threaded (MT) versions call pound() from the main thread,
plus create (ncpus-1) additional bound threads which run the same
routine pound() in parallel. In the MT versions, function pound()
allocates all of its dynamic memory separately in each thread.
Therefore, the memory consumption of this program is approximately
ncpus times greater than that of the single-threaded version.
Build the ST version:
 cc -xtarget=ultra -xO4 -DST -o single_threaded mt_hot_malloc_test.c
Build the multithreaded version using Solaris malloc(3C):
 cc -xtarget=ultra -xO4 -DMT -mt -o mt_sys mt_hot_malloc_test.c \
   -lpthread
Build the MT version using mt_hot malloc:
 cc -xtarget=ultra -xO4 -DMT -mt -o mt_hot mt_hot_malloc_test.c \
   mt_hot_malloc.o -lpthread

Run:
single_threaded <iterations>
mt_sys <iterations>
mt_hot  <iterations>

ncpus is the number of available CPUs (determined dynamically).
Function pound() executes a loop each iteration of which allocates
many blocks of memory, fills them with data, then frees some of that
memory and reallocates it for blocks with different sizes.
All this is repeated "iterations" times.
*/

#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <math.h>
#include <sys/time.h>
#include <unistd.h>
#define MAX_BLOCKS 80
#define SIZE_BLOCK 80
#define BIG_BLOCK 100

static int iterations;
static void *pound();

main(int argc, char *argv[])
{
  pthread_attr_t attr;
  pthread_t pthr_ids[64];
  int i;
  int num_main;
  int ncpus;
  hrtime_t start, stop;

  if (argc > 1)
    iterations = atoi(argv[1]);
  else
    iterations = 5000;

  ncpus = sysconf(_SC_NPROCESSORS_ONLN);
  printf("Detected %d CPUs\n", ncpus);
  start = gethrtime();

#ifdef MT
  if(pthread_attr_init(&attr))
  {
    printf("Error: pthread_attr_init() failed\n");
    exit(1);
  }
  /* make it a bound thread */
  if(pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM))
  {
    printf("Error: pthread_attr_setscope() failed\n");
    exit(1);
  }
  for( i = 0; i<ncpus-1; i++ )
  {
    if(pthread_create(&pthr_ids[i], &attr, (void *(*)(void *))pound, NULL))
    {
      printf("Error: Thread #%d cannot be created.\n", i);
      exit(1);
    }
    printf("Additional thread %d has been launched.\n", pthr_ids[i]);
  }
  num_main = 1;
#else
  num_main = ncpus;
#endif

  for( i = 0; i<num_main; i++ )
  {
    printf("Calling pound() in the main thread: %d\n", i+1);
    pound();
  }

#ifdef MT
  /* wait until all threads are finished */
  for( i = 0; i<ncpus-1; i++ )
    if(pthread_join(pthr_ids[i], NULL))
    {
      printf("Error: pthread_join() failed\n");
      exit(1);
    }
#endif
  stop = gethrtime();
  printf("Elapsed time: %10.2f sec\n", 1.e-9*(stop - start) );
}
void *pound()
{
  char *blocks[MAX_BLOCKS];
  int i, j, count;
  char lastc = '!' + MAX_BLOCKS - 1;
  for( count=0; count < iterations; count++ )
  {
    for( i = 0; i < MAX_BLOCKS; i++ )	      /* Allocate all blocks */
    {
      blocks[i] = (char *)malloc(i + SIZE_BLOCK);
      for( j = 0; j < i + SIZE_BLOCK; j++ )   /* Fill with data */
	blocks[i][j] = '!' + i;
    }
    for( i = 1; i < MAX_BLOCKS; i += 2)    /* Free odd numbered blocks */
      free(blocks[i]);
    for( i = MAX_BLOCKS-1; i > 0; i -= 2)  /* Allocate odd numbered blocks */
    {
      blocks[i] = (char *)malloc(SIZE_BLOCK - i);
      for( j = 0; j < SIZE_BLOCK - i; j++ )
        blocks[i][j] = lastc - i;
    }
    for( i = MAX_BLOCKS-1; i > 0; i -= 2 ) /* Reallocate odd numbered blocks */
    {
      blocks[i] = (char *)realloc(blocks[i], SIZE_BLOCK + i);
      for( j = 0; j < SIZE_BLOCK + i; j++ )    /* Fill with data */
        blocks[i][j] = '!' + i;
    }
    for( i = 1; i <= BIG_BLOCK; i++ )         /* Reallocate one big block */
    {
      blocks[MAX_BLOCKS-1] = 
                      (char *)realloc(blocks[MAX_BLOCKS-1],SIZE_BLOCK * i);
      for( j = 0; j < SIZE_BLOCK * i; j++ )
        blocks[MAX_BLOCKS-1][j] = lastc;
    }
    for( i = 0; i < MAX_BLOCKS; i++ )         /* Free all blocks */
    {
      free(blocks[i]);
      blocks[i] = NULL;
    }
    for( i = 0; i < MAX_BLOCKS; i++ )	      /* Free NULL poin */
      free(blocks[i]);
  }
}

Back to Article

Jul01: Improving Scalability of Multithreaded Dynamic Memory Allocation

Figure 1: Mapping of threads to pools.

Jul01: Improving Scalability of Multithreaded Dynamic Memory Allocation

Figure 2: Speedup factor versus number of CPUs.

Jul01: Improving Scalability of Multithreaded Dynamic Memory Allocation

Table 1: 2-CPU Ultra-60 450 MHz, 5000 iterations/CPU.

Jul01: Improving Scalability of Multithreaded Dynamic Memory Allocation

Table 2: 4-CPU Ultra-80 450 MHz, 10000 iterations/CPU.

Jul01: Improving Scalability of Multithreaded Dynamic Memory Allocation

Table 3: 8-CPU Ultra-Enterprise 5000 250 MHz, 5000 iterations/CPU.

Jul01: Improving Scalability of Multithreaded Dynamic Memory Allocation

Table 4: 14-CPU Ultra-Enterprise 4500 400 MHz, 5000 iterations/CPU.

Jul01: Improving Scalability of Multithreaded Dynamic Memory Allocation

Table 5: 26-CPU Ultra-Enterprise 6500 400 MHz, 5000 iterations/CPU.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.