Dr. Dobb's | Managing Application Thread Use

Multicore processors are increasingly replacing single-core processors, and developers are being confronted with new challenges when using them.

From high-end servers to desktops and laptops, multicore processors are increasingly replacing single-core processors. At the same time, however, application developers are confronted with new challenges when creating software that efficiently uses these multicore processors.

The specific challenge I examine in this article is the situation in which multiple CPU-bound multithreaded applications create "oversubscription" on a server and/or desktop system. The problem occurs when individual applications create as many threads as the number of cores available, and when each of these threads performs CPU-intensive tasks. Consequently, oversubscription occurs, negatively impacting overall performance. In other words, without knowing the actual system load and other applications running on the system, each multithreaded application competes with each other for system resources.

Framework Details

The current generation of nonreal-time operating systems provide some type of fair resource (processor/core) allocation or scheduling; thus, the overall performance is impacted negatively when multiple multithreaded applications execute in parallel. Of course, there have been proposals—static mapping (done before applications are started) and dynamic mapping (calculated at runtime), for instance—about how operating systems can provide some level of quality of service (QoS) and how application QoS values can be mapped to processor capacity.

The concept I propose here involves neither solving the oversubscription problem, nor providing a QoS service. Instead, my proposal involves a lightweight thread usage balancing framework that can be implemented without any changes to operating systems or runtime environments. (However, getting this functionality implemented in operating systems would make this mechanism available to all the applications on demand.) "Thread usage balancing" in this context refers to adjusting the number of threads each application uses, based on the utilization of the cores/processors. If not already using the maximum number of threads allowed (normally the number of physical cores available on the system), the applications only increase their thread count to the maximum allowed during their runtime.

Figure 1: Block diagrams showing the framework (loads are the applications).

As Figure 1 illustrates, there are two fundamental parts to the dynamic load-balancing framework:

Also, applications only increase their thread count, but can't set it more than the physical number of cores. This makes the initial thread count with which the application starts worth mentioning.

The Load Governor

The load governor (Figure 2) uses a basic metric—processor idle time—to compute system utilization. Based on the idle time, the governor identifies available cores and returns the number of available cores to the requesting applications. For example, if the idle threshold is set to 5, the governor marks all cores that are idle 5 percent or more of their time as available (95 percent or less utilized) at the time of the snapshot. The processor statistics (system time, idle time, and so on) are collected from /proc/stats periodically (with Linux); this period can be adjusted as a parameter to the governor.

The goal of the governor's algorithm is not to prevent oversubscription, but rather to allow applications to finish as soon as possible. The reflection of the algorithm on the applications can be described as "slow start fast finish" with the exception of the first multithreaded application that contacts the governor, because the first application is guaranteed to start with as many threads as the number of cores. (The assumption here is that when the first application starts, the utilization of all the cores is less than the defined threshold.)

Figure 2: Block diagram showing the internals of the load governor.

The Thread Count Library

The second part in this framework is the library providing the API calls to communicate to the governor, and API calls to adjust the number of threads based on the number of available cores received as a response. The applications increment their thread count by the number of available cores returned, with the ceiling being the number of physically available cores.

Also, the applications were instrumented to run many iterations rather than just one. And before entering a parallel region, each application sets its thread count, which was updated by this sample OpenMP (www.openmp.org) code segment:

Methodology

To simulate multiple multithreaded applications running in parallel, I used benchmarks parallelized via OpenMP. I chose OpenMP because it is easy to prototype with and easy to implement the proposed framework. Some of the benchmarks were taken from the OpenMP source code repository (SCR) project (http://sourceforge.net/projects/ompscr) and others were written from scratch. All the benchmarks are CPU-bound and scale well.

Again, the first benchmark that is started gets the maximum thread count, which is equal to the number of physically available processors.

The baseline measurement for both uniform and mix scenarios is taken when any running benchmark sets the number of threads as the number of processors available on the system, rather than changing their thread count during runtime (no load governor is running). In other words, if n different benchmarks or n instances of any benchmarks are executed simultaneously, there is a total of (n * the number of cores) threads.

Test Results

In this simulation, I used a 2.67-GHz Intel Core 2 quad-core processor-based DP server with a total of 16-GB RAM running on Fedora Core 6. This platform provided 8 cores for the simulation. The load governor was set to run every 1 second to read the processor statistics from /proc/stats. Each benchmark communicated with the load governor every 10 seconds to update their thread count. The iteration count (ITERATION) was set as 100. Idle time threshold was set to 5, 10, 15, and 20 during the simulation for each test setup. The benchmarks I used in the simulation were: Prime, which finds the prime numbers within a range; Mandel, OpenMP SCR's Mandelbrot implementation that computes an estimation to the Mandelbrot Set area using MonteCarlo sampling; MD, OpenMP SCR's molecular dynamic simulation; and Matvec, matrix vector multiplication. All benchmarks were compiled using Intel C++ compiler version 10.0.

Uniform Test Results

The pseudocode for the uniform simulation can be given as follows: The # of instances in the simulation was set to 2 and 4, respectively.

The uniform simulation results (Figure 3) show that regardless of the number of instances of the benchmarks (loads), the aggregate elapsed time can be reduced anywhere from 5-20 percent. These results also show that if the number of instances of a benchmark running in parallel is increased, then aggregate elapsed time can be decreased further. Each instance of any benchmark is capable of fully utilizing all cores on the system; therefore, various threshold values used in the simulation didn't make a significant difference.

Figure 3: Uniform test results combined (4 instances of each benchmark; idle threshold 10).

Mix Run Test Results

In the real world, however, it is more common to have multiple multithreaded applications running in parallel and competing for the limited number of cores available on the system, rather than running the same application in parallel. As multithreaded applications are more widely available, this is increasingly becoming an everyday scenario. Therefore, the idea behind the mix run is to simulate this case. During the simulation, a single instance of each multithreaded benchmark was executed simultaneously.

The pseudocode for the mix simulation can be given as follows: # of instances here was set to 1 and then 2, respectively.

Again, the first instances of each load will get the maximum thread count, which is equal to the number of physically available processors/cores. This ensures that the first benchmark to start will complete as fast as possible while others run slower.

Figure 4: Single instances of each benchmark are executed in parallel.

In Figure 4, the total speed-up (that is, decrease in elapsed time of all benchmarks running simultaneously) is 1.23x. Only one out of four benchmarks (matvec) was negatively impacted in this simulation run. After analyzing all the simulation results, it became clear that with this framework, the execution time of each benchmark, and the order in which they are started, made a difference in the aggregate performance. The best aggregate elapsed time improvement was achieved when the benchmark that takes the least amount of time to complete was started first, and the second fastest benchmark as the second. This can be explained in the following manner:

Let L_i be any given benchmark (load), T_Li be the execution time for a benchmark, and n the number of benchmarks. If benchmarks' time to completion is:

then the best results are achieved when L₁ is started first, then L₂. However, it is not guaranteed that L₁ will pick up the available cores when L₁ completes; any benchmark can pick up the available cores based on the fact that they communicate with the load governor independently and based on their own timer. In the best mix run, a total of 23 percent improvement on aggregate elapsed time was achieved.

then starting L_n first gives the worst aggregate elapsed time response. The explanation to this is that L_n will take all the available cores and yet still finish last, so that other benchmarks will start with a low number of threads and will complete without having a chance to increase their thread count. This is the only condition where a traditional (oversubscription) framework will out-perform the proposed framework.

It was also noted that the more loaded the system or the longer the applications ran (higher iteration count), the better the aggregate performance.

The dynamic nature of the framework can be analyzed by the Intel Thread Profiler. Figure 5 shows the Thread Profiler analysis of one of the benchmarks during the test run. One can easily see that OpenMP worker threads were increased every time more processors became available.

[Click image to view at full size]

Figure 5: Intel Thread Profiler analysis showing how one benchmark increases its OpenMP worker threads dynamically within the framework.

Conclusion

Many applications are becoming demanding in how they use processor resources to take advantage of parallelism. However, these applications are not aware of the other applications running alongside them on the same system. Clearly, the lack of system-wide knowledge hinders not only the applications' performance but also the overall system performance.

This framework showed that with a very basic metric and feedback mechanism, overall performance of the applications can be significantly improved. While the described lightweight framework is easy to implement and to take advantage of, it has some aspects that need improvement. Therefore, further study is needed on how to eliminate the dependency on the execution order of the applications. This can be solved by letting the load governor keep some state information about the thread usage of applications and their timing. In this manner, rather than merely increasing their thread count, applications can also decrease the number of threads they use.