Parallel

Managing Application Thread Use

By Levent Akyil, September 04, 2008

Multicore processors are increasingly replacing single-core processors, and developers are being confronted with new challenges when using them.

Test Results

In this simulation, I used a 2.67-GHz Intel Core 2 quad-core processor-based DP server with a total of 16-GB RAM running on Fedora Core 6. This platform provided 8 cores for the simulation. The load governor was set to run every 1 second to read the processor statistics from /proc/stats. Each benchmark communicated with the load governor every 10 seconds to update their thread count. The iteration count (ITERATION) was set as 100. Idle time threshold was set to 5, 10, 15, and 20 during the simulation for each test setup. The benchmarks I used in the simulation were: Prime, which finds the prime numbers within a range; Mandel, OpenMP SCR's Mandelbrot implementation that computes an estimation to the Mandelbrot Set area using MonteCarlo sampling; MD, OpenMP SCR's molecular dynamic simulation; and Matvec, matrix vector multiplication. All benchmarks were compiled using Intel C++ compiler version 10.0.

Uniform Test Results

This scenario was simulated with two and four instances of each benchmark.

The pseudocode for the uniform simulation can be given as follows: The # of instances in the simulation was set to 2 and 4, respectively.


for threshold in 5 10 15 20
do   
  for load in {prime, mandel, md, matvec}
    do
      start governor -thresh  -method
         for i in # of instances
         do 
            start load
            sleep <sometime>
         done
       wait_for_all_loads_to_finish
       stop governor
    done
done

The uniform simulation results (Figure 3) show that regardless of the number of instances of the benchmarks (loads), the aggregate elapsed time can be reduced anywhere from 5-20 percent. These results also show that if the number of instances of a benchmark running in parallel is increased, then aggregate elapsed time can be decreased further. Each instance of any benchmark is capable of fully utilizing all cores on the system; therefore, various threshold values used in the simulation didn't make a significant difference.

Figure 3: Uniform test results combined (4 instances of each benchmark; idle threshold 10).

Mix Run Test Results

In the real world, however, it is more common to have multiple multithreaded applications running in parallel and competing for the limited number of cores available on the system, rather than running the same application in parallel. As multithreaded applications are more widely available, this is increasingly becoming an everyday scenario. Therefore, the idea behind the mix run is to simulate this case. During the simulation, a single instance of each multithreaded benchmark was executed simultaneously.

The pseudocode for the mix simulation can be given as follows: # of instances here was set to 1 and then 2, respectively.


for threshold in 5 10 15 20
do
  start governor -thresh -sched
   for load in {prime, mandel, md, matvec}
     do 
       for i in # of instances
        do    
         start load
         sleep <sometime>
      done
    done
  stop governor
done

Again, the first instances of each load will get the maximum thread count, which is equal to the number of physically available processors/cores. This ensures that the first benchmark to start will complete as fast as possible while others run slower.

Figure 4: Single instances of each benchmark are executed in parallel.

In Figure 4, the total speed-up (that is, decrease in elapsed time of all benchmarks running simultaneously) is 1.23x. Only one out of four benchmarks (matvec) was negatively impacted in this simulation run. After analyzing all the simulation results, it became clear that with this framework, the execution time of each benchmark, and the order in which they are started, made a difference in the aggregate performance. The best aggregate elapsed time improvement was achieved when the benchmark that takes the least amount of time to complete was started first, and the second fastest benchmark as the second. This can be explained in the following manner:

Let L_i be any given benchmark (load), T_Li be the execution time for a benchmark, and n the number of benchmarks. If benchmarks' time to completion is:

T_L1 <..<T_L(i-1)< T_Li<...<T_Ln

then the best results are achieved when L₁ is started first, then L₂. However, it is not guaranteed that L₁ will pick up the available cores when L₁ completes; any benchmark can pick up the available cores based on the fact that they communicate with the load governor independently and based on their own timer. In the best mix run, a total of 23 percent improvement on aggregate elapsed time was achieved.

But if again:

T_Ln.>...> T_L(i-1)> T_Li>...>T_L1

then starting L_n first gives the worst aggregate elapsed time response. The explanation to this is that L_n will take all the available cores and yet still finish last, so that other benchmarks will start with a low number of threads and will complete without having a chance to increase their thread count. This is the only condition where a traditional (oversubscription) framework will out-perform the proposed framework.

It was also noted that the more loaded the system or the longer the applications ran (higher iteration count), the better the aggregate performance.

The dynamic nature of the framework can be analyzed by the Intel Thread Profiler. Figure 5 shows the Thread Profiler analysis of one of the benchmarks during the test run. One can easily see that OpenMP worker threads were increased every time more processors became available.

[Click image to view at full size]

Figure 5: Intel Thread Profiler analysis showing how one benchmark increases its OpenMP worker threads dynamically within the framework.

Conclusion

Many applications are becoming demanding in how they use processor resources to take advantage of parallelism. However, these applications are not aware of the other applications running alongside them on the same system. Clearly, the lack of system-wide knowledge hinders not only the applications' performance but also the overall system performance.

This framework showed that with a very basic metric and feedback mechanism, overall performance of the applications can be significantly improved. While the described lightweight framework is easy to implement and to take advantage of, it has some aspects that need improvement. Therefore, further study is needed on how to eliminate the dependency on the execution order of the applications. This can be solved by letting the load governor keep some state information about the thread usage of applications and their timing. In this manner, rather than merely increasing their thread count, applications can also decrease the number of threads they use.

Previous 1 2 3

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

Managing Application Thread Use

Test Results

Uniform Test Results

Mix Run Test Results

Conclusion

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

Managing Application Thread Use

Test Results

Uniform Test Results

Mix Run Test Results

Conclusion

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content