Channels ▼
RSS

Parallel

Managing Application Thread Use


Test Results

In this simulation, I used a 2.67-GHz Intel Core 2 quad-core processor-based DP server with a total of 16-GB RAM running on Fedora Core 6. This platform provided 8 cores for the simulation. The load governor was set to run every 1 second to read the processor statistics from /proc/stats. Each benchmark communicated with the load governor every 10 seconds to update their thread count. The iteration count (ITERATION) was set as 100. Idle time threshold was set to 5, 10, 15, and 20 during the simulation for each test setup. The benchmarks I used in the simulation were: Prime, which finds the prime numbers within a range; Mandel, OpenMP SCR's Mandelbrot implementation that computes an estimation to the Mandelbrot Set area using MonteCarlo sampling; MD, OpenMP SCR's molecular dynamic simulation; and Matvec, matrix vector multiplication. All benchmarks were compiled using Intel C++ compiler version 10.0.

Uniform Test Results

This scenario was simulated with two and four instances of each benchmark.

The pseudocode for the uniform simulation can be given as follows: The # of instances in the simulation was set to 2 and 4, respectively.


for threshold in 5 10 15 20
do   
  for load in {prime, mandel, md, matvec}
    do
      start governor -thresh  -method
         for i in # of instances
         do 
            start load
            sleep <sometime>
         done
       wait_for_all_loads_to_finish
       stop governor
    done
done


The uniform simulation results (Figure 3) show that regardless of the number of instances of the benchmarks (loads), the aggregate elapsed time can be reduced anywhere from 5-20 percent. These results also show that if the number of instances of a benchmark running in parallel is increased, then aggregate elapsed time can be decreased further. Each instance of any benchmark is capable of fully utilizing all cores on the system; therefore, various threshold values used in the simulation didn't make a significant difference.

Figure 3: Uniform test results combined (4 instances of each benchmark; idle threshold 10).

Mix Run Test Results

In the real world, however, it is more common to have multiple multithreaded applications running in parallel and competing for the limited number of cores available on the system, rather than running the same application in parallel. As multithreaded applications are more widely available, this is increasingly becoming an everyday scenario. Therefore, the idea behind the mix run is to simulate this case. During the simulation, a single instance of each multithreaded benchmark was executed simultaneously.

The pseudocode for the mix simulation can be given as follows: # of instances here was set to 1 and then 2, respectively.


for threshold in 5 10 15 20
do
  start governor -thresh -sched
   for load in {prime, mandel, md, matvec}
     do 
       for i in # of instances
        do    
         start load
         sleep <sometime>
      done
    done
  stop governor
done

Again, the first instances of each load will get the maximum thread count, which is equal to the number of physically available processors/cores. This ensures that the first benchmark to start will complete as fast as possible while others run slower.

Figure 4: Single instances of each benchmark are executed in parallel.

In Figure 4, the total speed-up (that is, decrease in elapsed time of all benchmarks running simultaneously) is 1.23x. Only one out of four benchmarks (matvec) was negatively impacted in this simulation run. After analyzing all the simulation results, it became clear that with this framework, the execution time of each benchmark, and the order in which they are started, made a difference in the aggregate performance. The best aggregate elapsed time improvement was achieved when the benchmark that takes the least amount of time to complete was started first, and the second fastest benchmark as the second. This can be explained in the following manner:

Let Li be any given benchmark (load), TLi be the execution time for a benchmark, and n the number of benchmarks. If benchmarks' time to completion is:

TL1 <..<TL(i-1) < TLi <...<TLn

then the best results are achieved when L1 is started first, then L2. However, it is not guaranteed that L1 will pick up the available cores when L1 completes; any benchmark can pick up the available cores based on the fact that they communicate with the load governor independently and based on their own timer. In the best mix run, a total of 23 percent improvement on aggregate elapsed time was achieved.

But if again:

TLn.>...> TL(i-1) > TLi >...>TL1

then starting Ln first gives the worst aggregate elapsed time response. The explanation to this is that Ln will take all the available cores and yet still finish last, so that other benchmarks will start with a low number of threads and will complete without having a chance to increase their thread count. This is the only condition where a traditional (oversubscription) framework will out-perform the proposed framework.

It was also noted that the more loaded the system or the longer the applications ran (higher iteration count), the better the aggregate performance.

The dynamic nature of the framework can be analyzed by the Intel Thread Profiler. Figure 5 shows the Thread Profiler analysis of one of the benchmarks during the test run. One can easily see that OpenMP worker threads were increased every time more processors became available.

[Click image to view at full size]

Figure 5: Intel Thread Profiler analysis showing how one benchmark increases its OpenMP worker threads dynamically within the framework.

Conclusion

Many applications are becoming demanding in how they use processor resources to take advantage of parallelism. However, these applications are not aware of the other applications running alongside them on the same system. Clearly, the lack of system-wide knowledge hinders not only the applications' performance but also the overall system performance.

This framework showed that with a very basic metric and feedback mechanism, overall performance of the applications can be significantly improved. While the described lightweight framework is easy to implement and to take advantage of, it has some aspects that need improvement. Therefore, further study is needed on how to eliminate the dependency on the execution order of the applications. This can be solved by letting the load governor keep some state information about the thread usage of applications and their timing. In this manner, rather than merely increasing their thread count, applications can also decrease the number of threads they use.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video