The Load Governor
The load governor (Figure 2) uses a basic metricprocessor idle timeto compute system utilization. Based on the idle time, the governor identifies available cores and returns the number of available cores to the requesting applications. For example, if the idle threshold is set to 5, the governor marks all cores that are idle 5 percent or more of their time as available (95 percent or less utilized) at the time of the snapshot. The processor statistics (system time, idle time, and so on) are collected from /proc/stats periodically (with Linux); this period can be adjusted as a parameter to the governor.
The goal of the governor's algorithm is not to prevent oversubscription, but rather to allow applications to finish as soon as possible. The reflection of the algorithm on the applications can be described as "slow start fast finish" with the exception of the first multithreaded application that contacts the governor, because the first application is guaranteed to start with as many threads as the number of cores. (The assumption here is that when the first application starts, the utilization of all the cores is less than the defined threshold.)
The Thread Count Library
The second part in this framework is the library providing the API calls to communicate to the governor, and API calls to adjust the number of threads based on the number of available cores received as a response. The applications increment their thread count by the number of available cores returned, with the ceiling being the number of physically available cores.
The modifications to the original applications are minimal:
- Include two header files. One for the thread count and governor communication, and one for the timing calculations. The header file for the timing calculations is not necessary but is used during the simulation.
- Link with library.
- Modify the code so that just before entering a parallel region, thread count is updated.
Also, the applications were instrumented to run many iterations rather than just one. And before entering a parallel region, each application sets its thread count, which was updated by this sample OpenMP (www.openmp.org) code segment:
init_thread_count ( ); // initialize thread count while ( iter < ITERATION ) { set_thread_count(); // update thread count start_timer(); PARALLEL_REGION //original benchmark #pragma omp parallel for for (...) { WORK ( ); } stop_timer(); }
Methodology
To simulate multiple multithreaded applications running in parallel, I used benchmarks parallelized via OpenMP. I chose OpenMP because it is easy to prototype with and easy to implement the proposed framework. Some of the benchmarks were taken from the OpenMP source code repository (SCR) project (http://sourceforge.net/projects/ompscr) and others were written from scratch. All the benchmarks are CPU-bound and scale well.
The simulation is performed in two different scenarios:
- In the uniform run, two or more instances of the same benchmark were executed.
- In the mix run, one or more instances of all benchmarks were run together simultaneously.
Again, the first benchmark that is started gets the maximum thread count, which is equal to the number of physically available processors.
The baseline measurement for both uniform and mix scenarios is taken when any running benchmark sets the number of threads as the number of processors available on the system, rather than changing their thread count during runtime (no load governor is running). In other words, if n different benchmarks or n instances of any benchmarks are executed simultaneously, there is a total of (n * the number of cores) threads.