Code Characterization, Parallelization
SMOKE was originally developed as a multi-threaded computer game demo implementing most of the major features of modern 3D games. It has been used to demonstrate efficient parallelization techniques applicable to the wide range of CPU-intensive computer games. As always, you can find room for improvement, and indeed initial runs of the SMOKE code under Intel Thread Profiler revealed a sub-optimal overall concurrency level. Figure 1 is a screenshot of the actual profile obtained. The profile view (top pane) shows a concurrency level of slightly greater than 2 on 4 cores. The timeline view (bottom pane) shows the time based execution flow of the parallel code, namely the initialization step, the main computational loop, and the application wrap-up before termination. For SMOKE, this view also shows a region of considerable contention (marked in yellow by the tool). This part of the timeline corresponds to the main computation loop of the SMOKE code.
With the initial profile in hand, the next step is to take a closer look at the timeline view. This is conveniently done using the zoom feature of Intel Thread Profiler. You can select and magnify any region of the timeline view and, in the process, not only get a better view of the behavior of the threads but also automatically get the average concurrency level for the selected region. A zoomed view of just a few iterations (Figure 2) of the main computation loop showed two obvious things:
- The concurrency level did not change from iteration to iteration, implying that a very small subset could be chosen for detailed analysis
- The iterations suffer from ending with a critical section, where changes to the game scene get distributed among worker threads before they start working on the next frame.
Ultimately, any analysis with Intel Thread Profiler has only one main goal -- to uncover issues with an application's multi-threading design or implementation and associate those issues with particular lines in source code. As such, the last step of our investigation is to isolate the source code corresponding to the serialization mentioned above. Intel Thread Profiler lets you do this via a menu item (obtained by right-clicking anywhere near but not on the yellow line of a transition) called "Transition Source View". In this case, the view shows the transition from four worker threads to one main thread, when it enters the critical section. The source view of the tool also displays the function call stack, and walking up this stack for the main thread, we arrive at the location of two function calls at the top level of the main computation loop. These function calls distribute changes made to various game objects in the iteration just completed. This is as far as Intel Thread Profiler can take the analysis, and now the scrupulous work of reading the code begins.
Further investigation of the source showed that the DistributeChanges()functions responsible for updating game world objects and scenes cannot be parallelized in the manner written. Both of the functions contain data dependencies and use shared data structures. Their underlying data structures and the way threads work with them had to be modified for this functionality to become parallelizable. The changes to the relevant data structures were approached in the following manner:
- SMOKE uses an associative container (a map) to collect the change notifications made to game world objects. Find-and-Erase functionality in a thread-safe map introduces noticeable overhead. This overhead can be avoided if the map is replaced by a vector for which each thread has its own range of elements holding notifications for only it to process.
- The number of change notifications is not known in advance, so in order to be able to divide up the vector between all threads, each thread first needs to generate the notifications in Thread Local Storage (TLS), count them, grow the vector atomically to the necessary length, and then concurrently copy change notifications from TLS to the common container. For all of this no synchronization is required.
- After the above changes were implemented it turned out those functions that process different types of change notifications are not thread-safe either. Guarding access to certain common resources fixed this problem.
The changes outlined above were introduced into the code and Intel Thread Profiler was used to assess the effect. Figure 3 depicts a zoomed-in view of a previously single-threaded section of application run time. It is seen that a majority of the application run time is now spent at the concurrency levels of three or four.
The yellow bar in Profile View represents the time spent on synchronization between worker threads (i.e. overhead). However, it should also be noted that even with this overhead the changes made deliver an improvement to overall code performance. As with any synchronization point, the mechanism chosen to effect the synchronization is quite critical. The most commonly used mechanisms are: atomic operations, user-level and kernel-level mutexes, fair mutexes, and reader-writer mutexes. Each has its own advantages and disadvantages, and some experimentation is required when making the choice of mechanism for a particular application and workload. Here, Intel Thread Profiler can assist in the following way. The timeline view of the region of interest shows that for this application and workload, worker threads "ping-pong" on the mutex/lock often and for a very short period of time. This indicates that user-level spin mutexes could be a good fit. Naturally, to be sure of the choice, one has to benchmark the application with respect to all possible mutex types and over the most commonly processed workloads on each of the target platforms. This approach will help account for hardware and software variations in the many possible environments that the code may be run in. In the case of SMOKE, these kinds of benchmarks confirmed the advantage of user-level synchronization. For this purpose then either an Intel TBB spin_mutex or a Critical Section with a spin count would be good candidates.
Having addressed the serialization-between-iterations issue, we now turn to something fundamental in parallel processing -- load balance. For the case at hand, the main loop is composed of a few different types of tasks each of which take different times to execute and offer different levels of potential parallelism. For example, the Rendering Task usually takes a long time and, barring any low level parallelism, is a single instance for each frame. AI tasks on the other hand usually take relatively small amounts of time to complete, with several instances possible per frame. AI tasks represent every "thinking" object in the game, and usually all of them can be executed in parallel on all available worker threads. Temporally, Physics tasks represent something in between AI and Rendering tasks. Physics tasks are therefore of "average" size and can appear in large numbers. The collision computation associated with them usually implies communication, which limits the concurrency level achievable in the handling of these tasks.
To understand the implications of all this in practice, we once again rely on the timeline view of Intel Thread Profiler. We utilize a standard feature that allows you to mark certain activities for each worker thread. This allows a detailed analysis of tasks sizes, their execution order, and therefore worker thread load balance. Figure 4 illustrates the situation for SMOKE.
To address this situation, we use the task scheduler available in Intel TBB library. The scheduler maps available tasks to a thread pool in a manner designed to maximize concurrency. As mentioned in the introduction, the pool of threads is created and managed by the library in a fashion transparent to the user, so the user only focuses on representing code in terms of tasks. Tasks themselves are distributed to thread local queues and when a thread runs out of tasks in its own queue it can steal from another thread's queue. This is the feature that enhances concurrency [2-4].
This is only part of the story, because the order in which tasks are spawned (and thus submitted to the scheduler) will also be expected to have an effect on execution times. Indeed, if AI tasks get spawned first, Intel TBB will balance the load via task stealing, and the threads of the thread pool will finish their execution of this group of tasks at almost the same time. Spawning the Rendering Task last will mean that when all worker threads finally finish with the AI tasks for example, only one thread will be able to pick up the serial Rendering task. The rest of the team will have to wait for the Rendering task to be executed.
With these considerations in mind the Intel TBB task scheduler was used to improve the load balance on the CPUs. For the present work, explicit task-to-thread affinity was enforced to effect task prioritization. Indeed this could also have been done using Intel TBB's affinity partitioner. In this approach, an affinity ID (a relative Intel TBB thread ID) is set for a certain task making it a prioritized task in a way. A thread with this ID picks up the task as soon as it runs out of local work (tasks in the local queue). Thus, the high-priority tasks, which in our case are Rendering and Physics, would get assigned an affinity ID. As soon as an iteration starts, one of the worker threads picks up the Rendering task, and then another thread picks the main Physics task, while the AI, Geometry and Particles tasks get divided among the rest of the worker threads by normal stealing. This would presumably result in finer balanced load and reduced wait times.