Andrei Marochko is a Senior Development Engineer in the Performance Analysis and Threading (PAT) group in Intel's Developer Products Division. Bradley Werth is a Senior Software Engineer in Intel's Entertainment Technical Marketing Engineering group. Anton Pegushin a Senior Technical Consulting Engineer for Intel's Threading and Performance Analysis group specializing in threading and in the use of Intel Threading Building Blocks. Michael D'Mello is a Senior Technical Consulting Engineer in Intel's Developer Products Division.
The main frame processing loop of a typical computer game is composed of a few functional blocks. Typically these are the Artificial Intelligence (AI), Physics, Particles, and Rendering computational functions. A popular strategy for parallelizing this loop is to represent the functional blocks as stages of a pipeline which can then be run in parallel across available processors, where the normal dependencies between the stages of the pipeline apply. Naturally, in this approach theoretical scalability is also limited by the number and weights of the stages in the pipeline. Most implementations of this approach assign one stage to a physical processor. As a result, on machines where the number of processors is greater than the number of pipeline stages, this approach obtains no benefit from the additional processor resources available. On the plus side, the method can be implemented to maintain data locality with respect to processor cache as each data item is run through the pipeline.
In pure serial implementations of the main computational loop, the heterogeneous tasks which make up the body of the loop are executed in an ordered fashion. In the parallel case you may attempt to relax this ordering and attempt a more asynchronous execution of the tasks as long as the fidelity of the output is not affected or is maintained to some acceptable degree. Such an approach to parallelization of the frame loop is intriguing in that scalability is now limited only by the availability of tasks ready to execute and cores to process them -- subject, of course, to any and all constraints that apply between the different tasks. Further, you may parallelize each of the high-level tasks into subtasks and exploit any fine-grained parallelization that may be available. If you employ a thread pool to avoid oversubscription of the CPUs, the problem then becomes one of scheduling a pool of tasks to a pool of threads subject to any constraints that may apply.
In this article, we examine the realizable benefits of the afore-mentioned task parallel approach as applied to the SMOKE computer game demo. The task parallel implementation is done with the Intel Threaded Building Block (TBB) C++ template library. In a nutshell, the library provides generic parallel algorithms and concurrent containers [5, 6] which enable users to write parallel programs without directly creating and managing threads. Indeed, with this library, users need only focus on representing code in terms of tasks. This step that can be done implicitly via library provided high level algorithms or explicitly via derivations of a base task class, also provided by the library. All aspects of thread management and the mapping of tasks to threads are handled by the library in a manner transparent to the user. Internally, the library treats tasks as user-level objects that are scheduled for execution by the Intel TBB task scheduler. The task scheduler maintains a pool of native threads and a set of queues (one queue per thread) of tasks ready for execution. At initialization, the Intel TBB task scheduler creates an appropriate number of threads in the pool (by default, 1 per hardware thread). During code execution the scheduler distributes tasks to threads using a randomized work-stealing algorithm. The decentralized (each thread has its own queue of tasks) work stealing mechanism is what enables the scheduler to achieve near optimal load balance and high scalability of parallel programs [3, 4]. All Intel TBB algorithms are tested and tuned for the current generation of multi-core processors, and they are designed to scale as the core count continues to increase.
A detailed review of Intel TBB features specifically as they apply to computer games has already been provided by one of us in the past . In this article, we only focus on how Intel TBB has been implemented in an optimized version of the SMOKE gaming demo. The source code we discuss here is publicly available on the Intel Software Network .