Fluid Animate Particle Simulation
Today we examine the fluidanimate simulation program from the PARSEC benchmark suite available at Princeton University.
In this simulation there is a volume filled particles in an initial state. The dimensions of the volume, number of particles and interactions of the particles with each other and external forces (gravity). The initial state is set such that the external force will cause significant disturbance in the fluid and afterwards the fluid transitions towards a stable state. Two of the simulations will be examined in this blog: simsmall with 36,504 particles and simnative with 501,642 particles. A screenshot of the simulation with 36,504 particles follows:
For fluid simulations, the fluid "particle" motion is dependent on the external forces (e.g. gravity) plus forces generated by force differentials which are a function of local density (proximity with other particles). In a full featured simulation there may be multiple properties, each with different force contributions (external force, viscosity flow forces, mechanical, charge, etc…). This simulation program considers only two: external force and a pressure force.
In a perfect simulation the internal forces are a function of the local density, which in turn is a function of the proximity of a particle with all of the other particles in the system. The computation requirements of each particle interacting with all other particles become excessive as the number of particles grow. Therefore, some fidelity in the simulation is traded off against reduction in computation time by considering interactions between particles within a specified radius of each other. The technique employed is to partition the volume into much smaller cells and then only consider particles within same cell and neighboring cells when computing the separation of the particles. In the 500K particle simulation there are 229,900 cells. Using a 3x3x3 array of neighboring cells the volumetric reduction is 27/229900 ~= 1/32842. Due to the external force, the particles are not evenly distributed through the full simulation domain as can be seen in the above screenshot. The vast majority of particles reside in 1/8 the simulation volume resulting in the higher density 3x3x3 array containing ~ 1/4100 particles of the total. The number of permutations is thus greatly reduced by a factor of about 1000.
Prior to my involvement, the fluidanimate simulation contained implementations using a Serial method, a Pthreads parallel implementation, and a TBB parallel implementation. My involvement was to add a QuickThread implementation.
I must state up front that the design requirements of the QuickThread implementation was to show the reader of the code how to write more efficient code in solving the particle simulation problem as opposed to illustrating a different threading model executing a not nearly as optimal solution. While the strategy used to implement the QuickThread solution can be retrofitted into the Serial, Pthreads and TBB implementations, this author will leave this as an exercise for others.
In this blog I outline the strategy differences. The source code of the implementation can be obtained by first obtaining the PARSEC benchmark suite (sans the QuickThread code revisions) then contacting me for the QuickThread extensions (firstname.lastname@example.org). Visit www.quickthreadprogramming.com for additional information about QuickThread. Eventually the QuickThread extensions will be included in the PARSEC benchmark suite for general distribution.
The following is a comparative chart of running the 35K particle simulation on an Intel Core i7 920 system with 4 cores and HyperThreading (total of 8 hardware threads). The best out of 3 runs charted for each threading model.
For those readers not familiar with HyperThreading processors the short description is this system has 8 integer execution units but only 4 floating point execution units (two threads share each floating-point execution unit). The scaling curve will drop when you exceed the number of floating point execution units. Larger core/processor count systems were not explored as of this article.
In examining the chart you can see in the 8 thread run QuickThread has a 30% advantage over Pthreads-w32 and a 38% advantage over TBB. (where as Pthreads has a 6.4% advantage over TBB). The QuickThread strategy has significant advantage over the other strategies. Let's examine the strategy differences to see how this is this advantage is attained.
There are several distinct differences between the implementations. While I do not have the data to quantify the effect of each implementation difference I can generalize them in the following paragraphs.
One distinction is the QuickThread scheduler uses Affinity Pinned worker threads. And when the thread pool under subscribes the number of hardware threads (4, 2, and 1 thread runs on 8 thread system) the default affinity pinning is to separated hardware threads. Therefore QuickThread can take advantage of better cache utilization.
All of the threading models use a similar strategy of starting a team of threads and then having each thread call the frame advance routines with each thread executing on a different slice of the cells. This particular application has less to do with the efficiencies of thread scheduling other than for the Affinity Pinning effect that the QuickThread implementation might exhibit some additional benefit.
In email discussions with Christian Bienia of Princeton University, he mentioned when fluidanimate.cpp was adapted by Intel for use with TBB (Threading Building Blocks), a design change was made which greatly improved performance. This design change was introduced back into the Pthreads and Serial versions. From my understanding of our emails, the original design had each particle allocated from the heap then collected into cells as a linked list. The Intel revision allocated the cells, each with an array of particles, a variable portion of which represented the subset of active particles of the system contained within the cell. Christian later issued an update where when the capacity of the fixed array of particles in the cell is exceeded, that an additional array is allocated and linked in to the cell list of arrays of particles. This is similar in some respect to how TBB implemented its concurrent container.
The above change was made under the (not always true) assumption that an array of items is always faster than a linked list of items. This statement bears some explanation. While the changes from linked list of particles to cell-by-cell array(s) of particles significantly improved performance (Christian said this), and in the process contributes to the truism of arrays being faster than linked lists. This contribution isn't the complete picture. I can confidently state this because the 38% faster QuickThread implementation is using linked lists of particles.
This presents a paradox in which the substantially faster QuickThread code is using a linked list technique that was demonstrated to be slower than an array technique as used by the TBB enhancements. How so?
The explanation of which is the Intel programmer(s) used a generalization of arrays are faster than linked lists rather than looking at why the particular implementation of linked lists was slower. While this programmer, closely examined the linked list, as an opportunity to improve performance over an array implementation.
When looking at the array implementation, and when particles move from cell to cell, the particle data must be copied from cell to cell as well as an accounting for the change of in-use array elements within the cell. Particles contain 13 floating point variables (either float or double) that require copying as they migrate from cell to cell. Exported particles may require squishing of the internal cell array of particles. This is overhead, meaning an opportunity for optimization.
When looking at the linked list implementation, when particles move from cell to cell, a single linked list has a node (particle) unlinked from one cell list and lined to a different cell list. The unlink, relink operation is significantly faster can copy and squish.
When particles do not migrate the array technique has a definite advantage (or so it would seem), when they do migrate, the linked list might have an advantage (or so it would seem). This said, why is the linked list technique as used by QuickThread superior to the array technique used by TBB and Pthreads?
The answer comes from how the oldest form of linked list technique performed its allocations. As stated earlier, each particle has 13 variables which consume 104 bytes of memory (using doubles). Not having the original linked list code I am making a guess as to what occurred. Due to various platforms supported by the benchmark suite, the current code (pre-QuickThead) attempts alignments to 128 bytes for optimal cache line usage. (64 byte alignment would be sufficient for ia32 and Intel64 platforms.) When allocating memory off the heap, each allocation contains a header which is not part of the data to be used by the application. Rather it is use when the object is returned to the heap. For aligned allocations, there is an extended header which permits use of the header as pad to attain requested alignment for the pointer returned to the application. Due to the nature of how this works out, each aligned to 128 byte object (104 bytes actually used) will consume 256 bytes of RAM.
This memory consumption works against the application cache utilization two ways. First, assuming 64 byte cache line, the 104 byte record is not utilizing 24 of 128 bytes sitting in cache (in two lines). 24/104 additional cache utilization is available. Second, most cache systems use a cache paging system (not necessarily the same size page for virtual memory) whereby the virtual memory address space is partitioned into cache pages and where a limited number of pages can be mapped into the cache system at any one time (e.g. 4-way set associative, 8-way, …). With the aligned allocation, cache pages are utilized at 104/256 or 40% utilized. On large datasets (e.g. 500,000 particles) this requires a large number of cache page management operations. And with 60% of the cache page unused.
The solution to the linked list problem, and the direction towards better cache management methods, is to follow procedures that are contrary to commonly accepted programming principals. First, do not align the data and pad the data. Second, allocate the entire linked list in one allocation and then thread the list. These two steps gain ~ 20/108 cache line utilization (4 extra bytes for pointer on 32-bit system), and perhaps more importantly ~60% better cache page utilization. Add this to the reduction in overhead for particle migration and you see significantly better performance.
An additional technique employed by the QuickThead code, and which improves cache utilization, is derived from an observation of how cell(particles) to cell(particles) calculations are performed. When the interactions within a 3x3x3 (or smaller) array of cells is performed, duplicate calculations are eliminated by calculations to be performed with cells in the 3x3x3 array of neighboring cells who's cell address is less than the current cell. This avoids calculations between cell A and B being made together with calculations between cell B and A. The code used by Serial, TBB and Pthreads-w32 drive the current focus cell in the X, Y, Z directions from 0:nX, 0:nY, 0:nZ which means the current cell of focus is migrating away from cells who's particles data have just been read in to the L1 cache. The QuickThread implementation runs the iterations in reverse nX:0, nY:0, nZ:0 an by doing so, some of the cell data from the prior iteration has a better probability of residing in L1 cache and an almost certainty of residing in L2 cache.
In closing, there are questions remaining: Is this the best technique to use, and if not, what technique is better. The answer to these are related and not so simple to state. In running various prototype techniques it becomes apparent that some techniques are superior in the smaller particle count problems, while other techniques are superior in the larger particle count simulations. The general technique used by all threading models in this report advanced the frame in 5 stages with barriers in between each stage. An experimental version seems to show promise on large simulations (500K and larger) which has eliminated the barriers and which permits phase skewing between the 5 stages amongst the various threads. This technique has higher management overhead and therefore payback occurs only on larger simulation models. This technique may be discussed in a future blog article.