This article is the first of three about applying parallel patterns to a real program. I'll point out some excellent parallel pattern documents on the Web and an unusual video game of mine. In this installment, I discuss background material and overall parallelization strategy. In Part 2, I cover threading and vectorization and cache optimizations in Part 3.

The game is Seismic Duck, which I recently rewrote it for Windows platforms. It's a complete rewrite of something I wrote for Macs in the mid 1990s. It's a freeware game about reflection seismology, which is imaging underground structures by sending soundwaves into the ground and interpreting the echos. The program teaches the topic by letting the user experiment interactively. It sounds technical, but even children "get it" after playing with it a while. I'm fond of writing games based on peculiar subjects. For example, Frequon Invaders is a Fourier transform game. Reflection seismology is a subject I worked on at Shell in the early 1990s.

This article series is about how and why Seismic Duck was parallelized with Intel Threaded Building Blocks (TBB) differently than a related demo in the TBB distribution. Seismic Duck runs three independent core computations:

- Seismic wave propagation and rendering.
- Gas/oil/water flow through a reservoir and rendering.
- Seismogram rendering.

I run all three in parallel, using **tbb::parallel_invoke**, and vectorize the code where I can. At a high level, it follows the Three Layer Cake pattern (of which I'm one of the authors.) However, that level of parallelism was not enough to get a good animation speed on my machine. The bottleneck was the wave propagation simulation, which is the focus of this article series.

Some background on the numerical simulation of waves is necessary. I used the popular "staggered grid" finite-difference time-domain (FDTD) method. There are five 2D arrays, each representing a scalar field over a 2D grid. Three of the arrays represent variables that step through time. The other two represent rock properties. The arrays are:

**B[i][j]**. A constant array with vertex centered values. It's constant in the sense that it changes only when the subsurface geology changes.**A[i][j]**. A constant array with vertex-centered values related to rock properties. The FDTD method actually requires the edge-centered values, but that would double the memory bandwidth for reading it, because for each interior grid point there is a horizontal edged and a vertical edge. As shown later, memory bandwidth, not computation, is the bottleneck. So to reduce memory references, edge-centered values are computed by averaging the two nearest vertex-centered values. Doing so saves memory references because there twice as many edges as vertices. The vertices actually store 1/2 their physical value, so that each edge value can be computed as the sum of its two endpoint values.**U[i][j]**. A variable array with vertex-centered values representing how much the rock is compressed at that point. It is variable in the sense that it changes for every time step.**Vx[i][j], Vy[i][j]**. Two variable arrays that represent the x and y component of the point's velocity. This should not to be confused with the velocity of sound through the rock. These two grids are edge-center.**Vx[i][j]**represents a physical value between grid points**[i][j]**and**[i][j+1]**.**Vy[i][j]**represents a physical value between grid points points**[i][j]**and**[i+1][j]**.

Figure 1 shows one square of the grid and how the array elements relate to the scalar fields. Notice how **Vx** and **Vy** are edge centered.

Array **Vx** and **Vy** are not only staggered halfway in space, they are staggered halfway in time. At the beginning of a time step, **Vx** and **Vy** are a half a time step behind U. The simulation advances **Vx** and **Vy** a full step, so that they become a half time step head of **U**. Then the simulation advances **U** a full time step. The algorithm is sometimes called "leap frog" because of how the arrays advance over each other in time.

The advantage of staggering and leap frogging is that it delivers results accurate to 2nd order for the cost of a 1st order approach. The update operations are beautifully simple:

forall i, j { Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]); Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]); } forall i, j { U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j])); }

The TBB demo "seismic" and Seismic Duck use two very different approaches to parallelizing these updates. I wrote the original versions of both. (Anton Malakhov has since made many improvements to my original TBB version.) The reason for different approaches is different purposes:

- The TBB demo is supposed to show how to use a TBB feature (
**parallel_for**) and a basic parallel pattern. I kept it as simple as possible. - Seismic Duck is written for high performance, at the expense of increased complexity. It uses several parallel patterns. It also has more ambitious numerical modeling, notably the use of perfectly matched layers to reduce artificial reflections from the simulation boundaries.

The pattern behind the TBB demo is Odd-Even Communication Group in time and a geometric decomposition pattern in space. The code flip-flops between updating the pair of arrays **Vx** and **Vy** , and updating array **U**. (Note to readers of that code: The variable names are **S**, **T**, and **V** instead of **Vx**, **Vy**, and **U**.) Each **forall** can be performed with a **tbb::parallel_for** loop. From a patterns perspective, this is a geometric decomposition pattern. These patterns make a good introduction to parallel programming, and require minimal changes for parallelization.

The big drawback of the Odd-Even pattern is memory bandwidth. When Seismic Duck is played on a typical 21-inch widescreen monitor, each grid has dimensions 531x1520. (You do not see the entire grid in the game -- there is a hidden 16 pixel thick border for the perfectly matched layers.) That works out to about 16 MByte for all five grids. High-end server have caches that large, but I'm targeting current desktop machines, which have caches in the 1-4 MByte range. Fortunately, a few consecutive rows of each array do fit in cache.

Thus using the Odd-Even pattern, each grid point is loaded once from main memory per time step. Here is a breakdown of work per grid point for each sweep:

- First sweep (update
**Vx**and**Vy**):- 6 memory references (read A, U, Vx, Vy; write Vx, Vy).
- 6 floating-point adds
- 2 floating-point multiplications

- Second sweep (update U):
- 4 memory references (read B, Vx, Vy, U; write U)
- 4 floating-point adds
- 1 floating-point multiplication

The multiplications are insignificant because the hardware can overlap them with the additions. The key consideration is the one floating-point addition per memory reference. For brevity, from now on I'll call this ratio C (for Compute density). A ratio of C=1 is a serious bottleneck. My Core-2 Duo system can deliver eight floating-point additions per clock (two cores each using 4-wide SIMD). But the memory bandwidth limits my system to two single-precision memory references every three clocks. (I used the STREAM benchmark to determine this.) So C for the hardware is 8/(2/3) = 12.

To summarize, C=12 for the hardware but C=1 for the Odd-Even code. Thus the odd-even version delivers only 1/12th of my machine's theoretical peak performance. Another approach was needed. In Part 2, I show how I raised C to 2.5 and multi-threaded the wave simulation using TBB. Part 3 describes how I raised C to 7.5 (and higher) and vectorized the code.