Multicore Programming Patterns for Real-Time Math
From a software architecture perspective, the developer should look to incorporate a parallel pattern that best suites the real-time HPC problem. Before choosing the appropriate pattern, both application characteristics and hardware architecture should be considered. For example, is the application computation or I/O bound? How will cache structure, overall memory hierarchy, and system bus speeds affect the ability to meet real-time requirements? Because of the wide range of scenarios that are dependent on the specific application, the LabVIEW language includes hardware-specific optimizations, timing, and querying capabilities to help the developer utilize the multicore architecture in the most efficient manner possible. (From a historical perspective, LabVIEW originated as a programming tool for test and measurement applications, and therefore it was very important to include timing and synchronization capabilities in the form of native constructs in the language.)
As we will observe, these patterns map well to real-world applications that contain characteristics that are common for real-time HPC applications: (a) the patterns execute code in a loop that may be continuous, and (b) the patterns are communicating with I/O. By I/O, in this sense, we are talking about analog to digital conversion or digital to analog conversion that would be used to control real-world phenomena or control system. (For many control engineers, 1 millisecond (ms) is viewed as the longest acceptable "round trip time," so any software component that induces > 1 ms of jitter would make the system unfeasible.)
Entire books are dedicated to programming patterns, and for completeness we will at a high-level consider three such patterns that can be applied to a real-time HPC application (without delving into the intricacies):
- Pipelining
- Data parallelism
- N-dimensional grid
Pipelining is known as the "assembly line" technique, as in Figure 4. This approach should be considered in streaming applications and anytime a CPU-intensive algorithm must be modified in sequence, where each step takes considerable time.
Like an assembly line, each stage focuses on one unit of work, with the result passed to the next stage until the end of the line is reached.
By applying a pipelining strategy to an application that will be executed on a multicore CPU, you can break the algorithm into steps that have roughly the same unit of work and run each step on a separate core. The algorithm may be repeated on multiple sets of data or in data that is streaming continuously, as in Figure 5.
The key here is to break your algorithm up into steps that take equal time as each iteration is gated by the longest individual step in the overall process. Caveats to this technique arise when data falls out of cache, or when the penalty for inter-core communication exceeds the gain in performance.
A code example in LabVIEW is demonstrated in Figure 6. A loop structure is denoted by a black border with stages S1, S2, S3, and S4 representing the functions in the algorithm that must execute in sequence. Since LabVIEW is a structure dataflow language, the output of each function passes along the wire to the input of the next. A special feedback node, which appears as an arrow with a small dot underneath, is used to denote a separation of the functions into separate pipeline stages. A non-pipelined version of the same code would look very similar, without the feedback nodes. Real-time HPC examples that commonly employ this technique include streaming applications where fast Fourier transforms (FFTs) require manipulation one step at a time.
Data Parallelism is a technique that can be applied to large datasets by splitting up a large array or matrix into subsets, performing the operation, and then combining the result.
First consider the sequential implementation, whereby a single CPU would attempt to crunch the entire dataset by itself, as illustrated in Figure 7.
Instead, consider the example of the same dataset in Figure 8, but now split into four parts. This can be spread across the available cores to achieve a significant speed-up.
Now let's examine how this technique can be applied, practically speaking. In real-time HPC, a very common, efficient, and successful strategy in applications such as control systems is the parallel execution of matrix-vector multiplications of considerable size. The matrix is typically fixed, and it can be decomposed in advance. The vector is provided on a per-loop basis as the result of measurements gathered by sensors. The result of the matrix-vector could be used to control actuators, for example.
A matrix-vector multiplication distributed to 8 cores is shown in Figure 9 (execution is performed from left to right). The matrix is split before it enters the while-loop. Each block is multiplied by the vector and the resulting vectors are combined to form the final result.
The Structured Grid pattern is at the heart of many computations involving physical models, as illustrated in Figure 10. A 2D (or ND) grid is calculated every iteration and each updated grid value is a function of its neighbors. The parallel version would split the grid into sub-grids where each sub-grid is computed independently. Communication between workers is only the width of the neighborhood. Parallel efficiency is a function of the area to perimeter ratio.
For example, in the LabVIEW diagram in Figure 11, one can solve the heat equation, where the boundary conditions are constantly changing. The 16 visible icons represent tasks that can solve the Laplace equation of a certain grid size; these 16 tasks map onto 16 cores (the Laplace equation is a way to solve the heat equation). Once per iteration of the loop, boundary conditions are exchanged between those cores and a global solution is built up. The arrows represent data exchange between elements. Such a diagram can also be mapped onto a 1-, 2-, 4-, or 8-core computer. A very similar strategy could also be used for machines with greater numbers of cores as they become available.
A key element to any design pattern is how to map to the underlying memory hierarchy. The next section reviews important cache considerations that should be followed for optimal CPU performance in real-time HPC applications.


