In traditional embedded systems, CPU caches are viewed as a necessary evil. The evil side shows up as a nondeterministic execution time inversely proportional to the amount of code and/or data of a time-critical task located inside the cache when the task execution has been triggered. For demonstration purposes, we will profile cache performance to better understand some important characteristics. The technique applied is using a structure within LabVIEW called a "timed loop" (Figure 12).
The timed loop acts as a regular while loop, but with some special characteristics that lend themselves to profiling hardware. For example, the structure will execute any code within the loop in a single thread. The timed loop can be configured with microsecond granularity, and it can be assigned a relative priority that will be handled by the RTOS. In addition, it can set processor affinity, and it can also react to hardware interrupts. Although the programming patterns shown in the previous section do not utilize the timed loop, it is also quite useful for dealing with real-time HPC applications, and parallelism is harvested through the use of multiple structures and queue structures to pass data between the structures. The following describes benchmarks that were performed to understand cache performance.
An execution time of a single timed loop iteration as a function of the amount of cached code/data is in Figure 13. The loop runs every 10 milliseconds, and we use an indirect way to cause the loop's code/data to be flushed from the cache; a lower priority task that runs after each iteration of the loop adds 1 to each element of an increasingly larger array of doubles flushing more and more of time critical task's data from the CPU cache. In addition to longer runtime, in the worst-case scenario the time goes from 4 to 30 microseconds for an increase by a factor of 7.5. Figure 13 also shows that decaching also increases jitter. The same graph can be also used to demonstrate the "necessary" part of the picture. Even though some embedded CPUs will go as far as completely eliminating cache to increase determinism, it is obvious that such measures will also significantly reduce performance. Besides, few people are willing to go back one or two CPU generations in performance especially as the amounts of L1/L2/L3 cache are continuously increasing providing enough room for most applications to run while incurring only minor cache penalties.
In real-time HPC, the cache is extremely important because of its ability to keep the CPU's computational resources busy. For example, let's assume that we want to add 1 to all elements in a single-precision array on a 3-GHz processor that can execute one single-precision floating point operation every clock cycle—a task of only 3 GFLOPs. The memory bandwidth required to keep the FPU busy would have to be at least 24 GB/s (12 GB/s in each direction), bandwidth above the latest generation of processors with built in memory controllers. The three-channel i7 CPU tops out at 18 GB/s. However, the CPUs can perform more than one FLOP per cycle because they contain multiple FPUs. Using SSE instructions, one can add four single precision floating point numbers every cycle so our array example would require at least 96 GB/s memory bandwidth to prevent stalls. The red curve in Figure 14 contains benchmark results for the described application. Figure 14 shows GFLOPs as a function of array size on 3.2 GHz i7 (3.2 GHz, 8 MB L3) for x[i] = x[i] + 1 (red curve) and x[i] = A*x[i]2 + B*x[i] + c (black curve) operation on each element of a single-precision floating point array using SSE instructions. The second graph zooms into first 512 KB region. Three steps are clearly visible: L1 (32K), L2 (256K) and L3 (8M). Benchmark executed on LabVIEW Real-Time platform thus minimum amount of jitter. When all data fits in L1, one CPU can achieve approximately 8.5 GFLOPs requiring 72 GB/s memory bandwidth. When running out of L2, CPU delivers 4.75 GFLOPs requiring 38 GB/s. Once data does not fit into CPU caches any more, the performance drops to 0.6 GFLOPs completely bounded by 4.8 GB/s memory bandwidth.
Zooming into the plot further also shows additional step at the beginning for the red curve, which may point to another level of cache 8K.
The ratio between maximum and minimum performance is a whopping 14x. The situation gets worse on a quad core CPU since the application can easily be parallelized. In the best case, the four CPUs can deliver 36 GFLOPs since the caches are independent and in the worst case the performance stays at 0.6 GFLOPs since the memory bandwidth is shared among the processors. The resulting maximum/minimum performance ratio jumps to 56x. As a verification, we run another test for which more FLOPs are performed for each array element brought into the FPU. Instead of only adding one to the element, we calculate a second order polynomial, which requires four FLOPs compared to one originally. Results are shown in Figure 14 (black curve). AS expected, the maximum performance goes up to 15 GFLOPs since the memory is being accessed less. For the same reason, the performance difference between data completely located in L1 and L2 caches, respectively, drops. As main memory latency and bandwidth becomes a gating factor, we again see large drop-off in the GFLOPs performance, though to a lesser value of "only" 8x.
The above simple example demonstrates that multiple cores with their fast caches can deliver better-than-linear performance increase if the problem that did not originally fit into a single-CPU cache can fit into multiple CPU caches after it has been parallelized. However, Figure 14 also implies that any unnecessary data transfer between the CPUs can seriously degrade performance especially if data has to be moved to/from main memory. Causing cache misses while parallelizing the application can not only eat up all performance improvements resulting from an increased number of CPUs, but it can also cause an application to run tens of times slower than on single CPU.
So, what can one do if real-time HPC application data does not fit into cache? The answer depends on the amounts of data used for the time-critical versus non-time-critical tasks. For example, the data used in a control algorithm must be readily available at all times and should be kept in cache at all cost. The data that is used for user interface display or calculation of real-time parameters can be flushed.
If the data used by the time-critical code fits into the cache, the application must overcome cache strategy that can be simply described as "use it or lose it," by, well, using the data. In other words, accessing data even when it is not required will keep it in the cache. In some cases, the CPU may offer explicit instructions for locking down parts of the cache but that is more exception than a rule. For example, if there is control data that may be used as the plant approaches some critical point, accessing data in each control step is a must. An argument that executing control code each iteration that may otherwise be required to run only once every 1000 iterations is overkill, since even in the worst case the out-of-cache execution may be only 20x slower, is correct, at least when viewed form an HPC point of view. Following the described approach would yield 50x worst CPU utilization and proportionally slower execution. Unfortunately, in the real-time HPC this line of argument is false because a 20x slower execution of control algorithm can result in serious damage—the real-time requirement states that every control action must be taken before a given deadline, which may be much shorter that the worst-case out-of-cache 20x longer execution time.
Another way to keep data in the CPU cache is to prevent any other thread from execution on the given processor. This is where ability of an RTOS to reserve certain CPUs for execution of a single task becomes extremely powerful. One does have to keep in mind that certain caches may be shared between multiple CPUs (for example, L3 cache in Intel's i7 architecture is shared between up to 8 CPUs) residing on the same physical chip so reserving a core on the processor that churns a lot of data on its other cores will be ineffective.
Finally, what can one do if the real-time data does not fit in the cache? Short of redesigning the underlying algorithm to use less data, or further prioritizing importance of different real-time tasks and devising a scheduling scheme that will keep the most important data in cache, there is not much that can be done. The penalty can be reduced if one can design an algorithm that can access data in two directions. If the data is always accessed in the same order, once it does not fit into the cache any more, each single element access will result in the cache miss. On the other hand, if the algorithm alternates between accessing data from first-to-last and last-to-first element, cache misses will be limited only to the amount of data actually not fitting into the cache: the data accessed last in the previous step is now accessed first and is thus already located in the cache. While this approach will always reduce algorithm execution time in the absolute terms, the relative performance benefit will decrease as more and more data does not fit into the cache.
Figures 15 and 16 show benchmarks from a real-world application to which we applied all cache management methods described above. The figures show time required to multiply a symmetric matrix with a vector. The multiplication is part of a control algorithm that has to calculate 3000 actuator control values based on 6000 sensor input values in less than 1 ms. (This industry example is from the European Southern Observatory and is mentioned in the final section of the article).
Initially, we tried to use standard libraries for matrix vector multiplication but we could not achieve desired performance. The algorithms were top notch but they were not optimized for real-time HPC. So, we developed a new vector-matrix multiplication algorithm that took advantage of the following:
- In the control applications, the interaction matrix whose inverse is used for calculation of actuator values does not change often. Consequently, expensive offline preprocessing and repackaging of the matrix into a form that takes advantage of L1/L2 structure of the CPU as well as exercises SSE vector units to their fullest when performing online real-time calculation is possible.
- By dedicating CPUs to control a task only, the control matrix stays in the cache and offers the highest level of performance.
- Splitting vector matrix multiplication into parallel tasks increases the amount of cache available for the problem.
- The new algorithm can access matrix data from both ends.
Figure 15 shows a Matrix vector multiplication time (worst case) as a function of matrix size. Platform: Dual Quad-core 2.6-GHz Intel Xeon processors, with a 12?MB cache each. The results were achieved by using only 4 cores, 2 on each processor. Utilizing all 8 cores resulted in further multiplication time for 3k x 3k matrix of 500 us.
Figure 16 depicts the Nehalem (8M L3) -- cache boundary approximately 1900. The curve is similar to that of curve for the Intel Xeon processor. The difference is due to direction toggling smaller because of a much larger increase in memory bandwidth compared to the increase in computation power. Excellent memory performance is visible for the 6K x 6K case: for 20% CPU clock increase, there is a 160% increase in performance (210% for non-toggling approach).
The results show that we are able to achieve 0.7 ms multiplication time for the required 3k x 3k matrix. The 1-millisecond limit is reached for the matrix size of about 3.3k x 3.3k, which is also close to the total L2 cache size (2 processors x 12 MB = 24 MB L2). Increasing the matrix size 4 times (6k x 6k) speeds execution time 17 times, implying a 4x degradation in GFLOPs. Using direction toggling approach results in up to 50 percent relative performance improvements for data sizes slightly larger than the available L2. The speed-up reduces in relative terms as the matrix is getting larger.