Inspect Existing Code
Our first step was to inspect the "starting" application code to understand its intrinsic parallelism and how efficiently it used hardware resources. The following four tasks helped us better understand the structure of the application before porting it to a pair of Dual-Core Intel Xeon Processors LV 2.0 GHz.
Gather Information About the Code. We gathered information about the runtime performance and control characteristics of the code. The former indicates code segments that benefit most from the parallelization; the latter identifies locations to split the work. This analysis highlights the code's intrinsic parallelism and indicates how well the hardware resources (such as processor cache memory) handle the data structures.
Before beginning code parallelization, start with an optimized program. The largest gains come from parallelizing the code that is executed most often and consumes the greatest amount of execution time. If, through single-thread optimization, the percentage of time spent in different function calls changes, then the decision on which functions to parallelize may change as well. Once the code is optimized for the single-thread case, you can use performance-profiling tools to determine the best candidates for parallelization.
Our analysis of Amide and VolPack resulted in the data-flow diagram in Figure 2. This chart depicts the flow of data through major code blocks and three loops, and represents the software architecture of the application.
We collected Amide performance data as measured by the time required to render an image after users request to view the image from a new angle. Depending on the angle of rotation, Amide decides whether to search the image dataset from the x-, y-, or z-axis. We found the time required to display an image depended on the axis Amide selected to traverse. For example, a search along the z-axis tends to be roughly three times faster than along the x-axis, illustrated for 1000 rotations in Figure 3. We used image-rendering time as our performance metric to measure the effectiveness of our parallelization implementation.
This x- and z-axis render-time variance is explained by the memory stride of the 2D pixel search. For the z-axis, bytes are accessed sequentially with a stride of four bytes, which is relatively short. In other words, Amide reads regular, sequential data blocks that fit nicely into cache memory. This data regularity also lets the CPU more easily predict the data required for future instructions. Based on these predictions, the CPU preemptively moves data from memory into its cache to hide memory latency. This process, called "prefetching," lets the CPU preload its cache in anticipation of future accesses.
In contrast, the cache hit rate for searches along the x-axis was 26 percent; this result is associated with the relative long memory stride (976 bytes). With memory access strides of 1 KB or greater, the data prefetcher may become significantly less effective. Hardware perfecting is typically configured in BIOS and may be disabled manually.
Pinpoint External Dependencies. Next, we looked for external dependencies manifested by a high degree of idle time. When idle time is minimized, the functions consuming the most time tend to be compute-intensive pieces of the code. If code functions are waiting for something (the hard drive, network, user input, or whatever), parallelization will not speed up those external interfaces. Having convinced ourselves that the code was free of external dependencies, ran without interruption, and had no interaction with peripheral devices, we determined it was the best candidate for parallelization.
Identify Parallelization Opportunities. We had to decide on a parallelization approachdividing the data or dividing the control. In our case, the image data is stored in a one-dimensional array. We could split either by data (with each core responsible for accessing pixels in its fourth of the array) or by control (splitting one of the loops by its iterations). These methods are often equivalent because the same loop iteration accesses the same data each time this function is called. However, as we rotate the image, the number of loop iterations changes, as does the data that a particular loop iteration touches.
We decided that splitting the computation by data would lead to great load imbalances because each render does not touch all of the data, and only the innermost loop searches the data to find viewable pixels. From some angles, only one or two cores would be active. So we instead split the computation by the control. Each core was responsible for determining one-fourth of the pixels displayed by the monitor.
To split by control, we had to decide at what level to parallelize this code. In other words, if the code is nested, how deep within the nesting structure should the threads be created? And which loop (see Figure 2) should be split among the four cores? We evaluated three nesting levels corresponding to the three loops (lowest to highest):
Solid_Pixel_Search_Loop, New_Pixel_Loop, Image_Loop
The lowest threading level contains the function with the greatest workload (compAR11B.c, available electronically; see "Resource Center," page 5), which runs Solid_Pixel_Search_Loop. Within this loop, compAR11B.c accesses the main array and checks for pixels containing data. It stops when it finds a pixel with data. This type of loop is not appropriate for parallelization because, for any given execution, it executes for an undetermined number of iterations. This loop could consume one core while the other cores are idle, waiting for a synchronization event to occur.
The next level kicks off the Initialize Image function that runs the New_Pixel_Loop, with a deterministic number of iterations (for example, while it is not a constant number, the number is known at the beginning of the loop execution). This loop calls compAR11B.c during every iteration. This is an excellent candidate for parallelization because we can divide the iterations by the number of threads and distribute the workload evenly among the four cores. Another benefit from parallelizing this FOR LOOP is that, because the threads call compAR11B.c multiple times before coordinating with each other, synchronization overhead is minimized.
At the highest level, the Image_Loop could be parallelized so each thread renders a different image. Amide would then call this function multiple times, once for each image, displaying multiple images at the same time. This has two possible drawbacks:
- There will be idle cores whenever the number of images being processed is less than the number of cores.
- Each image requires substantially different amounts of time to render.
If one thread completes its image much earlier than the other, there is parallelism for only part of the time. Thus, load balancing is less consistent when the threads are split at this higher level.
Locate Data Dependencies. The fourth task was to examine the data structures to determine whether parallelism may create data dependencies, causing incorrect execution or negatively impacting performance. One source of data dependencies occurs when multiple threads write to the same data element. Without additional synchronization code, this can produce incorrect results. Synchronization code and data-locking mechanisms (such as semaphores) can lead to stalled threads.
In addition to ensuring that multiple threads are not writing to the same data element, it's important to minimize any false sharing of memory among the threads. False sharing occurs when two or more cores are updating different bytes of memory that happen to be located on the same cache line. Technically, the threads are not sharing the exact memory location, but because the processor deals with cache-line sizes of memory, the bytes end up getting shared anyway. Because multiple cores cannot cache the same line of memory at the same time, the shared cache line is continually sent back and forth between cores, creating cache misses and potentially huge memory latencies.
Amide calls VolPack to render a 2D display image by processing an image dataset. This requires searching the image dataset, but never changing it, and writing the output as a 2D display image to a graphics driver for display on a monitor. Both structures (the medical image dataset and the 2D display image) are stored as arrays.
Each loop iteration corresponds to one location in the 2D display image, and consecutive iterations write to neighboring locations. Thus, by giving each core a set of consecutive iterations, rather than assigning the iterations round robin, the cores never write to the same locationand they only write to neighboring locations (false sharing) on their first or last iteration.