Dr. Dobb's | Optimizing Software for Multicore Processors

With the potential for real performance gains, multicore processors present the challenge of deciding how to validate and optimize code.

With the potential for significant performance improvements, multicore processors present software developers with the challenge of determining how best to validate and optimize code. Some of these challenges became apparent to us when we converted Amide, an open-source Linux application, from single- to multithreaded code. In this article, I discuss this project, focusing on threading code, code inspection, decomposition, and evaluation.

Amide (amide.sourceforge.net) is designed for viewing, analyzing, and registering volumetric medical imaging datasets. Written by Philippe Lacroute, Amide supports features for rotating images, selecting layers to view, and fusing multiple images; see Figure 1. VolPack (graphics.stanford.edu/software/VolPack) is a portable library for volume rendering that's called by Amide.

[Click image to view at full size]

Figure 1: Image rendering by Amide.

Inspect Existing Code

Our first step was to inspect the "starting" application code to understand its intrinsic parallelism and how efficiently it used hardware resources. The following four tasks helped us better understand the structure of the application before porting it to a pair of Dual-Core Intel Xeon Processors LV 2.0 GHz.

Gather Information About the Code. We gathered information about the runtime performance and control characteristics of the code. The former indicates code segments that benefit most from the parallelization; the latter identifies locations to split the work. This analysis highlights the code's intrinsic parallelism and indicates how well the hardware resources (such as processor cache memory) handle the data structures.

Before beginning code parallelization, start with an optimized program. The largest gains come from parallelizing the code that is executed most often and consumes the greatest amount of execution time. If, through single-thread optimization, the percentage of time spent in different function calls changes, then the decision on which functions to parallelize may change as well. Once the code is optimized for the single-thread case, you can use performance-profiling tools to determine the best candidates for parallelization.

Our analysis of Amide and VolPack resulted in the data-flow diagram in Figure 2. This chart depicts the flow of data through major code blocks and three loops, and represents the software architecture of the application.

Figure 2: Amide/VolPack data-flow diagram.

We collected Amide performance data as measured by the time required to render an image after users request to view the image from a new angle. Depending on the angle of rotation, Amide decides whether to search the image dataset from the x-, y-, or z-axis. We found the time required to display an image depended on the axis Amide selected to traverse. For example, a search along the z-axis tends to be roughly three times faster than along the x-axis, illustrated for 1000 rotations in Figure 3. We used image-rendering time as our performance metric to measure the effectiveness of our parallelization implementation.

[Click image to view at full size]

Figure 3: Time to render an image for 1000 rotations.

This x- and z-axis render-time variance is explained by the memory stride of the 2D pixel search. For the z-axis, bytes are accessed sequentially with a stride of four bytes, which is relatively short. In other words, Amide reads regular, sequential data blocks that fit nicely into cache memory. This data regularity also lets the CPU more easily predict the data required for future instructions. Based on these predictions, the CPU preemptively moves data from memory into its cache to hide memory latency. This process, called "prefetching," lets the CPU preload its cache in anticipation of future accesses.

In contrast, the cache hit rate for searches along the x-axis was 26 percent; this result is associated with the relative long memory stride (976 bytes). With memory access strides of 1 KB or greater, the data prefetcher may become significantly less effective. Hardware perfecting is typically configured in BIOS and may be disabled manually.

Pinpoint External Dependencies. Next, we looked for external dependencies manifested by a high degree of idle time. When idle time is minimized, the functions consuming the most time tend to be compute-intensive pieces of the code. If code functions are waiting for something (the hard drive, network, user input, or whatever), parallelization will not speed up those external interfaces. Having convinced ourselves that the code was free of external dependencies, ran without interruption, and had no interaction with peripheral devices, we determined it was the best candidate for parallelization.

Identify Parallelization Opportunities. We had to decide on a parallelization approach—dividing the data or dividing the control. In our case, the image data is stored in a one-dimensional array. We could split either by data (with each core responsible for accessing pixels in its fourth of the array) or by control (splitting one of the loops by its iterations). These methods are often equivalent because the same loop iteration accesses the same data each time this function is called. However, as we rotate the image, the number of loop iterations changes, as does the data that a particular loop iteration touches.

We decided that splitting the computation by data would lead to great load imbalances because each render does not touch all of the data, and only the innermost loop searches the data to find viewable pixels. From some angles, only one or two cores would be active. So we instead split the computation by the control. Each core was responsible for determining one-fourth of the pixels displayed by the monitor.

To split by control, we had to decide at what level to parallelize this code. In other words, if the code is nested, how deep within the nesting structure should the threads be created? And which loop (see Figure 2) should be split among the four cores? We evaluated three nesting levels corresponding to the three loops (lowest to highest):

The lowest threading level contains the function with the greatest workload (compAR11B.c, available electronically; see "Resource Center," page 5), which runs Solid_Pixel_Search_Loop. Within this loop, compAR11B.c accesses the main array and checks for pixels containing data. It stops when it finds a pixel with data. This type of loop is not appropriate for parallelization because, for any given execution, it executes for an undetermined number of iterations. This loop could consume one core while the other cores are idle, waiting for a synchronization event to occur.

The next level kicks off the Initialize Image function that runs the New_Pixel_Loop, with a deterministic number of iterations (for example, while it is not a constant number, the number is known at the beginning of the loop execution). This loop calls compAR11B.c during every iteration. This is an excellent candidate for parallelization because we can divide the iterations by the number of threads and distribute the workload evenly among the four cores. Another benefit from parallelizing this FOR LOOP is that, because the threads call compAR11B.c multiple times before coordinating with each other, synchronization overhead is minimized.

At the highest level, the Image_Loop could be parallelized so each thread renders a different image. Amide would then call this function multiple times, once for each image, displaying multiple images at the same time. This has two possible drawbacks:

If one thread completes its image much earlier than the other, there is parallelism for only part of the time. Thus, load balancing is less consistent when the threads are split at this higher level.

Locate Data Dependencies. The fourth task was to examine the data structures to determine whether parallelism may create data dependencies, causing incorrect execution or negatively impacting performance. One source of data dependencies occurs when multiple threads write to the same data element. Without additional synchronization code, this can produce incorrect results. Synchronization code and data-locking mechanisms (such as semaphores) can lead to stalled threads.

In addition to ensuring that multiple threads are not writing to the same data element, it's important to minimize any false sharing of memory among the threads. False sharing occurs when two or more cores are updating different bytes of memory that happen to be located on the same cache line. Technically, the threads are not sharing the exact memory location, but because the processor deals with cache-line sizes of memory, the bytes end up getting shared anyway. Because multiple cores cannot cache the same line of memory at the same time, the shared cache line is continually sent back and forth between cores, creating cache misses and potentially huge memory latencies.

Amide calls VolPack to render a 2D display image by processing an image dataset. This requires searching the image dataset, but never changing it, and writing the output as a 2D display image to a graphics driver for display on a monitor. Both structures (the medical image dataset and the 2D display image) are stored as arrays.

Each loop iteration corresponds to one location in the 2D display image, and consecutive iterations write to neighboring locations. Thus, by giving each core a set of consecutive iterations, rather than assigning the iterations round robin, the cores never write to the same location—and they only write to neighboring locations (false sharing) on their first or last iteration.

Decompose Code

VolPack performs the same operation on each pixel, and this characteristic, combined with the relatively simple data structure, provides for a straightforward parallelization strategy. We used POSIX threads to divide the New_Pixel_Loop into four subloops, each running on one of four cores.

In our implementation, Core 1 first executes serial code to load the image, which precedes the dataflow in Figure 2, while the other three cores are idle. Next, Core 1 uses POSIX threads to spawn three threads for cores 2-4. Our software experts took two days to inspect the code, parallelize the image-rendering code, and test the workload balance.

Create the Threads. Listing One initializes one thread per core using pthread_create. Each thread executes the same function, vp_thread_task_loop. Listing One uses a variable NUM_THREADS, which corresponds to the number of cores in the system. Because the number of cores is not hard coded, it can easily be ported to run on systems with any number of cores.

Start the Threads. After the threads have been created, Core 0 parallelizes Amide at the New_Pixel_Loop to run on four cores. Listing Two illustrates how this is accomplished. Each core is assigned variables in a global array using the instruction:

One quarter of the array indexes are passed to each core to process the image volume using the variable loop_args[i]. The four cores are started by first assigning memory for data with:

Evaluate Code Performance

Before examining the overall performance data, what should you reasonably expect? Given we migrated Amide from a single-core to a four-core system, the highest "realistic" expectation is a linear four-fold performance improvement. This is a theoretical maximum because the cores share resources such as busses and memory, which can introduce execution delays. There is also overhead associated with maintaining data coherency among the caches in the system. Applying a rule-of-thumb, these factors can lead to a 10-20 percent reduction in system performance. Therefore, on average, we anticipate a four-core system performs in the range of 3.2 to 3.6 times faster than a single-core system.

First, we measured overall application performance, relative to the original code, as measured by the time it took VolPack to render images along the z- and x-axis (Table 1). These two rotations were rendered from the same image dataset along two different axes. The outcome of our migration from a single-core system to a four-core system was a performance speedup of between 3.2 and 3.4 times. These results indicate this parallelization implementation was effective and reasonably well optimized.

	1 Core (seconds)	2 Cores (seconds)	4 Cores (seconds)	Speedup Ratio: 1 Core/4 Cores
z-axis search	1.85	0.96	0.55	3.4
x-axis search	5.48	3.18	1.71	3.2

Table 1: Overall performance results.

Next, we examined the cache hit rate. Table 2 shows the L2 cache hit rate for the images rendered along the z- and x-axis run on one core, two cores, and four cores. Along the z-axis, the cache hit rate is fairly consistent for one, two, and four cores and about 76 percent. Along the x-axis, the L2 cache rate dips down to 34 percent with four cores. This lower L2 cache hit may be further evidence of the slower rendering time for x-axis renders in Table 1.

	1 Core	2 Core	3 Cores
z-axis search	78%	78%	76%
x-axis search	26%	28%	34%

Table 2: L2 cache hit rates.

Third, we looked at CPU utilization. During image rendering, all cores are running at nearly 100-percent utilization for configurations: one-core, two-core, and four-core. All available cores are sharing evenly in the workload. The CPU utilization was measured using the Linux's TOP command.

Our fourth code metric was synchronization overhead, which provides a measure of the noncompute resources required to maintain the threads that otherwise could be used to speed up the application. Figure 4, generated by the Intel Thread Profiler, shows Core 1 loading the image for 45 seconds, then spawning threads to cores 2, 3, and 4. The solid green areas represent individual image renders where the renders start around 46, 48, 56, 59, 62, and 68 seconds.

[Click image to view at full size]

Figure 4: Synchronization overhead: Parallel image rendering.

Synchronization overhead is displayed in red when its duration is greater than 30 microseconds. The only synchronization instance displayed in Figure 4 occurs when Core 1 creates the threads for Cores 2, 3, and 4 around time 46 seconds. We were pleased with this relatively low level of synchronization overhead.

Our last code metric was thread stall overhead, which provides a measure of the core idle time due to the cores waiting for system resources or work assignments. Figure 5 shows that a single thread executes for 66 seconds (CL.1); this is the sum of the time Core 1 loads the image and the idle time between rendering images. For the rest of the time, four threads are executing (CL.4) corresponding to image rendering. This means we have four cores all working away during image rendering and this is good.

[Click image to view at full size]

Figure 5: Measuring thread stall overhead.

There are short periods when only two or three threads are working (CL.2 and CL.3), but these are relatively short. CL.2 and CL.3 periods are better than CL.1 periods (one thread), but not as good as CL.4 (four threads). Thread profilers typically provide more detailed views of thread stalls, and they can map thread stalls to the responsible source code.

Conclusion

Developing parallel code for multicore processors warrants careful consideration of threading approaches and a thorough analysis of the resulting performance. The approach I describe here can be applied to any application transitioning from single-threaded to multithreaded to take advantage of multicore processor performance. This code migration case study showed it is possible to take an application designed to run on a single-core processor and migrate it to a four-core system in a matter of days, while realizing a performance increase of more than three times.