Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Parallel

Optimizing Software for Multicore Processors


Evaluate Code Performance

Before examining the overall performance data, what should you reasonably expect? Given we migrated Amide from a single-core to a four-core system, the highest "realistic" expectation is a linear four-fold performance improvement. This is a theoretical maximum because the cores share resources such as busses and memory, which can introduce execution delays. There is also overhead associated with maintaining data coherency among the caches in the system. Applying a rule-of-thumb, these factors can lead to a 10-20 percent reduction in system performance. Therefore, on average, we anticipate a four-core system performs in the range of 3.2 to 3.6 times faster than a single-core system.

First, we measured overall application performance, relative to the original code, as measured by the time it took VolPack to render images along the z- and x-axis (Table 1). These two rotations were rendered from the same image dataset along two different axes. The outcome of our migration from a single-core system to a four-core system was a performance speedup of between 3.2 and 3.4 times. These results indicate this parallelization implementation was effective and reasonably well optimized.

  1 Core
(seconds)
2 Cores
(seconds)
4 Cores
(seconds)
Speedup Ratio:
1 Core/4 Cores
z-axis search 1.85 0.96 0.55 3.4
x-axis search 5.48 3.18 1.71 3.2
Table 1: Overall performance results.

Next, we examined the cache hit rate. Table 2 shows the L2 cache hit rate for the images rendered along the z- and x-axis run on one core, two cores, and four cores. Along the z-axis, the cache hit rate is fairly consistent for one, two, and four cores and about 76 percent. Along the x-axis, the L2 cache rate dips down to 34 percent with four cores. This lower L2 cache hit may be further evidence of the slower rendering time for x-axis renders in Table 1.

  1 Core 2 Core 3 Cores
z-axis search 78% 78% 76%
x-axis search 26% 28% 34%
Table 2: L2 cache hit rates.

Third, we looked at CPU utilization. During image rendering, all cores are running at nearly 100-percent utilization for configurations: one-core, two-core, and four-core. All available cores are sharing evenly in the workload. The CPU utilization was measured using the Linux's TOP command.

Our fourth code metric was synchronization overhead, which provides a measure of the noncompute resources required to maintain the threads that otherwise could be used to speed up the application. Figure 4, generated by the Intel Thread Profiler, shows Core 1 loading the image for 45 seconds, then spawning threads to cores 2, 3, and 4. The solid green areas represent individual image renders where the renders start around 46, 48, 56, 59, 62, and 68 seconds.

[Click image to view at full size]

Figure 4: Synchronization overhead: Parallel image rendering.

Synchronization overhead is displayed in red when its duration is greater than 30 microseconds. The only synchronization instance displayed in Figure 4 occurs when Core 1 creates the threads for Cores 2, 3, and 4 around time 46 seconds. We were pleased with this relatively low level of synchronization overhead.

Our last code metric was thread stall overhead, which provides a measure of the core idle time due to the cores waiting for system resources or work assignments. Figure 5 shows that a single thread executes for 66 seconds (CL.1); this is the sum of the time Core 1 loads the image and the idle time between rendering images. For the rest of the time, four threads are executing (CL.4) corresponding to image rendering. This means we have four cores all working away during image rendering and this is good.

[Click image to view at full size]

Figure 5: Measuring thread stall overhead.

There are short periods when only two or three threads are working (CL.2 and CL.3), but these are relatively short. CL.2 and CL.3 periods are better than CL.1 periods (one thread), but not as good as CL.4 (four threads). Thread profilers typically provide more detailed views of thread stalls, and they can map thread stalls to the responsible source code.

Conclusion

Developing parallel code for multicore processors warrants careful consideration of threading approaches and a thorough analysis of the resulting performance. The approach I describe here can be applied to any application transitioning from single-threaded to multithreaded to take advantage of multicore processor performance. This code migration case study showed it is possible to take an application designed to run on a single-core processor and migrate it to a four-core system in a matter of days, while realizing a performance increase of more than three times.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.