Channels ▼
RSS

Deferred Mode Optimization


Performance Measurements

Code elegance is important by all means, but customers as well as sale managers are not really impressed by it. Let's have a look at some measurements done to check how DMIP performs compared to IPP.

I have measured and compared the performance of the above edge detection algorithm, using IPP and DMIP, on four different monochrome images (8 bits per pixel) with the following sizes: 64KB, 256KB, 4MB, 11MB. The measurements were performed on a Dell Vostro 1500 laptop, equipped with Intel Core2 Duo T7300 CPU which contains two symmetric cores and 4MB shared L2 cache.

I started the performance analysis by measuring the impact of L2 cache faults on both processes using Intel's VTune performance analyzer. The measurement model assumes that each cache fault costs 80 CPU cycles on average (this is according to Intel's recommendations, however some claim this is very conservative assumption and that the real number is closer to 200). So to estimate the performance impact I counted the amount of L2 cache faults, multiplied by 80 cycles per fault and divided by the total cycles consumed by the process. This metric should give some rough estimation of the impact L2 cache faults has regarding the process' performance. The equation is formed below, where pi stands for performance impact, cf stands for L2 cache faults, cc stands for the total amount of cycles consumed by the process:

The results can be seen in Figure 2.

Figure 2: L2 misses performance impact.

For small images the IPP implementation exhibits far better cache spatial locality compared to the DMIP implementation which pays a heavy price due to the optimization overhead of splitting the image into fragments and applying the algorithm pipeline on each fragment. Actually, in the case of small images, this optimization is not necessary at all: the data (input, output and temp buffers) can easily fit within the 4MB L2 cache. However, as the image size increases the IPP implementation exhibits significant increase in cache faults while the DMIP implementation seems to converge to a relatively low cache fault impact on performance.

After reassuring that indeed DMIP manages to seriously reduce the amount of cache faults (as far as large data loads are concerned), I moved on to test the speedup achieved by DMIP compared to IPP implementation. The speed up is computed by dividing the IPP timing by the DMIP timing. As an example, a speed up of 2 means DMIP is twice as fast as IPP. Figure 3 shows the measured speed up for each image size.

Figure 3: DMIP Speedup compared to IPP.

The results are clearly in line with the theory and with the cache misses measurements: we can see that on small images, the DMIP implementation performs two times slower compared to IPP. However, as the image size grows, the picture changes radically and the speed up increases: for 256KB image the speed up is 1.5, and for 4MB, 11MB images it peaks to roughly a speed up of 2.5 -- more than two times faster than IPP, a significant improvement.

Another interesting observation is that the DMIP implementation outperforms IPP significantly although it consumes much less CPU resources. While the IPP process makes its threads work hard (the average for both cores is 70% to 100%), the DMIP process is using approximately 50% of the cores resources -- although, it's important to note, both cores are active. This is an interesting observation since it is commonly considered that the higher the process CPU utilization, the better performance this process will exhibit in terms of throughput. As it shows, this common consideration is not always true.

Summary

Creating code which maintains a high degree of cache spatial locality is mandatory for any performance-critical system designer, and especially for vision system designers who wish to create systems with high throughput that will be able to scale as hardware offers more cores and larger cache sizes and while high end sensors produce bigger and bigger image loads.

Intel's DMIP appears to give a good solution for those facing these challenges and requiring deferred mode optimization technique: it is an object-oriented, extensible, and easy to use framework which dynamically adjusts to the runtime CPU resources. The framework itself is extensive, and has many features that were not discussed here since it is not in the scope of this article. Another important property of Intel's DMIP is the functional API (used in the code example) which allows simple, elegant and quick coding of complex image processing algorithms, very much like Matlab's script.

Still, as evident from the measurements, deferred mode sometimes exhibit reduced performance compared to "standard" implementations. As always, the recommendation is to measure the application metrics, analyze bottlenecks and only then decide what should be the optimization method of choice.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video