Code elegance is important by all means, but customers as well as sale managers are not really impressed by it. Let's have a look at some measurements done to check how DMIP performs compared to IPP.
I have measured and compared the performance of the above edge detection algorithm, using IPP and DMIP, on four different monochrome images (8 bits per pixel) with the following sizes: 64KB, 256KB, 4MB, 11MB. The measurements were performed on a Dell Vostro 1500 laptop, equipped with Intel Core2 Duo T7300 CPU which contains two symmetric cores and 4MB shared L2 cache.
I started the performance analysis by measuring the impact of L2 cache faults on both processes using Intel's VTune performance analyzer. The measurement model assumes that each cache fault costs 80 CPU cycles on average (this is according to Intel's recommendations, however some claim this is very conservative assumption and that the real number is closer to 200). So to estimate the performance impact I counted the amount of L2 cache faults, multiplied by 80 cycles per fault and divided by the total cycles consumed by the process. This metric should give some rough estimation of the impact L2 cache faults has regarding the process' performance. The equation is formed below, where pi stands for performance impact, cf stands for L2 cache faults, cc stands for the total amount of cycles consumed by the process:
The results can be seen in Figure 2.
For small images the IPP implementation exhibits far better cache spatial locality compared to the DMIP implementation which pays a heavy price due to the optimization overhead of splitting the image into fragments and applying the algorithm pipeline on each fragment. Actually, in the case of small images, this optimization is not necessary at all: the data (input, output and temp buffers) can easily fit within the 4MB L2 cache. However, as the image size increases the IPP implementation exhibits significant increase in cache faults while the DMIP implementation seems to converge to a relatively low cache fault impact on performance.
After reassuring that indeed DMIP manages to seriously reduce the amount of cache faults (as far as large data loads are concerned), I moved on to test the speedup achieved by DMIP compared to IPP implementation. The speed up is computed by dividing the IPP timing by the DMIP timing. As an example, a speed up of 2 means DMIP is twice as fast as IPP. Figure 3 shows the measured speed up for each image size.
The results are clearly in line with the theory and with the cache misses measurements: we can see that on small images, the DMIP implementation performs two times slower compared to IPP. However, as the image size grows, the picture changes radically and the speed up increases: for 256KB image the speed up is 1.5, and for 4MB, 11MB images it peaks to roughly a speed up of 2.5 -- more than two times faster than IPP, a significant improvement.
Another interesting observation is that the DMIP implementation outperforms IPP significantly although it consumes much less CPU resources. While the IPP process makes its threads work hard (the average for both cores is 70% to 100%), the DMIP process is using approximately 50% of the cores resources -- although, it's important to note, both cores are active. This is an interesting observation since it is commonly considered that the higher the process CPU utilization, the better performance this process will exhibit in terms of throughput. As it shows, this common consideration is not always true.
Creating code which maintains a high degree of cache spatial locality is mandatory for any performance-critical system designer, and especially for vision system designers who wish to create systems with high throughput that will be able to scale as hardware offers more cores and larger cache sizes and while high end sensors produce bigger and bigger image loads.
Intel's DMIP appears to give a good solution for those facing these challenges and requiring deferred mode optimization technique: it is an object-oriented, extensible, and easy to use framework which dynamically adjusts to the runtime CPU resources. The framework itself is extensive, and has many features that were not discussed here since it is not in the scope of this article. Another important property of Intel's DMIP is the functional API (used in the code example) which allows simple, elegant and quick coding of complex image processing algorithms, very much like Matlab's script.
Still, as evident from the measurements, deferred mode sometimes exhibit reduced performance compared to "standard" implementations. As always, the recommendation is to measure the application metrics, analyze bottlenecks and only then decide what should be the optimization method of choice.