In recent years, vision system architects have been challenged when trying to maintain reasonable system throughput in the face of ever increasing sensor resolutions and image sizes. Sensor manufacturers have been selling sensors with more than 10 megapixels (MP) for a few years now, and a 50MP sensor is already offered by Kodak, the KAF-50100. A MP is 1 million pixels. On top of that, many vision applications now require color imaging or multi-spectral imaging, which further increases the data load with multiple channels per pixel. These trends cause a serious increase in the computational load for vision systems. On the other hand, CPU capabilities are advancing as well, but mainly in the form of additional processing cores as traditional CPU designers have reached a CPU speed barrier. To harness the power of many cores and allow the software to scale well on many cores, software architects need to write parallel code. However, when a relatively large data load is concerned, such as in the case of modern vision systems, simply writing parallel code does not always guarantee scaling or efficient harnessing of multiple cores since the major bottleneck resides elsewhere.
The Memory Access Bottleneck
To understand the bottleneck encountered by modern vision systems, let's examine four basic facts:
- Modern vision applications need to handle image sizes of 10MB and above. Again, cutting-edge vision systems needs to handle ever increasing data loads due to high-resolution sensors, color images, and multi-spectral images.
- CPU cache sizes are limited compared to average vision system image sizes. As an example, Intel's Core i7 contains "only" 8MB of L3 cache (which is shared across all cores) -- a huge cache size by current standards, but still not big enough to contain a single full frame when considering modern vision applications data load.
- Main memory access is approximately 100 times slower compared to cache access. CPU cache was invented for an important reason: memory access is a very slow operation compared to a CPU operation. This means that an application pays a heavy performance toll for having frequent cache faults.
- Typical vision algorithms are structured as successive operations on images where the output of the last operation becomes the input for the next operation.
Let's combine these facts: Vision applications usually perform successive operations on images (fact 4), which are usually bigger than the cache (fact 1), and so cannot fit entirely within the CPU cache (fact 2). This causes a substantial amount of cache faults for each operation performed on an image, and happens every time an operator is applied on the image. Now, since cache faults are so expensive (fact 3), a substantial amount of the processing time is actually spent on accessing main memory which substantially degrades the application performance in terms of throughput. It is also worth mentioning that using multiple core does not help in this kind of bottleneck -- actually it might make things worse: if we try to parallelize the computation we might cause a dangerous situation where many cores are competing for cache space (which is limited and scarce in this case) and the amount of cache faults increases even more.
Deferred Mode Optimization
How can vision application architects overcome this bottleneck and allow their applications to scale better? Since reducing the image size is out of the question (algorithm accuracy will be degraded), as well as creating CPUs with huge caches (they will be way too expensive), the only way out is changing the structure of the algorithm so that the cache is used in a more efficient way and the amount of cache faults is minimized. In other words, the algorithm's cache spatial locality must be improved.
Examining the typical vision algorithm structure (fact 4) reveals that most vision algorithms are structured as successive operations on images, where the output of the last operation is the input for the next. The cache misses are caused simply because the images are very big, usually much bigger than the cache size. Moreover many typical operators require additional output image and/or an additional temp image buffer for various calculations -- all of which should fit in the cache.
An interesting optimization technique is to divide the original image into fragments so that for each fragment, the entire data load can fit nicely into the cache, and then apply the whole algorithm pipeline on each fragment sequentially, thus minimizing cache faults. The whole pipeline will be applied on all fragments sequentially, and finally all fragments will be combined to produce the result. This approach incurs cache faults since the fragments must be loaded to the cache initially, but still is much more cache friendly compared to the straightforward approach and so will reduce the cache faults significantly and improve performance. This optimization technique is called "Deferred Mode".