Video Encode and Decode
Video encoding and decoding is similar to image encoding and decoding except that video requires compression and decompression of successive frames. Video typically requires more storage than an image, but the amount typically does not scale linearly with the number of frames encoded and decoded. Video compression and decompression takes advantage of the relatively low entropy between successive frames. Two widely used standards are MPEG-2 and H.264.
MPEG-2 is intended for high-quality, high-bandwidth video. It is most prominent because it is used for DVD and HDTV video compression. Computationally, good encoding is expensive but can be done in real time by current processors. Decoding an MPEG-2 stream is relatively easy and can be done by almost any current processor or, obviously, by commercial DVD players.
MPEG-2 is a complicated format with many options. It includes seven profiles dictating aspect ratios and feature sets, four levels specifying resolution, bit rate, and frame rate, and three frame types. The bit stream code is complex and requires several tables. However, at its core are computationally complex but conceptually clear compression and decompression elements.
MPEG-2 components are very similar to those in JPEG. MPEG-2 is DCT based, and uses Huffman coding on the quantized DCT coefficients. However, the bit stream format is completely different, as are all the tables. Unlike JPEG, MPEG-2 also has a restricted, though very large, set of frame rates and sizes. But the biggest difference is the exploitation of redundancy between frames.
There are three types of frames in MPEG: I (intra) frames, P (predicted) frames, and B (bidirectional) frames. There are several consequences of frame type, but the defining characteristic is how prediction is done. Intra frames do not refer to other frames, making them suitable as key frames. They are, essentially, self-contained compressed images. By contrast, P frames are predicted by using the previous P or I frame, and B frames are predicted using the previous and next P or I frame. Individual blocks in these frames may be intra or non-intra, however.
MPEG is organized around a hierarchy of blocks, macroblocks, slices, and frames. Blocks are 8 pixels high by 8 pixels wide in a single channel. Macroblocks are a collection of blocks 16 pixels high by 16 pixels wide and contain all three channels. Depending on subsampling, a macroblock contains 6, 8, or 12 blocks. For example, a YCbCr 4:2:0 macroblock has four Y blocks, one Cb and one Cr.
The key to the effectiveness of video coding is using earlier and sometimes later frames to predict a value for each pixel. Image compression can only use a block elsewhere in the image as a base value for each pixel, but video compression can aspire to use an image of the same object. Instead of compressing pixels, which have high entropy, the video compression can compress the differences between similar pixels, which have much lower entropy.
Objects and even backgrounds in video are not reliably stationary, however. In order to make these references to other video frames truly effective, the codec needs to account for motion between the frames. This is accomplished with motion estimation and compensation. Along with the video data, each block also has motion vectors that indicate how much that frame has moved relative to a reference image. Before taking the difference between current and reference frame, the codec shifts the reference frame by that amount. Calculating the motion vectors is called motion estimation and accommodating this motion is called "motion compensation".
The two series of video codec nomenclature H.26x and MPEG-x overlap. MPEG-2 is named H.262 in the H.26x scheme. Likewise, another popular codec, H.264, is a subset of MPEG-4 also known as MPEG-4 Advanced Video Coding (AVC). Its intent, like that of all of MPEG-4, was to produce video compression of acceptable quality and very low bit-rate -- around half of its predecessors MPEG-2 and H.263.
Image Processing and Object Recognition
Geometric transformations constitute a large and important segment of image processing operations. Any function that changes the size, shape, or orientation of the image or order of the pixels can be grouped under this broad classification. The math employed by these transform operations uses two coordinate systems, the source image coordinate system and the destination. Both systems have an origin (0,0) that is defined by the data pointer. The two coordinate systems are related by the geometric transform.
In most cases, the location in the source from which the data is to be drawn, indicated as (x', y'), does not lie exactly on a source pixel. Some form of interpolation would then be used to calculate the value. The nearest neighbor method chooses the pixel that is closest to (x', y'). Linear interpolation takes a weighted average of the four surrounding pixels. Cubic interpolation fits a second-order curve to the data to calculate the (x', y') value. Super-sampling interpolation averages over a wider range of pixels and is suitable for resizing images to a much smaller size, such as when creating a thumbnail image.
Other common transformations are summarized as follows:
- Resize. Resizing functions change an image from one size to another. The pixels in the first image are either stretched by duplication or interpolation, or they are compressed by dropping or interpolation. A single resize operation can stretch the image in one direction and compress it in the other.
- Rotation. Turn an image around the origin, around a designated point, or around the center.
- Affine Transform. The affine transform is a general two-dimensional transform that preserves parallel lines and is general enough to shear, resize, or shift an image.
- Perspective Transform. The perspective transform is a general three-dimensional transform. Properly applied, this transform can represent a projection of an image onto a plane of arbitrary orientation.
- Remap The remap function is a completely geometric transform. It takes a destination-to-source map the same size as the destination image. Each pixel has a corresponding floating-point (x,y) coordinate pair. The operation calculates the value at that location according to the interpolation mode and sets the destination pixel to that value. The remap function is most useful in morphing or exciting video effects, for which the other geometric transforms are not flexible enough.
Image processing functions are grouped into the following categories:
- Statistics: norm, mean, median, standard deviation, histograms
- Analysis functions and filters: erode and dilate, blur, Laplace, Sobel, distance transform, pyramid
- Feature detection: edge, corner, template matching
- Motion detection and understanding: motion templates
Many of these functions are closely associated with the Open Source Computer Vision Library (OSCVL). The histories of this library and the Intel IPP computer vision domain are intertwined, and the OSCVL uses Intel IPP as its optimization layer.
The next subsections summarize key image processing functionality.
Edge Detection. Perhaps the most important low-level vision task is detection of edges in the image. Edges are visual discontinuities of brightness, color, or both. They are usually detected by an automated operation on a small region of pixels. Interpreted correctly, they convey higher-level scene information, particularly the boundaries of objects in the scene.
Multi-Resolution Analysis. When trying to find an object in a scene, size is as important a characteristic as shape or color. Even if you know exactly what shape or color an object is, you need to know how many pixels wide and tall it is. One easy way of performing an analysis without knowing this size is to perform the search on multiple resolutions of an image. Such a set of resolutions of an image is often called an image pyramid.
Template Matching. In computer vision, a template is a canonical representation of an object used for finding the object in a scene. There are many ways to match the template, such as taking the pixel-by-pixel normalized sum of the squared difference between template and image.
Processing of character strings is common to many applications and is embodied by several standard library functions across a number of operating systems. Accelerated versions of these functions exist for various platforms taking advantage of specialized hardware capabilities such as SIMD instruction sets. Typical optimized string functions include functions to perform string copy, search, string length, insertion, removal, compare, uppercase, lowercase, and concatenation.
This article is based on material in book "Break Away with Intel Atom Processors: A Guide to Architecture Migration" by Lori Matassa and Max Domeika, published by Intel Press (www.intel.com/intelpress/sum_ms2a.htm.)