According to a leading market research firm, Internet video consumption has increased by nearly 100% over the past year: from an average of 700 terabytes/day in 2006, to 1200 terabytes/day in 2007. They expect video consumption to expand 650% by 2011, to 7800 terabytes/day. Video content creation will expand even more rapidly. According to the same market research firm, Internet video uploads will grow from 500K uploads/day in 2007 to 4800K uploads/day in 2011.
The mobile handset is one of the leading capture platforms for video content. Enabled by rapidly declining non-volatile memory prices, mobile handsets are capable of storing, decoding and encoding ever-longer video clips, with improved resolution. The increasing availability of high-definition (HD) content and display options in the home is driving consumer expectations ever higher for video resolution on the handset. Whether movie trailers, music videos, or user-generated-content, consumers increasingly want, and expect their mobile devices to capture and playback HD video without adversely impacting battery life.
Handset developers have a range of video codec options to satisfy this consumer demand: a) fully customized hardware blocks integrated into system-on-chip (SoC) designs, b) optimized software codecs running on enhanced-instruction set RISC or DSP processors, or c) software running on standard processor cores, like ARM 9 or ARM11. The authors have done extensive simulations to determine the relative trade-offs in performance—especially power consumption—that can be expected from each type of video coder and decoder implementation.
Design goals
To be able to bring high-definition video to the hands of users, mobile device designers must optimize the power consumption of all components. While the display and wireless transmitter consume a large chunk of the total power, every component of a mobile system must be evaluated for its contribution to the power budget. Application and baseband processors must meet stringent power requirements to win a socket on a mobile handset's circuit board. Video coding is usually the most power-hungry application a mobile processor has to run. Therefore, minimizing the maximum power consumption figure of a chip often concentrates on finding the most power-efficient way to implement video processing algorithms.
Digital mobile chips are generally implemented in low-voltage, low-leakage CMOS processes to minimize the chip's power consumption both in active and stand-by modes. These low-power process technologies have one drawback: they are slow. Therefore the processing architecture must execute the algorithms in a small number of clock cycles, as high clock-frequencies cannot be used due to long gate-delays. This can be achieved by designing an architecture that provides both module and functional unit level parallelism. In other words, the processing architecture has to execute different parts of an algorithm (e.g. IDCT, de-quantization, motion-compensation, de-block filtering in an MPEG-4 video decoder), and different operations within each module (additions, multiplications, memory accesses etc) concurrently in a pipelined manner.
Using an extremely parallel architecture makes it possible to use a low clock frequency, which in turn makes it possible use a slow, low-leakage silicon process, and to lower the supply voltage level of the video decoder block in a SoC. As power has a square dependency on voltage, this gives a huge advantage over designs that require a higher clock-frequency.

Figure 1. Hardwired Architecture for Video Encoding.
Architecture alternatives for mobile video coding
The three most common solutions used to implement video coding on mobile chips are hardwired (HW) video codecs, video-optimized DSPs, and general-purpose RISC processor cores (often enhanced with SIMD hardware). The level of parallelism is highest in the HW codecs, and lowest in the RISC-based software codecs. Programmable architectures (DSP and RISC) mainly differ in the amount of functional unit level parallelism they offer, and both lack module level parallelism, unless an awkward multi-core solution is used.
The clock-frequency, and power consumption, required to execute the same video coding algorithms on different platforms varies according to the level of parallelism the architecture implements. This can be seen in the following chart, which shows the relative power consumption, clock frequency, and silicon area required by a H.264 decoder (with a display that supports VGA@30fps) for the different implementation styles. The DSP solution's figures are averages of figures published by leading DSP core vendors. This chart clearly demonstrates the well-known performance advantage that algorithm-specific hardwired architectures have over programmable solutions in any application. The drawback is naturally the lack of flexibility.