The authors are Intel engineers. Courtesy Intel Corp. All rights reserved.
Digital signal and image processing (DSP) is ubiquitous: From digital cameras to cell phones, HDTV to DVDs, satellite radio to medical imaging. The modern world is increasingly dependent on DSP algorithms. Although, traditionally, special-purpose silicon devices such as digital signal processors, ASICs, or FPGAs are used for data manipulation, general-purpose processors (GPPs) can now also be used for DSP workloads. Code is generally easier and more cost-effective to develop and support on GPPs than on large DSPs or FPGAs. GPPs are also able to combine general purpose processing with digital signal processing in the same chip, a major advantage for many complex algorithms. The Intel processor microarchitecture, instruction set, and performance libraries have features and benefits that can be exploited to deliver the performance and capability required by DSP applications. This article explores the main differences between traditional DSPs and modern Intel general-purpose processor architectures.
Vectorization is one example of how GPPs can meet DSP requirements. Although DSP algorithms tend to be mathematically intensive, they are often fairly simple in concept. Filters and Fast Fourier Transforms (FFTs), for example, can be implemented using simple multiply and accumulate instructions. Modern GPPs use Single Instruction Multiple Data (SIMD) techniques to increase their performance on these types of low-level DSP functions. Current Intel Core processor family and Intel Xeon processor have 16 128-bit vector registers that can be configured as groups of 16, 8, 4, or 2 samples depending on the data format and precision required. For single-precision (32-bit) floating point SIMD processing, for example, four floating point (FP) numbers which need to be multiplied by a second value are loaded into vector register 1 with the multiplicand(s) in register 2. Then the multiply operation is executed on all four numbers in a single processor clock cycle. Current Intel Core2 processor family and Intel Xeon processor have a 4-wide instruction pipeline with two FP Arithmetic Logical Units, so potentially 8 single-precision FP operations can be done per clock cycle per core. This number will increase to 16 operations per clock when the Intel Advanced Vector Extensions (Intel AVX) Instruction Set Architecture debuts in 2010 "Sandy Bridge" generation processors since AVX SIMD registers will be 256 bits wide.
Parallelization is one example of how GPPs can meet DSP requirements. The Intel architecture, as a multicore architecture, is suited for executing multiple threads in parallel. In terms of DSP programming, there are several approaches for achieving parallelism:
- Pipelined execution: The algorithm is divided in stages and each of these stages is assigned to a different thread.
- Concurrent execution: The input data is partitioned, and each thread processes its own portion of the input data through the whole algorithm. This is only possible if the functionality of the algorithm is not compromised.
Both approaches can also be combined in order to maximize performance and efficient resource utilization.
When evaluating parallelism, the programmer should also consider cache hierarchy. For maximum throughput, each thread should ideally have its input/output data fit within local caches (L1, L2), minimizing cache trashing due to inter-core coherency overheads. On every stage of the algorithm, threads should guarantee that their output data is contiguously stored in blocks with size that is a multiple of the internal cache line width. Inter-thread data dependencies should be minimized and pipelined to reduce algorithm lock-ups.


