Case Study: Wireless Baseband Signal Processing
In wireless communication systems, the physical layer (PHY) (baseband signal processing) is usually implemented in dedicated hardware (ASICs), or in a combination of DSPs and FPGAs, because of its extremely high computational load. GPPs (such as Intel architecture) have traditionally been reserved for higher, less demanding layers of the associated protocols.
This section, however, will show that two of the most demanding next-generation baseband processing algorithms can be effectively implemented on modern Intel processors -- LTE turbo encoder and channel estimation. The following discussion assumes Intel architecture as the target platform for implementation, and the parameters in Table 1.
LTE Turbo Encoder
The Turbo encoder is an algorithm that operates intensively at bit level. This is one of the reasons why it is usually offloaded to dedicated circuitry. As will be shown further, there are software architecture alternatives that can lead to an efficient realization on an Intel architecture platform.
The -- LTE standard specifies the Turbo encoding scheme as depicted in Figure 1.
The scheme implements a Parallel Concatenated Convolutional Code (PCCC) using two 8-state constituent encoders in parallel, and comprises an internal interleaver. For each input bit, 3 bits are generated at the output.
The relationship between the input i and output π(i) bit indexes (positions in the stream) is defined by the following expression:
where K is the input block size in number of bits (188 possible values ranging from 40 to 6144). The constants f1 and f2 are predetermined by the standard, and depend solely on K.
At a cost of a slightly larger memory footprint (710 kilobytes), it is possible to pre-generate the π (i) LUTs for each allowed value of K. For processing a single data frame, only the portion of the table referring to the current K value will be used (maximum 12 KB).
Computing the permutation indexes at runtime would require 4 multiplications, 1 division and, 1 addition, giving a total of 6 integer operations per bit.
Each convolutional encoder implements a finite state machine (FSM) that cannot be completely vectorized or parallelized due to its recursive nature.
In terms of complexity, the implementation of this state machine requires 4 XOR operations per bit per encoder. If all the possible state transitions are expanded and stored in a LUT, the number of operations is 8 per byte per encoder.
Total Computational Requirements
For an input rate of 57 Mbit/s, the Turbo encoder requires 57.34 MOP (Million Integer Operations) for processing a single 10-ms frame.
Internal Interleaver Implementation
To allow parallelization and vectorization, the algorithm was changed by replacing the mod operation with a comparison test and a subtraction. Also, the inter-sample dependence was reassessed for allowing 8-way vectorized implementation as follows:
For parallel implementation, each thread receives a portion of the input data stream within the 6144-bit maximum range. The results (in CPU cycles) per input byte are given in Table 8 for the reference system described below. These results are included in the overall Turbo encoder performance measurements presented ahead. As can be seen from the results in Table 2, performance scales in an almost linear manner with the number of threads.
Convolutional Encoder Implementation
The implementation for this block comprises two steps:
- Generate all possible FSM state transitions and output values for each input value, regardless of the current state of the FSM. Generation of the output values is done in parallel.
- From the results generated in the previous step, select the one that corresponds to the actual state. All other generated results are discarded. The two convolutional encoders operate in parallel during this step.
A LUT is used that stores the pre-computed transition matrix (for each possible value of the input and FSM state). The size of this LUT depends on the number of bits sent to the encoders in each iteration.
Table 3 shows the number of cycles it takes to encode the input stream, as well as the memory footprint, per iteration. It can be seen that on Intel architecture, the large cache size allows a more flexible tradeoff between performance and memory usage. In this case, a 128-KB table is used.