### Overall Performance Results for the Complete Turbo Encoder

From Tables 8 and 9, the internal interleaver takes 4.99 cycles per byte, using 4 independent threads for an input block size of 6144 bits, while the encoder, which uses 2 threads, takes 4.76 cycles per byte. As there is no inter-block dependence it is possible to run two encoders in parallel on the reference platform: Dual Intel Core i7 Processor 2112 MHz (8 MB Cache/CPU, 1.5 GB DDR3 800MHz/CPU. 64-bit CentOS 5.0, Intel C++ Compiler 10.0.0.64, 80 GB HD Samsung 5400 rpm.

As a result, a 10-ms frame (57 Mbps) is encoded in 159.1 microseconds corresponding to a total CPU usage of 1.59 percent.

**Channel Estimation**

On the next generation of mobile wireless standards the estimation of the channel characteristics is necessary to provide high data throughputs. LTE includes a number of reference signals in its data frame that are used to compute the estimation, as in Figure 2.

These reference signals are sent every six subcarriers with alternate frequency offsets. They are sent on the first and fourth OFDM symbols of each slot, so two channel estimations are computed per slot.

The estimation consists of a time average of the current reference frame and the 5 previous ones, in order to minimize noise distortion.

Figure 3 represents the high-level view of the channel estimator, comprising a complex reciprocal operation (rcp(z)), a complex multiplication per each set of reference values, an averaging operator (∑), and a polyphase interpolator (H(z)).

In terms of computational complexity per sample:

- Reciprocal calculation: 6 multiplications, 1 division and 1 addition.
- Complex multiplication: 4 multiplications and 2 additions.
- Averaging operation: 6 additions and 1 multiplication.
- Polyphase interpolator: 6 multiplications and 3 additions.
- Total number of operations: 30.

For a 10-ms full 4x4 MIMO, 20-MHz frame, the algorithm computes 120 channel estimations, where only 340 samples per frame are used. Multiplying this by the total number of operations per sample we get a total of 1.224 MFLOP per frame.

**Implementation**

The input data parameters are assumed as described in Table 4.

Only the complex multiplications and reciprocals are computed in floating point. Reciprocals in particular are implemented with SSE intrinsics for a higher throughput. The performance results in CPU cycles per reference input sample are presented in Table 5.

For a 10-ms frame, and assigning two cores per MIMO channel on our reference system (Dual Intel Core i7 Processor 2112 MHz (8 MB Cache/CPU, 1.5 GB DDR3 800MHz/CPU. 64-bit CentOS 5.0, Intel C++ Compiler 10.0.0.64, 80 GB HD Samsung 5400 rpm), each thread computes a total of 20 estimations per frame, resulting in 47.2 microseconds processing time per frame, and a total CPU usage of 0.48 percent.

### Overall Turbo Encoder and Channel Estimation Performance

Table 6 summarizes the performance results of the Intel architecture implementation for both algorithms. The first column states the computational complexity of the algorithm in terms of millions of (floating-point) operations per frame. The second shows the actual time taken by our reference system to process the data (using the 8 cores available). The final column is the total CPU usage for processing the 57 Mbps data stream.

While the actual partitioning of the system will depend on the amount of baseband processing offloaded or/and throughput required, the results show that it is possible to move several portions of the baseband processing into an Intel architecture-based platform.

### Conclusions

Modern Intel general-purpose processors incorporate a number of features of real value to DSP algorithm developers, including high clock speeds, large on-chip memory caches and multi-issue SIMD vector processing units. Their multiple cores are often an advantage in highly parallelizable DSP workloads, and software engineers can write applications at whatever level of abstraction makes sense: they can use higher-level languages and take advantage of compiler's automatic vectorization features. They can further optimize performance by linking in Intel IPP and MKL functions. In addition, if certain areas require it, SSE intrinsics are available, or the rich and growing set of specialized SSE and other assembly instructions can be used directly. The wireless infrastructure study we have summarized indicates that current Intel Architecture Processors may now be suitable for a surprising amount of intensive DSP work.

*For more information see the Intel Technology Journal (March 2009) "Advances in Embedded Systems Technology" (intel.com/technology/itj.)*