Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

# A Simple and Efficient FFT Implementation in C++: Part II

In Part One of this article, I introduced a new efficient implementation of the Cooley-Tukey Fast Fourier Transform (FFT) algorithm, discussing recursion and the elimination of trigonometric function calls. In Part Two, I discuss the specialization of short FFTs, FFT selection at runtime, and present some comparative benchmarks and conclusions.

### Specialization of Short FFTs

The implemented template class recursion has P levels. Every FFT calculation process runs from level P to level 1, while the level 1 is empty (Listing Three, in Part One of this article). Some comprehensive books on FFT, for example [4], show that short length FFTs (P=1,2,3,4) could use fewer operations than the general algorithm. Such particular cases can be simply incorporated into the new implementation using partial specialization of the template class `DanielsonLanczos`. Listing Six represents those specializations for N=4 and N=2. Since every FFT goes down to the first specialized one, these additional specializations lead to a small overall performance improvement of about 1-5 percent.

Listing Six

```>
template<typename T>
class DanielsonLanczos<4,T> {
public:
void apply(T* data) {
T tr = data[2];
T ti = data[3];
data[2] = data[0]-tr;
data[3] = data[1]-ti;
data[0] += tr;
data[1] += ti;
tr = data[6];
ti = data[7];
data[6] = data[5]-ti;
data[7] = tr-data[4];
data[4] += tr;
data[5] += ti;

tr = data[4];
ti = data[5];
data[4] = data[0]-tr;
data[5] = data[1]-ti;
data[0] += tr;
data[1] += ti;
tr = data[6];
ti = data[7];
data[6] = data[2]-tr;
data[7] = data[3]-ti;
data[2] += tr;
data[3] += ti;
}
};

template<typename T>
class DanielsonLanczos<2,T> {
public:
void apply(T* data) {
T tr = data[2];
T ti = data[3];
data[2] = data[0]-tr;
data[3] = data[1]-ti;
data[0] += tr;
data[1] += ti;
}
};
```

You might ask why, when programming in C++, am I still using a C-style array instead of `std::complex<T>` or even `std::vector<T>`? Because the C++ standard library is not suited to high-performance computing, at least its open source distribution that I have. A simple trace into the sources of the header complex makes clear, that a simple operation on complex numbers like x=a+b written in C++ generates a temporary object of type `std::complex<T>` resulting in poor performance demonstrated by the dotted line 'CGFFT' on Figure 1 (see Part One of this article). This is an example where the expression templates technique in the header {complex} could be very helpful, but was not applied. Application of `std::complex<T>` could result in some shorter code, but the computational performance, which I try to maximize, would be lost.

### More Insights

 To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.