In Part One of this article, I introduced a new efficient implementation of the Cooley-Tukey Fast Fourier Transform (FFT) algorithm, discussing recursion and the elimination of trigonometric function calls. In Part Two, I discuss the specialization of short FFTs, FFT selection at runtime, and present some comparative benchmarks and conclusions.

### Specialization of Short FFTs

The implemented template class recursion has P levels. Every FFT calculation process runs from level P to level 1, while the level 1 is empty (Listing Three, in Part One of this article). Some comprehensive books on FFT, for example [4], show that short length FFTs (P=1,2,3,4) could use fewer operations than the general algorithm. Such particular cases can be simply incorporated into the new implementation using partial specialization of the template class `DanielsonLanczos`

. Listing Six represents those specializations for N=4 and N=2. Since every FFT goes down to the first specialized one, these additional specializations lead to a small overall performance improvement of about 1-5 percent.

**Listing Six**

> template<typename T> class DanielsonLanczos<4,T> { public: void apply(T* data) { T tr = data[2]; T ti = data[3]; data[2] = data[0]-tr; data[3] = data[1]-ti; data[0] += tr; data[1] += ti; tr = data[6]; ti = data[7]; data[6] = data[5]-ti; data[7] = tr-data[4]; data[4] += tr; data[5] += ti; tr = data[4]; ti = data[5]; data[4] = data[0]-tr; data[5] = data[1]-ti; data[0] += tr; data[1] += ti; tr = data[6]; ti = data[7]; data[6] = data[2]-tr; data[7] = data[3]-ti; data[2] += tr; data[3] += ti; } }; template<typename T> class DanielsonLanczos<2,T> { public: void apply(T* data) { T tr = data[2]; T ti = data[3]; data[2] = data[0]-tr; data[3] = data[1]-ti; data[0] += tr; data[1] += ti; } };

You might ask why, when programming in C++, am I still using a C-style array instead of `std::complex<T>`

or even `std::vector<T>`

? Because the C++ standard library is not suited to high-performance computing, at least its open source distribution that I have. A simple trace into the sources of the header complex makes clear, that a simple operation on complex numbers like x=a+b written in C++ generates a temporary object of type `std::complex<T>`

resulting in poor performance demonstrated by the dotted line 'CGFFT' on Figure 1 (see Part One of this article). This is an example where the expression templates technique in the header {complex} could be very helpful, but was not applied. Application of `std::complex<T>`

could result in some shorter code, but the computational performance, which I try to maximize, would be lost.