Dr. Dobb's | A Simple and Efficient FFT Implementation in C++:<br> Part II

A Simple and Efficient FFT Implementation in C++:
Part II

The specialization of short FFTs, FFT selection at runtime, and comparative benchmarks

May 25, 2007
URL:http://www.drdobbs.com/cpp/a-simple-and-efficient-fft-implementatio/199702312

In Part One of this article, I introduced a new efficient implementation of the Cooley-Tukey Fast Fourier Transform (FFT) algorithm, discussing recursion and the elimination of trigonometric function calls. In Part Two, I discuss the specialization of short FFTs, FFT selection at runtime, and present some comparative benchmarks and conclusions.

Specialization of Short FFTs

The implemented template class recursion has P levels. Every FFT calculation process runs from level P to level 1, while the level 1 is empty (Listing Three, in Part One of this article). Some comprehensive books on FFT, for example [4], show that short length FFTs (P=1,2,3,4) could use fewer operations than the general algorithm. Such particular cases can be simply incorporated into the new implementation using partial specialization of the template class DanielsonLanczos. Listing Six represents those specializations for N=4 and N=2. Since every FFT goes down to the first specialized one, these additional specializations lead to a small overall performance improvement of about 1-5 percent.

Listing Six

>
template<typename T>
class DanielsonLanczos<4,T> {
public:
   void apply(T* data) {
      T tr = data[2];
      T ti = data[3];
      data[2] = data[0]-tr;
      data[3] = data[1]-ti;
      data[0] += tr;
      data[1] += ti;
      tr = data[6];
      ti = data[7];
      data[6] = data[5]-ti;
      data[7] = tr-data[4];
      data[4] += tr;
      data[5] += ti;

      tr = data[4];
      ti = data[5];
      data[4] = data[0]-tr;
      data[5] = data[1]-ti;
      data[0] += tr;
      data[1] += ti;
      tr = data[6];
      ti = data[7];
      data[6] = data[2]-tr;
      data[7] = data[3]-ti;
      data[2] += tr;
      data[3] += ti;
   }
};

template<typename T>
class DanielsonLanczos<2,T> {
public:
   void apply(T* data) {
      T tr = data[2];
      T ti = data[3];
      data[2] = data[0]-tr;
      data[3] = data[1]-ti;
      data[0] += tr;
      data[1] += ti;
   }
};

You might ask why, when programming in C++, am I still using a C-style array instead of std::complex<T> or even std::vector<T>? Because the C++ standard library is not suited to high-performance computing, at least its open source distribution that I have. A simple trace into the sources of the header complex makes clear, that a simple operation on complex numbers like x=a+b written in C++ generates a temporary object of type std::complex<T> resulting in poor performance demonstrated by the dotted line 'CGFFT' on Figure 1 (see Part One of this article). This is an example where the expression templates technique in the header {complex} could be very helpful, but was not applied. Application of std::complex<T> could result in some shorter code, but the computational performance, which I try to maximize, would be lost.

FFT Selection at Runtime

The new generic FFT implementation (GFFT) depends on a constant parameter P, specified before compilation. What should you do, if length of the data will be known at runtime at first? Write a big switch? No. It would not really solve the problem. To overcome compile-runtime parameter definition, I apply object factory, as described in Alexandrescu's Modern C++ Design [1], Chapter 8. The source code of the template class Factory in header Factory.h is part of the Loki library supplementing the book as well as added to the full source code of this article.

The idea is quite simple: All the template classes with varying template parameters are inherited from a single base class, such as AbstractFFT in Listing Seven, where necessary member functions are declared as virtual. The base class is passed to GFFT as an additional template parameter FactoryPolicy. It allows you also to substitute an empty base class EmptyFFT without virtual function declaration and thus avoid the virtual function call penalty. This is the default case, if GFFT is used without object factory. GFFT receives also a unique identification number id=P and a static object generation function Create() to conform to the object factory requirements.

Listing Seven

template<typename T>
class AbstractFFT {
public:
   virtual void fft(T*) = 0;
};

class EmptyFFT { };

template<unsigned P, typename T=double,
class FactoryPolicy=EmptyFFT>
class GFFT:public FactoryPolicy {
   // ...
public:
   enum { id = P };
   static FactoryPolicy* Create() {
      return new GFFT<P,T,FactoryPolicy>();
   }
   // ...
};

Each GFFT template class that should be available at runtime is registered in the object factory by its identification number using the FactoryInit template metaprogram from Listing Eight. The initializer receives template classes in the form of a Typelist ([1], Chapter 3). Template metaprogram GFFTList (Listing Eight) constructs such a Typelist, receiving an FFT template class, e.g. GFFT, as well as the first and the last value of P. Thus, the creation of the FFT object factory in a program looks like this:

Loki::Factory<AbstractFFT<double>,unsigned int> gfft_factory;

The new factory bears information about the base class, but is still empty. Metaprograms GFFTList and FactoryInit help to add the necessary GFFT template classes into the factory writing a single line:

    FactoryInit<GFFTList<GFFT,10,27>
                ::Result>::apply(gfft_factory);

The factory contains now all the GFFT implementations from P=10 to P=27. To create a needed object instance at runtime on demand and to run the FFT, type:

    AbstractFFT<double>* gfft = gfft_factory.CreateObject(P);
    gfft->fft(data);

Listing Eight

template<class TList>
struct FactoryInit;

template<class H, class T>
struct FactoryInit<Loki::Typelist<H,T> > {
   template<class Fact>
   static void apply(Fact& f) {
      f.Register(H::id,H::Create);
      FactoryInit<T>::apply(f);
   }
};

template<>
struct FactoryInit<Loki::NullType> {
   template<class Fact>
   static void apply(Fact&) { }
};

template<
template<unsigned,class,class> class FFT,
unsigned Begin, unsigned End,
typename T=double,
class FactoryPolicy=AbstractFFT<T> >
struct GFFTList {
   typedef Loki::Typelist<FFT<Begin,T,FactoryPolicy>,
           typename GFFTList<FFT,Begin+1,End,T,
                    FactoryPolicy>::Result> Result;
};

template<
template<unsigned,class,class> class FFT,
unsigned End, typename T, class FactoryPolicy>
struct GFFTList<FFT,End,End,T,FactoryPolicy> {
   typedef Loki::NullType Result;
};

As a result, the GFFT obtains a kind of policy-driven design, which makes the implementation very flexible concerning base class. The same procedure can be applied to other independent FFT components, for example, to the element reindexing (function scramble) and the Danielson-Lanczos section.

The element reindexing could also be implemented using recursive template classes. But to preserve only one template class definition per recursion level, the handled indexes (original and bit-reverse) must be non-constant and passed as parameters to some member function. Tests of such implementation showed the same performance and there is no reason to load compiler with another template recursion and to waste the compile-time.

Comparative Benchmarks and Conclusions

Definition of the GFFT object factory means instantiation of all needed GFFT template classes during compilation. Each of them includes O(P) recursive template classes. A reasonable question: how long will the compilation take? GNU C++ 4.x was used to compile performance tests shown in Figures 1 and 2, and took about 10 seconds with optimization keys for a full range of values P in the object factory, because the overall number of different templates to be instantiated stays small. Microsoft Visual C++ 2003 and Intel C++ 8.x and later versions showed similar results, although not all C++ compilers can treat the recursive templates of the GFFT so efficiently. I suppose some older compilers try to instantiate template classes for each member function call even if the template classes are the same. There exist O(2^P) such member function calls in the GFFT. As a result, the compilation hangs and finally fails, after entire memory has been exhausted. Those unlucky examples are IBM C++ 6 (Visual Age) and GNU C++ 3.x, but not GNU C++ 2.96! They could compile the GFFT object factory without optimization options quite fast, but hung with optimization turned on.

A nice property of the GFFT that distinguishes it from other well known FFT libraries is that a runtime computation of roots of unity using sine and cosine functions is not needed any more. Usual FFT implementations of known libraries like FFTW or Intel MKL proceed in two steps:

"Planning" including calculation of roots of unity Computing of FFT.

The second step can be repeated for different data of the same length. The GFFT omits the first step without any additional penalty. Figure 2 shows benchmarks of the GFFT compared to FFTW with options FFTW_ESTIMATE and FFTW_MEASURE as well as routine zfft1dc from Intel MKL 7.0, where total CPU time of both steps was measured. Being very good for small P, the performance of FFTW become poor for large P. The only hardware optimized Intel MKL has a similar high performance comparing to the GFFT. Qualitatively the same benchmark results were obtained on Itanium2 processor with Intel C++ compiler.

Finally, I would like to make a general conclusion as to the differences between the GFFT and traditional implementation. What made the GFFT so efficient? The essential step was the assumption of P to be static constant. The reason was not simply to play with template metaprogramming, but to give additional information to the compiler, so that it could optimize the code better. Really, many numerical algorithms have some integer parameters, which vary in a small range (like P in FFT). It can be assumed static and should be used as far as possible in template metaprogramming, but not to overload the compiler taking care about the total number of template classes to be instantiated. The final implementation can be more efficient, since the compiler has got more information and so an opportunity to optimize the code. You can use such an implementation originally with static parameter or compile it for all possible or needed parameter values within object factory and use as a library. The latter means you can choose highly optimized pieces of code without dynamic bindings at runtime.

This article described a C++ implementation technology on example of simple classical Cooley-Tukey FFT algorithm. Of course, there are many advanced and more complicated FFT algorithms, such as Winograd FFT, parallelized FFT, and the like. Those modern FFT approaches can also take advantages of template recursion and metaprogramming to improve performance significantly.

[Click image to view at full size]

Figure 2a: Comparative benchmark, part 1.

[Click image to view at full size]

Figure 2b: Comparative benchmark, part 2.

References

1. A. Alexandrescu. Modern C++ Design. Addison-Wesley, 2001.
2. J.W. Cooley, J.W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Comp. Vol. 19, 1965, pp. 297-301.
3. R. Crandall, C. Pomerance. Prime Numbers: A computational perspective. Springer, 2005.
4. H.J. Nussbaumer. Fast Fourier transform and convolution algorithms. Springer, Berlin, 1981.
5. W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P.Flannery. Numerical Recipes in C++. Cambridge university press, 2002.
6. T. Veldhuizen. Fast Fourier Transform (FFT) implemented using Template Metaprograms. http://www.oonumerics.org/blitz/examples/fft.html
7. T. Veldhuizen. Using C++ template metaprograms, C++ Report, Vol. 7 No. 4 (May 1995), pp. 36-43. http://osl.iu.edu/~tveldhui/papers/Template-Metaprograms/meta-art.html

Vlodymyr teaches at Brandenburg University of Technology, Cottbus, Germany. He can be reached at [email protected].

Read Part I of this article here