Channels ▼
RSS

Mobile

A Simple and Efficient FFT Implementation in C++:
Part I


Elimination of Trigonometric Function Calls

The original implementation in Listing One contains a nice property of roots of unity: their recurrence calculation. Listing Four is the next implementation step with the same recurrence formula. Computing only two sine functions and starting with 1, the next roots (wr,wi) are calculated recurrently from (wpr,wpi): (wr,wi)+=(wr,wi)*(wpr,wpi). The end-specialization of DanielsonLanczos template class stays unchanged.

Listing Four

template<unsigned N, typename T=double>
class DanielsonLanczos {
   DanielsonLanczos<N/2,T> next;
public:
   void apply(T* data) {
      next.apply(data);
      next.apply(data+N);

      T wtemp,tempr,tempi,wr,wi,wpr,wpi;
      wtemp = sin(M_PI/N);
      wpr = -2.0*wtemp*wtemp;
      wpi = -sin(2*M_PI/N);
      wr = 1.0;
      wi = 0.0;
      for (unsigned i=0; i<N; i+=2) {
        tempr = data[i+N]*wr - data[i+N+1]*wi;
        tempi = data[i+N]*wi + data[i+N+1]*wr;
        data[i+N] = data[i]-tempr;
        data[i+N+1] = data[i+1]-tempi;
        data[i] += tempr;
        data[i+1] += tempi;

        wtemp = wr;
        wr += wr*wpr - wi*wpi;
        wi += wi*wpr + wtemp*wpi;
      }
   }
};

Listing Four now represents a C++ version of the C-style algorithm from Listing One. The latter includes three loops with dependent indexes, which have been exchanged with template class recursion and a single loop in the C++ version. Although they are very similar, the tests show different performance on both 32-bit (Figure 1a) and 64-bit (Figure 1b) processors. The dashed line "CT" is the performance of the original C-style implementation. The line "GFFT1" corresponds to the implementation in Listing Three, and "GFFT2" to that in Listing Four. "GFFT" denotes the final generic FFT implementation, which will be reached in the second article in this series, after some improvements. Performance of the original CT-implementation is pretty high until some data length: P=14 on P4 Xeon and P=16 on AMD Opteron. For bigger P, the CPU time grows faster and the C-style implementation becomes very inefficient compared to the recursive templates implementation. What are these "magic" numbers? These powers of two correspond to the data length smaller than the L2 cache memory of the workstations: 512 KB on P4 Xeon and 2 MB on AMD Opteron. It means, for the next power of two, the data does not fit the cache and performance falls. This is not the case in the new implementation because it contains only one loop of constant length and compiler can better unroll the code.

[Click image to view at full size]

Figure 1a: Performance on 32-bit processors.

[Click image to view at full size]

Figure 1b: Performance on 64-bit processors.

The essential reduction of trigonometric function calculations from 2*2P to 2P made the FFT about four times faster. Moreover, it opens a nice observation, that all 2P sine functions receive static constants M_PI/N and 2*M_PI/N, which have been already known at compile-time. Therefore, the corresponding sine function values could be known at compile-time as well. Let the C++ compiler compute them! An implementation of such template metaprogramming is not new—Todd Veldhuizen described it in the articles [6,7]. The sine values are calculated using series expansion in a static member function of the template class SinCosSeries, outlined in Listing Five. All necessary arguments are template parameters. Defining a template metaprogram SinCosSeries to compute the series from member M to N, I can write compile-time Sin(A*M_PI/B) and Cos(A*M_PI/B) functions for both single and double precision data and substitute their "calls" in two lines of Listing Four containing sin():

      wtemp = -Sin<N,1,T>::value();

and

      wpi = -Sin<N,2,T>::value();

Actually, compile-time functions are not called. Their values are evaluated at compile-time and stored in the code as static constants. It means complete elimination of runtime trigonometric functions and 20%-60% additional performance.

Listing Five

template<unsigned M, unsigned N, unsigned B, unsigned A>
struct SinCosSeries {
   static double value() {
      return 1-(A*M_PI/B)*(A*M_PI/B)/M/(M+1)
               *SinCosSeries<M+2,N,B,A>::value();
   }
};

template<unsigned N, unsigned B, unsigned A>
struct SinCosSeries<N,N,B,A> {
   static double value() { return 1.; }
};

template<unsigned B, unsigned A, typename T=double>
struct Sin;

template<unsigned B, unsigned A>
struct Sin<B,A,float> {
   static float value() {
      return (A*M_PI/B)*SinCosSeries<2,24,B,A>::value();
   }
};
template<unsigned B, unsigned A>
struct Sin<B,A,double> {
   static double value() {
      return (A*M_PI/B)*SinCosSeries<2,34,B,A>::value();
   }
};

template<unsigned B, unsigned A, typename T=double>
struct Cos;

template<unsigned B, unsigned A>
struct Cos<B,A,float> {
   static float value() {
      return SinCosSeries<1,23,B,A>::value();
   }
};
template<unsigned B, unsigned A>
struct Cos<B,A,double> {
   static double value() {
      return SinCosSeries<1,33,B,A>::value();
   }
};

In part two of this article, I will discuss the specialization of short FFTs, FFT selection at runtime, and present some comparative benchmarks and conclusions.

References

1. A. Alexandrescu. Modern C++ Design. Addison-Wesley, 2001.
2. J.W. Cooley, J.W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Comp. Vol. 19, 1965, pp. 297-301.
3. R. Crandall, C. Pomerance. Prime Numbers: A computational perspective. Springer, 2005.
4. H.J. Nussbaumer. Fast Fourier transform and convolution algorithms. Springer, Berlin, 1981.
5. W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P.Flannery. Numerical Recipes in C++. Cambridge university press, 2002.
6. T. Veldhuizen. Fast Fourier Transform (FFT) implemented using Template Metaprograms. http://www.oonumerics.org/blitz/examples/fft.html
7. T. Veldhuizen. Using C++ template metaprograms, C++ Report, Vol. 7 No. 4 (May 1995), pp. 36-43. http://osl.iu.edu/~tveldhui/papers/Template-Metaprograms/meta-art.html


Vlodymyr teaches at Brandenburg University of Technology, Cottbus, Germany. He can be reached at myrnyy@math.tu-cottbus.de.

Read Part II here.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video