In seismic processing software, wave migration algorithms often incorporate FFT as part of the solution. This algorithmic technique (the FFT approach) usually takes the following sequence: Apply the FFT, phase shift the data, and then apply the inverse FFT to obtain the next location of the traveling wave front. When this is computed on a processor via software, data is taken out of memory to compute the FFT, then the result is put back into memory. The resulting data is taken out of memory again to do the phase shift and the new result is put back into the memory. Lastly, the new result data is taken out so the inverse FFT can be computed and the final result is put back into memory.

In such a linear execution, there are too many memory accesses that can slow down the execution. Using hardware to accelerate the algorithm, begin by taking the data out of memory and feeding it to a pipeline that executes the FFT and feeds the result into the phase shifter. The phase shifter then directly feeds its result into the inverse FFT. The output of the inverse FFT is then put back into memory.

A typical application-level profile for a wave migration algorithm might look like:

- FFT 35%
- Inverse FFT 35%
- Phase Shift 25%
- I/O 3%
- Misc 2%

Traditionally, these algorithms are executed using floating-point math operations because the input data is gathered from 24-bit sensors and the output is often displayed as 24-bit color pixels. By implementing multiple floating-point units in an FPGA, the FPGA can outperform a CPU 5:1 on floating-point code, but can deliver a 10-100 time improvement on integer-based code.

Next, see if the code fits into the hardware. Take a look around the Web to find a variety of floating-point and integer FFT algorithms. Floating-point versions are larger and slower, but deliver more precise results. One example, a hybrid approach, can compute a 2Kpoint FFT at a rate of 400 Msamples/s (www.andraka.com/V4_FP_fft.htm). This is 4.5 times faster than a 2.2-GHz dual-core Opteron (www.fftw.org/speed/opteron-2.2GHz-64bit). Three such FFT engines can fit in a Xilinx SX55. A CORDIC algorithm can be used to do the phase shift, because implementing such an algorithm does not consume many of the FPGA's logic elements—less than 1000 look-up tables. Moreover, the computations can be pipelined to improve throughput.

While you may be impressed with speeds 4.5 times faster, you might still think that you can buy a 4-way Opteron-based system for the same price as an FPGA-based coprocessor with a dual-core Opteron to achieve the same performance gains. But because the functional units will be pipelined in an FPGA-based coprocessor, the phase shift and the inverse FFT come with the cost of some latency until the pipeline is filled—a few hundred nanoseconds. As a result, you achieve a speedup for the phase shift and the inverse FFT for free. The overall throughput improvement can then be estimated as follows:

4.5 + 4.5 + [(25/35)*4.5] = 12.2x (the overall speedup for the pieces we put in hardware)

The calculation *[(25/35)*4.5]* is an estimate of the performance of the phase shift and is based on the phase shift using similar math to the FFT, and the 25/35 is the proportion of the time taken on the CPU for the FFT.

Using Amdahl's Law, P equals the percent that can be parallelized and S is the speedup of that portion. Using this formula you find that:

1/(1-P)+(P/S) = 1/[(1-0.95) + (0.95/12.2)] = 7.8x application speedup