Most of the software written today involves instructions executed in sequence. To speed up execution, programmers have typically pushed hardware designers to build processors with increasingly higher clock rates, resulting in heavily pipelined processors that operate at clock rates of 3 GHz and more. To get the most out of every clock cycle, these processors resort to tricks such as large caches and out-of-order execution.
However, faster processors generate lots of heat. As a result, clock speeds have, for the most part, leveled off since the heat generated by faster circuits ends up constraining clock speeds. To continue the march towards faster execution, hardware designers have switched from single processors on a chip to dual, quad, and even more CPU cores on a single chip. The operating system then allocates processors to different applications, all running in parallel. The next challenge is to find ways to parallelize the code running in each application, then run that parallelized code on multiple engines either within the CPU or in a companion coprocessor that's optimized to execute that particular segment of parallelized code.
Hardware designers have turned to the latest generation of field programmable gate arrays (FPGAs) and the new open-standard Torrenza coprocessor interface on Opteron system platforms defined by Advanced Micro Devices, and the Intel QuickAssist Technology Acceleration Abstraction Layer (AAL) for the hardware portion of the parallelization goal. By downloading configurations into an FPGA tied to either a Torrenza or QuickAssist motherboard platform, designers can accelerate computationally complex algorithms such as encryption, compression, search, and sort up to 1000 times over general-purpose processors. Also, any algorithm that needs billions of integer or floating-point operations per second (image and audio processing, seismic exploration, bioinformatics, and the like) can be accelerated by an FPGA-based coprocessor.
To accelerate an algorithm, you must first identify the code within the application that can be parallelized, then figure out how to parallelize that code so it can be executed on an array of computational elements configured in an FPGA or ASIC. But the question is where to start. The logical starting point is to first profile the code to find its computationally intensive portions, then find ways to isolate the code so that it does not have many data dependencies. Once the code has been isolated, you need to find ways to optimize it so that it can be executed on the resources available on the coprocessor. It is difficult to optimize the code to fit on architectures like graphics-processing units and the Cell processor, where users are not given all the data for the device. FPGAs, on the other hand, make it easier for you to define an optimized architecture on a case-by-case basis.
The next step to accelerating an algorithm is to examine the communication channels between hardware and software. Make sure the hardware is not data starved (not enough data reaches the hardware to keep the hardware continually performing its computations), and at the same time, ensure that the hardware can keep up with the pace at which the software feeds data to the hardware. Finding the right balance in the hardware/software partitioning between system inputs and outputs is critical to smooth operation. If it takes longer to transmit data to the accelerator than it takes the CPU to calculate the answer, make sure that the hardware design uses all memory accesses most efficiently. You may even consider adding another memory port to feed more data to the compute array.
For example, if you are accelerating a fast-Fourier transform (FFT) followed by a phase shift and then followed by an inverse FFT in the software algorithm, you take the data out of a memory location, perform the FFT, and then put data back into memory. The data then comes out of the memory again and the algorithm executes the phase shift computations and places the data back into memory one more time. Finally, the inverse FFT retrieves the data, executes the algorithm, and then places the final result back in memory. Main memory accesses are expensive, so reusing data can dramatically improve performance by eliminating multiple store and access operations.
When performing numerical calculations, you should track the precision of the variables that really need to be implementedyou may not always need the full precision of single- or double-precision floating-point data types for every variable. A fixed-point multiplier and arithmetic unit combination usually requires less logic and, thus, more elements can be configured on an FPGA. The more elements, the more computations per cycle, and the faster the algorithm can execute. In FPGAs, integer or fixed-point math can often run 10 to 100 times faster than floating-point computations.
To parallelize the algorithms, you must first perform a data-flow, which helps you understand how data moves between the different logic and computational elements. Next, execute a latency analysis to determine where potential bottlenecks may occur and then find a balance between desired performance and the cost of implementing the design. To achieve a desired performance level, you use PCI Express or HyperTransport to provide high-bandwidth, low-latency communication channels.
Last, scour the literature for tricks that can accelerate computationssome tricks may actually be decades old but were impractical before the availability of multimegagate FPGAs. Consider different types of math for computations aside from basic integer and floating-point operations. Logarithmic computations or math based on residue number systems, for example, could execute faster than standard integer or floating-point-based code.