Web Development

SA-C: Single Assignment C

By Willem Bohm, Ross Beveridge, Bruce Draper, Charles Ross and and Monica Chawathe, May 01, 2003

SA-C, a high-level, C-like language with single assignment semantics, was designed for building applications that run on reconfigurable computing systems, such as Field Programmable Gate Arrays (FPGAs).

May03: Programmer's Toolchest

Willem, Ross, and Bruce are faculty members and Charles and Monica students in the computer science department at Colorado State University. They can be contacted at http://www.cs.colostate.edu/cameron/contact.html.

SA-C is a high-level, C-like language for creating applications that run on reconfigurable computing systems. Developed as part of the Cameron Project (http://www.cs.colostate.edu/cameron/), SA-C has single assignment semantics; that is, a variable is assigned a value exactly once—hence, its name "Single Assignment C" (SA-C). This replaces C's memory-based model with a functional, dataflow model. An SA-C program can be viewed as a graph where nodes correspond to operators, and edges to data paths. Dataflow graphs are ideal (data driven, timeless) abstractions for hardware circuits.

The most important aspect of SA-C is its treatment of for loops and their close interaction with arrays. A loop has three parts: one or more generators, a loop body, and one or more return values. The generators provide parallel array access operators that are easy for the compiler to analyze. In particular, window generators allow rectangular subarrays to be extracted from a source array. All possible subarrays of the specified size are produced. A loop can return arrays and reductions built from values that are produced in the loop iterations. SA-C provides integers and fixed-point numbers, both signed and unsigned. All of these types have user-defined width and precision.

In this article, we use SA-C to implement an image-processing application called "Probing" that runs on a reconfigurable system built on Field Programmable Gate Arrays (FPGAs)—chips made of large arrays of configurable logic blocks (CLBs) connected by programmable wires. In most FPGAs, the perimeter of the chip has I/O cells that interface with the external pins. Multiple FPGAs can be packaged with high-speed RAM on coprocessor boards. FPGAs are typically programmed using hardware description languages (such as VHDL) that most application programmers aren't familiar with.

SA-C is not the first or only high-level FPGA project. Others include:

Handel-C (http://www.iis.ee.ic.ac.uk/~frank/surp99/article1/amag97/), which extends C to express bitwidths, timing, and parallelism, and limits the language to exclude C features that do not lend themselves to hardware translation.
Streams-C (http://rcc.lanl.gov/Tools/Streams-C/), which focuses on extensions to C that facilitate expressing communication between parallel processes.
SystemC (http://www.systemc.org/), which provides C++ class libraries to add the functionality required for RCS programming to an existing language.

The SA-C compiler performs both conventional and FPGA-specific optimizations. The functional semantics of SA-C make analysis and optimizations easy due to the lack of side effects. One optimization is the full unrolling of loops, which is important when generating code for FPGAs because it spreads loop iterations in the FPGA area rather than in time. The SA-C compiler fully unrolls loops when the number of iterations of a loop can be determined at compile time. Array value propagation searches for array references with constant indices, then replaces such references with the values of the array elements. When the value is a compile-time constant, this enables constant propagation.

Another optimization is common subexpression elimination (CSE), a well-known compiler optimization that eliminates redundancies by looking for multiple subexpressions that compute the same value. The SA-C compiler performs standard CSE, but also performs temporal CSE, looking for values computed in one loop iteration that are also needed in later iterations. In such cases, the redundant computation can be eliminated by holding result values in a register chain.

A useful phenomenon often occurs with temporal CSE: One or more columns in the left part of the window are unreferenced, making it possible to eliminate those columns. Narrowing the window lessens the number of shift registers (and therefore space) required to store the window elements. The compiler may narrow the window space even further by shifting window computations as far to the right as possible and inserting shift registers to move the results back to the correct iteration.

Some operations, such as division, can be inefficient when implemented directly in hardware, and can be replaced by lookup tables. A pragma indicates that a function or an array needs to be compiled as a lookup table. Another low-level optimization, bitwidth narrowing, exploits the user-defined bitwidths of variables to infer the minimal required bitwidths of intermediate variables. The parts of the program that are to be executed on the FPGA-based reconfigurable hardware are translated, via dataflow graphs, to VHDL. Commercial tools then map VHDL-to-FPGA configuration codes.

Probing

To illustrate SA-C's use, we implement and compile an Automatic Target Recognition (ATR) algorithm to run on a commercially available reconfigurable processor. Probing is an ATR algorithm suited to this purpose. We used the algorithm presented here as part of a larger multisensor ATR system developed to find vehicles in visible light, infrared (IR), and laser-radar (LADAR) imagery.

A probe is a pair of input LADAR pixels and an associated Boolean result. Typically, a probe returns True when the absolute value of the difference in pixel values exceeds a threshold, answering the question: "Is one pixel inside an object of interest and the other outside?" A probe set is a set of probes arrayed along the silhouette of an object. When the proper probe set is placed over an object of interest, you expect a high percentage of the probes to return True. If the probe sets are sufficiently detailed and objects sufficiently distinct, then no other probe set will score as high.

Figure 1 is a typical ATR problem to which probing may be applied. Figure 1(a) is an image roughly corresponding to a LADAR field of view, while 1(b) is the LADAR image with the winning probe set overlaid on top of LADAR pixels.

In our multisensor ATR effort, probing of LADAR imagery is used to signal a more complex target verification process that relates visible light, IR, and LADAR imagery to a 3D target model. Unlike previous uses where probing was implemented to make a final determination of target type and viewing angle, in our application, probing generates a list of possible target types and viewing angles.

At first glance, an exhaustive search through the probe sets for all possible objects, all possible viewing angles, and all possible placements in an image seems a tremendously burdensome computation. However, it is also a computation ripe for optimization. Thus, it is both an algorithm of considerable practical interest and a powerful demonstration of what can be done using the compilation capabilities developed in the Cameron project.

Probing Implementation Details

Example 1 is pseudocode for the probing algorithm. The probing algorithm applies each probe set to each image position and returns two result images (the outer loop in Example 1). The two result images indicate the highest score of the probe sets in each position and the corresponding index of the winning probe set. Each threshold operation compares two pixel values and produces 0 if their absolute difference is below a threshold; otherwise, it returns 1.

The two inner loops computing the scores and probe set indices can be fully unrolled because the probe sets are statically known. This turns the code into a single loop with a giant loop body consisting of threshold operators, sum trees, division operators, and max trees; see Figure 2. This massive loop body allows for CSE, temporal CSE, and window narrowing optimizations.

Figure 3 demonstrates the effect of spatial CSE. Probes common to two (or more) probe sets are computed once, and their result is shared. If probe sets have several probes in common, then parts of the sum tree can be shared as well. Figure 4 is temporal CSE. The right-most of the four top probes is computed, and its result is reused in three computations that occur three, five, and seven iterations later. In Figure 4, this replaces three threshold expressions by a smaller 1-bit register chain of length 7. Figure 5 illustrates window narrowing. The window width is reduced from the width of the silhouette (in the case of Figure 5, 15) to the width of the widest probe (in this case, 4).

Bitwidth narrowing optimization significantly reduces the size of the sum trees. The inputs to the trees are 1-bit unsigned integers and the partial sums grow to, at most, 6 bits, as the number of probes per probe set is between 20 and 42.

Computing the score of a probe set in a window requires the hit count to be divided by the probe set size. Floating-point division is a costly operation on FPGAs, while fixed-point division causes truncation errors that alter the ordering of the scores. Since the goal is to find the maximum score, the division can be replaced by a table lookup that maps hit counts and probe set sizes to rank numbers. The maximum rank number of all probe sets indicates the winner. Scores below a certain threshold (say, 80 percent) are not significant and can be given rank 0. Removing insignificant scores drastically reduces the number of bits in the rank, the lookup tables, and the comparison operators. Because the size of every probe set is statically known, the two-dimensional lookup table can be replaced by a set of one-dimensional lookup tables, one for each probe size. The division is thus replaced by a small, one-dimensional table lookup using the hit count as a lookup index.

Probing Performance and Analysis

The test suite for the probing application consists of three vehicles: an M60 tank, M113 armored personnel carrier, and M901 armored personnel carrier with missile launcher. Each is represented by 81 probe sets (27 aspect angles at 3 depression angles), totalling 7573 probes in windows of sizes up to 13×34. The input is a 512×1024 tiled LADAR image of 12-bit pixels.

The reconfigurable platform used is a WildStar Board, produced by Annapolis Micro Systems (http://www.annapmicro .com/). The WildStar has three XCV2000E Virtex FPGAs made by Xilinx (http://www.xilinx.com/). The WildStar board is capable of operating at frequencies from 25 MHz to 180 MHz. It communicates over the PCI bus with the host computer at 33 MHz. Here, we compare the performance of SA-C code running on the WildStar to the performance of C code running on an 800-MHz Pentium III. (The XCV2000E and 800-MHz PIII are of similar age, and both are fabricated at 0.18 microns.) In both cases, execution times do not include the time required to read the image into local memory, but do include I/O time between the computational engine and local memory

The SA-C code is partitioned in a straightforward way: Each vehicle is mapped on to one FPGA. Each FPGA scans the input image and produces an image of winning scores and probe set indices for its vehicle. The host gathers the resulting data and merges it into a single result image. Figure 1(b) is an 8-bit version of the input image with the probe set of the highest scoring winner superimposed over it. Table 1, which lists dataflow-graph-level statistics for the test suite, shows the number of probes, number of additions in the sum trees, and window size, before and after optimization. This shows that optimization reduces the number of probes by a factor of 19, and halves the number of additions. About half of these additions are 1-bit additions. The number of columns in the window is compacted from the width of the largest probe set (34) to the horizontal width of the widest probe (4). Program execution time on the Wildstar is 81 milliseconds. Executing the same algorithm in C on the Pentium III using Microsoft's Visual C++ compiler optimized for speed takes 65 seconds. Hence, the Wildstar is about 800 times faster than the Pentium.

These times can be explained as follows: For the configuration generated by the SA-C compiler for the probing algorithm, the FPGAs run at 41.1 MHz. The program is completely memory I/O bound—during every clock cycle, each FPGA reads one 32-bit word containing two 12-bit pixels. Because there are (512-13+1)×(1024) columns to be read, each 13 pixels high, the FPGAs perform (512-13+1)×(1024)×(13/2)=3,328,000 reads. At 41.1 MHz, this takes 80.8 milliseconds. The Pentium performs (512-13+1)×(1024-34+1) window accesses in its outer loops. Computation on each window involves 7573 threshold operations. Hence, the inner loop that performs one threshold operation is executed (512-13+1)×(1024-34+1)×7573=3,752,421,500 times. In Example 2, the inner loop body in C, in_width and THRESH, are constants. Visual C++ infers that all the accesses to the probe set can be done by pointer increments and generates a 16-instruction inner loop body. The total number of instructions executed in the inner loop is, therefore, 3,752,421,500×16=60,038,744,000. If one instruction were executed per cycle, this would bring the execution time to about 75 seconds. As the execution time of the whole program is 65 seconds, the Pentium (a superscalar architecture) is actually executing more than one instruction per cycle.

Acknowledgments

The initial probing work was done at Colorado State University working with Alliant Techsystems under the DARPA Reconnaissance, Surveillance, and Target Acquisition program (RSTA). The SA-C work and the probing implementation was supported by DARPA under U.S. Air Force Research Laboratory contract F33615-98-C-1319.

DDJ

1 2 3 4 5 6 7 8 9 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Web Development