Numerical C and DSP

Numerical C, a new high-level language built on the Free Software Foundation's GNU C compiler (gcc), makes it easier to code mathematically intensive applications such as those used with digital-signal processing. Numerical C is a superset of ANSI C, differing in that the additional language constructs are geared towards mathematical programming paradigms.

August 01, 1994
URL:http://www.drdobbs.com/cpp/numerical-c-and-dsp/184409286

Example 4

Example 4

Figure 1

Example 4

AUG94: Numerical C and DSP

Numerical C and DSP

Numeric extensions for C programmers

Kevin Leary

Kevin is systems-engineering manager at Analog Devices. He can be contacted at [email protected].

Digital-signal processors have traditionally been difficult to program in high-level languages such as C. In part, this is due to the fact that the constructs for performing operations such as dot products (which are the same as FIR filters) were not necessarily appropriate for DSP architectures. For example, most programmers code loops in C by counting up like this: for (i=0;i<N;i++). The index i is often used as an induction variable in the loop. DSP chips, however, often have looping hardware that counts down, so the count register is not readily available as an induction variable. This results in source code that can adversely affect the compiler's optimization in loops. This is significant since DSPs are intended to perform loops well. (In fact, performance in loops is one of the key differentiators between DSP and RISC processors.)

Numerical C, a new high-level language built on the Free Software Foundation's GNU C compiler (gcc), makes it easier to program DSP applications and related mathematical algorithms. It is easy to use, and it also makes it easier to produce better code for the target.

Numerical C is a superset of ANSI-standard C. It differs from Standard C in that the additional language constructs are geared to mathematical-programming paradigms. These constructs enable the compiler to generate more-efficient code by giving the compiler more information about the algorithm and by enforcing a canonical form on the input program.

The effort to define Numerical C began with the formation of the Numerical C Extensions Group (NCEG) in March of 1989. This group quickly became part of the ongoing C standardization efforts of ANSI X3J11 and officially operated as ANSI X3J11.1. NCEG has been chartered by ANSI to standardize the math library and suggest any changes to X3J11 in the area of high-speed numerical computing for the next draft of the C language.

In particular, NCEG is focusing on math libraries, floating-point math, complex numbers, variable-length arrays (VLAs), and language extensions to handle parallel and vector architectures. (For more information, see "Numerical Extensions to C" by Robert Jervis, DDJ, August 1992.) The most important extensions that have been added to gcc include operators, data types, and iterators. In this article, I'll focus on complex numbers and iterators using Analog Devices' implementation of Numerical C for the ADSP-21060 SHARC DSP as an example. For more details on this processor, see the accompanying text box, entitled "The ADSP-21060 SHARC DSP." Note, however, that as part of the GNU releases, Numerical C is supported on SPARC, 386 (32-bit), and 486 (32-bit) platforms. As for other DSPs, Texas Instruments has publicly committed to supporting Numerical C on its floating-point DSPs.

Complex Numbers

Numerical algorithms in general, and DSP applications in particular, often rely on complex numbers and complex arithmetic as a basis for analysis. Numerical C supports complex numbers implicitly in order to allow better support for DSP applications.

In Numerical C, complex numbers are supported as an aggregated pair of values: One value represents the real component of the complex number, and the other value represents the imaginary part of the number--the so-called "Cartesian representation." Two primitives are used to access the real and imaginary values of a complex number--creal and cimag, both of which follow the same syntax and semantics as the sizeof operator in C. creal is used to extract the real part of the complex number, and cimag, the imaginary part.

The type qualifier complex is used to construct a complex data type. By combining complex with int, float, or double, you can construct a complex data type with different base types. For example, complex float x declares x to be a complex number in which its real and imaginary parts are of type float. Likewise, the construct complex int ca[10] declares an array of complex integers.

In Numerical C, all operators behave as you would expect, regardless of whether the operands are complex or real. The Numerical C compiler is responsible for generating the best possible code for implementing the complex operation on the target architecture.

Complex constants are constructed with the same syntax found in mathematical notation. The i suffix notation is used to build complex lexical constants; for example, 5+6i, 1i, and 3--.01i. The complex data type has been integrated directly into the compiler's type system, and complex numbers will exist cleanly with the normal, basic types in C. If you add a float to a complex number, the float first gets promoted to a complex float, then added. For instance, adding 5 to 6+4i would result in promoting 5 to 5+0i, and then adding it to 6+4i, resulting in 11+4i. Example 1 illustrates this by computing a "twiddle factor" for a discrete Fourier transform.

Iterators

In Standard C, arrays are declared and operated on explicitly--exact array sizes must be known at declaration time, and every element of an array that is to be operated on must be explicitly accessed. In Numerical C, on the other hand, arrays are declared explicitly but operated on implicitly. A tool to help this implicit operating on arrays is called an "iterator."

Iterators are represented syntactically as declarations. To declare an iterator, you use the new iter keyword. Iterators need to be initialized when they are declared. The initialization value of the iterator is the iterator's upperbound. When an iterator is used in an expression, it causes the expression's evaluation to be based on all integer values [0,N) for the iterator. That is, the iterator itself takes on all the values in the set 0,N). Note that {[0,N) means 0,1,2_N--1 }.

Example 2(a), for instance, loads the first ten elements of the vector A with a ramp resulting in A[0]=0, A[1]=1, _ A[8]=8, A[9]=9. As you can see in this example, all elements of the array A were explicitly declared and (with the help of the iterator I) implicitly operated upon. This example demonstrates two important benefits of Numerical C:

Coding of an algorithm is much more succinct since the algorithm can be abstracted above the element-wise level of analysis.
The compiler has more information about the actual function of the algorithm. Consequently, the compiler can more reliably generate the best code for the algorithm. In this particular case, the compiler can utilize the do loop capability and automated address generation of the ADSP-21060.

For the sake of comparison, Example 2(b) is Standard C code that is semantically equivalent to Example 2(a).

When an expression contains an iterator, the expression becomes a vector expression. Vector expressions can include piece-wise operations and reduction operations.

Piecewise operations are the simplest operations. They typically involve use of the Standard C operators to perform standard arithmetic operations on arrays of data. Example 3(a), for instance, results in {Z[0]=X[0]*Y[0], Z[1]=X[1]*Y[1], _, Z[9]=X[9]*Y[9]}. All Standard C operators work in this piece-wise manner on vector expressions.

Reductions are used to reduce a vector into a scalar quantity. The most common operation is the sum reduction, which works the same as [sum] does in standard mathematical notation. sum adds up all the elements of the expression, resulting in a scalar value that is the sum of the vector expression. To take the dot product of X and Y, you reduce the vector expression X[I]*Y[I] to scalar by summation, as in Example 3(b).

Order of Iteration and Multidimensions

Up to to this point, the iteration space has been one-dimensional. Still, Numerical C is flexible enough to support a three-dimensional iteration space. Given the matrices C, A, and B, where A, B, and C are m x n, p x n, and m x p matrices respectively, the algorithm in Example 4(a) would be coded with iterators in Numerical C as Example 4(b).

Expressions that involve more-complicated iterators and require explicit control of the order of iteration use the for(<iter>) construct. Example 5(a) is the grammar extension for this. For instance, Example 5(b) is semantically identical to Example 5(c), while Example 5(d) is semantically identical to Example 5(e).

Other Extensions

Numerical C extensions included in the Analog Devices implementation provide support for dual data-storage architectures. Most DSP chips are architected with a Harvard or modified Harvard architecture. These architectures support two or three memory spaces for the simultaneous fetching of up to two data operands and an instruction. Numerical C allows you to specify in which of two memory spaces data is to be located. In the case of the ADSP-21060, these spaces are known as pm and dm; as such, the ADI implementation of Numerical C supports pm and dm type qualifiers, allowing you full control over which memory data resides in.

Since Numerical C is based on the gcc core, it has inherited many other extensions to Standard C. These include support for variable-length arrays and run-time initializers for aggregates.

A Neural-Net Example

One Numerical C application we've written is a backprop neural-network system based on Neural Networks: Algorithms, Applications, and Programming Techniques, by James Freeman and David Skapura. If you're unfamiliar with neural-net concepts, refer to Freeman and Skapura's book, as well as any number of DDJ articles, among them "Neural Networks and Character Recognition," by Ken Karnofsky (June 1993), "A Neural-Network Audio Synthesizer," by Thorson et al. (February 1993), and "Bidirectional Associative Memory Systems in C++," by Adam Blum (April 1990).

The network in Listing One is a feed-forward network, where all signals move forward from the input to the outputs to simulate a back-propagation network. In simple terms, you'll find in Listing One two arrays of coefficients modified by a training algorithm based on minimizing the LMS error. Training is the crux of the entire algorithm (and where the network gets the name, "back propagation") because we compute the error at the output layer and then adjust the weights on the output layer. Then we take some weighted multiple of the error and propagate the error back to the hidden layer. Since the code is heavily commented, I won't go into detail about how the network works. Suffice it to say, that this neural net, implemented for a DSP chip, is an example of how you can best use Numerical C.

Performance

The most important reason for using Numerical C is performance. To examine this issue, I'll briefly present an assembly-language code fragment generated by the ADSP-21060 implementation of Numerical C for a dot product.

Example 6(a) is the Numerical C source code used to generate the ADSP-21060 assembly-language code in Example 6(b). In this example, the most important features are the use of the chip's do loop instruction, the single-cycle inner loop, and the dual-memory fetch capability. Note that all ADSP-21060 instructions execute in a single cycle at 25 ns. Also note the algebraic syntax of the ADSP-21060 assembly language.

Conclusions

Numerical C makes writing application code simpler, and the code produced using the Numerical C compiler is more efficient than the code produced by standard C compilers.

Numerical C is now available as a part of gcc as a publicly available compiler covered by the ENU Software licence agreement which encourages the dissemination of source code. Analog Devices encourages the use of Numerical C on any platform and specifically encourages the porting of Numerical C to other DSP chips.

The ADSP-21060 SHARC DSP

The ADSP-21060 Super Harvard Architecture (SHARC) digital-signal processor is a 32/40-bit IEEE floating-point system on a chip offering 40-MIPS performance with a 25-ns instruction rate and single-cycle instruction execution. In particular, the processor and its instruction set are designed to support efficient C code generation. Through the Harvard architecture and three independent computation units, the processor can achieve 120-MFLOPS peak performance without an arithmetic pipeline. The 4-Mbit on-chip SRAM memory allows real-time signal processing without the off-chip data bottlenecks typically found in data-intensive systems. Dual data-address generators with indirect, immediate, modulo, and bit-reverse addressing and efficient program sequencing with zero-overhead looping contribute to sustaining the ADSP-21060's throughput.

As Figure 1 shows, the core of the ADSP-21060 includes:

Independent, parallel computation units. The ADSP-21060 contains three independent, parallel computation units--the ALU, multiplier, and shifter--that perform single-cycle instructions. The units are arranged in parallel, maximizing computational throughput. Single, multifunction instructions execute parallel ALU and multiplier operations. These computation units support the following data formats: IEEE 32-bit, single-precision floating point; extended-precision, 40-bit floating point; and 32-bit fixed point.
Data-register file. This is a general-purpose register file used for transferring data between the computation units and the data buses, and for storing intermediate results. This 10-port, 32-register (16 primary, 16 secondary) register file, combined with the on-chip Harvard architecture, allows unconstrained data flow between computation units and internal memory.
Single-cycle fetch of instruction and two operands. The data memory (DM) bus transfers data and the program memory (PM) bus transfers both instructions and data.
High-performance instruction cache. This enables three-bus operation for fetching one instruction and two data values. The cache is selective: Only the instructions whose fetches conflict with PM bus data accesses are cached. This allows full-speed execution of core, looped operations such as digital-filter multiply accumulates and FFT butterfly processing.
Two data-address generators (DAGs) with hardware circular buffers. Circular buffers allow efficient implementation of delay lines and other data structures required in digital-signal processing. They are commonly used in digital filters and Fourier transforms. The two DAGs contain sufficient registers to allow the creation of up to 32 circular buffers (16 primary register sets, 16 secondary). The DAGs automatically handle address-pointer wraparound, thus reducing overhead, increasing performance, and simplifying implementation. Circular buffers can start and end at any memory location.
Flexible instruction set. The 48-bit instruction word accommodates a variety of parallel operations, for concise programming. For example, the ADSP-21060 can conditionally execute a multiply, add, subtract, and branch in a single instruction.

The ADSP-21060 takes advantage of its Super Harvard Architecture to yield high performance for signal processing applications. Table 1 provides a sampling of benchmarks.

The processor contains 4 Mbits of dual-ported internal SRAM, organized as two blocks of 2 Mbits. Each block is dual-ported for single-cycle, independent accesses by the core processor and I/O processor or DMA controller (the 21060 achieves 240-Mbyte DMA transfer rate). The dual-ported memory and separate on-chip buses allow two data transfers from the core and one from I/O processor, all in a single cycle. The memory can be configured as a maximum of 128K words of 32-bit data, 256K words of 16-bit data, 80K words of 48-bit instructions (and 40-bit data), or combinations of different word sizes up to 4 Mbits. For example, each block can be organized into 64K 32-bit words for data or into 40K 48-bit words to support the ADSP-21060's 48-bit instruction word. These two-word types can coexist in the same memory block. All memory can be accessed as 16-, 32-, or 48-bit words.

While each 2-Mbit block can store combinations of code and data, on-chip memory accesses are most efficient when one block stores data using the DM bus for transfers and the other block stores instructions and data, using the PM bus for transfers. Using the DM bus and PM bus with one dedicated to each memory block assures single-cycle execution with two data transfers. In this case, the instruction must be available in the cache. Single-cycle execution is also maintained when one of the data operands is transferred to or from off-chip, via the ADSP-21060's external port.

--K.L.

Figure 1 ADSP-21060 block diagram.

Example 1: Computation for Fourier-transform twiddle factor.

complex float Wn;
Wn = cexp (2.0*PI*1.0i/N);

Example 2: (a) is the semantic equivalent of (b).

(a)  int A[10];
     iter I = 10;
     A[I] = I;

(b)  int A[10];
     int i;
     for (i=0; i<10; i++)
         A[i] = i;

Example 3: (a) Piece-wise operations; (b) reducing vector expressions.

(a)  float Z[10], X[10], Y[10];
     iter I = 10;
     Z[I] = X[I]*Y[I];

(b)   sum(X[I]*Y[I]);

Example 4 (a) would be coded with iterators in Numerical C as (b).

Example 5: (a) is the grammar extension for the for(<iter) construct; (b) is semantically identical to (c); (d) is semantically identical to (e).

(a)  for (<iterator-variable>)
          statement;

(b)  float A[m][m], B[m];
     iter I=m, J=m;
     for (I) B[J] = B[I] + A[J][I]*I;

(c)  float A[m][m], B[m];
     int i, j;
     for (i=0; i<m; i++)
         for (j=0; j<m; j++)
             B[j] = B[i] + A[j][i]*i;

(d)  float A[m][m], B[m];
     iter I=m, J=m;
     for (J) B[J] = B[I] + A[J][I]*I;

(e)  float A[m][m], B[m];
     int i, j;
     for (j=0; j<m; j++)
         for (i=0; i<m; i++)
             B[j] = B[i] + A[j][i]*i;

Example 6: (a) is the Numerical C source code used to generate the ADSP-21060 assembly-language code in (b).

(a)  int fir, A[10];
     pm int B[10];
     iter I=10;
     fir = sum (A[I]*B[I]);.

(b)  r4=dm(i4,m6), r2=pm(i12,m14);
     lcntr = 9,  do (pc,1) until lce;
        mrf=mrf+r4*r2(ssi), r4=dm(i4,m6), r2=pm(i12,m14);
     mrf=mrf+r4*r2(ssi);

Table 1: ADSP-21060 benchmarks (@ 40 MHz).

Operation                     Time        Cycles

1024-Pt. Complex FFT          0.46 msec   18,221
FIR Filter (per tap)           25 ns           1
IIR Filter (per biquad)       100 ns           4
Divide (y/x)                  150 ns           6
Inverse Square Root
    (1/sqrt(x))               225 ns           9

Listing One


#include <stdio.h>
#include <math.h>
#include <dspc.h>

static inline float sigmoid (float x){ return 1.f/(1.f+expf(-x));}
/* Simulating a backpropogation network: a feed-forward network. All signals 
move foward from the input to the outputs. xp ip op the 'p' is used to denote
the pth exemplar. Multiply 'xp' our input vector by wts on the hidden layer.
      * * * * * * *  input layer.
      \ | | | | | /
       + + + + + +   hidden layer.
      / | | | | | \
     o  o o o o o  o output layer 
     ip output of the hidden layer.  op output of the output layer.
The matrix wts_h is used to map the input layer to the hidden layer, each node
in the input layer has a weight from that node to every node in hidden layer.
An zero in the weight matrix implies there is no connection. So the ip vector 
or the output of the input layer.
                     h
   ip  = SUM xp * wts
     i    j    i     ji
                     o
   op  = SUM ip * wts
     k    j    j     kj       the sigmoid function is used as a thresholding 
function on the output signal of the layer.*/
void
bpn_simulate(int nni, int nho, int nno, float xp[nni], float ip[nho], 
             float op[nno], float pm wts_h[nho][nni], float pm wts_o[nno][nho])
{
  iter i = nni, j = nho, k = nno;
  ip[j] = sigmoid (sum (xp[i]*wts_h[j][i]));
  op[k] = sigmoid (sum (ip[j]*wts_o[k][j]));
}
/* Training is the crux of the algorithim and where network got its name--
Backpropogation. Compute the error at the output layer and adjust the weights
on the output layer. Take some weighted multiple of the error and propogate 
error back to hidden layer. Given xp the input vector, yp the expected output 
vector, and a set of weights wts_h, wts_o, the control variable eta is used to
control how fast algorithim learns. erro = yp-op is the difference of what we 
expected and what we got. Now we find the gradiant of the error surface and 
build delta_o and delta_h from delta_o. Modify the weights with delta_h and 
delta_o. Return half of the gradiant squared to be used by a control algorithm
to determine how much more we need to train. */
float
bpn_train(int nni, int nho, int nno, float xp[nni], float yp[nno],
                float pm wts_h[nho][nni], float pm wts_o[nno][nho], float eta)
{
  iter i = nni, j = nho, k = nno;
  float s;
  float ip[nho], op[nno], delta_h[nho], delta_o[nno];
  /* simulate xp to get ip, op */
  ip[j] = sigmoid (sum (xp[i]*wts_h[j][i]));
  op[k] = sigmoid (sum (ip[j]*wts_o[k][j]));
  /* error * derivitive of 1/(1+exp(op)) {simplifed} */
  delta_o[k] = (yp[k] - op[k]) * op[k] * (1 - op[k]);
  delta_h[j] = ip[j] * (1 - ip[j]) * sum (delta_o[k] * wts_o[k][j]);
  /* adjust the weights on the output layer. */
  wts_o[k][j] += eta * delta_o[k] * ip[j];
  /* adjust the weights on the hidden layer */
  wts_h[j][i] += eta * delta_h[j] * xp[i];
  /* return half of the gradiant squared  d2 */
  return .5f * sum (delta_o[k] * delta_o[k]);
}
/* Initialize the weights to random values between 0 1. not too important */
void
bpn_wts_init(int nni, int nho, int nno,
                           float pm wts_h[nho][nni], float pm wts_o[nno][nho])
{
  float op[nno],ip[nho];
  for (p) { 
    bpn_simulate (nni,nho,nno,exemplars[p],ip,op,wts_h,wts_o);
#if 0
    printf("{"); printf (" %f,",exemplars[p][i]); printf("}\n");  
    printf("{"); printf (" %f,",desired[p][j]); printf("}\n");  
    printf("{"); printf (" %f,",op[j]); printf("}\n");  
    printf ("------------\n"); 
#endif
  }
}
void
bpn_training_loop(int nni, int nho, int nno,int nex, float exemplars[nex][nni],
   float outputs[nex][nno],float pm wts_h[nho][nni], float pm wts_o[nno][nho],
   float eta,float epsilon,int printp)
{
  float error = 100;
  int    iteration=0;
  iter   p = nex;
  while (error > epsilon)
    {
      error = 0;
      for (p) 
       error += bpn_train (nni, nho, nno,exemplars[p], outputs[p], wts_h, 
                                                                   wts_o, eta);
      error /= nex;
      if (printp)
         printf ("%5d: err=%10lf\n",iteration,error);
      iteration++;
      if (printp && (iteration % printp) == 0)
       bpn_simulate_set (nni, nho, nno, nex, exemplars, outputs, wts_h, wts_o);
    }
}