## Writing efficient code for resource-constrained platforms

*Shehrzad is an engineer at Labcyte Inc. Portions of this article were adapted from his forthcoming book Embedded Image Processing on the C6000 DSP (Springer, 2005). He can be contacted at shehrzad_[email protected] hotmail.com.*

Autonomous Threshold Detection

Writing efficient code for memory- and resource-constrained embedded platforms is difficult, and the entire process is compounded when dealing with data-intensive computations such as signal- and image-processing algorithms. Digital Signal Processors (DSPs) are less general-purpose than familiar desktop CPU architectures—IA32, PowerPC, AMD, and the like. However, DSPs are architected to excel at the computations typically found in signal- and image-processing applications. In this article, I examine how 64-bit architectural features of the Texas Instruments C6416, a fixed-point member of the TMS320C6000 (C6x) family of DSPs, engender significant performance boosts in common operations used in image processing. The C6416 is an updated and faster version of the older C62x fixed-point DSPs, and the C6x family also includes the floating-point C67x series. Specifically, I illustrate how packed data-processing optimizations that take advantage of double-word-wide memory accesses can be incorporated into C code using Code Composer Studio compiler intrinsics to enhance the performance of critical loops. Code Composer Studio (CCStudio) is TI's flagship IDE and in many respects looks and feels like Microsoft's Visual Studio. I've compiled all of the code accompanying this article using Version 2.20 of CCStudio, and tested it on a C6416 DSK (DSP Starter Kit).

My approach to DSP code optimization involves a series of stages:

- Development of a MATLAB prototype to gain an in-depth understanding of the algorithm and underlying mathematics at a high level.
- Implement a straightforward reference C/C++ implementation of the algorithm in Visual Studio. This is usually a MATLAB-to-C/C++ port that generates the same output as Step 1. Because the C6416 is a fixed-point processor, I generally convert any algorithms from floating point to fixed point here (see "Fixed-Point Arithmetic for Embedded Systems," by Jean Labrosse,
*C/C++ Users Journal*, February 1998, for a discussion of fixed-point representations of numbers). - Get the code running on the DSP platform. If performance is unsatisfactory, begin the optimization process through techniques such as reducing the memory footprint and packed data-processing techniques via specialized compiler intrinsics.
- After identifying bottlenecks, write gating loops in so-called "linear assembly" or hand-optimized native C6x assembly.

Obviously, you must get to at least Step 3 during development of a DSP-based embedded system. It is essential to profile the code at this point to garner whether the processing meets the stated time requirements (hopefully such requirements are available). After all, if the code is fast enough as is, there is not much to be gained from proceeding to Steps 3 or 4, save for your own personal edification. Experimentation with the compiler and understanding its capabilities must be stressed—it is easier to fiddle with the compiler optimizations than it is to code in assembly. The TI compiler has a "compiler directed feedback" feature that emits information on utilization of the processor's functional units, software pipelining, and other useful information that may point one towards which compiler optimizations may make sense in this context (see "Code Efficiency & Compiler-Directed Feedback," by Jackie Brenner and Markus Levy, *DDJ*, December 2003, for additional information).

### Using Compiler Intrinsics

Intrinsics provide you with access to the hardware while letting you remain within the friendly confines of the C programming environment. They directly translate into assembly instructions for processor-specific features that standard ANSI C cannot support, and are inlined so that there is no function call overhead. A simple example is the *_abs* intrinsic. The C6x instruction set includes an instruction that computes the absolute value of a register in a single clock cycle, which is going to be more efficient than a multicycle C ternary statement that includes a branch statement; for instance:

int absolute_value = (a>0) ? a : -a;

In CCStudio, this C statement can be replaced by:

int absolute_value = _abs(a);

All C6x intrinsics begin with a leading underscore. Another basic operation encountered in image-processing algorithms is saturated arithmetic, whereby the result of an operation of two pixels is clamped to the maximum or minimum value of a predefined range. Rather than code a multicycle series of C statements that implement the saturated add, you should use the *_sadd* intrinsic (see Examples 2-6 in *TMS320C6000 Programmer's Guide*, literature number SPRU198g). Some of the C6x instructions are quite specialized, and many (not all) C6x assembly instructions have intrinsic equivalents—a full list is enumerated in the programmer's guide to the C6x. For example, on the floating-point C67x DSPs, there are instructions for single- and double-precision reciprocal approximations—*RCPSP* and *RCPDP*, respectively. These instructions, and their corresponding compilers intrinsics *(_rcpdp *and* _rcpsp)* can be used to either seed a Newton-Rhapson iterative procedure to increase the accuracy of the reciprocal or perhaps as the reciprocal itself, as accuracy requirements warrant. However, my focus here is not on the use of intrinsics for specialized operations, but rather on using intrinsics within image-processing loops for vectorizing code to use word-wide (32-bit), or preferably, double word-wide (64-bit) optimizations that operate on packed data.

### Packed Data Processing

Image-processing algorithms are notorious memory hogs, as it takes large amounts of memory to store image pixels. While this may not be of huge concern to nonembedded developers who have oodles of available RAM, embedded developers do not have the luxury of being so cavalier. Hence, the first order of business in the optimization of image-processing algorithms is to make maximum utilization of fast on-chip memory. This often entails splitting the image into pieces and paging these pieces from external memory to on-chip RAM, where there is a reduced memory-access latency.

Once the processing completes, the next order of business is to go the other way—page the processed pixels back out to external memory. Equally salient to image processing and signal processing in general is this idea of "packed data processing," where the same instruction applies the identical operation on all elements in a data stream. This general concept is known as SIMD (Single Instruction, Multiple Data) processing in computer architecture circles, and is powerful because, more often than not, the operations acting on the data stream are independent of one another, leading to code-generation techniques that exploit this inherent parallelism. With the right optimization, code size is reduced and processor throughput fully maximized. As you might imagine, Texas Instruments is not the only chip developer to incorporate such techniques into its processor cores. Intel brought to market similar SIMD IA-32 instruction set extensions with MMX, SSE, SSE2, and SSE3, as did AMD with 3DNow!. In a nutshell, packed data processing boils down to storing multiple elements in a single register, then using specialized processor instructions to operate on this data. Here, I use compiler intrinsics to access these processor instructions from within C code.

For example, consider the sum of products between two vectors in Example 1, which is critical in signal processing. This operation appears in various forms, a prominent example being the convolution of two sequences. Suppose *h* and *x* are 16-bit integer quantities, perhaps representing fixed-point numbers in Q15 format. Then a high-level description of what the processor is doing within a loop kernel that implements this vector product sum would look like Example 2. Since C6x registers are 32 bits wide, by reading the data in 16-bit (half-word) chunks at a time, you waste half of the storage capacity of registers 1 and 2. By packing the data to store multiple elements of the stream within registers 1 and 2, you can reduce the load pressure on this loop, as illustrated in the pseudocode in Example 3.

Actually, Example 3 is an embodiment of two optimizations that go hand-in-hand—packed data and loop unrolling. You alleviate the load pressure in the loop kernel by reducing the number of instructions using word-wide data access; for example, replacing what would be *LDH* (Load Half-Word) instructions in the first loop with *LDW* (Load Word) instructions in the second loop, and subsequently packing two 16-bit quantities into the 32-bit registers. The same optimization holds (and is even more advantageous) if data elements in the stream are 8-bit quantities; then, using the just mentioned example, each load would place four data elements in each register and operate on them accordingly. This strategy replaces four *LDB* (Load Byte) instructions with a single *LDW* instruction. Of course, specifying these sorts of loads and stores is not feasible in strict and portable ANSI C, and it is a risky proposition to completely rely on the compiler's code optimization engine to generate code that takes full advantage of SIMD instructions operating on packed data. This is where intrinsics come in.

All C6x DSPs have instructions and corresponding intrinsics that let you operate on 16-bit data stored in the high and low parts of a 32-bit register, as illustrated in Example 2. The C64x and C67x offer double word-wide access via the *LDDW* (Load Double Word) and *STDW* (Store Double Word) instructions and corresponding intrinsics. In this case, 64 bits worth of data are read into the register file, with elements packed into a pair of registers. A requirement is that arrays must now be aligned on certain boundaries to use these instructions. Arrays have to be word-aligned to use *LDW/STW*, and aligned on a double-word boundary to use *LDDW/STDW*. The C64x DSP builds upon this architectural feature by allowing nonaligned accesses of memory via various instructions such as *LDNW/STNW* (load/store nonaligned word) and *LDNDW/STNDW *(load/store nonaligned double word), which correspond to the intrinsics *_mem4* and *_memd8*, respectively. These nonaligned instructions constitute a significant advantage, especially in certain image- and video-processing scenarios when algorithms march along in 8-bit pixel (byte) boundaries. Without such instructions, you are locked out of the packed data optimization due to restrictions imposed by 32-bit or 64-bit alignment boundaries.

One of the simplest examples of using intrinsics to help guide the compiler to take advantage of packed data is the implementation of a function that zeros out an array. This function is similar to the Standard C Library *memset* function; see Listing One. This function only works with arrays aligned on a double-word boundary, where count is a multiple of 8 and greater than or equal to 32. However, given these restrictions, this function offers advantages over the general-purpose *memset* function, which by necessity must be more conservative than *memclear* to maintain generality (see Listing Two). The *_nassert* intrinsic in Listing One is an example of an intrinsic that does not generate any code, rather it asserts to the compiler that the address of the pointer passed into the function is divisible by 8; in other words, aligned on a double-word boundary. To declare an aligned array use the *DATA_ALIGN* pragma (see Listing Three). The *MUST_ITERATE* pragma directive is a means of conveying information to the compiler regarding the number of times a loop iterates, commonly referred to as the "loop trip count." Through this directive, you can specify the exact number of times a loop executes, if the trip count is a multiple of some number, the minimum number of iterations through the loop, and so on. This pragma should be used wherever possible—especially when the minimum trip count is known because this information lets the compiler be more aggressive when applying loop transformations. The form of the *MUST_ITERATE* pragma used in *memclear* specifies that the loop is guaranteed to execute at least 32 times, and armed with this information the compiler can proceed to unroll the loop. Loop unrolling is a code optimization technique where the kernel of a loop is expanded by some factor X—and the loop stopping condition adjusted to N/X—with the intent of reducing the number of branches. By reducing the branch overhead, the efficiency of the loop is increased, and it also allows for better scheduling of instructions contained within the loop kernel.

By stipulating the minimum number of iterations through the *memclear* loop, the input pointer casted to a long (64-bit) type, and guaranteeing alignment of *lptr* to a 64-bit boundary via *_nassert*, the compiler is given numerous pieces of information so that it can generate a loop that runs faster than an equivalent *memset*-like function. The compiler is now free to use the *LDDW/STDW* (load/store aligned double word) instructions to initialize the 64-bit number pointed to by *lptr*. The more conservative code compiles to assembly language using *LDB/STB* (load/store byte) instructions to initialize the 8-bit values pointed to by *ptr*, and a series of these instructions is not as efficient as a series of *LDDW/STDW* instructions due to the lessened throughput of the data flowing through the DSP. Figure 1 highlights the difference in operation between *memclear* and *memset* (as defined in Listing One). In a conservative, but more general, implementation, successive *LDB/STB* instructions are used for accessing and storing array elements. In a single iteration of *memclear*, each *LDDW* instruction loads 64 bits of data into a register pair. The assembly code would use two *MVK* (move constant) instructions to zero out both registers each time through the loop, then *STDW* to send the packed data back into the storage array pointed to by *lptr*.

### Optimization of the Center-of-Mass Calculation

The "isodata" clustering algorithm for automatic threshold detection (see the text box "Autonomous Threshold Detection") is used in image segmentation. This algorithm calls for computing the center-of-mass of two portions of an image histogram, where the histogram is split by the current threshold. The isodata algorithm also entails repeatedly performing these center-of-mass computations, as the iterative procedure continues until a "good" threshold value is found. The center-of-mass calculation is illustrated graphically in Figure 2. The threshold *T=150* bifurcates the histogram, with c_{1} the center-of-mass of the left portion of histogram and c_{2} the center-of-mass to the right of *T*. The mechanics of this algorithm can be transformed so that you can employ the packed data-processing optimization via intrinsic functions to improve the performance of the algorithm. Furthermore, this optimization is an example where the nonaligned double-word instructions present only in the C64x series lets you use 64-bit memory accesses where you otherwise would not be able to.

A description of the procedure for computing the center-of-mass of a region of the histogram is simple enough. Referring to Figure 2, computing the center-of-mass to the left of *T* requires multiplying each histogram count by the bin number, summing those results together, and then dividing by the sum of the counts from 0 to *T*. In mathematical terminology, this translates to the equation in Example 4. Computation of the center-of-mass to the right of *T* is the same, except that the limits of summation change accordingly. Focusing attention on the numerator of this expression, note that you could state the numerator in terms of a dot product; in other words, the sum of products of two vectors, the values of the actual histogram, and the bins for that portion of the histogram we are summing over. So instead of loops like those in Listing Four, you can rewrite them and replace the variable *ii* in the loop kernels with an array indexing operation. Listing Five shows this modification, where the array *pixval* (consisting of an integer ramp) is used in lieu of *ii*. You can now vectorize the operation because you are willing to sacrifice memory usage (the *pixval* array for a 16-bit image would be large and would most likely preclude this optimization). If both the *hist* and *pixval* arrays are aligned on a double-word boundary, you might consider replacing half-word accesses with double-word accesses, reading and writing four elements at a time, and packing two elements per register. If you can make the assumption that the starting value of *ii* is divisible by 4 and *T-ii* is also divisible by 4 (64 bits divided by 16-bit elements), which is equivalent to saying the number of iterations through the loop is divisible by 4, then the loops in Listing Six would suffice on both the C67x and c64x DSPs.

The form of the *MUST_ITERATE* pragma in Listing Six tells the compiler that the number of iterations is divisible by 4, and this in conjunction with *_nassert* should be a sufficient trigger for the compiler to apply packed data optimizations using *LDDW* and *STDW*. Unfortunately, you can make no such assumption about the number of times through either loop and by extension can make no such claim about the starting value of *ii* in the second loop. This fact prevents double word-wide accesses on the C67x; however, with the C64x, you can take advantage of the nonaligned double-word instructions *LDNDW* and *STNDW* to improve the loop performance. Listing Seven shows the contents of the *dotprod* function that is used to compute the numerator of Example 4, assuming the existence of a global array *hist*. The use of *double* in the declarations for *h1_h2_h3_h4* and *b1_b2_b3_b4* may seem odd at first, but the type *double* in this context is just used as a placeholder to signify 64 bits of storage. The loop has been unrolled by a factor of 4 and the *_memd8_const* intrinsic, which generates code that uses *LDNDW*, is used to read 64 bits, or two words worth of data, into *h1_h2_h3_h4* and *b1_b2_b3_b4*. Next, both of these 64-bit elements, which in reality are stored in register pairs, are "split" into 32-bit upper and lower halves via the *_lo* and *_hi* intrinsics. At this point, you are left with a situation like that in Figure 3, with four 32-bit elements.

Although you could split the contents of the four registers again, you need not resort to that course of action. There are assembly instructions available that multiply the 16 LSBs (least significant bits) of one register by the 16 LSBs of another register. Similarly, there are assembly instructions for multiplying the 16 MSBs (most significant bits) of one register by the 16 MSBs of another register. There are actually four different variants of each of these instructions, pertaining to whether the data contained within the portion of the register are to be treated as signed or not, and whether the result should be truncated down to 16 bits or returned as a 32-bit entity. In general, multiplication of two 16-bit quantities yields a 32-bit result. As the comments in Listing 7 indicate, the *_mpyu* (16 unsigned LSBs multiplied together and returned as 32-bit quantity) and *_mpyhu* (16 unsigned MSBs multiplied together and returned as 32-bit quantity) intrinsics are used to perform four packed-data multiplications. These quantities are then added to the accumulators, four of which are used to avoid loop-carried dependencies that inhibit parallelism.

A loop "epilogue" of sorts is required to clean up. Because you can make no claims about loop count being divisible by 4, you need to wrap up the dot product computation using a final "traditional" C loop, which iterates at most three times. Alternatively, you could zero pad the arrays such that you are guaranteed to iterate a number of times divisible by 4. We have now succeeded in fully vectorizing the loop kernel, given the memory bandwidth of the architecture, but in fact have yet to reach the denouement of this story. A hallmark of DSP architectures is the fabled multiply-and-accumulate, or MAC, instruction, and it is not surprising that the C64x provides just the intrinsic you need here to cut down on the number of operations within this loop kernel. TI provides another set of intrinsics that map to so-called "macro" instructions, and one of these is DOTP2. This instruction, accessed in C via *_dotp2*, performs a MAC-like operation on packed 16-bit data; that is, it returns the dot product between two pairs of packed 16-bit values, as in Figure 4 for one of the *_dotp2* "invocations" (remember, while they may look like function calls they are in reality inlined functions). With this final tweak, the number of add operations is further reduced, as *_dotp2* subsumes two of the four adds present in Listing Seven. Listing Eight is the fully optimized dot product function that can be used within an isodata image segmentation implementation.

### Conclusion

While the aligned word (LDW/STW) and double-word (LDDW/STDW) wide instructions are useful in the application of packed data processing code optimization techniques, the C64x line of DSPs augments this functionality with unaligned variants of these instructions. There are many image- and video-processing algorithms that call for marching across image dimensions in step sizes that are not even multiples of 4 or 8 bytes. Consequently, without these unaligned instructions, you're locked out of utilizing 64-bit computation methods. Two good papers on C64x- specific code optimization strategies as they pertain to digital-image processing are "VLIW SIMD Architecture Based Implementation of a Multi-Level Dot Diffusion Algorithm" by Ju and Song (sips03.snu.ac.kr/pdf/adv_prog.pdf) and "Implementation of a Digital Copier Using TMS320C6414 VLIW DSP Processor" by Hwang and Sung (mpeg.snu.ac.kr/pub/conf/c61.pdf). For a more detailed exposition on C6x code optimizations, refer to the TI documentation (http://www.ti.com/) or "Preferred Strategies for Optimizing Convolution on VLIW DSP Architectures" by Sankaran, Pervin, and Cantrell (http://www.crest.gatech.edu/conferences/cases2004/ paper/sankaran.pdf).

**DDJ**

void memclear( void * ptr, int count ) { long *lptr = ptr; _nassert((int)ptr%8==0); #pragma MUST_ITERATE (32); for (count>>=3; count>0; count--) *lptr++ = 0; }Back to article

**Listing Two**

void memset( void *ptr, int x, int count ) { char *uch = ptr; for (; count>0; count--) *uch++ = x; }Back to article

**Listing Three**

#pragma DATA_ALIGN(double_word_aligned_array, 8) unsigned char double_word_aligned_array[256]; #pragma DATA_ALIGN(word_aligned_array, 4) unsigned char word_aligned_array[256];Back to article

**Listing Four**

/* "left" center-of-mass numerator */ for (ii=0; ii<T; ii++) sumofprod1 += ii*hist[ii]; /* "right" center-of-mass numerator */ for (ii=T+1; ii<MP; ii++) /* MP=255 for 8-bit images */ sumofprod2 += ii*hist[ii];Back to article

**Listing Five**

/* pixval = {0, 1, 2, ..., 255} for an 8-bit image */ for (ii=0; ii<T; ii++) /* left */ sumofprod1 += pixval[ii]*hist[ii]; for (ii=T+1; ii<MP; ii++) sumofprod2 += pixval[ii]*hist[ii];Back to article

**Listing Six**

#pragma MUST_ITERATE(,,4) _nassert((int)pixval%8 == 0) _nassert((int)hist%8 == 0) for (ii=0; ii<T; ii++) /* left */ sumofprod1 += pixval[ii]*hist[ii]; #pragma MUST_ITERATE(,,4) _nassert((int)pixval%8 == 0) _nassert((int)hist%8 == 0) for (ii=T+1; ii<MP; ii++) /* right */ sumofprod2 += pixval[ii]*hist[ii];Back to article

**Listing Seven**

unsigned long dotproduct(int lo, int hi) { /* 0, 1, 2, ..., 255 */ static const unsigned short pixval[] = {0,1,2, /* 3,5,...,252 */ ,253,254,255}; unsigned long sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0, sum; const int N = hi-lo; int ii=0, jj=lo, remaining; double h1_h2_h3_h4, b1_b2_b3_b4; unsigned int h1_h2, h3_h4, b1_b2, b3_b4; /* unrolled dot-product loop with non-aligned double word reads */ for (; ii<N; ii+=4, jj+=4) { h1_h2_h3_h4 = _memd8_const(&hist[ii]); h1_h2 = _lo(h1_h2_h3_h4); h3_h4 = _hi(h1_h2_h3_h4); b1_b2_b3_b4 = _memd8_const(&pixval[ii]); b1_b2 = _lo(b1_b2_b3_b4); b3_b4 = _hi(b1_b2_b3_b4); sum1 += _mpyu(h1_h2, b1_b2); /* (h1)(b1) */ sum2 += _mpyhu(h1_h2, b1_b2); /* (h2)(b2) */ sum3 += _mpyu(h3_h4, b3_b4); /* (h3)(b3) */ sum4 += _mpyhu(h3_h4, b3_b4); /* (h4)(b4) */ } sum = sum1 + sum2 + sum3 + sum4; /* loop epilogue: if # iterations guaranteed to * be a multiple of 4, then this would not be required. */ remaining = N - ii; jj = N - remaining; for (ii=jj; ii<N; ii++) sum += hist[ii]*pixval[ii]; return sum; }Back to article

**Listing Eight**

unsigned long dotproduct(int lo, int hi) { /* 0, 1, 2, ..., 255 */ static const unsigned short pixval[] = {0,1,2, /* 3,5,...,252 */, 253,254,255}; unsigned long sum1 = 0, sum2 = 0, sum; const int N = hi-lo; int ii=0, jj=lo, remaining; double h1_h2_h3_h4, b1_b2_b3_b4; unsigned int h1_h2, h3_h4, b1_b2, b3_b4; /* unrolled dot-product loop with non-aligned double word reads */ for (; ii<N; ii+=4, jj+=4) { h1_h2_h3_h4 = _memd8_const(&smoothed_hist[ii]); h1_h2 = _lo(h1_h2_h3_h4); h3_h4 = _hi(h1_h2_h3_h4); b1_b2_b3_b4 = _memd8_const(&pixval[ii]); b1_b2 = _lo(b1_b2_b3_b4); b3_b4 = _hi(b1_b2_b3_b4); sum1 += _dotp2(h1_h2, b1_b2); /* see Figure 4 */ sum2 += _dotp2(h3_h4, b3_b4); } sum = sum1 + sum2; /* loop epilogue: if # iterations guaranteed to * be a multiple of 4, then this would not be required. */ remaining = N - ii; jj = N - remaining; for (ii=jj; ii<N; ii++) sum += smoothed_hist[ii]*pixval[ii]; return sum; }Back to article