Code Efficiency & Compiler-Directed Feedback

By Jackie Brenner and Markus Levy, December 01, 2003

With compiler-directed feedback, compiler output includes information you can leverage to tune application code for higher performance.

Dec03: Code Efficiency & Compiler-Directed Feedback

Increasing code efficiency by improving optimization

Jackie is the code-generation product technology manager and senior member of the Technical Staff at Texas Instruments. Markus is president of the Embedded Microprocessor Benchmark Consortium, senior analyst on the Instat/MDR team, and an editor of the Microprocessor Report. They can be contacted at [email protected] and [email protected], respectively.

Real-World Benchmarking

Feedback-directed optimizations are changes made to a program's execution based on observations made of previous code runs. Optimizing compilers that provide compiler-directed feedback (CDF) offer a different approach to feedback-directed optimization. With CDF, the compiler output created at build time (rather than observed at runtime) includes information you can leverage to tune application code for better performance. And when you couple a highly parallel architecture with a compiler that provides CDF capabilities that take advantage of the parallel nature of the processor, the results can be significant. In this article, we implement an algorithm in C and show how the CDF-supported code-generation tools that are part of the Texas Instruments Code Composer Studio IDE can significantly improve the algorithm's performance on processors such as the TMS320C64x DSP.

The TMS320C64x (Figure 1) provides two identical groups of four functional units that can execute an instruction per clock cycle, and two identical banks of 32 32-bit general-purpose registers. Independent data paths connect the functional units and register files, allowing eight instructions to execute in parallel. The compiler can use any of the 64 registers to hold either data or address pointers. Additionally, the instruction set is primarily RISC like; the individual instructions consist of relatively simple commands that the compiler can distribute among the eight parallel functional units.

These features play a prominent role in accelerating DSP code execution, which typically spends most of its time in loops, rather than control code. CDF generally focuses on these loops because that's where the biggest payback is derived. CDF provides details on the number of cycles needed for each loop along with information about parameters such as data dependencies and resource utilization.

Additionally, CDF provides even more value when you face partitioning challenges in parallel architectures—particularly in determining how to divide tasks so that all the architecture's functional units can be efficiently utilized. While compilers can be very aggressive in their optimizations, they still must be conservative enough so that they don't break the code. You must examine the data from a CDF and use it to let the compiler make more aggressive optimizations. The application of sound programming techniques and knowledge of a chip's architectural features makes it straightforward to take advantage of CDF to achieve performance increases.

Listing One is a standard "out-of-the-box" C implementation of the core computations for a typical autocorrelation function, while Listing Two is the compiler assembly output for this C implementation. Listing Three is the CDF report that results from compiling Listing One using the C64x compiler. First look at the bottom of the report at the value ii ("iteration interval"), which is 1 in this case. This says that the DSP runs one iteration of the loop in one cycle. Next, examine the value for .M, which tells how many times the multiply functional unit was used during that interval. In this case, its value equals "1" for the A-side and "0" for the B-side. Executing only one multiply instruction per clock cycle is a poor use of the device, especially with a DSP that can perform up to four 16×16-bit multiplies every clock cycle. This lack of resource utilization is equally apparent in the compiler assembly output (Listing Two), which has only one multiply (MPY, a 16×16-bit multiply).

Unrolling the Loop

How can you better take advantage of the chip's parallelism to gain more performance? Your first thought is probably to unroll the loop. In this process, you expand the single loop in the C code so it replicates by a factor of X (in this case, 4), then executes the loop N/X times where N was the original loop count. Obviously it increases code size, but it does reduce loop overhead in areas such as branching. More importantly, this approach allows better scheduling of low-level instructions.

By performing loop unrolling and recompiling the new C code (Listing Four), you can measure the performance boost. The compiler generates a usage report (Listing Five) for this module that indicates that the new assembly code (Listing Six) requires the chip to execute three cycles for one iteration of the expanded loop. This might seem counterproductive, but closer examination shows there is a net performance gain because the processor performs more work per cycle. Specifically, each of the two multiplier units performs two multiplies, obtaining four 16×16-bit multiplies in three cycles.

This seems good, but theoretically the chip can execute eight instructions in each of these three cycles, for a total of 24 instructions. However, the unrolled loop code scheduled only 12 instructions in those three cycles. Perhaps giving the compiler additional information about the loop—specifically the loop-count characteristics—can help.

Knowing the minimum number of iterations through the loop aids the compiler in choosing the best software pipelining and nested-loop transformations. In this case, you should insert:

#pragma MUST_ITERATE(4,,4);

immediately after {sum_0=sum_1=sum_2= sum_3=0; at the top of the loop in the unrolled code in Listing Four. The first parameter in the MUST_ITERATE pragma is "4," specifying that the loop executes a minimum of four times. (The complete listing using MUST_ITERATE is available electronically; see "Resource Center," page 5.) The next parameter, which specifies the maximum number of times the loop iterates, is left blank because the maximum number of times this loop executes is likely not known until runtime. The pragma's last parameter tells the compiler whether the loop count is a multiple of a known number.

Even though you might not be able to determine the exact number of iterations, you may be able to supply a multiple, in this case, a value of 4. This tells the compiler that the number of times the loop executes is always a multiple of four (in the previous optimization, you unrolled the loop four times). In most cases when the loop modulo is known, the compiler uses this information to automatically unroll the loop.

After recompiling the code, the CDF report (Listing Seven) provides some interesting information. First, it shows a Loop Unroll Multiple of 4, and the Known Maximum Trip Count drops by 75 percent to 8190. The routine now requires an eight-cycle loop, but schedules 12 multiplies. A look at the assembly code (Listing Eight), however, shows that the code performs 12 16×16-bit multiplies. How is this possible?

The C64x architecture contains instructions that can perform multiple operations in one functional unit per clock cycle. In this case, the compiler makes use of the DOTP2 instruction, which computes a dot product. This powerful instruction performs two 16×16-bit multiplies and adds the pair of resulting 32-bit products; see Figure 2. When examining the resulting assembly code, remember to count two multiplies for each of the two occurrences of this instruction.

There's a hint of even further optimization possibilities in the compiler output. Specifically, when reviewing the latest assembly code, you see it uses a large number of nonaligned memory accesses, namely the instructions LDNW (which allows unaligned loads of 4 bytes from memory) and LDNDW (which does the same for 8 bytes). One limitation of unaligned accesses is that the chip can perform only one such access per clock cycle versus two accesses for aligned data.

In embedded systems, however, you are responsible for the placement of code and data in memory and, therefore, can determine when the data of interest is on an aligned boundary. If you give the compiler this information, it will likely use aligned instructions that are more efficient at loading and storing data in memory. To relay this information to the compiler, use the _nassert() intrinsic, which generates no code, but tells the optimizer that the expression declared with the assert function is True. This gives the compiler a hint as to which optimizations it might employ.

In this case, add:

_nassert((int)x % 8 == 0);

_nassert((int)i % 4 == 0);

above the pragma from the previous step (after {sum_0=sum_1=sum_2=sum_3=0;, but before #pragma MUST_ITERATE(4,,4);). (The complete listing using _nassert is available electronically; see "Resource Center," page 5.) These intrinsics tell the compiler that the x pointer is aligned on a 64-bit double-word boundary, and that the i pointer is aligned on a 32-bit word boundary. The C64x supports double-word loads and stores; there are two 64-bit paths for loading data from memory to the register file. To maximize data throughput it's desirable to use a single load or store instruction to access multiple data values located consecutively in memory. With the _nassert() intrinsic, the compiler knows that it's safe to use an LDW or LDDW instruction, which loads two to four 16-bit values in one cycle.

After running the updated C code through the compiler, the resulting assembly code (Listing Nine) uses LDW and LDDW. The CDF report (Listing Ten) in the assembly file shows what improvements their addition brings. Most obvious, the number of cycles needed to run through the loop has dropped from eight to six. A further examination shows that the number of multiply operations per iteration has dropped from 12 to 10, split equally between the A-side and the B-side. At first, this might seem counterproductive, but you also must look at the resulting compiler output and see exactly which multiply instructions are used. The code includes four uses of the DOTP2 instruction and two uses of the MPY2 instruction—both of which can perform two multiplies in one cycle. These instructions account for 12 multiplies, while a handful of other single-multiply instructions bring the total to 16. Not only have you reduced the number of cycles, you've also increased the number of multiplies per cycle. With the aligned data access, you can bring in more data and make better use of the instructions.

Device-Specific Optimization

To this point, we've achieved some impressive performance gains by examining information from the compiler and using techniques applicable to almost any DSP or compiler. You can realize even more performance improvements if you add device-specific intrinsics. In particular, you can add intrinsics that direct the compiler to use special instructions at specific places in the algorithm. These intrinsics (see Listing Eleven; available electronically) use the same C variable names, and even though they're not portable to other architectures, they are register and resource agnostic.

The C64x architecture can bring data in as 64-bit values, and the chip needs 16 16-bit values every clock cycle if it is to perform eight 16×16 multiplies every clock cycle. You can achieve this goal with a series of DOTP2 instructions (Listing Twelve; available electronically) and a few intrinsics that perform some associated data handling.

The results from this optimization step demonstrate dramatic performance improvements (see Listing Thirteen; available electronically). The loop runs in four cycles and executes four 16×16-bit multiplies per cycle. The DSP is fully utilizing its resources; you've successfully optimized the device as far as possible for the loop. Indeed, Table 1 summarizes how far you've come down the optimization trail. Most of the techniques are straightforward and don't require you to be a DSP guru. The original out-of-the-box C code performs one multiply per cycle and requires 16,000 total cycles to get the job done. At the other extreme, after you've pulled all the tricks out of your programmer's bag, the code performs four multiplies per cycle and requires only 4000 total cycles—a reduction of 75 percent.

Likewise, Figure 3 presents three certified Telemark figures for the TMS320C6416-720 from EEMBC (for information on EEMBC, see the accompanying text box entitled "Real-World Benchmarking"). The first value, 19.5, is calculated when you run the EEMBC benchmark code unmodified. The second column, with a value of 379.1, reflects the Telemark for the "full-fury" test where you can pull out all the stops in writing C code such as in this article. Working strictly within C achieves a performance gain of roughly 32X.

The Telemark resulting from the handcrafting of a DSP expert working with assembler code delivers a Telemark of 628.6. Although the performance gain is 66 percent higher, attaining that extra gain takes tremendous time and effort. Meanwhile, you can achieve a 32X boost just through the intelligent use of a C compiler, which takes far less time and expertise.

Conclusion

While Table 1 and Figure 3 demonstrate the value of compiler-directed feedback, a current limitation of CDF is the process of reading and interpreting the large amount of data that compilers generate in their performance reports. That can be tedious, requiring you to study the compiler documentation to discover how to apply the data. Luckily, compilers that provide more interactive help with the feedback, suggesting various techniques that would work well based on its knowledge of the architecture, are on the horizon.

DDJ

Listing One

void E_autocor_cn
(const e_s16 *restrict x,               /* Input data samples         */
             e_s16         *restrict r, /* Output correlations        */
             e_s16 nx,                  /* Number of input samples    */
             e_s16 nr,                  /* Number of outputs          */
             e_s16 scale                /* Scale factor (right shift) */
)
{     n_int i, j;
      long sum;
      for (i = 0; i < nr; i++)
      {
          sum = 0;
          for (j = i; J < nx; j++)
              sum += x[j] * x[j - i];
          r[i] = sum >> (16 + scale);
      }
}

Back to Article

Listing Two

   [A0]    SUB    .S1    A0,1,A0
|| [!A0]   ADD    .L1    A7,A5:A4,A5:A4
||         MPY    .M1X   B5,A3,A7
|| [B0]    BDEC   .S2    L3,B0
||         LDH    .D1T1  *A6++,A3
||         LDH    .D2T2  *B4++,B5

Back to Article

Listing Three

;* SOFTWARE PIPELINE INFORMATION
;*    Known Minimum Trip Count           : 1
;*    Known Maximum Trip Count           : 32767 
;*    Known Max Trip Count               : 1  
;*    Loop Carried Dependency Bound (^)  : 0  
;*    Unpartitioned Resource Bound       : 1
;*    Partitioned Resource Bound (*)      : 1
;*    Resource Partition: 
;*                              A-side     B-side
;*    .L Units                   1*          0
;*    .S Units                   0           1*
;*    .D Units                   1*          1*
;*    .M Units                   1*          0
;*    .X cross paths             1*          0
;*    .T address paths           1*          1*
;*    Long read paths            1*          0
;*    Long write paths           0           0   
;*    Logical ops (.LS)          0           0   (.L or .S unit)
;*    Addition ops (.LSD)        0           0   (.L or .S or .D unit)
;*    Bound (.L .S .D)           1*          1* 
;*    Bound (.L .S .D .LS .LSD)  1*          1* 
;*
;*    Searching for software pipeline schedule at ...
;*       ii = 1 Schedule found with 8 iterations in parallel done

Back to Article

Listing Four

{ n_int i,j,
   long sum_0, sum_1, sum_2, sum_3;
   for (i=0; i < nr; i+=4)
   {sum_0 = sum_1 = sum_2 = sum_3 = 0;
      for(j=i+4; j < nx; j++)
      { sum_0 += x[j] * x[j-i-0];   
        sum_1 += x[j] * x[j-i-1];   
        sum_2 += x[j] * x[j-i-2];   
        sum_3 += x[j] * x[j-i-3];   
      }
   sum_0 += x[i+0]*x[0] + x[i+1]*x[1] + x[i+2]*x[2] +  x[i+3]*x[3];
   sum_1 += x[i+1]*x[0] + x[i+2]*x[1] + x[i+3]*x[2];
   sum_2 += x[i+2]*x[0] + x[i+3]*x[1];
   sum_3 += x[i+3]*x[0];
   r[i+0] = sum_0 >> (16 + scale);
   r[i+1] = sum_1 >> (16 + scale);
   r[i+2] = sum_2 >> (16 + scale);
   r[i+3] = sum_3 >> (16 + scale);
   }
}

Back to Article

Listing Five

;* SOFTWARE PIPELINE INFORMATION
;*    Known Minimum Trip Count           : 1
;*    Known Maximum Trip Count           : 32763 
;*    Known Max Trip Count               : 1  
;*    Loop Carried Dependency Bound (^)  : 1  
;*    Unpartitioned Resource Bound       : 2
;*    Partitioned Resource Bound (*)      : 2
;*    Resource Partition: 
;*                              A-side     B-side
;*    .L Units                   2*          2*
;*    .S Units                   1           0
;*    .D Units                   1           1
;*    .M Units                   2*          2*
;*    .X cross paths             2*          2*
;*    .T address paths           1           2*
;*    Long read paths            2*          2*
;*    Long write paths           0           0   
;*    Logical ops (.LS)          0           0   (.L or .S unit)
;*    Addition ops (.LSD)        0           0   (.L or .S or .D unit)
;*    Bound (.L .S .D)           2*          1 
;*    Bound (.L .S .D .LS .LSD)  2*          1 
;*
;*    Searching for software pipeline schedule at ...
;*       ii = 2 Did not find schedule 
;*       ii = 3 Schedule found with 4 iterations in parallel

Back to Article

Listing Six

       MPYHL     .M2X       A5,B16,B9
||      MPY      .M1X       B16,A4,A16
||[A0]    BDEC   .S1        L3,A0
||     LDNDW     .D1T1      *A3++(2),A5:A4
  [!A1]   ADD    .L1        A4,A9:A8,A9:A8
||[!A1]   ADD    .L2        B9,B7:B6,B7:B6
  [A1]    SUB    .D1        A1,1,A1
||[!A1]   ADD    .L2        B9,B5:B4,B5:B4
||[!A1]   ADD    .L1        A16,A7:A6,A7:A6
        MPY      .M1X       B6,A5,A4
       MPYHL     .M2X       A4,B16,B9
||      LDH      .D2T2      *B8++,B16

Back to Article

Listing Seven

;*  SOFTWARE PIPELINE INFORMATION
;*     Loop Unroll Multiple               : 4x
;*     Known Minimum Trip Count           : 1
;*     Known Maximum Trip Count           : 8190
;*     Known Max Trip Count               : 1  
;*     Loop Carried Dependency Bound (^)  : 3  
;*     Unpartitioned Resource Bound       : 7
;*     Partitioned Resource Bound (*)      : 7
;*     Resource Partition: 
;*                               A-side     B-side
;*     .L Units                   7*          7*
;*     .S Units                   0           1
;*     .D Units                   3           1
;*     .M Units                   5           7*
;*     .X cross paths             5           4
;*     .T address paths           4           4
;*     Long read paths            7*          7*
;*     Long write paths           0           0   
;*     Logical ops (.LS)          2           0   (.L or .S unit)
;*     Addition ops (.LSD)        4           0   (.L or .S or .D unit)
;*     Bound (.L .S .D)           5           4 
;*     Bound (.L .S .D .LS .LSD)  6           3 
;*
;*     Searching for software pipeline schedule at ...
;*       ii = 7 Did not find schedule 
;*       ii = 8 Schedule found with 3 iterations in parallel

Back to Article

Listing Eight

       MPY       .M2X       B29,A3,B27
||     MV        .D1X       B29,A24
||     MPY       .M1        A26,A25,A27
  [!A0]   ADD    .L1        A26,A7:A6,A7:A6
||[!A0]   ADD    .L2        B27,B7:B6,B7:B6
||       DOTP2   .M1        A24,A25,A27
||       PACKLH2 .S1        A3,A3,A26
  [!A0]   ADD    .L2        B27,B17:B16,B17:B16
||      MPYHL    .M1        A25,A24,A27
||[B0]   BDEC    .S2        L5,B0
||[!A0]   ADD    .L1X       B30,A17:A16,A17:A16
||      MPYHL    .M2        B29,B25,B27
  [!A0]   ADD    .L2        B31,B9:B8,B9:B8
||[!A0]   ADD    .L1        A27,A9:A8,A9:A8
||      MPYHL    .M2        B28,B24,B24
||        ADD    .S1        2,A30,A29
||     LDNDW     .D1T2      *A30,B35:B24
  [!A0]   ADD    .L1        A27,A23:A22,A23:A22
||[!A0]   ADD    .L2        B27,B19:B18,B19:B18
||        MV     .D1X       B28,A26
||        MPY2   .M2X       B28,B24,B27
||     LDNDW     .D2T2      *B1++(8),B29:B28
  [!A0]   ADD    .L1X       B26,A19:A18,A19:A18
||        ADD    .L2        B24,B21:B20,B21:B20
||      MPYH     .M2        B29,B25,B30
||      DOT2     .M1        A26,A24,A20
||       ADD     .S1        8,A30,A30
||     LDNDW     .D1T1      *-A30(6),A26:A24
  [A0]   SUB     .D1        A0,1,A0
||[!A0]  ADD     .L1X       B30,A21:A20,A21:A20
||       ADD     .L2        B27,B23:B22,B23:B22
||     MPYHL     .M1        A24,A26,A26
||      MPY2     .M2X       B28,A28,B31:B30

Back to Article

Listing Nine

   [B0]   BDEC   .S2        L3,B0
||       DOTP2   .M1        A20,A21,A20
||[!A0]   ADD    .L2X       A25,B17:B16,B17:B16
||[!A0]   ADD    .L1        A26,A5:A4,A5:A4
||        MV     .D1X       B26,A25
  [!A0]   ADD    .L2        B22,B7:B6,B7:B6
||[!A0]   ADD    .L1        A26,A23:A22,A23:A22
||       DOTP2   .M2X       B27,A20,B23
||       DOTP2   .M1        A25,A20,A25
||       PACKLH2 .S1        A20,A20,A26
||       LDDW    .D2T2      *B28++,B27:B26
  [!A0]   ADD    .L1        A24,A9:A8,A9:A8
||[!A0]   ADD    .L2        B23,B5:B4,B5:B4
||      MPYHL    .M2X       B27,A21,B22
||      DOTP2    .M1        A25,A28,A26
||       PACKLH2 .S1        A28,A28,A27
||       LDDW    .D1T1      *++A3,A21:A20
  [!A0]   ADD    .L1X       B22,A7:A6,A7:A6
||[!A0]   ADD    .L2        B25,B21:B20,B21:B20
||      MPYHL    .M1        A25,A20,A24
||       MPY     .M2X       B26,A24,B22
||       LDW     .D1T1      +-A3(4),A28
  [!A0]   ADD    .L1        A20,A19:A18,A19:A18
||        ADD    .L2        B22,B9:B8,B9:B8
||         MV    .S1X       B27,A20
||       MPY2    .M2X       B27,A26,B23:B22
||        LDH    .D1T1      *-A3(6),A24
  [A0]    SUB    .D1        A0,1,A0
||[!A0]   ADD    .L1X       B24,A17:A16,A17:A16
||        ADD    .L2        B23,B19:B18,B19:B18
||        MPYHL  .M1        A28,A20,A26
||        MPY2   .M2X       B26,A27,B25:B24

Back to Article

Listing Ten

;*  SOFTWARE PIPELINE INFORMATION
;*     Loop Unroll Multiple               : 4x
;*     Known Minimum Trip Count           : 1
;*     Known Maximum Trip Count           : 8190
;*     Known Max Trip Count               : 1  
;*     Loop Carried Dependency Bound (^)  : 3  
;*     Unpartitioned Resource Bound       : 6
;*     Partitioned Resource Bound (*)      : 6
;*     Resource Partition: 
;*                               A-side     B-side
;*     .L Units                   6*          6*
;*     .S Units                   0           1
;*     .D Units                   3           1
;*     .M Units                   5           5
;*     .X cross paths             4           6*
;*     .T address paths           3           1
;*     Long read paths            6*          6*
;*     Long write paths           0           0   
;*     Logical ops (.LS)          2           0   (.L or .S unit)
;*     Addition ops (.LSD)        4           0   (.L or .S or .D unit)
;*     Bound (.L .S .D)           4           4 
;*     Bound (.L .S .D .LS .LSD)  5           3 
;*
;*     Searching for software pipeline schedule at ...
;*       ii = 6 Schedule found with 3 iterations in parallel

Back to Article

1 2 3 4 5 6 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Code Efficiency & Compiler-Directed Feedback

Increasing code efficiency by improving optimization

Unrolling the Loop

Device-Specific Optimization

Conclusion

Listing One

Listing Two

Listing Three

Listing Four

Listing Five

Listing Six

Listing Seven

Listing Eight

Listing Nine

Listing Ten

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Code Efficiency & Compiler-Directed Feedback

Increasing code efficiency by improving optimization

Unrolling the Loop

Device-Specific Optimization

Conclusion

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content