Optimizing Microcontroller Performance

Software practitioners often face the challenge of enhancing software performance. James describes one such situation he encountered when developing control software for a Mitsubishi M37735-based cellular phone.


November 01, 1997
URL:http://www.drdobbs.com/embedded-systems/optimizing-microcontroller-performance/184410319

Dr. Dobb's Journal November 1997: Optimizing Microcontroller Performance

Getting around a memcpy bottleneck

James is a senior staff engineer with Ericsson Inc. He can be reached at flynn@ egertp.ericsson.se.


As software developers, we often face the prospect of enhancing our software's performance. Usually, these challenges are the result of a few frequently executed routines. I recently encountered such a situation when developing control software for a Mitsubishi M37735-based cellular phone.

The problem I encountered was that the memcpy() that came with my compiler -- the IAR 7700 Micro Series compiler and IAR 7700 Assembler -- took too long for our interrupt-service routine requirements (which we wanted to write in C). This was due to a classic "good news/bad news" feature. The good news was that the IAR compiler was designed to support a variety of memory models (small, medium, compact, and banked) and target processors. The bad news was that the supplied run-time library routines had not been optimized for each target/memory model. However, after rewriting the routine in assembly to take advantage of the M37735's features, we were able to considerably cut execution time.

Tools on the Workbench

The compiler, which comes with the IAR 7700 Embedded Workbench development environment (http://www.iar.com/), is an optimized ANSI C compiler with extended keywords specific to the processor at hand. (In addition to the Mitsubishi M37000 and M37050 families, versions of the environment are available for processors from Intel, Motorola, Hitachi, Zilog, NEC, Atmel, Toshiba, and Texas Instruments.) The environment is hosted on DOS, Sun 4 Solaris, and the HP9000/700 series. With it, you can create projects, edit files, compile, assemble, link, build, and debug without leaving the environment. The Workbench also includes the C-SPY simulator/debugger, a relocatable macroassembler, a universal linker XLINK, a librarian XLIB, and an error-sensitive ASCII editor.

On the hardware side, the Mitsubishi M37735 is a member of Mitsubishi's single-chip microcomputer 7700 family core that has been optimized for embedded applications. It is a low-power static CMOS design with a 16-bit processor core that can also operate in 8-bit mode. Packaged with the processor are:

The total physical-addressing capacity of the Mitsubishi is 16 MB. This memory is arranged as 255 banks of 64 KB. In the 7735 family, the processor only has sufficient address lines and chip selects to address the first 15 banks of memory (0-0FFFFF); see Figure 1. Additionally, the first bank of memory has many predefined areas. These predefined areas are:

Figure 2 shows the registers available in the M37735. The A and B accumulators are 8- or 16-bit wide selectable. All data operations (arithmetic, data transfer, input/output, and so on) use these registers. The preferred accumulator is A because it generates fewer op-codes, thus minimizing execution time.

The index registers X and Y also are selectable as either 8- or 16-bit wide registers. These registers are used for all indexing operations as selected instructions. The block transfer (MVN) is one such instruction, which uses these index registers. In the MVN instruction, the X register indicates the low-order 16 bits of the source data address. The third byte of the MVN instruction is the high-order 8 bits of the source address. Like the X register, the Y register is used by the MVN instruction for the destination data address. The second byte of the MVN instruction contains the high-order 8 bits of the destination data address.

The stack pointer register is a 16-bit register that is used to access the hardware stack during subroutine calls and interrupts. Since this register has no associated bank register, it must reside in the bank 0 address space.

The program bank (PG) and data bank (DT) registers provide the upper 8 bits of address space used to index the 15 banks of memory for code and data, respectively.

The processor status register (PS) provides the same type of information that most status registers provide as well as providing control bits to manage data and index register widths (either 8 or 16 bits).

Of the Mitsubishi's registers, the DPR register deserves special attention. In IAR's implementation of C, this register is used as the software frame pointer. Like the stack pointer register (S), the direct page register only addresses bank 0. An interesting attribute of the DPR register is its ability to generate data addresses with a reduced number of clock cycles if the low-order byte is on a page boundary (0016). Since all direct addressing modes of the processor use this register, it operation must be well understood.

The Mitsubishi M37735 has 19 different interrupts and programmability for 7 different interrupt levels. While there is an inherent priority in the order of interrupt detection, each interrupt source also allows you to program its desired interrupt priority.

One interrupt, the watchdog timer, is essentially a programmable NMI. This interrupt is intended to be a check for errant program operation -- a watchdog timer. The period of this interrupt is programmable as the system clock divided by either 32 or 512.

The timers supported by the Mitsubishi are grouped into two types -- five are A timers and three are B timers. The principle differences between the two types are that A timers can be used for one-shot pulse generation and pulse-width modulation. The B timers can be used to perform pulse-period measurements but they have no output pins associated with them. An interesting aspect of the B2 timer is that it can be internally connected (cascaded) to the B1 timer. This connection (combined with the provision that an external subclock can be selected to drive the B2 timer) allows for implementing a time of day.

Solving the memcpy() Problem

Since the compiler is capable of outputting code for a variety of memory models, IAR implemented the memcpy() function in C (so the various memory flavors could be simply compiled). While I have no idea what the compiler's actual C run-time library routine looks like, I expect (from looking at the assembly on an in-circuit emulator) that it was written with a generic for-loop structure like Listing One While this is an easy solution for the run-time library supplier, it results in a routine that is painfully slow -- requiring more than a millisecond to copy a mere 16 bytes of data. After investigating this problem further, I discovered that Mitsubishi had, in fact, implemented an intrinsic block move instruction (MVN) that would quickly move a block of data between locations. The catch to this instruction is that, to use it, the destination and source bank address must be encoded as part of the executable instructions. Since memcpy() is a run-time library routine, it also has to be reentrant, meaning that no static or global memory can be used.

To implement a more efficient version of memcpy, I determined that the solution would be to write a routine that dynamically places the op-codes necessary on the frame, then continues execution there; in other words, a self-modifying (or self-writing) routine. Implementing the code on the caller's frame would ensure that the routine was reentrant (required for a run-time library routine) and that the execution speed increase would be significant.

Hardware Stacks and Software Frames

When calling functions in C (or most high-level languages), the compiler manages the hardware stack and software frame transparently. Implementing a routine that uses this memory differently requires a good understanding of how these two structures are used. In most compilers, the stack and frame occupy the same memory space, with both the software frame and hardware stack intermixed as in Figure 3.

This implementation is conceptually easy to follow, allows for detection of stack overruns, and is the model most of us are accustomed to. When implementing the stack/frame handling for the Mitsubishi, IAR chose a separate frame/stack model because of the way that the DPR register is implemented. This register is used as the software frame pointer and implements all direct addressing. Since the value that this register contains is assumed to be an address offset in bank 0, it is much easier to operate in a forward-moving memory address space. The previous memory model, though easy to follow, requires that the address space always be moved from high (0xFFFF) to low (0x0000). By separating the frame and stack, the DPR register always moves towards increasing addresses. A major potential problem with this model is that the frame could overrun the hardware stack, resulting in total anarchy. Figure 4 shows how memory for the separate stack and frame model is implemented.

Using this separate stack/frame mode, as functions are called and local variables are created, space is allocated and reserved on the software frame -- based upon the previous frame address. These ancestor frame locations are known by definition as being located immediately following the calling function's return address. This allows entry code to correctly locate already allocated frame space.

With this usage information, implementing my new and improved memcpy() would simply require me to reserve the space needed for my executable code, write my program snippet into this memory, and continue execution. Once the program snippet completes, it should clean up after itself by restoring the caller's frame pointer to the DPR register, then returning.

Implementation of Solution

To use Mitsubishi's block move (MVN) instruction, the following basic register usage is needed:

As previously discussed, the upper 8 bits of address for the destination and source addresses are encoded as the second and third bytes of the MVN instruction. In assembly, the encoding looks like MVN destination-address bank, source-address bank.

After writing the assembly necessary to implement this code fragment, I determined that 15 bytes of data would be required. This would account for the assembled move instruction as well as the necessary code to clean-up after the routine had completed execution. In keeping with the compiler's conventions, I reserved 15 bytes of space for my routine's software frame. To complete the encoding of the program snippet, the incoming parameters (source and destination address as well as number of bytes) are accessed in a variety of locations -- the Y and B registers as well as the caller's software frame. Listing Two is the actual implementation of the memcpy() routine.

Unlike the Intel family of processors, where the width of the data is determined by the op-code encoding, the Mitsubishi relies on a bit in status register for operating data width. Since I wanted to ensure that I could copy more than 256 bytes at once, I was careful to set both the data and index register widths to 16-bit mode. Throughout this code, you can see where the processor is switched back and forth between 8-bit and 16-bit mode, depending on the size of the data on which the program is operating. When I execute the return to start executing our newly constructed program continuation, I explicitly set the data width back to 16-bit mode.

Finally, after all the required executable code has been placed on the software frame, the Software Frame address is placed on the hardware stack and a return instruction executed. This makes execution start at the first byte of the frame -- which happens to be the MVN instruction, complete with the proper source and destination banks encoded.

Conclusion

Admittedly, this technique (and accompanying routine) is not for the faint of heart. But when addressing execution bottlenecks, extreme measures are sometimes necessary. In the end, we reduced execution time for a routine that previously took a prohibitively long time to a speedy [145+(number of bytes copied)/2×7] clock cycles.


Listing One

void *memcpy(void *s1, const void *s2, size_t n){
    char *su1;
    const char *su2;


for (su1 = s1, su2=s2; 0<n; ++su1,++su2, --n) *su1 = *su2; return s1; }

Back to Article

Listing Two

;; op-codes needed to build program snippet on the software frame 

#define MVN_INST 0x54 ;; MOVE INSTRUCTION #define PLD_INST 0x2B ;; POP FRAME REGISTER #define PLY_INST 0x7A ;; POP Y REGISTER #define RTL_INST 0x6B ;; RETURN LONG #define PLA_INST 0x68 ;; POP A REGISTER #define PLB_INST 0x42 ;; POP B REGISTER #define PLT_INST 0xAB ;; POP DATA BANK REGISTER

;*************************************************************** ; memcpy ; void memcpy(void *s1, const void *s2, size_t n) ; This routine copies n characters from the object pointed to by s2 ; into the object pointed to by s1. If copying takes place between ; objects that overlap, the behavior is undefined. ; NOTE: This routine is written to conform to IAR M37735 interface ; requirements for the banked memory model. ;Arguments: ; s1 = destination pointer ; s2 = source pointer ; n = number of bytes to copy ;Returns: ; Returns s1 ;*************************************************************** ;; On entry the following registers & frame locations are in use. ;; Y - LSW of s1 ;; B - MSW of s1 ;; dp:2- LSW of s2 ;; dp:4- MSW of s2 ;; dp:0- Number of bytes to copy ;; ;; this routine uses memory locations dp:6 - dp:14 for code

.public memcpy memcpy: lda a, 4, s ;; set new frame pointer up phd ;; but first save the old one tad ;; now, calculate the new one clc ;; by adding the users request adc a, #15 ;; of 15 bytes. Next save the new pha ;; frame location for future callers pht ;; save the DT register phb ;; Save destination address so it can phy ;; be returned to the caller. ldx dp:2 ;; X=source offset,Y=destination offset .data 8 ;; sem ;; switch to 8-bit data mode and ldm #MVN_INST, dp:5 ;; setup op-codes on the frame sta b, dp:6 ;; store destination bank lda a, dp:4 ;; get & store source bank sta a, dp:7 ;; ldm #PLA_INST, dp:8 ;; restore LSW of s1 to the A register ldm #PLB_INST, dp:9 ;; restore MSW of s1 to the B register ldm #PLA_INST, dp:10 ;; (whose op-code is two bytes) ldm #PLT_INST, dp:11 ;; restore DT register ldm #PLY_INST, dp:12 ;; clean up stack ldm #PLD_INST, dp:13 ;; restore dptr register ldm #RTL_INST, dp:14 ;; then return to caller .data 16 ;; clm ;; now, return to 16-bit mode and tda ;; get the address of the current frame clc ;; reposition it so it points to start adc a, #5 ;; of newly constructed code snippet .data 8 ;; switch to 8-bit data mode for move sem ;; (moves take place 8 bits at a time) lda b, #0 ;; setup the h/w stack with the frame phb ;; address (stacks are in bank 0) .data 16 ;; clm ;; pha ;; to perform a RTL when done lda a, dp:0 ;; A = number of bytes to move clp m,x ;; data & index both 16-bit mode. rtl ;; now we can jump to our snippet

Back to Article

DDJ


Copyright © 1997, Dr. Dobb's Journal

Dr. Dobb's Journal November 1997: Optimizing Microcontroller Performance

Optimizing Microcontroller Performance

By James Flynn

Dr. Dobb's Journal November 1997

Figure 1: Mitsubishi memory map.


Copyright © 1997, Dr. Dobb's Journal

Dr. Dobb's Journal November 1997: Optimizing Microcontroller Performance

Optimizing Microcontroller Performance

By James Flynn

Dr. Dobb's Journal November 1997

Figure 2: M37735 registers.


Copyright © 1997, Dr. Dobb's Journal

Dr. Dobb's Journal November 1997: Optimizing Microcontroller Performance

Optimizing Microcontroller Performance

By James Flynn

Dr. Dobb's Journal November 1997

Figure 3: Software frame and hardware stack.


Copyright © 1997, Dr. Dobb's Journal

Dr. Dobb's Journal November 1997: Optimizing Microcontroller Performance

Optimizing Microcontroller Performance

By James Flynn

Dr. Dobb's Journal November 1997

Figure 4: How memory for the separate stack and frame model is implemented.


Copyright © 1997, Dr. Dobb's Journal

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.