Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼

Embedded Systems

Optimizing Microcontroller Performance

Dr. Dobb's Journal November 1997: Optimizing Microcontroller Performance

Getting around a memcpy bottleneck

James is a senior staff engineer with Ericsson Inc. He can be reached at [email protected] egertp.ericsson.se.

As software developers, we often face the prospect of enhancing our software's performance. Usually, these challenges are the result of a few frequently executed routines. I recently encountered such a situation when developing control software for a Mitsubishi M37735-based cellular phone.

The problem I encountered was that the memcpy() that came with my compiler -- the IAR 7700 Micro Series compiler and IAR 7700 Assembler -- took too long for our interrupt-service routine requirements (which we wanted to write in C). This was due to a classic "good news/bad news" feature. The good news was that the IAR compiler was designed to support a variety of memory models (small, medium, compact, and banked) and target processors. The bad news was that the supplied run-time library routines had not been optimized for each target/memory model. However, after rewriting the routine in assembly to take advantage of the M37735's features, we were able to considerably cut execution time.

Tools on the Workbench

The compiler, which comes with the IAR 7700 Embedded Workbench development environment (http://www.iar.com/), is an optimized ANSI C compiler with extended keywords specific to the processor at hand. (In addition to the Mitsubishi M37000 and M37050 families, versions of the environment are available for processors from Intel, Motorola, Hitachi, Zilog, NEC, Atmel, Toshiba, and Texas Instruments.) The environment is hosted on DOS, Sun 4 Solaris, and the HP9000/700 series. With it, you can create projects, edit files, compile, assemble, link, build, and debug without leaving the environment. The Workbench also includes the C-SPY simulator/debugger, a relocatable macroassembler, a universal linker XLINK, a librarian XLIB, and an error-sensitive ASCII editor.

On the hardware side, the Mitsubishi M37735 is a member of Mitsubishi's single-chip microcomputer 7700 family core that has been optimized for embedded applications. It is a low-power static CMOS design with a 16-bit processor core that can also operate in 8-bit mode. Packaged with the processor are:

  • 32 KB of on-chip ROM or EPROM.
  • 3968 bytes of on-chip RAM.
  • 19 types and 7 levels of interrupts.
  • Eight 16-bit timers.
  • Three UARTS (synchronous or asynchronous).
  • Eight 10-bit A/D converters.
  • 12-bit NMI watchdog timer.
  • 72 binary I/Os.

The total physical-addressing capacity of the Mitsubishi is 16 MB. This memory is arranged as 255 banks of 64 KB. In the 7735 family, the processor only has sufficient address lines and chip selects to address the first 15 banks of memory (0-0FFFFF); see Figure 1. Additionally, the first bank of memory has many predefined areas. These predefined areas are:

  • Address 0x00-0x7F contain the special-function registers that are used to access on-chip peripheral devices previously described.
  • 3968 bytes of on-chip RAM. Though it is not obvious, many of the addressing modes require that data reside in bank 0 RAM. Since the stack must also reside in bank 0 (it has no bank register associated with it) and because C heavily uses the stack, the limited memory makes it challenging to write code for this processor.
  • A 28-KB hole.
  • 32 KB of on-chip ROM.
  • Interrupt vectors for all 19 interrupts as well as the reset vector are located at 0xFFD6-0xFFFF

Figure 2 shows the registers available in the M37735. The A and B accumulators are 8- or 16-bit wide selectable. All data operations (arithmetic, data transfer, input/output, and so on) use these registers. The preferred accumulator is A because it generates fewer op-codes, thus minimizing execution time.

The index registers X and Y also are selectable as either 8- or 16-bit wide registers. These registers are used for all indexing operations as selected instructions. The block transfer (MVN) is one such instruction, which uses these index registers. In the MVN instruction, the X register indicates the low-order 16 bits of the source data address. The third byte of the MVN instruction is the high-order 8 bits of the source address. Like the X register, the Y register is used by the MVN instruction for the destination data address. The second byte of the MVN instruction contains the high-order 8 bits of the destination data address.

The stack pointer register is a 16-bit register that is used to access the hardware stack during subroutine calls and interrupts. Since this register has no associated bank register, it must reside in the bank 0 address space.

The program bank (PG) and data bank (DT) registers provide the upper 8 bits of address space used to index the 15 banks of memory for code and data, respectively.

The processor status register (PS) provides the same type of information that most status registers provide as well as providing control bits to manage data and index register widths (either 8 or 16 bits).

Of the Mitsubishi's registers, the DPR register deserves special attention. In IAR's implementation of C, this register is used as the software frame pointer. Like the stack pointer register (S), the direct page register only addresses bank 0. An interesting attribute of the DPR register is its ability to generate data addresses with a reduced number of clock cycles if the low-order byte is on a page boundary (0016). Since all direct addressing modes of the processor use this register, it operation must be well understood.

The Mitsubishi M37735 has 19 different interrupts and programmability for 7 different interrupt levels. While there is an inherent priority in the order of interrupt detection, each interrupt source also allows you to program its desired interrupt priority.

One interrupt, the watchdog timer, is essentially a programmable NMI. This interrupt is intended to be a check for errant program operation -- a watchdog timer. The period of this interrupt is programmable as the system clock divided by either 32 or 512.

The timers supported by the Mitsubishi are grouped into two types -- five are A timers and three are B timers. The principle differences between the two types are that A timers can be used for one-shot pulse generation and pulse-width modulation. The B timers can be used to perform pulse-period measurements but they have no output pins associated with them. An interesting aspect of the B2 timer is that it can be internally connected (cascaded) to the B1 timer. This connection (combined with the provision that an external subclock can be selected to drive the B2 timer) allows for implementing a time of day.

Solving the memcpy() Problem

Since the compiler is capable of outputting code for a variety of memory models, IAR implemented the memcpy() function in C (so the various memory flavors could be simply compiled). While I have no idea what the compiler's actual C run-time library routine looks like, I expect (from looking at the assembly on an in-circuit emulator) that it was written with a generic for-loop structure like Listing One While this is an easy solution for the run-time library supplier, it results in a routine that is painfully slow -- requiring more than a millisecond to copy a mere 16 bytes of data. After investigating this problem further, I discovered that Mitsubishi had, in fact, implemented an intrinsic block move instruction (MVN) that would quickly move a block of data between locations. The catch to this instruction is that, to use it, the destination and source bank address must be encoded as part of the executable instructions. Since memcpy() is a run-time library routine, it also has to be reentrant, meaning that no static or global memory can be used.

To implement a more efficient version of memcpy, I determined that the solution would be to write a routine that dynamically places the op-codes necessary on the frame, then continues execution there; in other words, a self-modifying (or self-writing) routine. Implementing the code on the caller's frame would ensure that the routine was reentrant (required for a run-time library routine) and that the execution speed increase would be significant.

Hardware Stacks and Software Frames

When calling functions in C (or most high-level languages), the compiler manages the hardware stack and software frame transparently. Implementing a routine that uses this memory differently requires a good understanding of how these two structures are used. In most compilers, the stack and frame occupy the same memory space, with both the software frame and hardware stack intermixed as in Figure 3.

This implementation is conceptually easy to follow, allows for detection of stack overruns, and is the model most of us are accustomed to. When implementing the stack/frame handling for the Mitsubishi, IAR chose a separate frame/stack model because of the way that the DPR register is implemented. This register is used as the software frame pointer and implements all direct addressing. Since the value that this register contains is assumed to be an address offset in bank 0, it is much easier to operate in a forward-moving memory address space. The previous memory model, though easy to follow, requires that the address space always be moved from high (0xFFFF) to low (0x0000). By separating the frame and stack, the DPR register always moves towards increasing addresses. A major potential problem with this model is that the frame could overrun the hardware stack, resulting in total anarchy. Figure 4 shows how memory for the separate stack and frame model is implemented.

Using this separate stack/frame mode, as functions are called and local variables are created, space is allocated and reserved on the software frame -- based upon the previous frame address. These ancestor frame locations are known by definition as being located immediately following the calling function's return address. This allows entry code to correctly locate already allocated frame space.

With this usage information, implementing my new and improved memcpy() would simply require me to reserve the space needed for my executable code, write my program snippet into this memory, and continue execution. Once the program snippet completes, it should clean up after itself by restoring the caller's frame pointer to the DPR register, then returning.

Implementation of Solution

To use Mitsubishi's block move (MVN) instruction, the following basic register usage is needed:

  • Accumulator A: the number of bytes to move.
  • Index X register: the source memory offset (low 16 bits of address).
  • Index Y register: the destination memory offset (low 16 bits of address).

As previously discussed, the upper 8 bits of address for the destination and source addresses are encoded as the second and third bytes of the MVN instruction. In assembly, the encoding looks like MVN destination-address bank, source-address bank.

After writing the assembly necessary to implement this code fragment, I determined that 15 bytes of data would be required. This would account for the assembled move instruction as well as the necessary code to clean-up after the routine had completed execution. In keeping with the compiler's conventions, I reserved 15 bytes of space for my routine's software frame. To complete the encoding of the program snippet, the incoming parameters (source and destination address as well as number of bytes) are accessed in a variety of locations -- the Y and B registers as well as the caller's software frame. Listing Two is the actual implementation of the memcpy() routine.

Unlike the Intel family of processors, where the width of the data is determined by the op-code encoding, the Mitsubishi relies on a bit in status register for operating data width. Since I wanted to ensure that I could copy more than 256 bytes at once, I was careful to set both the data and index register widths to 16-bit mode. Throughout this code, you can see where the processor is switched back and forth between 8-bit and 16-bit mode, depending on the size of the data on which the program is operating. When I execute the return to start executing our newly constructed program continuation, I explicitly set the data width back to 16-bit mode.

Finally, after all the required executable code has been placed on the software frame, the Software Frame address is placed on the hardware stack and a return instruction executed. This makes execution start at the first byte of the frame -- which happens to be the MVN instruction, complete with the proper source and destination banks encoded.


Admittedly, this technique (and accompanying routine) is not for the faint of heart. But when addressing execution bottlenecks, extreme measures are sometimes necessary. In the end, we reduced execution time for a routine that previously took a prohibitively long time to a speedy [145+(number of bytes copied)/2×7] clock cycles.

Listing One

void *memcpy(void *s1, const void *s2, size_t n){
    char *su1;
    const char *su2;

    for (su1 = s1, su2=s2; 0<n; ++su1,++su2, --n)
        *su1 = *su2;
    return s1;

Back to Article

Listing Two

;; op-codes needed to build program snippet on the software frame 

#define  MVN_INST	0x54     ;; MOVE INSTRUCTION
#define  PLD_INST 	0x2B     ;; POP FRAME REGISTER
#define  PLY_INST	0x7A     ;; POP Y REGISTER
#define  RTL_INST	0x6B     ;; RETURN LONG
#define  PLA_INST	0x68      ;; POP A REGISTER
#define  PLB_INST	0x42      ;; POP B REGISTER

;	memcpy
;  	void memcpy(void *s1, const void *s2, size_t n)
;	This routine copies n characters from the object pointed to by s2 
;	into the object pointed to by s1.  If copying takes place between 
;	objects that overlap, the behavior is undefined.
;	NOTE: This routine is written to conform to IAR M37735 interface
;		requirements for the banked memory model.
;	s1 = destination pointer 
;	s2 = source pointer
;	n  = number of bytes to copy
;	Returns s1
;; On entry the following registers & frame locations are in use.
;;    Y - LSW of s1
;;    B - MSW of s1
;;  dp:2- LSW of s2
;;  dp:4- MSW of s2
;;  dp:0- Number of bytes to copy
;; this routine uses memory locations dp:6 - dp:14 for code

	.public	memcpy
	lda	a, 4, s	;; set new frame pointer up
	phd		;; but first save the old one
	tad		;; now, calculate the new one 
	clc		;; by adding the users request
	adc	a, #15	;; of 15 bytes.  Next save the new
	pha		;; frame location for future callers 
	pht		;; save the DT register
	phb		;; Save destination address so it can 
	phy		;; be returned to the caller.
	ldx	dp:2	;; X=source offset,Y=destination offset
	.data	8	;; 
	sem		;; switch to 8-bit data mode and
	ldm	#MVN_INST, dp:5	;; setup op-codes on the frame
	sta	b, dp:6	;; store destination bank
	lda	a, dp:4	;; get & store source bank
	sta	a, dp:7	;; 
	ldm	#PLA_INST, dp:8	;; restore LSW of s1 to the A register
	ldm	#PLB_INST, dp:9	;; restore MSW of s1 to the B register
	ldm	#PLA_INST, dp:10	;; (whose op-code is two bytes)
	ldm	#PLT_INST, dp:11	;; restore DT register
	ldm	#PLY_INST, dp:12	;; clean up stack
	ldm	#PLD_INST, dp:13	;; restore dptr register
	ldm	#RTL_INST, dp:14	;; then return to caller
	.data	16	;;
	clm		;; now, return to 16-bit mode and
	tda		;; get the address of the current frame
	clc		;; reposition it so it points to start
	adc	a, #5	;; of newly constructed code snippet
	.data	8	;; switch to 8-bit data mode for move
	sem		;; (moves take place 8 bits at a time)
	lda	b, #0	;; setup the h/w stack with the frame 
	phb		;; address (stacks are in bank 0)
	.data	16	;;
	clm		;;
	pha		;; to perform a RTL when done
	lda	a, dp:0	;; A = number of bytes to move
	clp	m,x	;; data & index both 16-bit mode.
	rtl		;; now we can jump to our snippet

Back to Article


Copyright © 1997, Dr. Dobb's Journal

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.