Brian Mack is a senior software engineer at Analog Devices. Peter Govoni is a staff software engineer at Analog Devices and manager/technical lead for Blackfin simulation tools. They can be contacted at Brian.Mack@analog.com and Peter.Govoni@analog.com, respectively.
Direct Memory Access is a critical requirement for meeting the data throughput goals in embedded digital signal processing (DSP) applications. It is already a prevalent technology, but recent innovations in the processing capabilities of Direct Memory Access (DMA) controllers give you more flexibility, power, and efficiency to increase system throughput and meet design goals.
As the DSP processing power in embedded systems continues to increase, the memory, bus architecture, and DMA controller must also evolve to fully utilize the increased processing power. If the DMA controller does not keep pace, it becomes a bottleneck, constraining the maximum throughput of the system. DMA controllers are responding to this need by functioning more independently, carrying out more of the tasks related to DMA transfer without assistance from the processor.
Previous generations of DMA controllers required large numbers of CPU cycles and complex programming to accomplish rudimentary moves of blocks of data. Modern DMA controllers are easing these requirements while providing more advanced functionality.
What Is DMA?
DMA was created to take the burden of transferring data between two areas of memoryor between a peripheral and memoryoff the CPU. The DMA controller is programmed to perform certain data moves, and then works independently, freeing up the CPU to process the data in parallel. Ideally, the processor only needs to set up the DMA flow and respond to interrupts when the data transfer has been completed.
DMA is the second master to the CPU in a system. It handles the time-consuming transfers of large amounts of data from one location to another, freeing the CPU from doing this task. Transfers are programmed by configuring source/destination channels to reference certain peripherals or memory HASH(0x80bca8). Important components of a DMA channel are: the location of data, word size, and number of words being transferred. DMA controllers consist of enough channels to provide access to all of the peripherals and memories supported by the system. A pair of channels (one to read from the source and one to write to the destination) moves data between memory HASH(0x80bca8). One channel would be programmed in the Read direction, and the other in the Write direction. This pair shares a FIFO to buffer the data between them. These channels may be divided among multiple controllers. DMA controllers typically use interrupts to inform the processor that a transfer has completed.
The use of DMA controllerseven multiple DMA controllersin system-on-chip (SOC) designs with dedicated DMA buses is creating new requirements for DMA controllers. The DMA and CPU are the only masters in most systems, so the first improvement made was with arbitration. In older system designs, the DSP and DMA competed for a single bus. This created a bottleneck at the arbiter for the system bus; see Figure 1. More recent SOC designs provide DMA controller(s) with dedicated buses, creating the potential for higher throughput when using newer, more autonomous DMA controllers; see Figure 2.
Acquisition of data from the outside world is required by most embedded applications using DSPs. Additional arithmetic logic units (ALUs) or customizable instructions can be added to efficiently handle complex equations, but any improvement would be minimized if the memory transfer were not optimized.
To write the most efficient code and optimize your applications, you should look at the characteristics of the processor. Improved characteristics of DMA engines on DSP microprocessors give another avenue for improving data throughput from the outside world. The CPU and DMA controller work independently, but they must also correctly work in parallel. Having more options for system efficiency on the DMA controller makes that task much easier.
With multiple masters in a system, contention (or fighting) for resources between the masters is likely to occur. Contention in a system brings stalls, timing issues, thrashing, and starvation of devices that have a lower priority with regard to getting attention from the CPU. To resolve issues between these two masters, you would spend time tweaking the code until things worked right. Older DMA was limited with options to help with the tweaking, but newer, smarter DMA engines have additional resources for working out system problems. Older DMA controllers are part of the problem, so it makes sense to add circuitry to help out the controller, as their sole existence was to work as independently as possible from the CPU.
Common issues when working with previous-generation DMA controllers include:
- I/O writes from the CPU to the controller's memory-mapped registers (MMRs) can be time consuming.
- Inherent stalling of switching direction from reads to writes and vice versa between multiple channels.
- Problems with starving lower priority channels.
- DMA is a peripheral in a system that burdens the CPU because of its lack of intelligence.
Solutions for Current Problems
For a DMA controller to act with more intelligence, consider it to be a simple processor with a simple instruction set. It already has the capability to interface with memory, so it can get its own instructions from memory rather than making the CPU do all the work. DMA channels have a finite number of registers that need to be filled with values, each of which gives a description of how to transfer the data. The DMA can use its read memory circuitry to fetch the register values rather than burdening the CPU to write the values. These blocks of memory are called "descriptors." The first address written to the DMA controller points to the section of memory at which the descriptor starts. The DMA controller then reads each memory location consecutively and fills its own registers. One parameter read from memory is a pointer to a set up for the next transfer. With multiple consecutive transfers, linking of multiple descriptors lets the DMA controller run independently from the CPU. Creating descriptors of different sizes lets users do away with overwriting the same values, further streamlining the process. A descriptor with a size of only one transfer is possible when only the location of the data is changed.
Once a section of memory has been programmed with the DMA descriptors, the DMA process can start, and then run independently of the CPU for the whole application. This scheme (Figure 3) can be found on the ADSP-BF5xx Blackfin processor (http://www.analog.com/ processors/processors/blackfin/) from Analog Devices (the company we work for).
Following the values of the Flow parameter in Figure 3 shows that the Blackfin processor has three descriptor sizes, which refer to:
- Whether linked descriptors are required (Array).
- Descriptors that cross memory page boundaries (Large).
- Descriptors that don't have to expand across memory pages (Small).
The Size parameter refers to the number of registers to read into the DMA's MMR space. Note that at the end of the descriptor chain, only one register was transferred (lower 16 bits of the memory address) because this was the only value needed to finish the job. This means that the previous values were the same and did not need to be reread from memory. Streamlined descriptors can greatly improve the performance of the system.
Improvements in Arbitration
Arbitration is the decision-making process that the DMA uses to decide which DMA channel is next in line to transfer its data. These are executed in sequential priority order, with the lowest channel number having the highest priority. If the average bandwidth of each peripheral is a small fraction of the total bandwidth, then all requests should be granted as required. Otherwise, higher bandwidth peripherals should be given higher priority. Prioritization allows the higher priority channel to cut in front of other channels having a lower priority. Congestion could periodically occur in a system for a number of reasons:
- Multiple high-priority channels requesting data at the same time.
- Memory is stalled because of CPU intervention.
- Cache line fills.
Congestion could also be a problem if a higher priority device caused a lower priority channel to starve for attention, because lower priority channels would be locked out from delivering their data within the required interval. Curing this situation requires tedious code changes, or a hardware design that guarantees that the problem will not occur. DMA with improved flexibility solves this problem by checking the status of the DMA channel FIFO buffers to see if they are full on a write, or empty on a read. A channel found in this state would be given first priority for the next cycle only. The priority order would then resume until the next FIFO event occurred. This is an inherent property of the DMA controller, and does not require intervention by the user.
Traffic, whether it involves automobiles, people, or data, inherently runs smoother if you minimize the direction change. If, at a busy intersection, a police officer directing traffic allowed only one car to pass in each direction at a time, delays and frustration due to the constant stalls would quickly build. A flexible DMA system includes a register that lets traffic flow in one direction until a certain count is reached, and then switches the traffic flow to the other direction. If no traffic in the new direction was available, traffic could continue to flow in the previous direction. This is not a difficult algorithm, but it could save a lot of time when you realize that the requested data is not yet available. Instead of trying to optimize the code, the register values could be changed to increase the throughput. This could then cause the data to arrive too soon, interrupting the system during a time slot that may cause a problem. The register value could then be decreased until the DMA and CPU worked correctly in parallel. There are many factors that dictate the best values for the traffic registers, including the time between channels, data input frequency, number of channels, and direction of channels. All of these factors make up how the CPU and DMA controller work together in harmony. An enhanced DMA controller helps a designer to get to that point. Figure 4 depicts a peripheral bus.
Figures 5 and 6 show the level of improvement that can be obtained in a system performing two DMA transfers in opposite directions. Higher system throughput occurs when fewer events are logged. Figure 5 shows the number of events logged by an ADSP-BF561 simulator with bus prioritization turned off, while Figure 6 shows the number of events logged with bus prioritization turned on. The y-axis gives the number of events generated in each situation. The overall improvement when using prioritization can easily be seen. Bus prioritization was enabled on both memory and peripheral sides of the controller. FIFO events increase when the direction of transfers is forced for a time period, but throughput can be improved through proper switching of lower priority channels during such an event.
Data transferred from the outside world to an embedded system is of a multidimensional nature. When data is transferred from the outside world through a peripheral device, it must go through formatting before it can be used in a filtering and digitizing algorithm. This would require bringing in the data and then separating the different dimensions for multimedia algorithms. Using the previous-generation DMA controller with a single-dimension access and a unilateral stride would require several movements of the data to get it correctly formatted. Having a two-dimensional plot of the data and stride parameter saves many cycles in the overall system when acquiring this type of data. Listings 1 and 2 show how a smart DMA controller can improve this situation. Listing 1(a) produces the address offsets in 1(b) from the base address. Listing 2(a) receives a video data stream of bytes and produces the address offsets in 2(b) from the base address.
Putting all these techniques together shows how applications can benefit from programming the ADSP-BF5xx Blackfin DSP with object-oriented design methods to simplify and model a DMA transfer. The DMA concept has always been very simple, and should remain simple even with the newly added features, letting users take data and move it from point A to point B, with any complex enhancements handled autonomously and privately from the user interface.
Refer to the source code (available at http://www.cuj.com/code/) and Figure 7. The user interface is class CDMATransfer, which retains the simplicity of a DMA object by using encapsulation to hide the details of the DMA channels, first creating a CDMATransfer object, then performing the transfer. The DMA object supports a descriptor- or register-based transfer. CDescriptorDmaTransfer and CRegisterDmaTransfer are created with the design patterns from . To complete a register-based transfer, pass the size, count, iFromBuf, and iToBuf. To complete a descriptor-based transfer, pass the pointers to the Descriptor buffers. The descriptors located in memory hold all of the information for configuration and buffer location for each channel. Executing PerformTransfer() completes the transfer for both types.
The class DMAChannel supports all functionality for a single DMA channel on Blackfin processors. This class could support a DMA stream from any product. A stream could be a memory-to-memory transfer or peripheral-to-memory transfer. This object sets up a memory DMA stream, which consists of two channels (read and write) to control the source and destination transfers. An MDMA stream shares a four-deep FIFO to buffer the data between reads and writes. The CDmaTransfer class holds all of the information required to set up each channel to complete the transfer with the PerformTransfer() method. The chosen namespaces support different configurations at the DMAChannel level, and help describe the set up of the Blackfin processor in simple terms.
Listing 3 shows a DMA transfer in action. Here, a destination buffer is fed values from two different buffers. After the initial descriptor is read on both channels, the destination channel is no longer in descriptor mode, so it automatically reloads previous values on completion. The source channel stays in descriptor mode, alternately switching between the two descriptors that give the other location of the buffer. So, only two values need to be read from the descriptorthe lower half of the descriptor address and the lower half of the buffer address. This gives an effect of a buffer rapidly changing from one buffer to another. In the example, the two buffers include one buffer with all "*" and one buffer with all spaces. Viewing memory while the transfer is taking place will cause a turn on/turn off effect of the buffer changing from one character to another.
DMA controllers have evolved into intelligent devices that control efficient, prioritized movement and organization of data through a system with minimal burden on the CPU. DMA architecture should be an important consideration when choosing an embedded processor for applications where data movement is of critical importance in meeting system performance goals.