Real-Time Debug Response Requirements
Coordination of debug breakpoints and events between processor cores in a heterogeneous processor is critical. The decision to break and halt execution needs to happen within a few cycles of the break that occurred on that encountered the breakpoint. Thus, many embedded designs employ a debug synchronization unit. The ability to stop and start all processor cores synchronously is extremely valuable for multi-core systems that have inter-process communication or shared memory. To ensure that this synchronization is within a few cycles, a cross-trigger mechanism should be added on the chipset itself.
The configuration registers of the debug synchronization unit enable the developer to select the required cross-triggering behavior. In other words, the developer can specify which cores are halted on a breakpoint. On the other hand, if the processor cores have widely separated and non-interfering tasks, it may be sufficient to synchronize stops and starts with the debug tools. In other words, the debugger can handle synchronization of breakpoints as well, but with a little less fidelity.
Inevitably, this will lead to hundreds of cycles of what is termed skid between the halting of processor cores. The synchronous starting of processor cores can be achieved with either a cross-triggering mechanism or via the test access port (TAP) controller of each core.
AMP versus SMP Support in an Embedded OS
Let us now examine these multi-core environments from one level higher in the software stack. In many embedded multi-core designs it may be desirable to assign a few processor cores to a dedicated real-time task. In this case, a careful comparison should be made between asymmetric multiprocessing, bound multiprocessing, and symmetric multiprocessing.
In symmetric multiprocessing (SMP) a single copy of the main operating system executes on all processor cores. Once the OS is running, thread distribution and workload distribution is almost completely handled by the OS. For effective debugging, it is important to know the unique thread identifier—which core is executing particular threads.
An OS that supports SMP has insight into activities occurring on the system and allocates resources on the multi-core processors with little or no input from the embedded developer. The native threading layer of an OS will probably provide interfaces that enabled safe data sharing between cores and threads. Since an SMP OS has this oversight over all activities on the system, it can dynamically allocate resources to specific applications rather than to processor cores, thereby enabling greater utilization of available processing power. It also lets system tracing tools gather operating statistics and application interactions for the multiprocessing system as a whole, providing valuable insight into how to optimize and debug applications.
In bound multiprocessing (BMP) a single OS manages all of the processor cores, but during application initialization, a setting determined by the system designer forces all of an application's threads to execute only on a specified processor core. This effectively isolates a workload and can eliminate the cache thrashing that can reduce performance in an SMP system by allowing applications that share the same data set to execute exclusively on the same processor core. BMP offers a simpler application debugging environment than SMP since all execution threads within an application run on a single processor core. It helps legacy applications that use poor techniques for synchronizing shared data to execute correctly, again by letting them run on a single processor. BMP can be very useful if you have one or two high priority applications that need to be isolated, either for legacy reasons or for prioritization reasons. BMP support is very OS specific. Examples of Real-time operating systems that offer support include VxWorks and QNX. One drawback with BMP is that it does not permit the use of idle resources on an unused processor core, thus artificially restricting performance gain through parallelism.
Asymmetric multiprocessing (AMP) is the software equivalent of heterogeneous multi-core platform development. The different processors execute their own dedicated OS. Data sharing is limited to shared memory and defined messaging APIs. In the simplest scenario the dedicated OS layers are carbon copies of each other. More commonly, this approach is used for heterogeneous hardware designs. The main purpose of this approach may be the need to use special purpose operating systems on a special purpose chip. For example, a handheld device or a SoC implementation targeting in-car infotainment (IVI) may comprise a digital signal processor (DSP) or GPS chip that execute its own RTOS or firmware code. An RTOS like Nucleus, VxWorks, or μ-Itron may be executing on a general purpose processor handling real-time background tasks like phone-call switching and telephone tower registration. This RTOS would also control the messaging API between the general purpose processor and the DSP. Lastly, a full Linux-based OS like Android or MeeGO may be executing the application user interface for the end user.
To successfully debug application or device driver code, debug access to the multiple processor cores may be required. One means of achieving this is to have two separate debuggers that debug aspects of the code executing on different processor cores. If this debugging occurs in shared memory a potential problem can occur if one debugger set breakpoints that the other debugger then encounters and breaks. What if the breakpoint instruction from one core is alien enough to the other core that it triggers an invalid instruction exception and needlessly crashes the execution of the entire application?
For the system level debugger an additional complication is that an AMP software stack design probably has a custom messaging API between the various processors. The application developer probably has no other choice but to rely on the API to behave as intended. There will however, be a system-level developer who has to first implement the messaging API. With highly customized embedded designs it may not be possible to adopt an existing API from elsewhere. The developer of this API may need to use multiple debuggers for the different architectures involved simultaneously.
An improvement to employing multiple debuggers is to have a heterogeneous multi-core debugger implementation that can monitor multiple processor cores simultaneously while having a hardware cross-trigger mechanism in place that allows for breakpoint and debugger execution control synchronization between the cores. Such a debugger would at the application layer export information from the API that can be used for messaging, bus monitoring, and debug. At the system level, the debugger would literally be in a bus signal probing tool that allows for signal timing optimization and handover correctness checks.
Debugging the boot sequence of an AMP software stack in a heterogeneous multi-core system is always going to be challenging, although luckily this is a task usually handled by the silicon vendor. Boot performance is frequently not quite as critical (in-car applications and emergency applications are exceptions), allowing for more freedom in signal timing.
Picking the right combination of application level OS, specialized RTOS, and microengine or DSP API is critical for efficient debug. It is strongly encouraged to look at the available debug solutions as part of the decision process when choosing a chipset combination for the embedded design.
This article is based on material found in book Break Away with Intel Atom Processors: A Guide to Architecture Migration by Lori Matassa and Max Domeika.