Cristian Dumitrescu is a Software Engineer with the Intel Architecture Group.
Packet-processing systems reside within a network node and handle traffic transitioning through that node. In this two-part article, I examine some of the typical problems engineers face when debugging packet-processing systems. The techniques I describe here represent a collection of built-in mechanisms to be provisioned from as early as the design phase to assist the system debugging at runtime, with the purpose of detecting any system errors from the first stages of development and testing. In Part 1 of this article, we focused on issues related to debugging functional problems. In Part 2, we turn to debugging of stability and performance problems.
Stability problems are the most difficult to debug, as they represent problems that cannot be triggered with basic low-rate functional tests. They need stress tests close to the real traffic conditions the device has to handle during real-life scenarios.
These problems usually require high traffic input rates and a long time before they activate (minutes, hours, or even days), so by that time many millions of packets have already transited through the system with no errors, thus making it impossible to analyze the system operation on a packet-by-packet basis. Sometimes it is not even possible to reproduce them with the equipment in the lab, having to attempt debugging the system while it is live in the field. These problems usually hide a flaw that is not fatal, meaning that the system is able to run correctly for some time before their impact cripples down the system.
The traditional debugging techniques do not work well for these problems. Running one of the cores under debugger control may be defeated by the fact that, while one of the cores is stopped as result of hitting a breakpoint, the rest of the system continues to process packets that keep coming in and modify the state of the system as the system continues to process the packets. This renders the state of the system no longer relevant for inspection because the system cannot keep up the pace with the input stream, which leads the packet queues to overflow and ultimately to a corrupted system state.
As these problems are difficult to reproduce on low data rates, debugging under heavy traffic loads becomes a necessity.
Possible Root Causes
Change in the typical sequence of events. Corner cases that are not triggered for low rates may now be triggered. The sequence of events and actions that take place while processing the packets is different than the typical sequence. Race conditions you were not aware of are now awakened.
Flawed state machines. The implementation of the state machines that are so commonly used for packet processing may be flawed. There are cases that are not robustly handled by the state machines, which now reach different states than expected. In the worst case scenario, the state machines may have to be massively cleaned up or even completely redesigned to remove the complexity that is not required.
Optimize your state machines for simplicity and readability, not for (often questionable) performance gain.
Deadlock. Synchronization between the producers and the consumers of the same queue may be flawed, making the producers think the queue is full or the consumers think that the queue is empty. Other possible root causes may be the incorrect usage of semaphores or other synchronization primitives, or waiting for an event that never takes place.
Resource exhaustion. A critical resource is exhausted and never replenished as result of incorrect usage, leading to incorrect operation. One of the most common scenarios is the permanent exhaustion of the buffer pools as result of buffer leakage. Buffers are allocated from the pool, but not all of them are correctly released back to their pool. These buffers are practically lost, as the software simply "forgot" about them, so it does not longer make use of them.
As result, the buffer pool shrinks over time, as fewer and fewer buffers are available in the pool at any given time which leads to performance degradation over time, as more and more input packets cannot be accommodated by the system and have to be dropped. The buffer pool eventually becomes empty and, as the pool is never replenished with buffers, no more packets will ever get out of the system.
If the code branches that leak buffers are frequently hit, then the problem is triggered relatively quickly. What usually happens is that the leaky code branches are infrequently hit, as they typically handle some infrequent error cases, so it takes a significant amount of time to reach the buffer pool exhaustion.
For example, consider a system which has a leaky Address Resolution Protocol (ARP) table aging process that runs once every 120 seconds leaking one buffer on every run. If the buffer pool initially contains 1K buffers, then the system has to run non-stop for about 34 hours until the buffer pool exhaustion is experienced and the output cut-off takes place.
Because people tend to wait for the later stages of the project before applying stress tests like running the system under heavy traffic non-stop for a few days in a row, this problem is usually discovered late into the project.
Start the stability testing from day 1 of the project!
A related problem is represented by an under-dimensioned buffer pool, which causes periodic packet drops as the pool reaches exhaustion. As opposed to the previous problem, the pool does get replenished correctly with previously allocated buffers and the output traffic does resume eventually, so this is a performance tweaking problem rather than a stability one.
Incorrect assumptions. The software is not considering the case when a message sent to a queue is dropped due to the queue being full, assuming that the delivery of the message is guaranteed. Consequently, if the software is relying on the consumer actions associated with the handling of that message to be performed in order to continue working correctly, then it operates based on false assumptions. Same if the software is relying on receiving back a response to this message, as the response never comes since the request message never made it to the other side.
Memory corruption. Some local variables might be left uninitialized before their value is read and used by the application. The software might not consider the fact that reading from a message queue can fail due to the queue being empty (queue underflow) or writing to a message queue might fail due to the queue being full (queue overflow). The software might incorrectly attempt to read/write more data from/to a buffer than allowed by the buffer size. As result, inconsistent data is read or incorrect memory addresses are written, leading to the data structure corruption problem that is so hard to debug.
Incomplete handling of the returned error codes. The software might not be handling all possible error codes returned by the called functions, either by assuming success or by ignoring some of the error codes.
Make sure that all the possible values for the return codes are handled.
Static Code Analysis. Make sure that all the local variables are initialized and all the return codes are handled.
Run-time Monitor/Logger. Do not let the system errors go unnoticed!
The first step is to instrument the code with statistics counters tracking the resource usage and the various error conditions that can take place and make them available through the CLI. Examples of relevant counters include:
- Number of free buffers for each buffer pool
- Queue occupancy for each queue
- Current position for all the semaphores in the system
- Number of free entries within the various tables maintained by the application (e.g., routing table, ARP table)
- Number of DMA errors and retries
- Bus occupancy (can be calculated as the number of transactions multiplied with the length of each transaction since the last monitor invocation)
The second step is to implement a run-time monitor. This represents a callback function that is periodically invoked on timer events and checks for some of the most common error conditions that can take place in the system and, if any such condition is met, trigger the corresponding alarm. Examples of relevant error conditions that may seriously impact the system operation include:
- One or more buffer pools are consistently empty or almost empty
- Some queues are consistently almost full (occupancy more than 90%)
- Specific queues have been written but not read in the last period of time (consumer deadlock)
- Specific semaphores busy during the last number of monitor invocations
- Some tables maintained by the application are full or almost full
- The buckets of some hash tables became too long (possible indication that the hashing function is not efficient)
- The number of DMA errors or retries are above their acceptable threshold
- Bus occupancy dangerously close to the upper limit
The run-time monitor can also log the full state of the system for further static analysis, as a graphical representation over time can uncover some less obvious problems. The monitor should be disabled or reduced to a bare minimum during normal regime as it eats computing cycles.