Performance problems become an issue when the output packets are correct and the system has been proven stable, but the performance metrics (like the throughput rate or the latency) do not meet the expected targets for some of the supported traffic scenario.
As the system is stable, it is arguably preferable to tackle performance tweaking problems than debugging stability problems. At least the performance problems are usually straightforward to reproduce. Some people would state that, once you have reached this stage, you can relax a little bit.
Possible Root Causes
If the performance problem is massive, then it might hide a functional or stability related problem. It may also be a design problem.
Design with the targeted performance numbers in mind from design day 1!
Otherwise, the performance tweaking problem might be caused by the incorrect dimensioning of resources when matched against the performance requirements.
Revisit the design, if need be.
Instrument the code with time stamps and profile the application to identify the problem regions that are candidates for optimization. Do not optimize those blocks that do not have a major contribution to the overall performance problem of the system. In theory, the blocks to pick for optimization should be those 20% of the blocks that use up to 80% of the packet budget.
Analyze the data collected by the real-time monitor looking for performance bottlenecks.
Overlooking some of the platform-related architectural considerations might also have a negative impact on the system performance:
- Cache line: On Intel Architecture processors, the size of the cache line is 64 bytes. Try to maximize the number of cache hits by minimizing the number of cache lines that have to be used for storing your data structures. For example, if one data structure is less or equal to 64 bytes in size, by allocating it in memory on 64-byte aligned addresses, it will fit into a single cache line rather than spanning across two cache line. If two cache lines are used to store a structure which normally fits into a single cache, then the cache hit probability for accessing an instance of this data structure is cut in half.
- Zero copy: Make sure that the packet does not have to be copied from one buffer to another in your design, as the memory copy is expensive.
In this article, I've presented several techniques for debugging functional, stability, and performance problems related to packet-processing systems.