Optimizing a Complete System
Our first pass analysis has led us to believe we should look at optimizing at a system level; that is, there are no particular outstanding CPU bottlenecks, IO over-subscribers, or code blocks using inordinate amounts of memory. In embedded systems, the amount of available resources is typically fixed so for the purpose of this article, we will not take into account possible system improvements such as adding more memory or adding an additional disk for more block devices.
Looking at memory usage with free, continuous swapping will be a performance issue. If at all possible, the main memory-using application(s) (see "Investigating a Memory Issue" for finding the big memory users) should be analysed for memory usage reduction (code analysis).
Looking for high CPU-intensive applications using top, one key item to note is if in a multi-core environment we need to pay attention to the CPU occupancy breakdown per available core that is provided by top. Identifying the main CPU user on a single core and making that application multi-threaded to share the load across cores is a key step in any multi-core system optimization. In the case of multiple heavy CPU-intensive applications, the Operating System scheduler will have already distributed the load over multiple cores.
Looking for high I/O utilization and bottlenecks using iotop and/or sar, optimizing the applications for more efficient use of the device (transfer sizes for instance) is most likely the only option in an embedded system where adding devices is not possible.
Investigating a Memory Issue
The first-pass analysis has led to identifying a potential memory issue. We should use free or sar to monitor memory usage at selected intervals to see if there is a consistent increase in system memory usage. Also, take note during this measurement of swap memory usage to determine if swapping is causing a bottleneck. Use top and sort by virtual memory usage to determine which application is using the most memory, if memory usage is increasing (memory leak), and if any applications are using a lot of swapped memory. In the case of a memory leak, once we have determined the application that is leaking memory, we should use valgrind to search for memory leak locations. Considering memory leaks are determined over time, based on multiple measurements carried out, it is important to note that the system must be sufficiently well understood to know when it has reached a stable state when memory usage is not expected to change. Without this information, a developer may misinterpret normal system operation as a memory leak.
Although it may be impossible for an embedded developer to increase main memory to alleviate excessive swapping or disk thrashing to improve performance, it may be desirable for a developer to lock all memory used by an application into main memory so that it does not get swapped out. While using a large amount of swap space, a developer may note (using gProf or LTT) that the wall-clock time required to access various regions of memory may be greater than that during periods when the swap usage is low.
IO Bottleneck Issue
IO bottleneck identification within a full system is arguably the most difficult issue to track down. In networking scenarios, where the network device is the bottleneck, this is not clearly identified without the use of external equipment to generate the appropriate network conditions in and out of the system. However, discussion on the performance analysis for networking IO is beyond the scope of this article. Based on the tools at our disposal, one IO area where we can get sufficient information for analysis is in the area of block devices and more specifically, disk IO. Beginning in the "Start at the 10,000 ft View" section, sar and/or iotop provide us with data indicating that the CPU is waiting on a block device which is 100% loaded. For further investigation, we should use iotop to get a "per process" breakdown of IO usage to determine which process is the main device user. Once the top process has been identified, further investigation is possible through the use of VTune to analyse sections of the application that are contributing to bus/disk utilization.
CPU Bottlenecks
As stated in the "Start at the 10,000 ft View" section, we can use top or ps to sort applications by CPU usage to identify primary CPU users. Then, using VTune on the selected application, we can drill down to module, function and instruction-level code to determine where the hot spots are. Careful analysis of the code (and maybe assembly code) to understand bottle necks should follow so that algorithms or code can be updated accordingly. Once this is done, the procedure is repeated to further refine the code.
Analysis Flow
For the purpose of clarity and to summarize what we have discussed so far, the following is one possible methodology represented as a flow diagram. This is by no means the only possible method. There are infinite variations, but we hope it can be a good indicator of one way to proceed.
Conclusion
Throughout this article, we have discussed many of available tools for performance analysis on Intel architecture and Linux. The tools discussed are by no means exhaustive as the "Alternative Tools" section indicates. By combining these tools with some basic performance analysis methodologies, we hope that we have provided the newcomer with sufficient information to feel comfortable starting a performance analysis task. For veteran developer and testers, we hope this paper is informative and helps them understand the approach and tools available at their disposal.


