Dr. Dobb's | Performance Diagnosis & .NET Applications

Performance Diagnosis & .NET Applications

The authors present a tool that lets you identify .NET-related problems and resolve bottlenecks during performance analysis.

August 01, 2005
URL:http://www.drdobbs.com/windows/performance-diagnosis-net-applications/184406201

August, 2005: Performance Diagnosis & .NET Applications

Ramkumar and Sachin are software architects for Infosys. They can be contacted at [email protected] and [email protected], respectively.

Operational Laws

Bayesian Networks

Performance analysis for any application must be managed at every stage of the software-development lifecycle. Each of these stages use different performance management tools—profilers during coding and unit testing stages, load-testing tools during system validation and QA stages, and tools that deal with monitoring. During performance testing, however, analysis becomes a stumbling block when the system is subjected to production-like workloads and has a distributed operating environment. In this context, large amounts of information need to be monitored and analyzed to detect bottlenecks. The lack of automation in this diagnosis process motivated us to design and implement the tool for the .NET Framework that we present here. This tool lets you identify problematic areas, then helps you resolve bottlenecks in the .NET Framework during performance analysis. (The complete source code for the tool is available electronically; see "Resource Center," page 3.)

Our approach to diagnosis begins with the creation of a knowledge base comprising several performance patterns, which are indicators for detecting bottlenecks. We represent these performance patterns as a Bayesian network that has been extensively used in the field of medical diagnosis. A given scenario is diagnosed with respect to several performance patterns to report the possible problem areas or to comment on the scalability of the application.

Performance Patterns

The process of diagnosis for systems under load involves collecting and understanding metrics related to system resources, the .NET managed code layer, and the application layer that constitutes the .NET stack. When examined, these metrics lead to tell-tale signs that can detect performance bottlenecks. For example, consider the performance counter System\Processor Queue Length defined in Microsoft's .NET platform for which a sustained queue of more than two threads is an indication of a processor bottleneck. This example illustrates the first category of performance patterns that are based on individual thresholds. There are at least 30 such performance patterns in the .NET platform (see http://www.microsoft.com/downloads/details.aspx?FamilyId=8A2E454D-F30E-4E72-B53175384A0F1C47&displaylang=en).

Likewise, if you were to determine the benefit of adding an extra processor, you need to correlate it with Processor\% Processor Time counter. This illustrates the second category of performance patterns where, based on the relationships between a set of metrics, problems are diagnosed. The source for the list of the second category of performance patterns are operational laws in queuing theory (see The Art of Computer Systems Performance Analysis by Raj Jain, John Wiley & Sons, 1990) that define relationships between overall performance metrics, load conditions, and system utilization, and performance tuning references that recommend specific tips.

Thus, by using these performance patterns, you can detect bottlenecks and the scalability of applications. In this article, we restrict our discussion to design and implementation of a tool that handles both of these types of patterns, and for brevity, illustrate it with a subset of both types of patterns. To come up with a comprehensive list of these performance patterns, the first step is to identify the metrics and their defined thresholds. This can then be logically augmented by the relationships that exist between them.

Metrics to be Captured

There are two types of metrics that need to be considered for analysis, the first being overall application metrics such as throughput and response time. These metrics can be obtained from load-testing tools, such as Microsoft's ACT.

For all other metrics, we rely on the Windows system monitor (Microsoft's Perfmon/Logman) that has well-defined performance objects for each layer of the .NET stack. These performance objects are a logical collection of counters that reflect the health of a resource. The threshold for each of these counters can also depend on the type of server on which it is processed. For example, Microsoft products, such as Commerce Server, Web Server (IIS), and Database Server (SQL Server), have their own subset of these objects, and their thresholds may differ. Table 1 presents a subset of these performance patterns. At the first pass, threshold-based performance patterns quickly identify the bottlenecks at a high level. However, a more detailed diagnosis is often driven by examining the relationships between metrics defined by the second category of performance patterns (see the accompanying text box entitled "Operational Laws").

The scope of our examination here is problem diagnosis, where a performance pattern is recognized and a recommendation is made when bottlenecks occur. These recommendations can be in the possible impact areas, such as hardware, source code, or software configuration problems. To illustrate, consider the counter Available Mbytes, which examines the amount of physical RAM available in MBs not satisfying the required threshold value. The recommendation would be to add more main memory or decrease the code footprint. This leads you to a point where you need to further drill down along these recommendations to resolve the problem.

Conceptual Design Of the Diagnosis Tool

The diagnosis engine relies on the collection of performance patterns for the .NET Framework. The strength of the tool lies in the knowledge base that constitutes these performance patterns. Hence, an important design requirement should be to offer flexibility to maintain and update the performance patterns over time. This needs to be achieved by a suitable representation of the knowledge base. The popular forms of representation are Bayesian networks (http://www.niedermayer.ca/papers/ bayesian/bayes.html); conventional Decision Trees (http://www.aaai.org/AITopics/ html/trees.html); and Case Base Reasoning (CBR) (http://www.aiai.ed.ac.uk/links/ cbr.html). In our design, we use Bayesian networks because they offer maximum flexibility and ease in modeling the knowledge base for our needs. The Bayesian network in this context is an acyclic graph where a set of nodes and their relationships represent performance patterns. Each node is associated with a predefined set of values and its corresponding outcomes in their relationships. (For more details, see the accompanying text box "Bayesian Networks.")

As indicated in Figure 1, conceptual stages in the design are:

Collecting metrics from performance monitoring utilities such as Microsoft's Perfmon/Logman and Microsoft's ACT.
Evaluating these metrics with performance patterns represented by Bayesian network.
Proposing one or more recommendations for the potential bottlenecks.
Updating the knowledge base periodically to build an effective diagnosis engine.

Tool Implementation

Our tool has two important components—monitoring and diagnosis. The monitoring part deals with collection of all the required metrics during load testing. Here we use the load generator's monitoring facility, along with utilities supported by Windows. There are other good utilities in Windows XP, such as Logman (http://www.microsoft.com/ resources/documentation/windows/xp/ all/proddocs/en-us/nt_command_logman .mspx), that offer flexibility to define all the metrics at one time and use the same script for the subsequent iterations.

In this tool, we implemented the monitoring functionality via the ServerMonitor.java program (available electronically), which takes test configuration details, such as the type of servers, hostname, and load levels as inputs. Based on the type of the server, the metrics to be captured are determined by the program. The program then generates performance counter logs using the Logman utility with relevant metrics on the local machine. When testing starts, the utility sends instructions to the relevant host machines to retrieve the specific metrics. While this remoting option offers flexibility, there are some restrictions one needs to be aware of. One restriction is that the host and local machines need to be on the same LAN. In addition, the local machine needs certain security access permissions (http://support.microsoft.com/ default.aspx?scid=kb;en-us;818032). The data gathered by using these counter logs is then stored in Comma Separated Values (CSV) format in the local machine, and the filename follows a specific naming convention. By default, one folder is created for each load level, and all the individual servers have a CSV file in each of them. The set of folders containing the collected metrics for different load levels forms one of the feeds for the diagnosis engine. The other feed consists of overall system-level metrics, such as application throughput and application response times that needs to be recorded from the load-testing tools for each load level.

In our implementation of the diagnosis engine (which is based on the Bayesian network), we used the Java version of the Netica API (http://www.norsys.com/ netica-j.html?popup). Specifically, we used its stub framework to add our specific performance patterns and create the specific network. We refer to this compiled java source as "PerfDiag." The input to the PerfDiag starts with taking the overall system metrics and providing an option to associate the folder containing all the CSV files for each load level. These inputs are sufficient to trigger the diagnosis process.

Figure 2 depicts a portion of the graphical model of the network. After the inputs are assessed by the diagnosis engine, a complete bottleneck report is generated. The next section details a small case study and the sample report.

Experimental Results

To demonstrate how you use this diagnosis tool, we present a performance analysis of an Online Job Recruitment system based on the .NET Framework. The requests were fired from a load generator to a cluster of two web servers that were load balanced. Figure 3 shows the deployment diagram.

We conducted load testing for one representative business scenario, Search And Apply, where users log in to the system, search or apply for a job, and log out. The load test was conducted for four different load levels. The ServerMonitor program creates and stores the performance logs for these servers. Further, the PerfDiag diagnosis engine analyzes the log files and a consolidated report is generated.

The first round of diagnosis revealed a problem with the test bed as Little's Law was not getting validated. The analyst determined the root cause of the problem was because of insufficient access permissions when the users logged in. This problem was fixed and the tests rerun. The second round of diagnosis revealed an important bottleneck with respect to the database disk performance. The recommendation pointed to improving the database disk organization. Other minor recommendations were related to garbage collection tuning and improving the file system cache performance. The report also contained graphs that indicated application scalability and performance under varying load conditions.

Conclusion

We have presented a new approach to automate the performance diagnosis. This automation hinges on identifying performance patterns that are tell-tale signs for detecting bottlenecks. The process involves retrieving the required metrics and further steering the diagnosis with the knowledge of performance patterns. To represent the knowledge base, we chose the Bayesian network framework as it is found to be most suitable for the problem statement at hand. While we have customized this tool for .NET applications, we believe that the same conceptual design can be extended to performance diagnosis on other platforms.

Acknowledgments

Thanks to Rajeshwari G. for the overall guidance and the support received from the rest of the Quality of Service group, SETLabs at Infosys Technologies Ltd.

DDJ

August, 2005: Performance Diagnosis & .NET Applications

Figure 1: Conceptual design.

August, 2005: Performance Diagnosis & .NET Applications

Figure 2: Bayesian network.

August, 2005: Performance Diagnosis & .NET Applications

Figure 3: Pilot application.

August, 2005: Performance Diagnosis & .NET Applications

Operational Laws

Let,

N: Number of users
X: Transaction throughput (in transaction/sec)
R: Transaction response time (in sec)
S: Service Demand (in sec)
Z: Think time (in sec)
U: Utilization of the resource (% of time the resource is busy)

Then,

The Utilization Law: U=X*S; facilitates the computation of the service demand.

Little's Law: N=X*(R+Z); Little's Law check is mandatory even before we look for bottlenecks in the application as this validates the test bed. If Little's Law is violated, it means that either the testing process or the test data captured is incorrect.

—The Art of Computer Systems Performance Analysis by Raj Jain

August, 2005: Performance Diagnosis & .NET Applications

Bayesian Networks

A Bayesian network (BN) is a graphical representation based on probability theory. It is a directed acyclic graph with nodes, arcs, tables, and their associated probability values. These probabilities may be used to reason or make inferences within the system. Further, BNs have distinct advantages compared to other methods, such as neural networks, decision trees, and rule bases, when it comes to modeling a diagnostic system. One of the many reasons why Bayesian networks are preferred over decision trees is that in BN, it is possible to traverse both ways. Recent developments in this area include new and more efficient inference methods, as well as universal tools for the design of BN-based applications. An extended list of software for BNs is at http://bayes.stat.washington.edu/almond/ belief.html.

—http://www.niedermayer.ca/ papers/bayesian/bayes.html

August, 2005: Performance Diagnosis & .NET Applications

Server Type	Performance Object/Performance Counter
All Servers (both Web and Database)	Processor\% Processor Time Memory\Available Mbytes Memory\Pages/sec Network Interface\Bytes Total/sec Network Interface\Output Queue Length System\Context Switches/sec System\Processor Queue length Server\Bytes Total/sec Memory\Pool Nonpaged Bytes Cache\MDL Read Hits % Memory\Page Reads/sec
Web Server	.NET CLR Memory\# Gen 1 Collections .NET CLR Memory\# Gen 2 Collections .NET CLR Memory\% Time in GC ASP.NET Applications\Requests Timed Out ASP.NET Applications\Requests/sec ASP.NET\Request Wait Time ASP.NET\Request Rejected
Database Server	PhysicalDisk\Avg. Disk Read Queue Length PhysicalDisk\Avg. Disk sec/Transfer PhysicalDisk\ Avg. Disk Write Queue Length

Performance Patterns

Metrics to be Captured

Conceptual Design Of the Diagnosis Tool

Tool Implementation

Experimental Results

Conclusion

Acknowledgments

Figure 1: Conceptual design.

Figure 2: Bayesian network.

Figure 3: Pilot application.

Operational Laws

Bayesian Networks

Table 1: Mapping between various servers and their respective metrics.