Design

Predicting Communications Bottlenecks

By Stathis Papaefstathiou, April 01, 2001

Stathis presents a model evaluation algorithm for calculating communication delays that occur during the execution of applications on any packet-switching network.

Apr01: Predicting Communications Bottlenecks

Stathis is a researcher at Microsoft Research's Cambridge lab. He can be contacted at [email protected].

Meeting performance objectives is becoming a critical factor in the success of an application. Measuring the time a distributed or parallel application spends in communicating over the network is one of the techniques that can be employed by a programmer or a system administrator to determine the source of performance degradation. Although monitoring sounds like a straightforward approach, it has a number of drawbacks. In many cases, monitor probes influence the performance of the system (intrusion), and it is difficult to isolate the performance characteristics of an application from network loads created by other contributors to network traffic.

Monitoring requires the existence of the complete application and system implementation, a constraining limitation in some cases. For example, a system administrator might want to experiment with a system configuration that is not currently available. In a similar fashion, a programmer might want to assess the performance characteristics of alternative designs in the early stages of development.

Performance prediction technologies can be used in these cases to understand performance, determine bottlenecks, and select alternative designs or configurations. Predicting the communication delays of an application has been the subject of long academic research. However, it hasn't managed to break through to wider audiences in the commercial world. Network models are typically very mathematical, introduce unrealistic constraints, and cannot encapsulate the detailed communication patterns generated by an application.

Part of the work at Microsoft Research involves the creation of technologies that can be used by nonexperts to predict and analyze system performance. In this article, I'll present a technique for calculating the communication delays that occur during the execution of an application. The system described uses ideas originating from various modeling approaches (statistical, analytical, simulation) and takes into account detailed application communication patterns, network contention, and network background loads. The model can be applied to any packet switching network (SAN and LAN, for instance). Some of the key features of the technique include:

User customization. Users can define the workload format, network characteristics (topology, routing, and the like), and processing of output information.
Fast evaluation. The model combines analytical and statistical models with simulation. This hybrid approach provides evaluation times on the order of hundreds of thousands of traces per second, which is substantially faster than previous techniques.
Detailed performance analysis. The model provides detailed output information about the operation of the network during the execution of the workload. The model output can be either further analyzed to produce summary metrics or converted to an appropriate format for additional processing and/or visualization.
No performance expertise is required. The model uses intuitive and straightforward abstractions that do not require advanced mathematics or performance expertise.

I'll first focus on the model evaluation algorithm, then address the main abstraction introduced within the communication model. I'll use a PC cluster connected to a Myrinet switch (a gigabit class SAN) to demonstrate the steps for calculating the communication cost for sample workload traces. The next section presents the C++ implementation and instructions for customizing the model to the characteristics of your system. Finally, I'll include a few model results and discuss issues that might arise when modeling LANs.

Model Algorithm

Packet switching networks subdivide long messages into short packets and then transmit the packets, as in message switching systems. The basic assumption of the model is that a single link can be shared amongst many messages. I have used a Myrinet switch (http://www.myri.com/), see Figure 1, to demonstrate an application of the model. The switch is a multistage network that includes 16 ports to PC nodes and 16 ports for interswitch connection. To simplify the example, I have only employed one switch, and therefore, the interswitch ports are inactive. The same model can be modified to work for a larger system with many switches. The peak bandwidth of the Myrinet is 1.28 Gbps and the latency is 100 ns per 8-port switch. I am using the HPVM-MPI Version 1.9 library from the University of Illinois to program the internode communications. The Message Passing Interface (MPI) is a standard set of communication primitives widely used by the high-performance community to program PC clusters and Super computers. Packets include 20 bytes of header and up to 2020 bytes of data.

The model requires detailed workload traces as input. The workload traces include the sequence of computations and communications that take place during the execution of the application for each of the PC nodes. A straightforward approach for obtaining the traces is to instrument the application and to collect traces for a range of input and system configurations. The same workload traces can be used with different networking configurations. For example, I can use the traces of an application running on the PC cluster for any interconnection network (Giganet, Myrinet, Gigabit Ethernet, or whatever) provided I have access to the underlying network model. In early stages of software development, when the source code is not yet available, you can generate traces by developing a functional prototype based on the application design. The prototype should emulate no more than the software components that generate the communication pattern. Usually, this is a manageable part of the overall software design. These techniques work well on applications with static communication patterns. Other systems, such as web farms and database servers, exhibit dynamic and irregular communication patterns. In these systems, you create a statistical profile of the communication pattern. For example, a server might receive requests in intervals specified by a certain distribution and then spend some time processing the requests. In this case, simulating the statistical behavior for a number of iterations can derive traces for the model. In the following model algorithm description, I assume that the workload traces include only synchronous, point-to-point communications, and computation delays. It is straightforward to extend the model to process more complex communication primitives such as asynchronous messages, synchronization primitives, and collective communications. Example 2 illustrates a sample trace.

The first step of the model evaluation begins with determining the Routing Array (RA); see Example 1(a). RAs embody the route that a message follows to reach its destination by determining the network links and direction that the message travels through. The RA is a 2D array where the columns represent the links of the network and the rows correspond to the message direction. An RA is created for each message that is active and all cells are initialized to 0. Each message is routed through a number of network links in order to reach its destination. The cells of the RA that correspond to the routing links and appropriate link directions are set to 1.

The function routing(s,t) is network specific and returns an RA given the source and the target nodes of a message. Listing One presents the Myrinet routing function. In the case of the Myrinet, an RA includes 32 columns — 16 for the links that connect the PC nodes to the first-level switches, and 16 for the links that connect first-level switches to second-level switches. The 16 interswitch links are not represented in the RA because they are inactive in this specific Cluster. In Figure 2, two messages are traveling through the network. The links and the direction of the messages are depicted with darker lines. The number associated with each link is the number of messages that share the link. In Figure 2, one of the links is shared between the two messages.

In the case of a link shared by more than one message, the contention factor increases, resulting in an increased communication delay. The Contention Array (CA), characterizing the traffic of the network at a particular instant in time, is determined by summing the active message RAs, as in Example 1(b). N_msg is the number of active messages.

The maximum contention factor for each message (that is, where a bottleneck occurs) can be determined by masking off links that are not parts of the message route and then calculating the maximum value of the remaining CA, as in Example 1(c). N_link are the number of links (32 in the case of Myrinet) and N_dir the number of possible message directions (two).

The third step in the evaluation of the model is the calculation of the communication delays. Assuming that the network is quiet (without any link contention), the estimation of the communication delay can be determined by a statistical model. The process for creating the statistical model begins by measuring the communication delays on the quiet network. For this you can use a ping-pong benchmark. In the case of the Myrinet, one of the nodes sends a message to a target node and then the target node returns the same message back to the source node. Measure the time for the ping-pong communication and divide by two. This is the communication delay for the specific message length. Repeat the process varying the length of the message and plot the delay times versus the size of the message. Using a linear regression statistical model, fit a line to the measurements. The linear function can then predict the communication delays for any message length. It is advisable to create two linear functions, one for messages up to a packet size and one for larger messages. Figure 3 shows the measurements and the linear models for the Myrinet switch. I have included the source of the linear regression function in Listing Two. Example 3(a) shows the regression model for the Myrinet switch. Do not use the network bandwidth to calculate the delays of the quiet network. Your prediction will be inaccurate, as it will not take into account any software overheads attributed to communication libraries, operating-system overheads, and so on.

The statistical model of the quiet network can be used to predict the communication delays considering the physical characteristics of the network, protocol, and length of the message. To extend the model to include the link contention and background load, the concept of Bandwidth Availability (BA) is introduced. The link bandwidth (B) determines the maximum achievable speed of communication. However, when the links are shared amongst many messages, the observed bandwidth of the link, B, is divided by the number of messages. Additionally, part of the bandwidth might be consumed by background load. This is the load that might be generated from external workloads, not modeled within the current system. For a message that utilizes more than one link, BA should be calculated for the link that has the highest contention factor, as in Example 3(b). This is the bottleneck link of the message and determines the effective bandwidth. BA takes values in the range of (0,1].

Example 3(c) calculates the communication delay for a message, taking into account the link contention and the background load.

In Figure 2, both messages are 1 KB in size. The communication delay of the messages traveling in a quiet network according to Tquiet(), Example 3(a), is 48.9 msec. Assume that there is only one application using the network, thus making the background load 0. The BA for both messages is 0.5 since there are two messages traveling through the bottleneck link. The communication delay considering the link contention is 97.8 msec for both messages.

The final step of the model evaluation algorithm is to determine when the network traffic changes. The communication pattern changes when a message ends or a new message begins. The duration of the network steady state is the Event Horizon (EH). The cost model determines the communication delays, assuming that the status of the network remains steady. However, during the lifecycle of a message, the traffic pattern might change many times. For each traffic change, the model has to determine the number of bytes that have traveled during the EH. For this purpose, a new function is required that returns the number of bytes consumed given EH (an inverse of Tcom). The inverse communication delay function for the Myrinet is in Example 3(d). The model subtracts these bytes from the overall message length. In the next evaluation cycle the cost model will calculate the communication delay for the remaining message and the new network traffic pattern. Example 4 summarizes the model evaluation.

The Implementation

I've implemented the model as a C++ library called "CCMOD" (available electronically; see "Resource Center," page 5) that can be used either as a stand-alone tool or as a part of a performance prediction and analysis tool. The library provides easy customization of the network characteristics, the workload trace formats, and the type of result created during the evaluation. Figure 4 is a class diagram of the library. The library provides three abstract classes that are the interface to the user definitions. These classes are:

workload, which provides an interface to the user-specified workload trace format. The user definition might include the reading and preprocessing of any type of traces generated by an application instrumentation library, functional prototype, or statistical simulation. A derived class wrkascii is included with the library that processes ASCII workload traces such as the ones in Example 2. The source code with the definition of the class is included in Listing Three. The functionality of the class member functions is self-explanatory.
netsys, which provides an interface between the evaluation engine and the network configuration and performance characteristics. A derived class myrinet is included with the library that customizes the model for the Myrinet switch used in the example. Listing Four is the source code with the definition of the class.
otrace, which provides detailed output traces about the operation of the network during the evaluation process. The user supplies an otrace-derived class to process the traces according to the requirements of the performance study. The output traces can be processed to generate a wide range of metrics or converted to a standard trace format (SDDF, for instance) for analyzing the network performance with a visualization tool. The source code with the definition of the class is included in Listing Five. The evaluation engine signals the user-defined otrace class processing function (RecvSignal) during the evaluation process. The signal includes the evaluation stage and a number of arguments that provide additional information about the evaluation process. For example, a signal is generated and passed to the user supplied RecvSignal function after the calculation of the event horizon and followed with the duration of the horizon. The signals and the corresponding arguments generated by the evaluation engine are described in file readme.htm supplied with the source code of the model. A derived class otrdeb is included with the library that formats and prints all available output traces created during the evaluation.

Listing Six shows how to configure the communication model for the PC cluster example and how to call the evaluation engine to produce predictions.

Conclusion

We've used the model presented here internally in a number of performance studies as a stand-alone tool and as part of a larger performance toolset. Most applications we have analyzed exhibit regular communication patterns. It is possible to generate detailed workload traces by simulating the statistical behavior of an application such as client-server database transactions or web applications where the communication patterns are not regular.

Customizing the model to LAN environments requires taking into consideration some additional performance issues. The most complex is the handling of the background load. LAN background loads are difficult to model because they exhibit irregular patterns. The majority of studies on the characterization of Ethernet traffic suggest using an exponential distribution to represent the background traffic. Included in Listing Seven is a function that produces background loads for the model based on the exponential distribution. Before using this function, you will need to measure the average background load. If you create a model for your LAN, I would suggest obtaining three sets of predictions for three possible background loads: an average case, a worst-case scenario, and a best-case scenario. This will give you a better idea of the bounds of the performance you can expect. It is always advisable to obtain performance bounds instead of single predictions for any performance study, but it is especially critical for models that include nondeterministic elements, such as the background load.

In the PC cluster example, the model consumed 130,000 traces per second, running on a 450-MHz Pentium III PC. The accuracy of the predictions vary depending on the characteristics of the network and the application. For the Myrinet switch and regular communication pattern, the error of predictions was less than 20 percent.

Finally, Figures 5 and 6 are performance graphs for an application running on a PC cluster to demonstrate the type of analysis that can be derived from the model output traces. The workload is Sweep3D, a complex benchmark for assessing wave-front application techniques used to evaluate advanced parallel architecture at Los Alamos Laboratories. Figure 5 shows the knee capacity of the network. Knee capacity is the throughput of the network in packets/sec for the link that has the highest load. The graph also includes the effect of 30 percent background load. I have used this type of graph to determine the effect of background loads on the effective network capacity. Figure 6 includes a space-time diagram for the same application. Space-time diagrams are often used by monitoring visualization tools to show the overall picture of the system operation. They include a timeline for each processor and show four basic types of events (idle, send, receive, and wait). Lines connect the timelines to denote communications. These are very useful diagrams for understanding, and in many cases debugging, the communication pattern. Both diagrams were created by processing the model output traces by Mathematica custom modules.

DDJ

Listing One

const int Nproc = 16; // # of PCs
const int LperG = 4;  // Links per switch group
const int SwitchLinks = 32; // Links per switch
const int LevelLinks = 16; // Links between switches
int S1toS2[16]; // 1st switch -> 2nd switch mapping

// Initialize routing algorithm
void RoutingInit(void)
{
  // Determine static S1->S2 level connections
  for( int i = 0; i < LevelLinks; i++ )
  S1toS2[i] = (int)floor(((float)rand()/(RAND_MAX+1)) * LperG)+1;
}
// Create Routing Array for message Psrc -> Ptrg
void Routing(int Psrc,int Pdst, unsigned int (*Vr)[2])
{
  int SrcGr = Psrc/LperG;  // Src switch group
  int TrgGr = Pdst/LperG;  // Trg switch group
  // Initialise RA to zeros
  for( int i = 0; i < SwitchLinks; i++ )
  Vr[i][0] = Vr[i][1] = 0;
  // Routing
  Vr[Psrc][0] = 1;  // Psrc -> 1st level switch
  // If Ptrg && Pdst don't belong to same switch
// involve 2nd level switch
  if( SrcGr != TrgGr ) {
  // Use static map to select 2nd level switch
  int selsw = S1toS2[Psrc];
  Vr[ SrcGr*LperG + selsw + LevelLinks-1][0] = 1;
  // Set busy link from 2nd-level switch to target proc 1st-level switch
  Vr[selsw + TrgGr * LperG + LevelLinks-1][1] = 1;
  }
  Vr[Pdst][1] = 1;  // 1nd level switch -> Ptrg
}

Back to Article

Listing Two

// Linear Regression Model
// Arguments
// n - number of measurements
// l - Message length array
// t - Communication delay array
// b0 - (returned) regression parameter
// b1 - (returned) regression parameter
// 
// Tcom(len) = b0 + b1 * len
//
void lreg(int n,long* l,long* t,double& b0,double& b1)
{
 double mx,my,sxy,sx2;
 mx = my = sxy = sx2 = 0.0;
 for( int i = 0; i < n; i++ ) {
  mx += l[i]; my += t[i];
  sxy += l[i]*t[i];
  sx2 += pow(l[i],2.0);
 }
 mx = mx / n; my = my / n;

 b1 = (sxy-n*mx*my) / (sx2-n*pow(mx,2.0));
 b0 = my - b1 * mx;
}

Back to Article

Listing Three

// Workload customization class
class workload { 
public:
  enum trace {  // Tracetype
  TRACE_IDLE,// Processor is idle
  TRACE_SCOM,// Synchronous communication
  TRACE_END // End of traces
  };
  // Print workload debug info 
  virtual void  print(std::ostream&) = 0;  
  // Get type of trace
  virtual trace  GetTraceType(int procid) = 0; 
  // Get comm trace data
  virtual void  GetTraceData(int procid,int &src, int &trg,long &len) = 0;
  // Get idle processor trace data
  virtual void  GetTraceData(int procid, long &time) = 0;
  // Fetch next trace
  virtual trace  FetchNextTrace(int procid) = 0;  
};

Back to Article

Listing Four

// Network customization class 
class netsys {
 public:
  // Create Routing Array
  virtual void  Routing(int,int,unsigned char(*)[2]) = 0;
  // Return number of network links
  virtual int  GetLinkNo(void) = 0;
  // Return number of nodes connected
  virtual int  GetNproc(void) = 0;
  // Return size of packet in bytes
  virtual int  PacketSize(void) = 0;
  // Return network name
virtual const char* Name(void) = 0;
  // Return background load array
  virtual float (*GetBgrLoad(long clock))[2] = 0;
  // Return communication cost for quiet network
virtual long  Tcom(int hops,float packets) = 0;
// Inverse for Tcom for calculating packing consumed
// during an Event Horizon
virtual float T2Pack(long eh,int hops, float total_packets,
  float remaining_packets,long Tcom) = 0; 
};

Back to Article

Listing Five

// Ouput trace customization class
class otrace {
 public:
  // Signal from evaleng -> otrace
  enum otrace_sig {
  CONFIG,GO_INIT,GO_STARTEVENT,
  GO_END,CREATEEVENT,NEWSCOM,
  NEWPROC,SYSCONT,MSGCONT,
  COMMCOST,EVENTHORIZON,
  UPPR_PROC,UPPR_SCOM,UPPR_PCNS,
  UPPR_END
  };
  // Receive output trace signal and arguments for evaluation engine
  virtual void  RecvSignal(otrace_sig ...)=0; 
};

Back to Article

Listing Six

#include <iostream>
#include <fstream>
#include "ccmod.h"
#include "wrkascii.h"
#include "myrinet.h"
#include "otrdeb.h"

using namespace std;
int main(int argc,char** argv)
{
  const int Nproc = 16; // Number of PC nodes
  if( argc != 3 ) {
  cerr << "Usage: model.exe <workload traces> <output file>" << endl;
  return 1;
  }
  // Set up trace source if present
  ifstream fsrc(argv[1]); 
  if( !fsrc ) {
  cerr << "Error opening trace file\n";
  return 1;
  }
  // Set up otrace file
  ofstream otrg(argv[2]);
  if( !otrg ) {
  cerr << "Error opening output trace file\n";
  return 1;
  }
  // Set up & evaluate model
  try {
  // Set up user defined classes
  // Ascii traces (derived from workload)
  wrkascii w(Nproc,1,&fsrc);

  // Network model (derived from netsys)
  myrinet myrswitch(Nproc);

  // Output trace facility (derived from otrace)
  // Use DBNONE to turn off debug mode
  otrdeb odb(otrdeb::DBALL,&otrg);

  // Set up evaluation engine
  evaleng e(&w,&myrswitch,&odb);
  // Evaluate
  // Return PC node clock
  cout << e.Go() << endl;
  }
  // Error exception
  catch( ccmod_error er ) {
  cerr << er.what() << endl;
  return(1);
  }
  return(0);
}

Back to Article

Listing Seven

// Inverse Exponential Distribution
static inline _ExpD(long l, long b)
{
 float inv = -l * log((float)rand()/(RAND_MAX+1));
 return( inv >= b ? 1 : inv/b);
}
// Function derives from netsys::GetBgrLoad
// Return background load vector based on exponential distribution
float (*ethernet::GetBgrLoad(long clock))[2]
{
  // Configure these parameters for your system
  // l - Measured average background load (packets/sec)
  // b - Max link bandwidth (packets/sec) 
  const long l = 7000;
  const long b = 25000;
  // For each link
  for(int i = 0; i < GetLinkNo(); i++) {
 Vbgr[i][0] = _ExpD(l,b); 
 Vbgr[i][1] = _ExpD(l,b);
  }
  return(Vbgr);
}

Back to Article

1 2 3 4 5 6 7 8 9 10 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Design

Predicting Communications Bottlenecks

Model Algorithm

The Implementation

Conclusion

Listing One

Listing Two

Listing Three

Listing Four

Listing Five

Listing Six

Listing Seven

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Design Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Design

Predicting Communications Bottlenecks

Model Algorithm

The Implementation

Conclusion

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Design Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content