Parallel

Load Balancing for UNIX and Win32

By Abdul Sakib Mondal, July 01, 2000

In most distributed applications, the workload needs to be balanced across all available processors via software. Sakib presents XYALB, a load-balancing program that works on SunOS 4.1.1 and 4.1.3, Redhat Linux 6.5 (kernel 2.0.36), and Windows 95/NT.

Jul00: Load Balancing for UNIX and Win32

Sakib is a doctoral fellow of the Indian Institute of Management, Calcutta, and is currently with Software Concept Laboratory of Infosys Technologies Limited. He can be contacted at abdulsakib@ inf.com.

In most distributed applications, the workload should be distributed among processors according to their processing capabilities. In other words, the workload needs to be balanced across all available processors via software. Most load-balancing software is based on one of two techniques -- static and dynamic.

Static load balancing (SLB) does not depend on the state (workload) of the processors. Simple static load-balancing techniques include join-the-shortest queue (SQ) and minimum expected delay (MED). The SQ policy allocates an arriving task to the processor having the minimum load, while the MED policy allocates an arriving task to the processor having the minimum expected value of waiting time for currently scheduled tasks. Although these policies may work similarly for homogeneous processors, their behaviors differ for heterogeneous processors.

Dynamic load balancing (DLB), on the other hand, relies on present and past states (workload history) of the processors, and distributes the load among the hosts based on these states. With DLB, an initial assignment of work to the processors is done through static load balancing. Subsequent adjustment of work depends on the workload profile of the processors, which involves performance monitoring, information exchange, decision making, and load exchange. DLB generally performs better than SLB, but has the burden of communication and further processing.

The timing of information exchange is decided by initiation strategies that are either time driven or load driven. In time-driven strategies, information is exchanged at fixed time intervals. For load-driven strategies, information exchange depends on load distribution. For example, a possible synchronization point could be the time at which any of the processors finishes the work assigned to it. Load-driven information exchange can be receiver-initiated or sender-initiated. In a receiver-initiated strategy, interrupts are sent to all participating hosts, which then share their information whenever the load of a host goes below a predetermined threshold. A new load distribution is calculated based on the information, and the receiver is possibly allocated some additional work. For sender-initiated exchanges, participating hosts are interrupted if the load of any workstation exceeds a predetermined threshold. Information is exchanged and a new distribution is determined in which a portion of the sender's workload is shared by other hosts. Thresholds are dynamically decided based on workload. A host can work as sender or receiver depending on its workload.

Information exchange can be either global (G) or local (L). In global strategies, all processors share performance information to derive the new distribution of work. With local strategies, processors are divided into a number of groups, and information on the load is shared only within a group. Global strategies in general have high communication overheads, though they are a good choice when you can group processors into clusters having uniform capabilities.

Load balancers can be centralized (C) (located at a master processor) or distributed (D) (submodules distributed among processors). All in all, there are four broad categories of DLB, depending on information sharing and load-balancing strategies:

Global centralized (GCDLB).
Local centralized (LCDLB).
Global distributed (GDDLB).
Local distributed (LDDLB).

With this in mind, a good load-balancing tool must still meet two conflicting requirements:

Use of detailed run-time system information. The effectiveness of a load-balancing strategy hinges on the accuracy of the information about the workload of the machines available. Obtaining this accuracy requires accessing information about available memory, swap space, CPU speed, and so on. The load-balancing tool should make judicious use of this system-level information. The flip side of this is that it makes the load-balancing software platform dependent, which conflicts with our second requirement.
Portability. For a number of reasons, machines for a distributed application are often heterogeneous. Even if machines at a site are of the same platform by coincidence, there is absolutely no reason to believe that these will not differ from the machines at different sites.

Although generic load-balancing programs are useful, they aren't easy to build. Consequently, in this article, I present XYALB, a load-balancing program designed to work on SunOS 4.1.1 and 4.1.3, Red Hat Linux 5.2 (kernel 2.0.36), and Windows 95/NT.

XYALB has its roots in YALB, a load-balancing program developed at the Institute for Operating Systems and Computer Networks (http://www.ibr.cs.tu-bs .de). YALB is a GDDLB tool written in C and tested on SunOS 4.1.1 and 4.1.3 only. My objective in developing XYALB was to provide an infrastructure that allowed for more experimentation than YALB with different load-balancing strategies on a variety of platforms. XYALB has three major modules, each of which runs as a daemon/ service on each host in the load-balancing network. The complete source code for XYALB is available electronically from DDJ (see "Resource Center," page 5).

loaddaemon, which keeps track of the local load of a host and communicates this information to the loadmanager periodically.
loadmanager, which periodically updates the list of loads of hosts either through broadcast or polling.
rexecmanager, which (with the help of the loadmanager) selects the best machine for executing the job, then executes the job at the selected machine, whenever a job is submitted.

In addition to these modules, there are a few other programs required for configuring and running the tool. For instance, once the daemons are running, you need to specify host- and system-specific parameters in a configuration file and run the programs yalb_host_config and yalb_ system_config to set up the system. Then, you can use rexec to submit a job. The job, along with a load-balancing strategy, is specified as command-line arguments of rexec. There are also programs to monitor load status and shut down the system. Details on how to set up XYALB are available electronically.

As Figure 1 illustrates, the three modules in XYALB run as services/daemons. For convenience, I have renamed services loaddaemon, loadmanager, and rexecmanager, as ldaemon, lmanager, and remote, respectively. remote runs as a TCP application, while ldaemon and lmanager run as UDP applications. Immediately after start up, all three services execute initialization tasks (setting up shared memory, opening socket connections with other modules, and the like). To run XYALB, you need to run only ldaemon, which in turn runs lmanager and remote after a broadcast request to ldaemons at other hosts. loaddaemon waits in a loop for requests and takes action depending on the request; see Table 1.

remote waits for any new job. When it receives a job, it checks whether it has any sockets available to serve the request; it accordingly sends an acceptance or refusal. If acceptance is sent, remote waits for the actual command from rexec, authenticates the user and requests, and (with the help of ldaemon) creates a child with standard input, output, and error redirected to the remote host submitting the request (see Figure 1). The child process then runs the submitted job.

You measure load based on CPU runqueue. Because job arrival is stochastic, taking the instant load as a true measure may cause unnecessary job transfer. To avoid being misled by short-term load spikes and bursts, I apply a smoothing average to the load. If any job is terminated or started since the runqueue was last calculated, the runqueue is adjusted accordingly. Other parameters which can serve as indication of load are CPU usage and swap space availability. Load measure is further modified based on the CPU usage by adding a factor depending on CPU usage. Taking load measure as the sole determining factor for the assignment of new jobs has the disadvantage that load measure may be low for low runqueue and high CPU usage or vice versa. To avoid this type of situation, another measure (called "availability") is used. This marks a host unavailable if either of the factors are very high. Availability of a host is also decided by the free swap space at the host.

The ANTARC Algorithm

Recall that my goal with XYALB was to build a framework for experimenting with different load-balancing algorithms and strategies. To demonstrate how to use XYALB in different situations, I implemented an algorithm based on the theoretical results of several papers. I call this algorithm "Adaptive Network Traffic-based Algorithm with Reduced Collision" (ANTARC).

In "Ancient and New Algorithms for Load Balancing in Lp Norm" (Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 1998), A. Avidor et al., showed that for minimization of the squared sum of loads on hosts, an algorithm that assigns incoming load to the lightest host guarantees that the cost of the solution so obtained never exceeds the optimal objective value by more than 33 percent. They also established that no online algorithm can achieve better performance than this. The result is not surprising, since assigning load to most lightly loaded machines has been shown to give good performance for many other scheduling problems with similar objective functions. However, Avidor et al. do not consider network delay, which is an important factor. Therefore, with ANTARC I try to assign load to one of the low-loaded servers, taking into account networking delay.

In "Optimal Static Load Balancing in Distributed Computer Systems" (Journal of the ACM, Vol. 32, 1985), A.N. Tantwai et al., consider network load and suggest an optimal static policy. Network delay in a LAN or satellite network is determined by total network traffic (not by load on an individual host) and the delay for the transfer of a packet between any two nodes is the same. Based on loads of hosts and networks, they partition the nodes into four types: sink, neutral, passive source, and active source. Classification of hosts into these categories is based on incremental delay. Taking a cue from this, I decided to experiment with a new algorithm where you can decide whether to execute a job locally or transfer it to a host, based on total expected incremental delay in response time. Rather than using a complicated formula (as does Tantwai), I use a simple function that's based on average delay due to workload, network delay, and provided speed factor (alpha). However, you can make it more precise by calculating speed online.

Sometimes due to transfer decisions taken by a number of hosts at the same time, the lowest loaded host may be chosen among a number of hosts, and as a result, this lowest loaded host becomes highly overloaded after the transfer of loads. To avoid such a situation, you should prevent the same host from being chosen by more than one host simultaneously. In "Using System State Information for Adaptive State Polling Policy in Distributed Load Balancing" (Second AIZU International Symposium on Parallel Algorithms, 1997), G. Lee proposed a global priority order (GPO) network to address this problem. However, his approach adds computational burdens. I tried to achieve the same objective by choosing one host randomly from the set of lightly loaded nodes. This does not ensure that load is transferred to the lowest loaded host. But as long as it is transferred to a low-loaded host and does not suffer from the lowest loaded host becoming heavily burdened, it suffices.

In Figure 2, which describes the ANTARC algorithm in detail, threshold idle_host is decided before the algorithm is run. Whenever the load of a host is below this threshold, that host is assumed to be idle. WT_NET is also decided beforehand. It is supposed to measure intensity of network transfer for remote execution. You can choose a value equal to the average size (in KB) of data transfer per job. The variable netdelay measures the average time of to and from transfer of 1 KB of data between any pair of hosts. The expected ready time of a host is obtained by dividing its total load by its speed. The variable temp_ cmpl_time measures minimum response time of all hosts. Figure 3 describes how the list of low-loaded hosts is chosen.

To experiment with the ANTARC algorithm, I inserted its implementation (see Listing One) inside lmanager.c. Consequently, XYALB uses ANTARC when rexec sends a message of type GET_ OPT_ HOST (see Table 2) to lmanager through the function get_hosts (Listing Two). (Listings Three and Four, available electronically, provide snippets of code for lmanager and rexec, respectively.) To make it possible for you to specify this algorithm while running a program, I added another block of code in main of rexec.c, which sets the variable what-> kind to GET_OPT when you specify x (intended to choose the new algorithm) as an option to rexec; when what->kind is GET_OPT, message.kind is set to GET_OPT_HOST in function get_hosts. Therefore, fine-tuning an already implemented algorithm or adding a new algorithm is straightforward, letting you easily plug-in new ideas and see the results.

Conclusion

Although XYALB can be used as an infrastructure on a number of platforms to experiment with innovative ideas, it needs further refinement. For instance, it does not deal with migration issues and hence, does not include strategies that are receiver initiated or sender initiated. However, all you need to do is add a framework for migration. Then you can use the load monitoring portion in lmanager to decide whether there is an overloaded host that should be off-loaded, and whether there is any lightly loaded host that can share loads of other relatively highly loaded hosts.

Though the performance of different load-balancing strategies has been compared for the setup with heterogeneous workstations, most of the time comparisons are made either analytically or through simulation, not through actual experiments on real-life applications. Also, performance models for systems with load balancing are often too complex and conceptual to be used in practice. With XYALB infrastructure, it is easy to devise a suitable practical strategy of distribution of workload.

It should be possible to apply the present framework for web-server load balancing. XYALB can be utilized for DNS-based, dispatcher-based, and server-based approaches of web-server load balancing (see "Dynamic Load Balancing on Web-Server Systems," by V. Cardellini et al., IEEE Internet Computing, 1999). Adapting to a server-based approach is as simple as developing a GDDLB system. For the other two approaches, a DNS server or first-level dispatcher can decide the server by sending a request to any of the lmanagers of the load-balancing module running on the other servers.

DDJ

Listing One

if (appl_request.kind == GET_OPT_HOST) {
	poll_for_load (); /**** List of load-indices is updated****/
	if (local_host->avail&& ((local_host->load + 
		SWAPLOAD (local_host)) < idle_host{
		/****local host idle, and hence job is run locally***/
		appl_reply.kind = APPLICATION_REPLY;
		appl_reply.data_number = 1;
		appl_reply.data_hostlist[0] = local_inaddr;
		appl_reply.data_factor[0] = local_host->load 
			+ SWAPLOAD (local_host);
		sendto (application, (char *) &appl_reply, sizeof 
			(rpc_message), 0, (struct sockaddr *) &from, fromlen);
	} 
	else{
		float temp_cmpl_time;
		struct load_indices *begin_selected_host;
		struct load_indices *selected_host = (load_indices *) 0;
		int num_selected_host=0;
		int netdelay=0;
		if (number_of_hosts >1){	
			for (help = host_list, count = 0;count < number_of_hosts;
				help = help->next_host, count++){
				if (memcmp (&help->host.sin_addr, &local_inaddr, 
				sizeof(struct sockaddr_in)) != 0) break;
			} 
			/*get network response time*/
			netdelay=getNetDelay(help->host);
		} 
		/***calculate lowest expected ready time*******/
	   	for (help = host_list, count = 0, temp_cmpl_time = MAXLOAD;
			count<number_of_hosts; help = help->next_host, count++){
			if (temp_cmpl_time >= (help->load + SWAPLOAD (help))/
				help->alpha && help->avail && !help->is_dummy && 
				!help->is_dead) {
			  	temp_cmpl_time = (help->load + SWAPLOAD (help))/
					help->alpha;
			  	selected_host = help;
			} 
		} 
		/*it is still wise to schedule job locally, as otherwise 
		 *response will be poor due to network traffick
		 */
		if(local_host->avail && ((local_host->load +
			SWAPLOAD(local_host))/local_host->alpha < temp_cmpl_time+
			WT_NET*netdelay)) {
			appl_reply.kind = APPLICATION_REPLY;
			appl_reply.data_number = 1;
			appl_reply.data_hostlist[0] = local_inaddr;
			appl_reply.data_factor[0] = local_host->load 
				+ SWAPLOAD (local_host);
			sendto (application, (char *) &appl_reply, sizeof
				(rpc_message),0,(struct sockaddr *) &from,fromlen);
	      	continue; 
		} 
		/**get hosts with expected ready time around  a small band**/
		int first_time=1;
		begin_selected_host=NULL;
		for (help = host_list, count = 0; count < number_of_hosts &&
			(memcmp(&help->host.sin_addr,&local_host->host.sin_addr,
			4)!=0);help = help->next_host, count++) {
			if (temp_cmpl_time+netdelay >= (help->load + SWAPLOAD
				(help))/help->alpha && help->avail && !help->is_dummy
				&& 	!help->is_dead) {
				num_selected_host++;
				if (first_time) {
					begin_selected_host=help;
			  		selected_host= help;
					first_time=0;
				} 
				else{
			  		selected_host->next_selected=help;
			   		selected_host=selected_host->next_selected;
				} 
			} 
		} 
		if (selected_host!=NULL) selected_host->next_selected=NULL;
		selected_host=begin_selected_host;
		if (!selected_host) {
		/**all hosts are highly loaded, hence schedule it locally**/
			appl_reply.kind = APPLICATION_REPLY;
			appl_reply.data_number = 1;
			appl_reply.data_hostlist[0] = local_inaddr;
			appl_reply.data_factor[0] = local_host->load
				+ SWAPLOAD (local_host);
			sendto (application, (char *) &appl_reply, sizeof
			(rpc_message),0, (struct sockaddr *) &from,fromlen);
			continue;			
		} 
		/********choose a random host from the above list*********/
		for (index = (rand () % num_selected_host); index > 0;
			index--){
			selected_host = selected_host->next_selected;
		} 
		if (selected_host->is_dead || selected_host->is_dummy || 
			!selected_host->avail || selected_host->load >= (float)
			(1 - (float) (improvment / 100.0)) * (local_host->load + 
			SWAPLOAD (local_host))) {
			for (count = 0 ; count < num_selected_host && 
				(selected_host->is_dead || selected_host->is_dummy || 
				!selected_host->avail ||selected_host->load >=(float)
				(1 - (float) (improvment / 100.0))* (local_host->load
				+ SWAPLOAD (local_host))); selected_host = 
				selected_host->next_selected, count++){
			} 
			if (count == num_selected_host) {
				/***local host selected***/
				selected_host = local_host;
			} 
			else{
				/***local host NOT selected***/
			} 
		} 
		appl_reply.kind = APPLICATION_REPLY;
		appl_reply.data_number = 1;
		appl_reply.data_hostlist[0] = selected_host->host;
		appl_reply.data_factor[0] = selected_host->load + 
		SWAPLOAD (selected_host);
		sendto (application, (char *) &appl_reply, sizeof 
			(rpc_message), 0, (struct sockaddr *) &from, fromlen);
	} 
}  /*END OF GET_OPT_HOST*/

Back to Article

Listing Two

int get_hosts (what, available_hosts)
request *what;
{
rpc_message message;	/* Message to be sent to load-manager */
int success;			/* successful communication */
fd_set readfds;			/* Read ... */
int count;				/* Count ... */
rpc_message answer;		/* Reply from load-manager */
rpc_message message;	/* Message to be sent to load-manager */
struct timeval timeout;	/* Timeout-interval for requests 
fd_set readfds;			/* Read ... */
stats_timeout timeout_info;

	/*do initialization, get option 'c'*/
. . . .
switch( c ) {
	. . . .
	case 'h':
		. . . . /*set host-list*/
		specified_host = TRUE;
		break;
	case 'x':
		lbstrat = TRUE;
		break;
	/*you can put option for your algorithm*/
	. . . .
} 
if( fastest_host ) {
	want.kind = GET_FASTEST;
} 
else if (lbstrat){
	want.kind=GET_OPT;
} 
/*modification you need to carry out
else if (. . .){
	what->kind=GET_AS_U_LIKE;
} */
else{
	want.kind = GET_TOP; 
} 
. . . .
/*you can put options for your algorithm, say,GET_AS_U_LIKE as below:
if (what->kind == GET_AS_U_LIKE){
    message.kind = GET_AS_U_LIKE;
    message.magic_number = MAGIC;
	message.data_number = what->number;
} 
else
******/
if (what->kind == GET_OPT){
    message.kind = GET_OPT_HOST;
    message.magic_number = MAGIC;
	message.data_number = what->number;
} 
else if (what->kind == GET_TOP){
	message.kind = APPLICATION_REQUEST;
	message.magic_number = MAGIC;
	message.data_number = what->number;
} 
. . . .
else{
	return ILLEGAL_REQUEST;
} 
. . . .

do{
	if (sendto (manager, (const void *)&message, sizeof(rpc_message),
		0,(struct sockaddr *) &name_of_manager, sizeof 
		(struct sockaddr))< 0) {
	  		return MANAGERNOTPRESENT;
	} 
	FD_ZERO (&readfds);
	FD_SET (manager, &readfds);
	tmp=select(MAXWIDTH, &readfds,(fd_set *)0, (fd_set *)0,&timeout);
	if (tmp<= 0) {
		success = FALSE;/* failure in get_hosts due to timeout*/ 
	} 
    	else{
		if (FD_ISSET (manager, &readfds)) {
			recv (manager, (char *) &answer, sizeof (rpc_message),0);
			if (answer.magic_number == MAGIC) {
				if (answer.kind == APPLICATION_ERROR)
		    			return answer.data_error;
		  		if (answer.kind == APPLICATION_REPLY) {
		      		success = TRUE;/*Got a suitable host*/
		      			break;
		    	} 
			} 
		} 
	} 
	timeout_info.count++;
} while (!success && (timeout_info.count <= ATTEMPTS));

if (! success)   return MANAGERNOTPRESENT;
*available_hosts = *(reply *) & answer.data.message;
return 0;
}

Back to Article

1 2 3 4 5 6 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel