Ramkumar is head of the Software Concept Laboratory at Infosys Technologies Ltd. He can be contacted at [email protected].
Servers running popular web sites may often be servicing hundreds or even thousands of connections (requests) simultaneously. In most web servers, incoming requests are serviced in a First-Come-First-Served (FCFS) manner. In many cases, however, it can be extremely useful if not downright necessary to have application-level control over scheduling policies. For example, not all user requests may have equal priority. Consider, for instance, an auction site that lets visitors browse and search items up for auction, and registered members to bid. Clearly, you want to give bid requests higher priority than search requests. To take another example, many subscription-based sites offer free as well as paid memberships. The site would like to provide better quality of service to paid users; in fact, you could even think of a scenario in which both categories of users get access to the same content, but paid users are promised better response times compared to free users.
Auction sites are examples of content-based priority, while subscription-based ones are priority based on user type. Today's web servers allow neither. To fill this void, I decided to adapt the Apache Web Server for this purpose. This was a natural choice because Apache runs about half the world's web sites, and the source code is freely available. Furthermore, Apache follows a process-per-connection philosophy, in which a limited number of server processes (typically about 100-150) serve as many connections, while the other pending connections are made to wait. The opportunity for request scheduling arises only when you have a queue of waiting processes. Compare this to Microsoft's worker-thread model, where all requests are served concurrently, even if the number of requests is significantly more than the number of available threads. This model makes it difficult to schedule or reschedule requests. Of course, each model has its own pros and cons, but from the request scheduling point of view, the Apache model is preferable.
In this article I'll describe the key ideas behind request scheduling and also the key modifications I made to the Apache server, so that if you want to incorporate these ideas into your server, you can use these leads to do so.
The Modified Apache Architecture
Apache works with a pool of identical child processes (which I refer to here as "worker processes") that contend to reach a listening socket for accepting a connection. Apache has a mechanism of mutual exclusion (mutexes), which it uses to control access by workers to the socket. At any point, only one worker is listening for a connection while others await their turn. A process that gets access to the socket blocks on an accept() call while others await their turn. As soon as a connection is accepted, the blocking process takes up the connection and handles it to completion before returning to contend for the listening socket again. The maximum number of workers in the pool is specified by the Apache configuration parameter MaxClients.
The TCP queue (sometimes called the listen() queue) is limited in size by a parameter called SOMAXCONN. Many UNIX systems specify a fairly small SOMAXCONN value. SunOS 4.1.3 has a value of five, for example. Also, the operating system does not let you increase it very much. When an incoming request hits a full TCP queue, the request is simply rejected (the "connection refused" message in your browser). So if you want to work with a large number of client requests (which means you reduce the number of refused connections even as you attempt to reschedule them), you need to have an application-level mechanism to queue the requests.
Figure 1 shows the modified architecture that accomplishes this. In this scheme of things, one additional process is created to serve as a master listener-cum-scheduler (MLCS), which is solely responsible for accepting new connections. Each new connection that is accepted is subjected to a quick check for the user type making the request, and possibly its size (which is easy to estimate in the specific case of static files). Based on this check, the MLCS stores the socket descriptor in a suitable data structure. I chose to code this data structure in Python. (I will refer to it as the "Python queue" or just "the queue," although it is useful to remember that it may not be a queue.) Each time a child process becomes free, MLCS picks a socket descriptor from the Python queue based on a scheduling rule and passes it to the child. The child then proceeds to handle it to completion while MLCS returns to its task of listening for new connections and managing the Python queue.
Figure 2 delves a little deeper into the architecture to show the constituent routines and functions. Apache has a function called child_main(), which is the main loop that each child process resides in as long as it is alive. Each child is identified by a numeric identifier called child_num_arg. So as to work within the framework of Apache's process creation and management framework, I decided to use one of these processes (the process with child_num_arg=0) as the MLCS. So I created a new version of child_main(), called new_child_main(), which would be executed by regular worker processes, and modified the original child_main() function to work as the MLCS. Listing One shows how when a process first enters child_main(), it checks whether its child_num_arg as assigned by Apache is 0. If it is not, it is shunted off to execute new_child_main().
Thus, it is the MLCS that sends socket descriptors to the remaining worker processes. To pass file and socket descriptors sensibly between independent processes (though they may have been formed by a fork() call), you need to use a stream pipe. A stream pipe is created by a call to socketpair() with an argument that is a pointer to an integer array of size 2. When the call returns successfully, the array contains a pair of file descriptors that represent the two ends of the pipe. Advanced Programming In the UNIX Environment, by W. Richard Stevens (Addison-Wesley, 1992) provides detailed information (plus code) on how to use the stream pipe to pass file descriptors between processes, so for brevity I will not elaborate further here.
The MLCS Core
The core of the MLCS (see Listing Two) is a piece of code in the child_main() function that constantly shuttles between doing three things:
- Listening to the TCP socket for incoming connections.
- Picking pending requests off the Python queue.
- Checking for available worker processes and passing socket descriptors to them.
The last part that is, interacting with worker processes is done using the stream pipe. Each worker process, when it becomes free, sends a flag (indicating "I'm available") to MLCS using one end of the stream pipe, which the MLCS gets by reading off the other end. Of course, there could be more than one worker process trying to get access to the stream pipe. I made only a few minor changes to the Apache mutex mechanism so that the object to which access was now controlled was the stream pipe rather than the TCP socket. This way, once a worker manages to send across the flag, it retains access to the stream pipe until it gets a socket descriptor from the MLCS (fetched from the Python queue), after which it relinquishes access to another worker, and the cycle repeats.
The select() call is used to block on the TCP socket (the variable sd) as well as the MLCS end of the stream pipe (the variable fd[0]). When the process unblocks due to an event on either of these sockets, it checks first if there is a worker that has sent an "I'm available" notification. The rest of the processing in the child_main() function proceeds as follows:
- If there is a free worker, the feedWorker() function (see Listing Three) is called. This calls getNextRequest() in the Python script that checks whether the worker can be fed with a file descriptor from the Python queue. If yes, it feeds it one and returns the number of remaining queued requests.
- If feedWorker() reports an empty queue (zero remaining requests), then the MLCS blocks on the accept() call waiting for a new request to arrive on the TCP socket.
- Suppose that the worker has been fed. If there is no waiting event on the TCP socket (!FD_ISSET(sd, &fdpair)), the MLCS returns to its original blocked state. If, however, there is a waiting request on the TCP socket, then it is plucked off.
Once the MLCS has plucked a request off the TCP queue, it does some default processing with it using Apache's original code, and then calls py_peek_and_store_request(). As its name suggests, this uses recv() with the MSG_PEEK flag to peek at the request URL and copy a portion of it into a string that is subsequently sent for analysis to the Python script (see Listing Four). The script can then determine the requested object and/or the user type (which may need to be explicitly encoded into the URL of your client application). Then it calls the function py_PrioritizeAndStoreRequest() with this information, which is an interface to the storeRequest() function in the embedded Python script.
The Embedded Python Scripts
When you think of scheduling or prioritization rules for an application such as an auction site, you realize quickly that these rules cannot be hardcoded into the server. You need a mechanism to allow these rules to be specified at run time. Enter Python. Python is a clean, powerful, object-oriented scripting language that can be easily embedded into native code (in this case C). Details of how to embed Python are available in several places including the Python manual, so I will not get into the details here. I will jump straight into the details of the script the functions it must support for the embedding, and what these functions do.
In my modified Apache code, there were three functions that the code expected in the Python script: storeRequest(), getNextRequest(), and readFileList(). Having created the scheduling infrastructure, here is what I needed to do to implement a scheduling policy:
- Have the URL contain information about the type of request or the type of user. (This was easy for static files because the URL would contain the filename.)
- Define (in Python) a data structure that will hold incoming requests (integer socket descriptors), and declare a global variable QNum that will contain the number of waiting requests.
- Write the function readFileList() to specify the request types and/or user types and their priorities (in general, this information would typically be input from a file). (When the Python module is first loaded, the readFileList() function is called first to perform the initialization required.)
- Write the function storeRequest() to examine the contents of the PEEKed URL fragment sent to it, extract the encoded information about the request or user type, and then insert the request (that is, the socket descriptor) in the data structure at an appropriate place based on the scheduling logic and update QNum.
- Write the function getNextRequest() such that each time it is called, it extracts a socket descriptor from the data structure based on the scheduling logic and updates the variable QNum.
Choosing a Scheduling Rule
The sample Python script in Listing Five implements a simple scheduling policy that enforces the rule "bidders get high priority" for an auction site. To do this, you define two request types called "bid" and "regular." The Python data structure is a pair of queues, one for each request type. Each time storeRequest() is called, it checks if the URL contains the key BID or REGULAR, and then stores the socket descriptor in the appropriate queue. When getNextRequest() is called, it plucks a socket descriptor from the BID queue if it is not empty, or one from the REGULAR queue.
Is this a good strategy? If there is a hot auction in progress and you happen to fire a search request, chances with a simple priority rule are that the search request will perennially wait for the bid requests to be exhausted! This phenomenon is called "request starvation."
In my experiments, the data structure was actually a hash table; each entry in the hash table corresponded to a user type or a filename, and was associated with its own FCFS queue. I had a probabilistic scheduling rule that essentially works by assigning a probability to each user type or file type, based on which the getNextRequest() function decides which pending request to service. For example, in the auction site case, we might assign probabilities of 0.9 to BID and 0.1 to REGULAR. The data structure is the same as before: a set of two queues, one for BID requests and one for REGULAR requests. However, the getNextRequest() function is coded in such a way that it picks a request in a randomized manner: With probability 0.9, it picks a BID; with probability 0.1, it picks a REGULAR request. With this probabilistic rule, on average one REGULAR request gets served for every nine BIDs. This translates directly into mean response times. If there are equal numbers of bidders and regular users, you can convince yourself that the bidders will see a mean response time that is 1/9th that of regular users. This scheduling policy is similar to the policy of lottery scheduling, a concept in operating-system design that was first published in 1994.
Results
To test the modified Apache with the probabilistic rule, I set the MaxClients parameter to 100 and ran it with up to 340 simultaneous clients. For simulating the clients I used the Webstone tool. For brevity I will not describe the details of all my results here, but I will state one result that drives home the effectiveness of request scheduling, especially under high load. This test was done with three user types making requests for three different files, of sizes 1 KB (small), 43 KB (medium), and 760 KB (large). I used a probabilistic scheduling strategy and set the probabilities for the users to 1/7 (light user), 2/7 (medium user), and 4/7 (heavy user), indicating that I wanted the heavy users' requests to be picked, on average, twice as often as the medium, and four times as often as the light users. (There were equal numbers of each user type.) If the scheduling worked right, the heavy users would see half the response times as the medium users and one fourth that of the light ones. This sounds difficult to achieve, but Figure 3 shows how it worked. For up to around 175 logged-in clients, the scheduling policy had only a small impact (because there were not enough pending requests in the Python queue for scheduling to have an impact), but by the time I had approached 300 clients, the ratios of the response times were what I was looking for two and four with an accuracy of better than 5 percent!
Conclusion
To handle heavy loads, many web sites overprovision their servers. Request prioritization is a good way of utilizing server resources smartly without overprovisioning. Request scheduling is one way of implementing request prioritization, and has the advantages that the scheduling strategy can be implemented at the application level, and with a probabilistic strategy, actually gets more effective as the load increases.
DDJ
Listing One
static void child_main(int child_num_arg) { . . . if (child_num_arg != 0) { new_child_main(child_num_arg); return; } . . . }
Listing Two
static void child_main(int child_num_arg) { ... int n_fdpair; extern int fd[]; fd_set fdpair; FD_ZERO(&fdpair); FD_SET(sd, &fdpair); FD_SET(fd[0], &fdpair); n_fdpair = (sd > fd[0])?(sd + 1):(fd[0] + 1); ap_select(n_fdpair, &fdpair, NULL, NULL, NULL); // block indefinitely if(FD_ISSET(fd[0], &fdpair)) // some worker is free, wants food if((QNum=feedWorker())>=0 && !FD_ISSET(sd, &fdpair)) continue; // at this point, either you have fed the worker and there is a // waiting connection or the queue is empty (OR ERROR) so you might as // well block on accept // or of course, there is no worker, but there IS a connection if(QNum >= socket_limit) continue; // don't accept ... //Apache code for error checking etc.. ... QNum = py_peek_and_store_request(csd); }
Listing Three
int feedWorker(void) { char buf[CONTROL_SIZ+1]; int next_csd; next_csd = py_GetNextRequest(); // sets QNum if(next_csd < 0) return -1;// do not read control data off fd[0] recv(fd[0], (void *)buf, CONTROL_SIZ+1, 0); if(send_fd(fd[0], next_csd) < 0) return -1; if(close(next_csd) < 0) { fprintf(stderr, "Error closing sent fd\n"); return -1; } return QNum; }
Listing Four
recv(csd, peekbuf, PEEKLEN, MSG_PEEK); peekbuf[PEEKLEN - 1] = '\0'; py_PrioritizeAndStoreRequest(csd, peekbuf);
Listing Five
# file rrpyap.py csdList = {1:[], 2:[]} # priority, socket descriptor list Qnum = 0 # total number of queued requests PTable = {} # table of relative priorities def readFileList(): # this information can be read from a configuration file instead PTable = {"BID": 1, "REGULAR": 2} def storeRequest(csd, url): rType = scanURL(url) csdList[PTable[rType]].append(csd) Qnum = Qnum + 1 def getNextRequest(): if Qnum > 0: Qnum = Qnum - 1 else: return -1, 0, 0 if len(csdList[1])>0: return csdList[1].pop(0), 0, QNum else return csdList[2].pop(0), 0, Qnum # the second return parameter may be used to # indicate an error in accessing the queue; # it is usually zero. def scanURL(url): # code for scanning the URL to determine request type . . . return rType # "BID" or "REGULAR"