Peter, who holds a Ph.D. in astrophysics, is a programmer at the University of Calgary. He can be reached at [email protected].
Many organizations are now providing remote users with online, web-based information. In corporations, this information ranges from human-resource policy manuals to data sheets. Libraries, on the other hand, offer users access to third-party online web-based electronic journals (such as the Journal of Mathematical Computation; http://www.jstor.org/journals/ 00255718.html) and databases (like those provided by the Online Computer Library Center; http://bart.prod.oclc.org/). Most commercial information vendors require that clients access their web servers from a valid IP address, which, at the Library of the University of Calgary, means a university IP address or campus-wide user ID/password. This is fine when users are on the campus, because on-campus machines usually have valid IP addresses. However, more and more users -- distance-learning students, retired professors, staff members, and the like -- are using their own off-campus ISPs to connect to the Internet. Users may also want to use a public workstation in a public library to access a service. However, when legitimate users of the university library connect directly to the Internet from an off-campus IP address, the vendor web server typically rejects the access request.
To address this problem, I designed and implemented webrelay -- a freely available multithreaded HTTP relay server. (The source code and related files for webrelay are available electronically; see "Resource Center," page 7.) In a nutshell, webrelay authenticates a client to make sure the client is a legitimate user before connecting it to the vendor web server. The vendor's server then sees the request as coming from the relay server itself, which always has a valid IP address or campus-wide user identification.
Design Considerations
One of my design goals for webrelay is that it needed to be as transparent as possible to both end users and library administrators. This precluded use of conventional HTTP proxy servers. Experience shows that with conventional HTTP proxy servers, end users must configure their browsers to use that specific proxy. If a user's ISP already has a proxy, it is difficult for the user to set up the browser to use the proxy designated by, say, the university. Furthermore, when browsing a web server other than those of specific third-party vendors, users have to turn that proxy off to avoid unnecessary user authentication imposed by the proxy. This is because the library has no easy way to restrict proxying to only those vendor web servers that the library has subscribed to with a conventional proxy server.
Webrelay is designed to mirror whatever remote web servers you want to include. Users do not have to configure their browsers in any special way, because users will not see the remote web server. To the user, the webrelay server is the real target server. The administrator of the library has complete control over what services are included in webrelay and whether authentication is mandatory for a given web server.
When webrelay mirrors a set of remote web servers, it maps the URL of a remote web server to a virtual directory of the webrelay with the form of http://webrelay.host .name:port/DB=db_key/, where the value of db_key is an abbreviation of the real URL that a vendor advertises to patrons. The DB=db_key is a virtual directory, because there is no such physical directory on the host of the webrelay. The mapping and a corresponding mandatory-authentication flag can be defined in a configuration file by the administrator. Users are introduced to these virtual directories by hyperlinks embedded in the top homepage of the library, which is under control of the library administrators. If a user comes in from an off-campus IP address or the virtual directory has its mandatory-authentication flag set to True, the request is channeled to the User Validation Engine (UVE); see Figure 1.
Another design consideration involves how you establish a session and maintain session data in a basically stateless HTTP protocol. One option is to use Netscape cookies, but in our case this wasn't a good mechanism since cookies are designed explicitly for user tracking. When users access those services from a public workstation in a public library, it is difficult to manage cookies for individual users, because other users may have used that workstation at different times. The webrelay server would have to manage a set of cookies for communication with the end user, as well as another set or sets of cookies that might have been issued by a remote web server.
With this in mind, I decided to take a very different approach. After users are successfully authenticated, they are assigned a unique session key and registered with the Session Control Engine (SCE). As the user browses through a web site, the SCE tracks the update time, records any cookies that are sent by the remote web server, and manages any other pertinent session data.
After a new session is established, I use the session key to replace the db_key. The virtual directory now consists of the session key and possibly a hostname of another web server that the vendor may select. From then on every embedded URL in any page downloaded by users must be converted to have its base point to the webrelay hostname and port number, plus the virtual directory. This is done on-the-fly by a Relay and URL Conversion Engine (RURLCE) before the page can be sent to the user. This ensures that subsequent requests always have the correct session key included. Furthermore, within the session all the requests will be forced through webrelay.
The other related aspect of the design is how to efficiently handle multiple connections, while at the same time avoiding relying on any interprocess communication means for sharing the session data. I chose to use multiple threads to handle separate connections. Different threads can share the session data in the same address space of a single process. Compared with traditional multiprocess programming with interprocess communication means, threads in a multithread process facilitate more efficient session control, simpler coding, and better scalability.
User Validation Engine (UVE)
When the first request for a given vendor's web server is sent to webrelay, the program decodes the virtual directory to get the db_key. Based on the db_key, webrelay finds the real URL of the vendor's web server and the mandatory-authentication flag from a lookup table stored in memory that has been loaded from a configuration file at the start up of the program. In addition to IP address checking as required by the majority of the vendors, the library sometimes requires mandatory authentication for a given vendor. Why? Because there are instances when a fee is required for a document delivery service associated with that vendor.
If the client's IP address is correct (from our campus, in other words) and no mandatory authentication is required for the destination web server, the webrelay simply redirects the client to the destination. From then on, the client does transactions directly with the vendor. This eliminates all traffic that involves on-campus users. Otherwise, webrelay checks to see if a session has been established. If not, the UVE engine sends out an authentication challenge to the client. You can choose to use either the basic or custom authentication scheme; the latter is preferred. In the case of basic authentication, the client sends out the user ID/password for all subsequent requests, which defeats the purpose of our session-control mechanism, where the SCE engine needs no more than a session key to keep track of all requests. With custom authentication, the UVE sends out the challenge in an HTML logon form asking the client to submit its credentials (we require a user ID/password for now). Once the UVE receives the credentials from the client, it checks with a remote authentication server where user IDs/passwords are stored and retrieved. We use a commercial server for that purpose. Available electronically is a testing module that takes a username and password; if the username is the same as the password, the user is regarded as a legitimate user. You should customize the code to interface to any plausible authentication server one might choose.
In a case where multiple users share a public workstation, a user may use the browser's Back button to go back to the logon form that was filled out by a previous user who vacated the workstation. To prevent users from stealing other users' authentication credentials for gaining access, the UVE sets a timestamp on the logon page it issues. The form is invalidated after a certain period of time, say, five minutes. Of course, this does not completely solve the problem.
Session Control Engine (SCE)
If a client is successfully authenticated, webrelay registers the client with the SCE. The SCE assigns a unique session key to that session and stores the session start time and other pertinent information. A session key consists of a timestamp concatenated with the hex digits of the client's IP address. The session control information is stored in a hashtable with a separate-chaining linked list to resolve any collisions that might occur. The SCE uses the session key for lookup, update, retrieval, or delete operations from the hashtable.
Fine-grained synchronization using the mutex of the POSIX pthread library has been made to protect the shared session control data in a multithreaded environment. Any thread at any moment can hold a mutex lock that locks a pointer to a node. While only one thread that holds the mutex lock holds the pointer to the node at any given moment, numerous threads may hold pointers to other nodes at the same time. This is certainly more efficient than coarse-grained synchronization methods, but harder to code (see Thread Primer: A Guide to Multithreaded Programming, by Bill Lewis and Daniel J. Berg, Prentice Hall 1996).
Cookie handling is an important aspect of the SCE. Webrelay has to take over cookie management for the client, because the cookie issued by a vendor's web server is meant for webrelay, which is seen as a client by the vendor's web server. If webrelay were to pass that cookie to its client directly, the client would have thought that the cookie had been associated with webrelay, rather than the vendor's web server. When sending a subsequent request, the client would have fetched any cookies that are associated with webrelay. The vendor's web server would think that was not a correct cookie and refuse connection. Listing One shows how the SCE stores a cookie into the session control data, while Listing Two shows how it fetches the corresponding cookie on behalf of the client to be sent back to the vendor's web server.
The other important aspect of the SCE is the control of idle sessions. This is handled by a garbage sweeper behaving like a daemon thread. It wakes up every 300 seconds to scan the entire hashtable to check when a client last accessed the vendor's web server. If the last time the client downloaded a page or a file was more than, say, 15 minutes ago, the session is considered as being idle too long, and is a candidate to be removed from SCE. One catch, though, is that before the idle session can be removed from memory, the SCE must make sure that there is no other thread that is reading from or writing to that node in the hashtable. This is taken care of by a reference count. The reference count is initialized to zero at the beginning. Whenever a thread starts (stops) reading from or writing to that node, its reference count increments (decrements) by one. If (and only if) the reference count reaches zero can the garbage sweeper remove that node from the SCE.
An idle session not only consumes computer resources, but also is itself a really annoying problem surrounding use of a public workstation in a public library shared by many random users. If the SCE does not purge the idle session, other users might be able to use the session left over by a previous user without being subject to any authentication. The garbage sweeper helps to alleviate this problem.
Relay and URL Conversion Engine (RURLCE)
As Figure 2 illustrates, the RURLCE consists of a REQuest Header Analyzer (REQHA), RESponse Header Analyzer (RESHA), and Response Entity-Body Converter (REBC). The REBC is able not only to convert a static HTML page, but also to convert dynamic pages generated by a JavaScript.
- REQuest Header Analyzer (REQHA). The REQHA analyzes the request header. It fetches the virtual directory from the first header line. If the virtual directory contains a string of "DB=db _key", it asks the UVE to start the authentication process. Once a new session is started, the REQHA gets the real URL of the web server that a vendor has advertised in its contract from the lookup table. With the real URL, the REQHA constructs a new first request header line using the path of the real URL, and a new "Host:" header line with the real hostname and port number of the vendor's web server.
- When the virtual directory does not start with a string of "DB=", it then must contain a session key, or a session key followed by a hostname and port number. The requested URL would look like
http://webrelay.host.name:port/ses_key/ targetpath
- or
http://webrelay.host.name:port/
ses_key=another.host.name:targetport/ targetpath
If the virtual directory contains only ses_key, that means the targeted machine remains the same as the web server advertised in the contract by the vendor. The REQHA sends the session key to the SCE, which does all the session control checking, and also decodes the requested URL to get the target path on the vendor's web server with the virtual directory removed. The REQHA then uses the session key to obtain the db_key from the session control data, from which it can find the real hostname and port number of the vendor's web server. If the session key in the virtual directory is followed by another hostname and port number, that means the vendor now delegates the other web server to handle the request. In this case, the vendor's original web server listed in the contract is no longer relevant. The session key is still used by the SCE to do various checking on the session validity, while the REQHA uses the designated hostname and port number in the virtual directory to construct the first request header line and the "Host:" header line.
The REQHA also removes any "Cookie:" request header line, because the cookie fetched by the client is not necessarily associated with the vendor's web server, but rather with webrelay. The REQHA will always ask the SCE to see if there is a relevant cookie stored in the session control data that was issued by the vendor's web server. If there is one, the SCE will retrieve the appropriate cookies based on matching domains and paths (Listing Two). The fetched cookies will be used by the REQHA to construct a new "Cookie:" header line.
If webrelay is started to use the Basic Authentication scheme, the REQHA will fetch the authentication data from the header and send it to the UVE for user validation.
- RESponse Header Analyzer (RESHA). The RESHA analyzes the response headers returned by the vendor's web server. It extracts any cookie in the "Set-Cookie:" header line issued by the remote web server and calls the SCE to store that cookie (Listing One). If there is a "Location:" header line, the RESHA extracts the redirect URL from that header line and calls the automatic redirection module to do a redirection right away. That automatic redirection module also asks the SCE to take care of the cookies before sending out the redirection request. The RESHA extracts the "Content-Type:" header line to get the content type for later use by the RURLCE engine. It also extracts the content length as stated in the "Content-Length:" header line. The content length will be used by the RURLCE engine to facilitate reading the entity-body from the remote web server. The content length will usually need to be updated after the conversion of the entity body before sending back to the client.
- Response Entity-Body Converter (REBC). The Response Entity-Body Converter (REBC) is the most complex in this project. It is essential that the rewriting of all original URLs in a page fetched by webrelay be made to map to the virtual directory of the host where webrelay is running. It is not that difficult to do this for a static page. However, more and more vendors have started using Javascript to produce dynamic pages. It isn't easy to make sure that a dynamic page, generated by a Javascript or whatever other means, correctly maps to the virtual directory. You are dealing with a full-fledged programming language in the case of Javascript. Furthermore, decisions on how to make the rewriting have to be made based on not only a lexical but also a contextual analysis. Nevertheless, the REBC I have developed in this project is able to do a fairly good job of supporting the third-party services the University of Calgary has subscribed to.
The REBC basically consists of a converter for a static page, and a set of functions to deal with a dynamic page, containing mainly Javascripts.
The converter for a static page scans the page to look for various HTML tags and the corresponding attribute that may have a URL as its value. We distinguish three different situations: a relative path of a relative URL (without a leading slash), an absolute path of a relative URL (with a leading slash), and an absolute URL. First the REBC either inserts a base URL or modifies the existing base URL in the HEAD section of the page to ensure that the new base URL points to the virtual directory of webrelay. This base URL almost eliminates the need to rewrite a relative path in a URL, because that relative path will be relative to the directory part of the base URL. However, it has to rewrite an absolute path of a relative URL, because the virtual directory in the newly constructed base URL interferes with the standard algorithm for figuring out the correct absolute URL from a relative URL. If a proper rewriting is not done, the standard algorithm would result in an absolute URL where the virtual directory would be left out. Consequently, when the client clicks on that hyperlink, the session key contained in the virtual directory would be lost. For example, suppose the original relative URL is in the form of
/dir1/dir2/file.html
while the inserted base URL is:
http://webrelay.host.name:port/ses_key/ targetdir/targetfile
The resulting URL based on the standard algorithm will become:
http://webrelay.host.name:port/dir1/dir2/file .html
and the ses_key is lost. Therefore, you have to ensure that the REBC should rewrite this URL with:
http://webrelay.host.name:port/ses_key/dir1 /dir2/file.html
The REBC also has to rewite any absolute URL to change the hostname and port number and to insert the virtual directory in front of the path. For example, suppose the original absolute URL is:
http://another.host.name:targetport/dir1/dir2 /file.html
The resulting URL after rewriting should look like:
http://webrelay.host.name:port/
ses_key=another.host.name:targetport/dir1/ dir2/file.html
When the converter for a static page is parsing the page, it also finds out other information for later use by the converter for a dynamic page. For instance, it scans over any invocation of a Javascript function to obtain the function name as well as an argument that passes a URL. This information is passed to the converter for a dynamic page. Whether this argument value should be rewritten depends on the relationship of this function argument to other elements in an assignment statement inside the definition of the corresponding Javascript function that will be analyzed by the dynamic page converter.
The dynamic page converter deals with Javascript function arguments, user-defined variables, navigator objects, forms, and event handlers. A balance must be made in choosing only necessary items to work on, instead of using a full-scale language parser. For example, you may only be interested in the location and window objects value to which a URL could be assigned, and leave other navigator objects untouched. If a location.ref object is assigned a value that is taken from an Option list of the Form->Select element, then the URLs of the Options of the corresponding Form selected by a client must be rewritten. If, however, a URL in an Option list of a Form is going to be used by a CGI script defined in the action attribute of that Form, then one should not rewrite that URL at all, because the CGI script will be run on the vendor's server machine, rather than on the client.
Inside a definition of a Javascript function, if an argument is used as the first term of an assignment statement, and that argument is passed a URL value, the dynamic page converter informs the static page converter to rewrite that URL. When a property of a location object appears on the right side of an assignment statement, the dynamic page converter does a careful analysis of relationships of various terms with the location object and decides how to rewrite the assignment as a whole. The tricky thing here is that the location object is referred to the "real" location object in the original page dispatched by the vendor's web server, and its value must be rewritten to point to webrelay with the virtual directory inserted in front of the target path.
In the case of an assignment statement for a user-defined Javascript variable, the insertion of a rewritten base URL in the HEAD section of a page helps resolve ambiguity between a string literal and a filename, because the REBC does not have to explicitly rewrite the relative path (a filename alone consists of a relative URL), which is taken care of by the inserted rewritten base URL.
The rewriting done by the REBC on-the-fly makes sure that the converted page presented to the client contains all hyperlinks that point to webrelay and have the right session key included. This ensures that subsequent requests sent by the client be forced to go through webrelay, and the session key can be used by webrelay to track the session.
Conclusion
Webrelay works efficiently to handle thousands of hits per day and is scalable, supporting as many remote vendor web servers as you want. It is easy for a nontechnical person to configure. All you have to do is be able to add or delete web servers from the configuration file, or decide whether you want mandatory authentication for any given web server. Its session control data is stored in memory in the same address space of a single process, so that multiple threads can access the data efficiently. The session control module permits legitimate university users to be able to use the services that the university subscribes to at any time from any ISP. They are only asked once for authentication at the start of access to a given web server, in subsequent transactions there is no need for the client to send in authentication credentials in the case of the custom authentication. The session control engine checks the session dutifully. Both static and dynamic page converting are supported, which makes the mechanism successful.
Acknowledgments
I'd like to thank Bob Revak and Mary Westell for their inspirations, Eric Tull for his meticulous testing of the code, and Dean Mah and Matthew Ling for useful discussions. Thanks also to Kurt Zhang, then at the University of Waterloo, for discussions on various aspects of POSIX multithread programming.
DDJ
Listing One
/* Update cookie in the session control data */ int sess_manager_update_cookie(char *seskey, unsigned int keylen, accept_info *aip, relay_info *rip) { chain_node_t *cnp; int status; cnp = sess_manager_find(seskey, keylen); if(cnp != NULL) { sess_info_t *sip; Spthread_mutex_lock(&cnp->lock); sip = (sess_info_t *) cnp->data; if(sip != NULL && sip->ClientIPAddr) { if ((aip->cliaddr->sin_addr.s_addr == sip->ClientIPAddr)) { time_t ct; time(&ct); if ( difftime(ct, cnp->LastUpdated) <= sess_manager_refresh ) { char *cookie_cookie = NULL; char *cookie_name = NULL; char *cookie_path = NULL; char *cookie_domain = NULL; char *p; int i, j, len1, len2; /* Session still valid.*/ cnp->LastUpdated = ct; /* rip->cookie: NAME=VALUE; PATH=/path1/path2, while cookie_cookie contains NAME=VALUE, and cookie_path is /path1/path2 */ cookie_cookie = parse_cookie(rip->cookie, &cookie_path, &cookie_domain); if ((p = strchr(cookie_cookie, '=')) != NULL) /* cookie_name is NAME */ cookie_name = strdupdelim(cookie_cookie, p); if(sip->cookie_path[0] == NULL && sip->cookie_name[0] == NULL && sip->cookie_value[0] == NULL) { /* There is no existing cookies in the SIP yet. Simply insert the new cookie into it. */ sip->cookie_path[0] = xstrdup(cookie_path); sip->cookie_name[0] = xstrdup(cookie_name); sip->cookie_domain[0] = xstrdup(cookie_domain); sip->cookie_value[0] = xstrdup(cookie_cookie); } else { /* Match the existing cookies already stored in SIP */ for(i=0; i<MAX_NUM_COOKIE && sip->cookie_path[i] != NULL; ++i) { len1 = strlen(cookie_path); if(!strncasecmp(cookie_path, sip->cookie_path[i], len1)) { for (j=i; j<MAX_NUM_COOKIE && sip->cookie_name[j]!=NULL;++j ) { len2 = strlen(cookie_name); if(!strncasecmp(cookie_name, sip->cookie_name[j], len2) && !strncasecmp(cookie_path, sip->cookie_path[j], len1)) { /* Overwrite this cookie */ FREE_MAYBE(sip->cookie_value[j]); /* We store NAME=VALUE together as one single cookie */ sip->cookie_value[j] = xstrdup(cookie_cookie); break; } } /* No match of cookie_name, regarded as a new cookie of the same path. Now we ADD this new cookie at j */ sip->cookie_path[j] = xstrdup(cookie_path); sip->cookie_name[j] = xstrdup(cookie_name); sip->cookie_domain[j] = xstrdup(cookie_domain); sip->cookie_value[j] = xstrdup(cookie_cookie); break; } } if (sip->cookie_path[i] == NULL && sip->cookie_name[i] == NULL) { /* No match either of cookie_name nor cookie_path. This is a new cookie of a new path. Now we add this new cookie at i */ sip->cookie_path[i] = xstrdup(cookie_path); sip->cookie_name[i] = xstrdup(cookie_name); sip->cookie_domain[i] = xstrdup(cookie_domain); sip->cookie_value[i] = xstrdup(cookie_cookie); } } FREE_MAYBE(cookie_name); FREE_MAYBE(cookie_path); FREE_MAYBE(cookie_domain); FREE_MAYBE(cookie_cookie); cnp->data = (void *) sip; status = SES_OK; } else status = SES_TIMEOUT; } else status = SES_CLIENT_ENDS; } else status = SES_CLIENT_ENDS; Spthread_mutex_unlock(&cnp->lock); chain_hash_release(cnp); } else status = SES_CLIENT_ENDS; return status; }
Listing Two
/* Retrieve cookie from session control data */ int sess_manager_retrieve_cookie(char *seskey, unsigned int keylen, accept_info *aip, relay_info *rip) { int i, len1, len2, len; chain_node_t *cnp; int status; cnp = sess_manager_find(seskey, keylen); if(cnp != NULL) { sess_info_t *sip; Spthread_mutex_lock(&cnp->lock); sip = (sess_info_t *) cnp->data; if(sip != NULL && sip->ClientIPAddr) { if ((aip->cliaddr->sin_addr.s_addr == sip->ClientIPAddr)) { time_t ct; time(&ct); if ( difftime(ct, cnp->LastUpdated) <= sess_manager_refresh ) { char *targethost = NULL; char *targetpath = NULL; int i, len, len1, len2, old_len, num_entries; int ck_dom_len, targethost_len; /* Session still valid.*/ if(rip->redir_targethost != NULL) targethost = xstrdup(rip->redir_targethost); else targethost = xstrdup(rip->targethost); if(rip->redir_targetpath != NULL) targetpath = xstrdup(rip->redir_targetpath); else targetpath = xstrdup(rip->targetpath); FREE_MAYBE(rip->cookie); old_len = 0; num_entries = 0; for(i = 0;i<MAX_NUM_COOKIE && sip->cookie_path[i] != NULL; ++i) { /* First match the domain */ targethost_len = strlen(targethost); if(sip->cookie_domain[i] != NULL) ck_dom_len = strlen(sip->cookie_domain[i]); else goto Match_path; /* Consume chars one by one from the end of the cookie_domain */ while(--ck_dom_len >= 0 && --targethost_len >= 0) { if(sip->cookie_domain[i][ck_dom_len] != targethost[targethost_len]) break; } if(ck_dom_len > 0) { /* No match of domain, search the next entry */ continue; } /* Match the path */ Match_path: len1 = (strlen(sip->cookie_path[i]) < strlen(targetpath)) ? strlen(sip->cookie_path[i]) : strlen(targetpath); if(!strncasecmp(targetpath, sip->cookie_path[i], len1)) { num_entries++; len2 = strlen(sip->cookie_value[i]); if(num_entries == 1) { len = len2; rip->cookie = Smalloc(len); memcpy(rip->cookie, sip->cookie_value[i], len); } else { len = old_len + 1 + 1 + len2; rip->cookie = Srealloc(rip->cookie, len); memcpy(rip->cookie + old_len, "; ", 2); memcpy(rip->cookie + old_len + 2, sip->cookie_value[i], len2); } old_len = len; } } if(num_entries > 0) { rip->cookie = Srealloc(rip->cookie, len + 1); rip->cookie[len] = '\0'; } FREE_MAYBE(targethost); FREE_MAYBE(targetpath); status = SES_OK; } else { status = SES_TIMEOUT; } } else status = SES_CLIENT_ENDS; } else status = SES_CLIENT_ENDS; Spthread_mutex_unlock(&cnp->lock); chain_hash_release(cnp); } else status = SES_CLIENT_ENDS; return status; }
- or