Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Webrelay: A Multithreaded HTTP Relay Server


Feb00: Webrelay: A Multithreaded HTTP Relay Server

Peter, who holds a Ph.D. in astrophysics, is a programmer at the University of Calgary. He can be reached at [email protected].


Many organizations are now providing remote users with online, web-based information. In corporations, this information ranges from human-resource policy manuals to data sheets. Libraries, on the other hand, offer users access to third-party online web-based electronic journals (such as the Journal of Mathematical Computation; http://www.jstor.org/journals/ 00255718.html) and databases (like those provided by the Online Computer Library Center; http://bart.prod.oclc.org/). Most commercial information vendors require that clients access their web servers from a valid IP address, which, at the Library of the University of Calgary, means a university IP address or campus-wide user ID/password. This is fine when users are on the campus, because on-campus machines usually have valid IP addresses. However, more and more users -- distance-learning students, retired professors, staff members, and the like -- are using their own off-campus ISPs to connect to the Internet. Users may also want to use a public workstation in a public library to access a service. However, when legitimate users of the university library connect directly to the Internet from an off-campus IP address, the vendor web server typically rejects the access request.

To address this problem, I designed and implemented webrelay -- a freely available multithreaded HTTP relay server. (The source code and related files for webrelay are available electronically; see "Resource Center," page 7.) In a nutshell, webrelay authenticates a client to make sure the client is a legitimate user before connecting it to the vendor web server. The vendor's server then sees the request as coming from the relay server itself, which always has a valid IP address or campus-wide user identification.

Design Considerations

One of my design goals for webrelay is that it needed to be as transparent as possible to both end users and library administrators. This precluded use of conventional HTTP proxy servers. Experience shows that with conventional HTTP proxy servers, end users must configure their browsers to use that specific proxy. If a user's ISP already has a proxy, it is difficult for the user to set up the browser to use the proxy designated by, say, the university. Furthermore, when browsing a web server other than those of specific third-party vendors, users have to turn that proxy off to avoid unnecessary user authentication imposed by the proxy. This is because the library has no easy way to restrict proxying to only those vendor web servers that the library has subscribed to with a conventional proxy server.

Webrelay is designed to mirror whatever remote web servers you want to include. Users do not have to configure their browsers in any special way, because users will not see the remote web server. To the user, the webrelay server is the real target server. The administrator of the library has complete control over what services are included in webrelay and whether authentication is mandatory for a given web server.

When webrelay mirrors a set of remote web servers, it maps the URL of a remote web server to a virtual directory of the webrelay with the form of http://webrelay.host .name:port/DB=db_key/, where the value of db_key is an abbreviation of the real URL that a vendor advertises to patrons. The DB=db_key is a virtual directory, because there is no such physical directory on the host of the webrelay. The mapping and a corresponding mandatory-authentication flag can be defined in a configuration file by the administrator. Users are introduced to these virtual directories by hyperlinks embedded in the top homepage of the library, which is under control of the library administrators. If a user comes in from an off-campus IP address or the virtual directory has its mandatory-authentication flag set to True, the request is channeled to the User Validation Engine (UVE); see Figure 1.

Another design consideration involves how you establish a session and maintain session data in a basically stateless HTTP protocol. One option is to use Netscape cookies, but in our case this wasn't a good mechanism since cookies are designed explicitly for user tracking. When users access those services from a public workstation in a public library, it is difficult to manage cookies for individual users, because other users may have used that workstation at different times. The webrelay server would have to manage a set of cookies for communication with the end user, as well as another set or sets of cookies that might have been issued by a remote web server.

With this in mind, I decided to take a very different approach. After users are successfully authenticated, they are assigned a unique session key and registered with the Session Control Engine (SCE). As the user browses through a web site, the SCE tracks the update time, records any cookies that are sent by the remote web server, and manages any other pertinent session data.

After a new session is established, I use the session key to replace the db_key. The virtual directory now consists of the session key and possibly a hostname of another web server that the vendor may select. From then on every embedded URL in any page downloaded by users must be converted to have its base point to the webrelay hostname and port number, plus the virtual directory. This is done on-the-fly by a Relay and URL Conversion Engine (RURLCE) before the page can be sent to the user. This ensures that subsequent requests always have the correct session key included. Furthermore, within the session all the requests will be forced through webrelay.

The other related aspect of the design is how to efficiently handle multiple connections, while at the same time avoiding relying on any interprocess communication means for sharing the session data. I chose to use multiple threads to handle separate connections. Different threads can share the session data in the same address space of a single process. Compared with traditional multiprocess programming with interprocess communication means, threads in a multithread process facilitate more efficient session control, simpler coding, and better scalability.

User Validation Engine (UVE)

When the first request for a given vendor's web server is sent to webrelay, the program decodes the virtual directory to get the db_key. Based on the db_key, webrelay finds the real URL of the vendor's web server and the mandatory-authentication flag from a lookup table stored in memory that has been loaded from a configuration file at the start up of the program. In addition to IP address checking as required by the majority of the vendors, the library sometimes requires mandatory authentication for a given vendor. Why? Because there are instances when a fee is required for a document delivery service associated with that vendor.

If the client's IP address is correct (from our campus, in other words) and no mandatory authentication is required for the destination web server, the webrelay simply redirects the client to the destination. From then on, the client does transactions directly with the vendor. This eliminates all traffic that involves on-campus users. Otherwise, webrelay checks to see if a session has been established. If not, the UVE engine sends out an authentication challenge to the client. You can choose to use either the basic or custom authentication scheme; the latter is preferred. In the case of basic authentication, the client sends out the user ID/password for all subsequent requests, which defeats the purpose of our session-control mechanism, where the SCE engine needs no more than a session key to keep track of all requests. With custom authentication, the UVE sends out the challenge in an HTML logon form asking the client to submit its credentials (we require a user ID/password for now). Once the UVE receives the credentials from the client, it checks with a remote authentication server where user IDs/passwords are stored and retrieved. We use a commercial server for that purpose. Available electronically is a testing module that takes a username and password; if the username is the same as the password, the user is regarded as a legitimate user. You should customize the code to interface to any plausible authentication server one might choose.

In a case where multiple users share a public workstation, a user may use the browser's Back button to go back to the logon form that was filled out by a previous user who vacated the workstation. To prevent users from stealing other users' authentication credentials for gaining access, the UVE sets a timestamp on the logon page it issues. The form is invalidated after a certain period of time, say, five minutes. Of course, this does not completely solve the problem.

Session Control Engine (SCE)

If a client is successfully authenticated, webrelay registers the client with the SCE. The SCE assigns a unique session key to that session and stores the session start time and other pertinent information. A session key consists of a timestamp concatenated with the hex digits of the client's IP address. The session control information is stored in a hashtable with a separate-chaining linked list to resolve any collisions that might occur. The SCE uses the session key for lookup, update, retrieval, or delete operations from the hashtable.

Fine-grained synchronization using the mutex of the POSIX pthread library has been made to protect the shared session control data in a multithreaded environment. Any thread at any moment can hold a mutex lock that locks a pointer to a node. While only one thread that holds the mutex lock holds the pointer to the node at any given moment, numerous threads may hold pointers to other nodes at the same time. This is certainly more efficient than coarse-grained synchronization methods, but harder to code (see Thread Primer: A Guide to Multithreaded Programming, by Bill Lewis and Daniel J. Berg, Prentice Hall 1996).

Cookie handling is an important aspect of the SCE. Webrelay has to take over cookie management for the client, because the cookie issued by a vendor's web server is meant for webrelay, which is seen as a client by the vendor's web server. If webrelay were to pass that cookie to its client directly, the client would have thought that the cookie had been associated with webrelay, rather than the vendor's web server. When sending a subsequent request, the client would have fetched any cookies that are associated with webrelay. The vendor's web server would think that was not a correct cookie and refuse connection. Listing One shows how the SCE stores a cookie into the session control data, while Listing Two shows how it fetches the corresponding cookie on behalf of the client to be sent back to the vendor's web server.

The other important aspect of the SCE is the control of idle sessions. This is handled by a garbage sweeper behaving like a daemon thread. It wakes up every 300 seconds to scan the entire hashtable to check when a client last accessed the vendor's web server. If the last time the client downloaded a page or a file was more than, say, 15 minutes ago, the session is considered as being idle too long, and is a candidate to be removed from SCE. One catch, though, is that before the idle session can be removed from memory, the SCE must make sure that there is no other thread that is reading from or writing to that node in the hashtable. This is taken care of by a reference count. The reference count is initialized to zero at the beginning. Whenever a thread starts (stops) reading from or writing to that node, its reference count increments (decrements) by one. If (and only if) the reference count reaches zero can the garbage sweeper remove that node from the SCE.

An idle session not only consumes computer resources, but also is itself a really annoying problem surrounding use of a public workstation in a public library shared by many random users. If the SCE does not purge the idle session, other users might be able to use the session left over by a previous user without being subject to any authentication. The garbage sweeper helps to alleviate this problem.

Relay and URL Conversion Engine (RURLCE)

As Figure 2 illustrates, the RURLCE consists of a REQuest Header Analyzer (REQHA), RESponse Header Analyzer (RESHA), and Response Entity-Body Converter (REBC). The REBC is able not only to convert a static HTML page, but also to convert dynamic pages generated by a JavaScript.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.