Channels ▼
RSS

Web Development

The Internet In Your Pocket


Simon is a freelance programmer and author, whose titles include Beginning Perl (Wrox Press, 2000) and Extending and Embedding Perl (Manning Publications, 2002). He's the creator of over 30 CPAN modules and a former Parrot pumpking. Simon can be reached at simon@simon-cozens.org.


It so often happens: you make the interesting discoveries when you're trying to get something else done. I was planning to write about a fantastic thing I'd coded up which proxied Google, presenting the results a bit more nicely, keeping a record of which results people clicked on for particular searches, doing various domain-specific disambiguation to determine whether "Jaguar" was a car, an animal or an operating system, and so on.

Unfortunately, I never got it finished before the vacation, where I don't really have enough bandwidth to get it tested. So to try to alleviate the problem, I wrote another kind of proxy—one to store every kind of request and response it sees, and then play back the responses to the same requests again later. This means that whenever I am connected to the Internet, I can throw out a few requests, collect the results in a database, and bring them back to the machine I'm developing on.

Now this might sound like what the browser's cache does, or something we could use Squid to achieve, and it is a little like that, but it has three useful properties: first, it's rude. It doesn't care about cache directives which ask the cache to fetch the page again if it has expired. Once it's stored a page, it'll give it to you again if you request the same page, no matter how old the cached version or how dynamic the page ought to be. This means it works nicely for completely disconnected operation.

Second, it stores every kind of request and response. Browser caches typically don't cache any pages where there are POST requests sending data to the server; my proxy does. Finally, it's portable. I can move a single database file around between different machines, and I have my snapshot of the Internet in my pocket.

Of course this is not only useful for the kind of development that I'm doing, but it's also useful for module testing. For instance, if you're writing a module which accesses something on the web, you might find it useful to ship a database of known-good data to test from, both so that your module can be tested in situations where the end-user is currently disconnected from the Internet and also so that, in situations such as testing an interface to a search engine, the tests can be protected from the highly dynamic and changing nature of likely result sets.

A Proxy Primer

Let's remind ourselves how proxies work in general, before we pick up the Perl tools to help us write our storage and replay proxies.

In the normal case of affairs, a web browser puts together a HTTP request and sends it to a web server. The server responds with a HTTP request. Both messages have headers (for instance, saying when the page was generated, what type of data it is, and so on) and a body, the contents of the web page or any POSTed form data.

When a proxy gets involved, the browser sends the request to the proxy instead of to the remote server; the proxy might decide to respond to it itself, or it might pass on the request to the web server as before. The proxy will rewrite some of the headers, and may choose to mess with the body if it wants to. The proxy then receives the response from the server, modifies it if it wants to, and finally passes it back to the client. That might sounds like a lot of work, but we have CPAN!

There are two major ways to write web proxies in Perl using CPAN modules: first, we can use the HTTP::Proxy module, which basically does everything for us, or, if we're writing more complicated proxies, we can spin our own proxy together using POE and the POE::Component::Server::HTTP and POE::Component::Client::HTTP modules. HTTP::Proxy is much simpler, so we'll begin with that.

A Dummy Proxy

The simplest proxy is one which does nothing at all to interfere with the request/response cycle. It just passes on the request to the server, and passes the response back to the client. Such a simple proxy can be useful if, for instance, you have a network of computers which is disconnected from the Internet apart from one gateway machine. You don't want to allow complete Internet access, but you do want the computers on the network to access the web. The solution is to get the gateway to act as a web proxy. The computers on the private network connect to the gateway, and the gateway can connect to the outside world.

Here's how to write such a gateway proxy in HTTP::Proxy:

    use HTTP::Proxy;
    HTTP::Proxy->new( host => "10.0.0.2" )->start;

This will start a HTTP proxy on port 8080 on the internal IP address 10.0.0.2. It will forward HTTP connections to the relevant server on the outside world, and then pass back the response to the client.

HTTP::Proxy also allows us to attach filters onto the stages of operation of this basic proxy: to mess with the headers and body sent to the remote server, and to mess with the headers and body of the response from it. For instance, here's an example from the HTTP::Proxy documentation which removes various headers which might give away information about the browser:

   $proxy->push_filter(
       mime    => undef,
       request => HTTP::Proxy::HeaderFilter::simple->new(
           sub { $_[0]->remove_header(qw( User-Agent From Referer Cookie )) },
       ),
       response => HTTP::Proxy::HeaderFilter::simple->new(
           sub { $_[0]->remove_header(qw( Set-Cookie )); },
       )
   );

This says that we want to filter all MIME types, rather than the default text/*, and that we should construct a filter to go onto the request side of proxying which removes the User-Agent, From, Referer and Cookie headers before it goes on to the remote server, and that responses coming back from the server should have the Set-Cookie header stripped.

The Store Proxy

For the first of our matched pair of proxies, we don't want to change the request or the response, but we do want to store away the response when we see it. HTTP::Proxy generally sends data to the body filters in chunks as it arrives, but we want to wait until the full response has been received before doing anything. We do this by pushing the HTTP::Proxy::BodyFilter::complete filter onto the response stack:

    $proxy->push_filter(
        mime => undef,
        response => HTTP::Proxy::BodyFilter::complete->new,
    );

The next filter we're going to push on will serialize the request and the response.

    use DB_File; use Storable qw(dclone freeze);
    $proxy->push_filter(
        mime => undef,
        response => HTTP::Proxy::BodyFilter::simple->new(sub {
            return unless $proxy->response;
            my $request = dclone($proxy->request);
            $request->headers->remove_header($_)
                for qw/user-agent accept accept-language 
                   accept-charset x-forwarded-for via/;
            tie my %clicked, "DB_File", "cache.db";
            $clicked{freeze($request)} = freeze($proxy->response);
            untie %clicked;
        })
    );

This begins by making a copy of the request, and removing some of the headers which are incidental to the request. We want any other requests we make to the same URL with the same data in the body to look the same as the current request object, so we get rid of all the headers which would make it distinctive. This means when we freeze the request with Storable::freeze we can use it as a hash key, and freezing another request like it will come to the same hash key. Similarly, we freeze the response object so that we can retrieve it later; using a Berkeley DB means that we have a file we can move between machines easily.

The Replay Proxy

The replay proxy is very similar. We need to use the same Berkeley database:

    use DB_File;
    tie my %clicked, "DB_File", "cache.db";

We need to be able to both freeze and thaw objects: to freeze the request into a hash key, and to thaw the response from out of the hash again.

    use Storable qw(freeze thaw dclone);

So when the request comes in to the proxy, we want to look at it and see

if we've seen it before. This will be a body filter, because we want to wait until the whole request is available:

    $proxy->push_filter(
        mime => undef,
        request => HTTP::Proxy::BodyFilter::simple->new( sub {

Our filter needs to do the same thing to the request as it did in the store filter:

            my $request = dclone($proxy->request);
            $request->headers->remove_header($_)
                for qw/user-agent accept accept-language 
                   accept-charset x-forwarded-for via/;

And now, if we've seen this filter before, we can retrieve the response and return it immediately:

            return unless my $response = $clicked{freeze($request)};
            $proxy->response(thaw($response));
        })
    );

If we don't set response in a filter—that is, if we don't find the request in the database—then the request carries on to the remote server as normal. Of course, where we're disconnected, this will return an error, but it does enable us to intercept particular requests, which is what we wanted all along.

Conclusion

With this pair of proxies in place, we can run the store proxy on a machine which is directly connected to the Internet, store all our test data into a database, and then take the database home to an unconnected machine. From there we can do our development, hitting the same sites and getting the same responses as though we were connected, and ensure that our module gives the results we want. This works just as well for testing web-based modules.

Next time I'll be taking the technological temperature of the Perl community by reporting what's hot and popular at YAPC::Europe in Braga, and then (I hope) we'll look at using Perl to make Google a bit smarter.

TPJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 
Dr. Dobb's TV