Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Web Scraping Proxy


Jun03: Web Scraping Proxy

Howard is a technology consultant at AT&T Labs-Research. He can be contacted at http://www.research.att.com/info/hpk/.


Considering that vast amounts of information are available on the Web, it is often useful to obtain data from web sites for local consumption or processing. For simple tasks, such as viewing the local weather report, users can simply use a browser to access a web site. Other tasks, such as logging on to retailers and viewing order status, may require accessing many pages. A third type of task is to obtain data from web sites for processing with other programs. Since these sites will not be viewed by a person, these tasks can be automated by a process referred to as "web scraping."

One popular tool for automating these tasks is LWP, a collection of Perl modules (see Perl & LWP, by Sean Burke, O'Reilly & Associates, 2002, ISBN 0596001789). LWP provides a convenient programming interface to write web clients. LWP lets programs send GET/POST requests to servers and retrieve the result. Database results are often returned in a table, and the library's table parsing routines make it easy to access such data.

But as Cameron Laird observed, "the hard part of web scraping...comes in issuing requests that elicit useful pages from web servers. This is all reverse engineering, which means simply 'disciplined guesswork'" (http://cedar.intel.com/cgi-bin/ids.dll/content/content.jsp?cntKey=Generic+ Editorial::ws_scraping&cntType=IDS_ EDITORIAL). This reverse engineering is straightforward when the desired page is obtained from a GET query and cookies aren't involved. In this case, all the information you need to obtain the page is encoded in the URL. So, to write a web scraping program, you simply access the page from a browser and copy the URL displayed by the browser to the web scraping program.

POST queries are more difficult to decipher because their arguments are not displayed by the browser. You can examine the HTML source for the page, find the query, and convert it to the form used by the LWP POST request. Some pages use JavaScript to construct the query, making it more difficult to discern the query's arguments.

Cookies present an additional difficulty when writing web scraping programs. LWP handles cookies sent as part of a web page's HTTP header. However, web pages sometimes use JavaScript to set cookies. Because LWP does not interpret JavaScript, these cookies are not automatically interpreted by the LWP package. While it is usually straightforward to write Perl code to examine the JavaScript code, extract the cookie settings, and add the cookie to the LWP browsing session, it is tedious to have to examine every page to see whether it uses JavaScript to set cookie values.

Jon Udell suggests using Proxomitron, a proxy server that runs on a PC, to help discover the queries and responses necessary to access a site (http://www.byte.com/documents/s=493/byt20010214s0005/index3.htm). Users do this by configuring their browser to route web access via the Proxomitron, then manually browsing the site of interest while viewing the Proxomitron's log window, which displays the headers of messages that it intercepts.

This technique does help. It is easy to obtain a transcript of the pages that are accessed, including those pages obtained by redirects. It also makes it easy to see cookies go back and forth between the servers and the browser. However, to use the cookie information to find cookie-generating JavaScript pages, you must tediously watch for new cookies that were not produced by the HTTP headers. Proxomitron has further limitations, such as requiring a Windows host.

In this article, I present WSP, a web scraping proxy server designed to help write LWP applications. As with Proxomitron, users browse the Web while routing all requests through WSP. WSP monitors the traffic, and emits the Perl LWP calls to duplicate the actions it sees. It ignores pages with images since these are almost never needed in web scraping applications. When it sees the browser emit cookies that are not set by the server's HTTP headers, it emits a warning and examines any JavaScript code on the previous page, looking for where the cookie was set.

The Web Scraping Proxy

WSP is a Perl program that uses the socket library to communicate with the browser client and the server (the complete source code and related files are available electronically; see "Resource Center," page 5). It listens for browser requests on port 5364. (A command-line argument can be used to change the port if port 5364 is already in use.) The first line of the browser request indicates the connection type. If it begins with CONNECT, then it is a secure SSL request; otherwise it's an ordinary HTTP request.

The request indicates the name of the remote server and port number that the client wishes to contact. WSP establishes a connection to the remote server. If this is a secure connection, it uses the SSLeay (http://symlabs.com/Net_SSLeay/) library to encrypt communications on both the client and server connections.

WSP eavesdrops on the communications between the client and server, and produces a transcript of the pages visited, emitting Perl code for GET/POST requests. It also tells about cookies and saves copies of web pages that look interesting.

An instance of WSP is designed for use by a single programmer who is writing LWP code to access web sites. Since it is a Perl program, it should run on any machine that supports Perl. It need not run on the same machine as the browser. I do recommend that you run WSP in an initially empty directory, to make it easy to find the files that it produces. It writes its transcript to stdout, so this should be directed to a file.

By default, WSP listens for browser requests on port 5364. If another user is running WSP on the same host, then you will get an error that the port is in use. Choose some other port number and try running WSP with the number as a command-line argument.

The web browser should be started after WSP. Empty the browser's web page cache and clear its cookie cache. Then set it to use the proxy server on the host where WSP is running with the port that WSP is listening on. If your browsing includes secure pages (accessed by the HTTPS protocol), then set the browser to use the secure proxy server at the same host and port.

Suppose you want to access a list of books you ordered at amazon.com, say, to import into an expense voucher. Start by accessing http://www.amazon.com/ from a browser. In Listing One, WSP shows that two pages were accessed (using the HTTP 302 redirect) and that several cookies were set.

For each page, WSP emits data sent from the browser to the web site while requesting the page. It shows the Request and Referer fields as well as the cookies sent. It then displays the GET request as a Perl LWP call, followed by lines describing the response. Since this page produces cookies presumably needed in the rest of the session, the web scraping program should first access the URL http://www.amazon.com/. You don't need to write any code in WSP to handle the cookies because the LWP package automatically handles them.

You then click on "My Account" and a page comes up that asks what you want to do; for instance, see the books you ordered. Since this page did not set any cookies, it must be conveying this information as part of the URL of the page that it links to. You can ignore this page when writing the WSP program.

Finally, a page comes up that lets you enter an e-mail address and password. This is a secure page, and you get a warning from the browser indicating that the certificate it received came from WSP, not the site being browsed. This is because WSP intercepts the communications between the browser and the web site, so I accept the certificate. Listing Two is WSP's output for this page.

Again, the first lines of the output indicate data sent from the browser to the web site while requesting the page. It shows the Request and Referer fields as well as the cookies sent. It then displays the POST request as a Perl LWP call. The lines that follow describe the response. This response consists of several tables that include the information requested. It saved a copy of the output in a file named "w3." You can examine this file to determine how to write Perl code to extract the pages that you need.

For POST requests, early versions of WSP would simply indicate the POST URL and the arguments sent by the browser. While this is sufficient for duplicating the browser session, it is sometimes not useful for writing the web scraping program. The issue is that some arguments would have initial values sent from the web server that were unique to each session and were expected to be returned unchanged by the browser. For instance, some sites encode an order number into an argument. So, if the web scraping program always sent the arguments that WSP observed in this web session, the server would signal an error because it received unexpected arguments.

For this reason, WSP determines which arguments were changed by the browser; that is, which arguments changed between the HTML form sent to the browser and the arguments sent by the browser. It produces a list of the arguments that have changed. To send this request, WSP examines the received form, and sends back either the changed value reported by WSP or the initial value specified in the form. Listings Three and Four are fragments of sample programs that demonstrate these two methods for passing POST arguments.

Conclusion

Someday, interchange standards such as XML will provide an easy way to extract information from the World Wide Web. Until then, it will be necessary to use web scraping to obtain information from HTML web pages. Tools such as WSP help programmers write these web scraping applications because they significantly decrease the amount of effort needed to create the applications.

DDJ

Listing One

# Request: http://www.amazon.com/ 
$request = new HTTP::Request('GET' => "http://www.amazon.com/");
# HTTP/1.1 302 Found
# Set-Cookie: skin=; domain=.amazon.com; path=/; 
#                                expires=Wed, 01-Aug-01 12:00:00 GMT
# Request: http://www.amazon.com:80/exec/obidos/subst/home/home.html 
$request = new HTTP::Request('GET' => 
            "http://www.amazon.com:80/exec/obidos/subst/home/home.html");
# HTTP/1.1 302 Found
# Set-Cookie: session-id=102-0343904-7396174; path=/; domain=.amazon.com; 
#                                  expires=Tuesday, 06-Aug-2002 08:00:00 GMT
# Set-Cookie: session-id-time=1028620800; path=/; domain=.amazon.com; 
#                                  expires=Tuesday, 06-Aug-2002 08:00:00 GMT
# Request: 
#   http://www.amazon.com/exec/obidos/subst/home/home.html/102-0343904-7396174 
# Cookie: 'session-id', '102-0343904-7396174'
# Cookie: 'session-id-time', '1028620800'
$request = new HTTP::Request('GET' => 
"http://www.amazon.com/exec/obidos/subst/home/home.html/102-0343904-7396174");
# Set-Cookie: ubid-main=430-4160616-7432656; path=/; domain=.amazon.com; 
#                                 expires=Tuesday, 01-Jan-2036 08:00:01 GMT
# Set-Cookie: obidos_path=continue-shopping-url=/subst/home/home.html/
#   102-0343904-7396174&continue-shopping-post-data=
#   &continue-shopping-description=generic.gateway.default; 
#   path=/; domain=.amazon.com
# Table 1: 9 rows; table nesting: 3
# Table 2: 84 rows; table nesting: 5
# Table 3: 4 rows; table nesting: 2
# Table 4: 1 rows
# Contains JavaScript
# Saving web page as w0

# Request: http://www.amazon.com/exec/obidos/account-access-login/
#                             ref=top_nav_ya_gateway/102-0343904-7396174 
# Referer: http://www.amazon.com/exec/obidos/subst/home/home.html/
#                             102-0343904-7396174
# Cookie: 'session-id', '102-0343904-7396174'
# Cookie: 'session-id-time', '1028620800'
# Cookie: 'ubid-main', '430-4160616-7432656'
# Cookie: 'obidos_path', 'continue-shopping-url=/subst/home/home.html/
#     102-0343904-7396174&continue-shopping-post-data=
#      &continue-shopping-description=generic.gateway.default'
$request = new HTTP::Request('GET' => "http://www.amazon.com/exec/obidos/
           account-access-login/ref=top_nav_ya_gateway/102-0343904-7396174");
# Table 1: 7 rows; table nesting: 3
# Table 2: 39 rows; table nesting: 3
# Table 3: 1 rows
# Contains JavaScript
# Saving web page as w1

Back to Article

Listing Two

# Request: https://www.amazon.com/exec/obidos/flex-sign-in-done/
#                                                        103-6178643-7537408 
# Referer: http://www.amazon.com/exec/obidos/flex-sign-in/
#       103-6178643-7537408?page=help%2Fya-sign-in-secure.html
#       &response=order-history-filtered&method=POST&opt=ab&return-url=
#       order-history-filtered&ss-order-filter=year-2002&Go.x=13&Go.y=9
# Cookie: 'session-id', '103-6178643-7537408'
# Cookie: 'session-id-time', '1028620800'
# Cookie: 'ubid-main', '430-2320918-7404815'
# Cookie: 'obidos_path', 'continue-shopping-url=/subst/home/home.html/
#       103-6178643-7537408&continue-shopping-post-data=
#       &continue-shopping-description=generic.gateway.default'
$request = POST "https://www.amazon.com/exec/obidos/flex-sign-in-done/
     103-6178643-7537408" , [
    'Go.x' => "13",
    'Go.y' => "9",
    'method' => "POST",
    'opt' => "ab",
    'page' => "help/ya-sign-in-secure.html",
    'response' => "order-history-filtered",
    'return-url' => "order-history-filtered",
    'ss-order-filter' => "year-2002",
    'email' => "[email protected]",
    'action' => "sign-in",
    'next-page' => "help/ya-register-secure.html",
    'password' => "mypassword",
    'x' => "159",
    'y' => "7",
] ;
# DIFFERENCES between form from server and submitted form:
$post_args = { };
$post_args->{'password'} = " mypassword ";  # was ""
$post_args->{'x'} = "159";  # was ""
$post_args->{'email'} = " [email protected]";  # was ""
$post_args->{'y'} = "7";  # was ""
# end DIFFERENCES
# Set-Cookie: x-main=zXhR??@ELakCfL?rLjUW?yCkcMYNSl4d; path=/; 
#               domain=.amazon.com; expires=Tuesday, 01-Jan-2036 08:00:01 GMT
# Set-Cookie: auth-browser-session-main=ss; path=/; domain=.amazon.com
# Set-Cookie: x-main=zXhR??@ELakCfL?rLjUW?yCkcMYNSl4d; path=/; 
#               domain=.amazon.com; expires=Tuesday, 01-Jan-2036 08:00:01 GMT
# Table 1: 5 rows; table nesting: 3
# Table 2: 1 rows
# Table 3: 4 rows; table nesting: 3
# Table 4: 2 rows; table nesting: 2
# Table 5: 7 rows; table nesting: 5
# Table 6: 1 rows
# Table 7: 1 rows
# Contains JavaScript
# Saving web page as w3

Back to Article

Listing Three

use HTML::TableExtract;
use HTTP::Cookies;
use HTTP::Request::Common qw(POST GET);
use LWP::UserAgent;
 
$ua = new LWP::UserAgent();
$jar = HTTP::Cookies->new();
$ua->cookie_jar($jar);
$ua->agent("Microsoft Internet Explorer/5.5");
 
$request = new HTTP::Request('GET' => "http://www.amazon.com/");
 
$webdoc = $ua->request($request);
die unless !$webdoc->is_success();
 
## [...several GET requests omitted...]

$request = POST "https://www.amazon.com/exec/obidos/flex-sign-in-done/
         103-617864 3-7537408" , [
        'Go.x' => "13",
        'Go.y' => "9",
        'method' => "POST",
        'opt' => "ab",
        'page' => "help/ya-sign-in-secure.html",
        'response' => "order-history-filtered",
        'return-url' => "order-history-filtered",
        'ss-order-filter' => "year-2002",
        'email' => " hpk1024\@hotmail.com ",
        'action' => "sign-in",
        'next-page' => "help/ya-register-secure.html",
        'password' => " mypassword",
        'x' => "159",
        'y' => "7",
] ;
$webdoc = $ua->request($request);
die unless !$webdoc->is_success();
 
# obtain information from $webdoc->content()
# HTML::TableExtract() might be useful here

Back to Article

Listing Four

use HTML::TableExtract;
use HTTP::Cookies;
use HTTP::Request::Common qw(POST GET);
use LWP::UserAgent;
 
$ua = new LWP::UserAgent();
$jar = HTTP::Cookies->new();
$ua->cookie_jar($jar);
$ua->agent("Microsoft Internet Explorer/5.5");
$request = new HTTP::Request('GET' => "http://www.amazon.com/");
$webdoc = $ua->request($request);
die unless !$webdoc->is_success();
 
## [...several GET requests omitted...]

$post_args = { };
$post_args->{'password'} = " mypassword ";  # was ""
$post_args->{'x'} = "159";  # was ""
$post_args->{'email'} = " hpk1024\@hotmail.com";  # was ""
$post_args->{'y'} = "7";  # was ""

# use form from previous GET request
my $form = HTML::Form->parse($webdoc->content(), "http://www.amazon.com");
for $input ($form->inputs)
{
    unless (defined $post_args->{$input->name})
    {
        my @fnv = $input->form_name_value();
        while (my $fname = shift @fnv)
        {
            $post_args->{$fname} = shift @fnv;
        }
    }
 }
$request = POST "https://www.amazon.com/exec/obidos/flex-sign-in-done/
         103-617864 3-7537408" , $post_args;
$webdoc = $ua->request($request);
die unless !$webdoc->is_success();
 
# obtain information from $webdoc->content()
# HTML::TableExtract() might be useful here

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.