Channels ▼

Developer's Reading List

, April 17, 2012 C++ concurrency, web crawlers, Google testing, and more: This month's reading list is packed with great books on interesting topics.
  • E-mail
  • Print

Webbots, Spiders, and Screen Scrapers, 2nd Ed.

by Michael Schrenk

I expect that every one of us who codes for pleasure has at some point considered the fun it might be to write our own Web crawler — either our own little search engine, or a spider to check on what is happening on a specific site, or a webbot to rummage around and find all sorts of interesting downloadable resources from a site. It turns out that there is a cadre of folks who write just such software for a living. One of them is the author, who has written numerous such crawlers and bots and is willing to share the secrets of their construction.

He starts with the basics of a crawler, which are explained very clearly. He quickly moves to refining the project to handle a variety of obstacles: login screens, forms and form submissions, downloading images (and converting them to thumbnails), etc. In every explanation, you benefit from Schrenk's experience. For example, a discussion of why you should not use regular expressions when parsing incoming data; how to correctly detect phone numbers in crawled data, etc. You really have the sense throughout the book of being guided by a seasoned hand.

This is especially true when it comes to the social and political aspects of web crawling. He explains both sides: the rules of the road and what's expected by sites that don't want to be crawled, and how to cover over your tracks if you need to access a site with a cranky web administrator. This section is particularly rich in hard-won advice.

A final section of the book discusses sniping, which is the automated triggering of an event, generally as late as possible so as to preclude response. The typical case is making an automated bid on eBay a second before the auction closes.

The crawlers and bots themselves are fashioned from open-source products, notably PHP and CURL. CURL is really the primary engine and PHP is used to modify searches and crawls appropriately. This is the first book I've read in which PHP was the language of instruction, and I was surprised to see how simple it was to follow the code. As the author explains, PHP is an ideal language as the problem domain fits it perfectly and most anyone can quickly get to the point of modifying the code for their needs, without starting from scratch.

Overall, I found this a very clear, very readable, and thorough presentation of the topic. Given that this is the second edition of this volume, others before have realized that Schrenk has written probably the definitive introduction to this topic and made the whole field of crawlers, spiders, and bots an approachable and interesting area to explore. Highly recommended.

— Andrew Binstock






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.