Developer's Reading List
, April 17, 2012 C++ concurrency, web crawlers, Google testing, and more: This month's reading list is packed with great books on interesting topics.
Webbots, Spiders, and Screen Scrapers, 2nd Ed.
by Michael Schrenk
I expect that every one of us who codes for pleasure has at some point considered the fun it might be to write our own Web crawler either our own little search engine, or a spider to check on what is happening on a specific site, or a webbot to rummage around and find all sorts of interesting downloadable resources from a site. It turns out that there is a cadre of folks who write just such software for a living. One of them is the author, who has written numerous such crawlers and bots and is willing to share the secrets of their construction.
He starts with the basics of a crawler, which are explained very clearly. He quickly moves to refining the project to handle a variety of obstacles: login screens, forms and form submissions, downloading images (and converting them to thumbnails), etc. In every explanation, you benefit from Schrenk's experience. For example, a discussion of why you should not use regular expressions when parsing incoming data; how to correctly detect phone numbers in crawled data, etc. You really have the sense throughout the book of being guided by a seasoned hand.
This is especially true when it comes to the social and political aspects of web crawling. He explains both sides: the rules of the road and what's expected by sites that don't want to be crawled, and how to cover over your tracks if you need to access a site with a cranky web administrator. This section is particularly rich in hard-won advice.
A final section of the book discusses sniping, which is the automated triggering of an event, generally as late as possible so as to preclude response. The typical case is making an automated bid on eBay a second before the auction closes.
The crawlers and bots themselves are fashioned from open-source products, notably PHP and CURL. CURL is really the primary engine and PHP is used to modify searches and crawls appropriately. This is the first book I've read in which PHP was the language of instruction, and I was surprised to see how simple it was to follow the code. As the author explains, PHP is an ideal language as the problem domain fits it perfectly and most anyone can quickly get to the point of modifying the code for their needs, without starting from scratch.
Overall, I found this a very clear, very readable, and thorough presentation of the topic. Given that this is the second edition of this volume, others before have realized that Schrenk has written probably the definitive introduction to this topic and made the whole field of crawlers, spiders, and bots an approachable and interesting area to explore. Highly recommended.
— Andrew Binstock

