TechBookReport logo

Keywords: Perl, spidering, screen-scraping, Java, Python, PHP

Title: Spidering Hacks

Editors: Kevin Hemenway and Tara Calishain

Publisher: O'Reilly

ISBN: 0596005776

Media: Book

Level: Intermediate Perl

Verdict: Excellent for Perl hackers interested in spidering

There comes a time when surfing just isn't enough. It may be that you get sick of checking the Amazon rank of your favourite book, or you find a stash of images or media files that you want to download or maybe you find yourself endlessly cycling through the same set of sites day in a day out looking for specific pieces of data (stock prices, weather reports, news items, knitting patterns …). When that day dawns then that's when you start looking seriously at the different spidering options that are available to do away with the drudgery or to expand your reach. It should also be a day when you reach for 'Spidering Hacks', particularly if you're a Perl user or are prepared to dive in and learn.

Like 'Google Hacks' and the rest of the Hacks series, this book presents 100 bite-sized chunks of code or technique to tackle specific activities. In this book these range from the simple - how to download a set of image files - to the complex - cross-referring the output from one site with another to generate a third set of data. No matter what the complexity, each hack is clearly explained, with the code samples balanced with instructions, examples and notes on how to hack the hack.

As already mentioned, the hacks in this book mostly use Perl, though scattered here and there you'll find some Java, Python and PHP. If you really hate Perl, then this is not the place for you. On the other hand the authors assume only a rudimentary knowledge of Perl, and there is no requirement for any knowledge of network programming of any description. After the opening chapter which gives guidance of being a going spidering citizen (i.e. how to respect the sites you are sucking data from), there is a second chapter which details how to create a spidering toolkit (i.e. how to find and install the site of modules that many of the hacks depend on).

With a toolkit in place and a knowledge of good behaviour, it's straight into the different hacks. These are organised by topic: collecting media files, gleaning data from databases (with many examples for Yahoo!, Amazon, Google, Alexa and other popular information sources), maintaining your collections (more automation with cron or other scheduling tools) and a final chapter on giving something back (creating a web service, generating RSS feeds and so on).

The bulk of the hacks are in chapter four, which looks at extracting data from databases. Aside from the obvious sources such as Amazon and Google, these including online banks, tracking FedEx packages and more. There are a range of techniques used to grab and filter the data, so even if a data source you want to use isn't listed, the chances are that one of these hacks can be refactored to do what you want.

If Perl is not your thing then the very light sprinkling of non-Perl hacks probably isn't enough to make this an essential purchase. If you're a Perl hacker interested in spidering there is a ton of stuff for you here without doubt.

Hit the 'back' key in your browser to return to subject index page

Return to home page