|New Reviews| |Software Methodologies| |Popular Science| |AI/Machine Learning| |Programming| |Java| |Linux/Open Source| |XML| |Software Tools| |Other| |Web| |Tutorials| |All By Date| |All By Title| |Resources| |About| |
Keywords: PHP, web development, spiders, software agents Title: Webbots, Spiders, and Screen Scrapers Author: Michael Schrenk Publisher: No Starch Press ISBN: 978-1593273972 Media: Book Level: Introductory - some PHP knowledge required Verdict: Good as a general introduction but not right for anyone looking for depth |
As anyone who has ever perused the logs of an active web site, huge amounts of traffic come not from people casually surfing the web, but from search engine spiders, bots and other automated sources. It might be invisible to the user, but it's there all the same and growing in extent. For developers wanting to get in on the action, this book sounds like an ideal place to start as it is devoted to spiders, web bots, screen scrapers and the like. The question is, does it deliver on that promise?
First things first. While it's possible to write automated web agents in all manner of programming languages, the language used throughout this book is PHP. So, if you've no inclination to use PHP the book is not for you. That said, the level of PHP is fairly basic, so if you've had some exposure to it before then it should be fine. Note that there are no instructions on downloading and installing PHP and a development environment should you not have these already in place.
The second thing to note is that author has developed a library of tools that build on top of the standard cURL PHP library. While this is good in that it wrappers a lot of the complications so that the developer can focus on the projects in the book, it does mean that the reader doesn't get to grips with the standard tools that most developers will be using in practice. On the one hand this is good in that it enables the user to focus on the higher level functionality, it does mean that you're not exposed to the complexities that will be exposed in real-world development.
With those provisos in mind, the book does a reasonable job of introducing the landscape of automated web agents, detailing the kinds of functions each typically performs and then dives down into producing some code that implements an example. Part one of the book looks at fundamental concepts and techniques, including downloading web pages, parsing, automated form submission etc. Part two looks at some sample projects: price monitoring web bots, image capturing, link verification etc. Part three goes into more detail on specific technical issues, such as authentication, advanced cookie management, cryptography and so on. The final section of the book takes a look at some other areas that need attention, including the design of stealthy agents, the use of proxies, the use of robots.txt to keep spiders out and so on.
Overall the book is interesting and readable, and the code is straightforward and easy to follow even for those without a solid grounding in PHP. This is in part due to the use of the author's own libraries. And this is the key point on which the perceived value of the book hinges. If you are looking for solid technical depth and want to know more about PHP/cURL and some of the very technical challenges of writing fault tolerant and stealthy web agents, the use of the author's libraries is a major downtime. However, for those new to the topic completely then this really is a good introduction - you can pick up a lot on the way and has a fair breadth of information even if it doesn't have the depth required at guru level.