Website Screen scrapping using Zend Framework

February 21, 2011 Zend Framework

ZF is a component-based framework, so we can only use some of its packages for a specific task. For example, if we don’t need to build a site and don’t need MVC, dispatchers, routers and so on, we can include only necessary packages for the task.

Assume we need to build a screen-scrapper for a site or group of sites. We’d need Zend_Dom_Query with its convenient xpath and css query methods and Zend_Json since many sites interact in AJAX using JSON.

So, we start with forming the packages we only need. Since we use ZF classes and they in turn load their base classes, so we need Zend_Loader. which will register its own autoload function. Here is all we need for the task:

Zend
│   Json.php
│   Loader.php
├───Dom
│   │   Exception.php
│   │   Query.php
│   │
│   └───Query
│           Css2Xpath.php
│           Result.php
├───Json
│   │   Decoder.php
│   │   Encoder.php
│   │   Exception.php
│   │   Expr.php
│   │   Server.php
│   │
│   └───Server
│       │   Cache.php
│       │   Error.php
│       │   Exception.php
│       │   Request.php
│       │   Response.php
│       │   Smd.php
│       │
│       ├───Request
│       │       Http.php
│       │
│       ├───Response
│       │       Http.php
│       │
│       └───Smd
│               Service.php
└───Loader
│   Autoloader.php
│   Exception.php
│   PluginLoader.php
├───Autoloader
│       Interface.php
│       Resource.php
└───PluginLoader
Exception.php
Interface.php
Let’s start coding it. If we will scrape several sites, we’d need a class containig all the methods for all sites + some common methods for handling cURL operations and service checks. Actually, it is a good idea to create a base class with all these methods and extend it by each class site-scrapper, but let’s leave it for the future.

As you can see we use ‘cookie.txt’ file in current folder for holding cookies. But what if it is not accessible for writing/reading? Let’s add a constructor and do this check:

Ok, so the class will be used as:

in order to implement the method we have to work with the site using anything like Firebug, Charles shareware proxy server, Tamper data Firefox addon that will allow us to intercept and analyze HTTP headers and content. This all is beyond the topic of the article, but I have to say one can almost always emulate browsers behaviour. While working with the site you can notice that it may validate some additional headers, it may encode postdata in some non-standard manner etc. So we should tweak the getContent() method:

Let’s have a look at the method itself:

To make ZF classed load we need to instantiate Zend_Loader. This can be done either in constructor or before class definition.


Leave a Reply

Your email address will not be published