Website Screen scraping using Zend Framework

ZF is a component-based framework, so we can only use some of its packages for a specific task. For example, if we don’t need to build a site and don’t need MVC, dispatchers, routers and so on, we can include only necessary packages for the task.

Assume we need to build a screen-scrapper for a site or group of sites. We’d need Zend_Dom_Query with its convenient xpath and css query methods and Zend_Json since many sites interact in AJAX using JSON.

So, we start with forming the packages we only need. Since we use ZF classes and they in turn load their base classes, so we need Zend_Loader. which will register its own autoload function. Here is all we need for the task:

Zend
│   Json.php
│   Loader.php
├───Dom
│   │   Exception.php
│   │   Query.php
│   │
│   └───Query
│           Css2Xpath.php
│           Result.php
├───Json
│   │   Decoder.php
│   │   Encoder.php
│   │   Exception.php
│   │   Expr.php
│   │   Server.php
│   │
│   └───Server
│       │   Cache.php
│       │   Error.php
│       │   Exception.php
│       │   Request.php
│       │   Response.php
│       │   Smd.php
│       │
│       ├───Request
│       │       Http.php
│       │
│       ├───Response
│       │       Http.php
│       │
│       └───Smd
│               Service.php
└───Loader
│   Autoloader.php
│   Exception.php
│   PluginLoader.php
├───Autoloader
│       Interface.php
│       Resource.php
└───PluginLoader
Exception.php
Interface.php
Let’s start coding it. If we will scrape several sites, we’d need a class containig all the methods for all sites + some common methods for handling cURL operations and service checks. Actually, it is a good idea to create a base class with all these methods and extend it by each class site-scrapper, but let’s leave it for the future.
 
_ch = curl_init($url);
 
        curl_setopt($this->_ch,CURLOPT_POST,($method == 'get' ? 0 : 1));
        curl_setopt($this->_ch,CURLOPT_RETURNTRANSFER,1);
        curl_setopt($this->_ch,CURLOPT_COOKIEJAR,$this->_cookieFilename);
        curl_setopt($this->_ch,CURLOPT_COOKIEFILE,$this->_cookieFilename);
        curl_setopt($this->_ch,CURLOPT_HTTPHEADER,$this->_defaultHeaders);
 
        if(!empty($data))
        {
            curl_setopt($this->_ch, CURLOPT_POSTFIELDS, call_user_func($callback_encode,$data));
        }
 
        $res = curl_exec($this->_ch);
 
        curl_close($this->_ch);
 
        return $res;
    }

As you can see we use ‚cookie.txt‘ file in current folder for holding cookies. But what if it is not accessible for writing/reading? Let’s add a constructor and do this check:

 
_fileHandle = fopen($this->_cookieFilename,'w')) === false)
        {
            throw new Exception('Cookie file is not writeable!');
        }
    }
 
    protected function getContent($url, $method = 'get', $data = array())
    {
        $this->_ch = curl_init($url);
 
        curl_setopt($this->_ch,CURLOPT_POST,($method == 'get' ? 0 : 1));
        curl_setopt($this->_ch,CURLOPT_RETURNTRANSFER,1);
        curl_setopt($this->_ch,CURLOPT_COOKIEJAR,$this->_cookieFilename);
        curl_setopt($this->_ch,CURLOPT_COOKIEFILE,$this->_cookieFilename);
        curl_setopt($this->_ch,CURLOPT_HTTPHEADER,$this->_defaultHeaders);
 
        if(!empty($data))
        {
            curl_setopt($this->_ch, CURLOPT_POSTFIELDS, call_user_func($callback_encode,$data));
        }
 
        $res = curl_exec($this->_ch);
 
        curl_close($this->_ch);
 
        return $res;
    }

Ok, so the class will be used as:

 
$scrap = new Wow_Scrap;
// $scrap->method($some_input_data);
// where $some_input_data - are input parameters for the selected site like search terms
// e.g
 
$scrap->Scrape_gold4power('Ragnaros','Horde')
 
// meaning that we should grab some content from 'gold4power' server for 'Ragnaros' WOW server for 'Horde' faction

in order to implement the method we have to work with the site using anything like Firebug, Charles shareware proxy server, Tamper data Firefox addon that will allow us to intercept and analyze HTTP headers and content. This all is beyond the topic of the article, but I have to say one can almost always emulate browsers behaviour. While working with the site you can notice that it may validate some additional headers, it may encode postdata in some non-standard manner etc. So we should tweak the getContent() method:

 
   // added $callback_encode - function, that used for encoding postdata, by default built-in 'http_build_str'
   // added $dditional_headers which will be merged with $this->_default_headers
    protected function getContent($url, $method = 'get', $data = array(),$callback_encode = 'http_build_str', $additionalHeaders = array())
    {
        $headers = array_merge($this->_defaultHeaders,$additionalHeaders);
 
        $this->_ch = curl_init($url);
 
        curl_setopt($this->_ch,CURLOPT_POST,($method == 'get' ? 0 : 1));
        curl_setopt($this->_ch,CURLOPT_RETURNTRANSFER,1);
        curl_setopt($this->_ch,CURLOPT_COOKIEJAR,$this->_cookieFilename);
        curl_setopt($this->_ch,CURLOPT_COOKIEFILE,$this->_cookieFilename);
        curl_setopt($this->_ch,CURLOPT_HTTPHEADER,$headers);
 
        if(!empty($data))
        {
            curl_setopt($this->_ch, CURLOPT_POSTFIELDS, call_user_func($callback_encode,$data));
        }
 
        $res = curl_exec($this->_ch);
 
        curl_close($this->_ch);
 
        return $res;
    }

Let’s have a look at the method itself:

 
    public function Scrape_gold4power($Server, $Faction)
    {
        // to get the cookie and store it in JARFILE
        $url = 'https://www.gold4power.com/';
        $this->getContent($url);
 
        // url from Firebug that replies with our content
        $url .= '/ajaxpro/gold.buy.list_wow,gold.ashx';
        // getContent() call with all params, additional headers,
        // and Zend_Json_Encoder::encode() function as postdata encoding callback
        $data = $this->getContent($url,'post',array('Game' => 'WOW', 'Server' => $Server.'-'.$Faction, 'ChangeMonkey' => '1'),array('Zend_Json_Encoder','encode'),array('AjaxPro-Method: CreateItemList','Referer: '.$url.'World-of-Warcraft-US/'.$Server.'-'.$Faction.'.html'));
        // the server replies in someJSFunction(JSON_data) and we need only data, so we should cut out someJSFunction( and )
        $data = preg_replace(array('/[\s\S]*?\(/im','/\)/im'),'',$data);
        // beware of invalid JSON, for this very example it returns [...],[....] and no wrapping structure, it must be either object {} or array [], so we wrap it with [] and decode
        $data = Zend_Json_Decoder::decode('['.$data.']');
 
        // perform any necessary data manipulations and return it
        $data = $data[1];
        array_walk($data,create_function('&$item,$key','$item = array($item[1] => $item[3]);'));
 
        return $data;
    }

To make ZF classed load we need to instantiate Zend_Loader. This can be done either in constructor or before class definition.

 
ini_set('include_path',ini_get('include_path').PATH_SEPARATOR.realpath('./'));
 
require_once('Zend/Loader/Autoloader.php');
 
$loader = Zend_Loader_Autoloader::getInstance();

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

This site uses Akismet to reduce spam. Learn how your comment data is processed.