Scraping products from Walmart with PHP, Guzzle, Crawler and Doctrine

You know web scraping is a useful technique of data extraction from websites, especially if there is no any API or there is a ton of data that can’t be got in another way. The most simple and well-known way is to use CURL. But it is easy only in the aspect of using third-party components and widespread because of its universality. Our goal is to scrape data mostly simply for a programmer.
We are going to do this in 4 steps with Guzzle, Symfony DOM Crawler Component and Doctrine DBAL packages. As for an example of the site this one http://www.walmart.com will be taken.
We want to get all categories and goods from its catalogue and receive a CSV file with the goods at the end.

1. Install libraries

The easiest way to install all libraries with their requirements is to do this with Composer. We consider you have it installed.

1.1. Guzzle

The first thing we need for scraping is an HTTP Client. We choose Guzzle.
So let’s open command line and start with the command:

composer require guzzle/guzzle:~3.9

If install is successful you’ll see something like that:

C:\OpenServer\domains\walmart.loc>composer require guzzle/guzzle:~3.9
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
 
  - Installing symfony/event-dispatcher (v2.8.2)
    Loading from cache
 
  - Installing guzzle/guzzle (v.3.9.3)
    Loading from cache
 
symfony/event-dispatcher suggests installing symfony/dependency-injection ()
symfony/event-dispatcher suggests installing symfony/http-kernel ()
guzzle/guzzle suggests installing guzzlehttp/guzzle (Guzzle 5 has moved to a new package name. 
The package you have installed, Guzzle 3, is deprecated.)
Writing lock file
Generating autoload files

As you can see one additional package was installed: symfony/event-dispatcher. It doesn’t need your special attention but it is a required dependency of Guzzle.

1.2. Symfony DOM Crawler Component

Also we need something that will help us to scrape necessary data quickly and easy. Here we’ll use Symfony DOM Crawler Component.
Write the next command into the command line:

composer require symfony/dom-crawler

That’s how it will looks like if everything is OK:

C:\OpenServer\domains\walmart.loc>composer require symfony/css-selector
Using version ^3.0 for symfony/dom-crawler
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
 
  - Installing symfony/polyfill-mbstring (v1.1.0)
    Downloading: 100%
 
  - Installing symfony/dom-crawler (v.3..3)
    Downloading: 100%
 
symfony/dom-crawler suggests installing symfony/css-selector ()
Writing lock file
Generating autoload files

This package downloads additional package too, but also it suggests to install symfony/css-selector. It is very useful while scraping so we’ll accept this suggestion and write the following command:

composer require symfony/css-selector

You can see an example of the successful installation process below:

C:\OpenServer\domains\walmart.loc>composer require symfony/dom-crawler
Using version ^3.0 for symfony/dom-crawler
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
 
  - Installing symfony/css-selector (v3.0.1)
    Downloading: 100%
 
Writing lock file
Generating autoload files

1.3. Doctrine

After getting data we’ll need to save it to the DB. That’s why some database abstraction layer will come in handy, for example, Doctrine.
The way of its installation is the same as at the previous step:

composer require doctrine/dbal:~2.5.4

Here is a screenshot:

C:\OpenServer\domains\walmart.loc>composer require doctrine/dbal:~254
Using version ^3.0 for symfony/dom-crawler
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
 
  - Installing doctrine/lexer (v1.0.1)
    Loading from cache
 
  - Installing doctrine/annotations (v1.2.7)
    Loading from cache
 
  - Installing doctrine/cache (v1.6.0)
    Downloading: 100%
 
  - Installing doctrine/inflector (v1.1.0)
    Downloading: 100%
 
  - Installing doctrine/common (v2.6.1)
    Downloading: 100%
 
  - Installing doctrine/dbal (v2.5.4)
    Downloading: 100%
 
doctrine/dbal suggests installing symfony/console
(For helpful console commands such as SQL execution and import of files.)
 
Writing lock file
Generating autoload files

At the end of this part let’s look to our project’s directory:
directory with libraries

Here we see that all libraries are in the vendor directory and also composer.json and composer.lock files were created. So we can do the next step.

2. Load HTML code of the website page

We create file, for example, scraper.php in the root of project’s directory. Firstly, we need to include file vendor/autoload.php to make everything work:

<?php
require __DIR__ . '/vendor/autoload.php';
?>

2.1. Include Guzzle

Now we are ready to say that Guzzle client will be used. Also we need to think about some exceptions that may be thrown by this client.

use Guzzle\Http\Client; 
use Guzzle\Http\Exception\ClientErrorResponseException;

2.2. Create request

First of all let’s define some variables: URL of the site and URI of its page we want to scrape. Look at the http://www.walmart.com. In the menu there is a link to the page with all departments http://www.walmart.com/all-departments.
It is the best page for scraping categories, so we’ll use just it.

$url = 'http://www.walmart.com';
$uri = '/all-departments';

Also it is necessary to define User-Agent header.
If not, the following error will be produced:
Access Denied

Let’s copy data from the browser. We will use Chrome default user-agent header.

$userAgent = 'Mozilla/5.0 (Windows NT 10.0)'
           . ' AppleWebKit/537.36 (KHTML, like Gecko)'
           . ' Chrome/48.0.2564.97'
           . ' Safari/537.36';
$headers = array('User-Agent' => $userAgent);

To make a request we need to create object of the HTTP Client and use its method get().

$client = new Client($url);
$request = $client->get($uri, $headers);

2.3. Get response

When request is made we can get response from the http://www.wallmart.com. We will warp the request code block in try/catch to properly handle connection issues. Option true is required because we don’t want to echo page at the browser, we want to get it as a string.

try {
    $response = $request->send();
    $body = $response->getBody(true);
} catch (ClientErrorResponseException $e) {
    $responseBody = $e->getResponse()
                      ->getBody(true);
    echo $responseBody;
}

Now we have what to scrape so let’s move to the next part.

3. Scrape the page

3.1. Include Crawler

At the top of the script we’ll say we are going to use DOM Crawler Component and CSS Selector.

use Symfony\Component\DomCrawler\Crawler;
use Symfony\Component\CssSelector;

3.2. Get HTML block with categories

Now let’s continue our try block. To scrape the received page’s body we should create an object of Crawler and put there the body variable.
It is time to see what part of page we have to extract to get the categories’ titles. We inspect it using Chrome Dev Tools and understand that we need the first div class='all-depts-links' (see screenshot below). It will be our filter.

container with categories

Now we extract every child node of this div and push it to the array within the method each() and an anonymous function. After that we don’t need this Crawler object anymore so we’ll remove it.

    $crawler = new Crawler($body);
    $filter = '.all-depts-links';
    $catsHTML = $crawler
                    ->filter($filter)
                    ->each(function (Crawler $node) {
                        return $node->html();
                    });
    unset($crawler);

If we dump this array we’ll see 13 elements which are strings with the HTML inside.
Looking at it we get to know that we have three-level tree of categories: categories, their subcategories and sub-subcategories.

3.3 Get categories titles and subcategories HTML

Categories’ titles are located in the headings, so CSS selector for them is '.all-depts-links-heading > a'. Each subcategory locates in the separate li node, that’s why filter for them is 'ul'.
To get only titles from the categories’ headings is possibly using method text(). To get subcategories’ HTML code we’ll apply method html().

    $cats = $subCatsHTML = array();
    $catsFilter = '.all-depts-links-heading > a';
    $subCatsFilter = 'ul';
 
    foreach ($catsHTML as $index => $catHTML) {
        $crawler = new Crawler($catHTML);
        $cats[] = $crawler
                    ->filter($catsFilter)
                    ->text();
        $subCatsHTML[$index] = array();
        $subCatsHTML[$index] = $crawler
                                    ->filter($subCatsFilter)
                                    ->each(function (Crawler $node) {
                                        return $node->html();
                                    });
        unset($crawler);
    }

And now we have array with the categories title’s and the HTML code of subcategories.

3.4. Get subcategories’ and their sub-subcategories’ data

At this step we should receive an array with the whole data inside: categories titles’ and subcategories at the first level, subcategories’ data and their sub-subcategories at the second level, and sub-subcategories data at the third level.

From the screenshot above we know that CSS filter for the subcategories is 'li > a.all-depts-links-dept', and for the sub-subcategories it is 'li > a.all-depts-links-category'.

Let’s think about data that we need to get from subcategories. Of course, the first field is a title. But also as we are going to get products from this subcategory, we need to know how to get to its page. So the second field will be a link, namely an URI for the subcategory’s page. The same we need for sub-subcategories.
The code for this section will look like this:

    $subSubCats = array();
    $subCatsFilter = 'li > a.all-depts-links-dept';
    $subSubCatsFilter = 'li > a.all-depts-links-category';
 
    foreach ($subCatsHTML as $catIndex => $subCatHTML) {
 
        $subSubCats[$catIndex]['cat'] = $cats[$catIndex];
 
        foreach ($subCatHTML as $subCatIndex => $subSubCatHTML) {
 
            $crawler = new Crawler($subSubCatHTML);
 
            $node = $crawler->filter($subCatsFilter);
            $tempSubCat = array(
                'href' => $node->attr('href'),
                'title' => $node->text()
            );
 
            $tempSubSubCats = $crawler
                            ->filter($subSubCatsFilter)
                            ->each(function (Crawler $node) {
                                return array(
                                            'href' => $node->attr('href'), 
                                            'title' => $node->text()
                                        );
                            });
            unset($crawler);
 
            $subSubCats[$catIndex]['subCats'][$subCatIndex] = array(
                'subCat' => $tempSubCat,
                'subSubCats' => $tempSubSubCats
            );
        }
    }

3.5. Get goods

In the Walmart catalogue some of the subcategories don’t have sub-subcategories. That’s why our task is to get goods:

from the subcategories without sub-subcategories;
from the sub-subcategories.

All goods are located inside the list with classed 'tile-list tile-list-grid'. Every item has its own ul
and all item’s data is inside child node of li – div.
Being based on these conclusions we can form a goods filter: 'ul.tile-list.tile-list-grid > li > div'.
One more thing before we dive into the code.
While writing and testing this script we noticed that Guzzle was redirected from some links to others. But HTTP client couldn’t do this automatically, so it threw an exception with a such message: “Was unable to parse malformed url: http://url/to/what/it/was/redirected”. But we caught this exception, got the URL from it and went ahead with the new URL.

    $errorMessage = 'Was unable to parse malformed url: ';
    $errorLength = strlen($errorMessage);
    $goodsFilter = 'ul.tile-list.tile-list-grid > li > div';
 
    foreach ($subSubCats as $catIndex => $cats) {
 
        if (!empty($cats['subCats'])) {
 
            foreach ($cats['subCats'] as $subCatIndex => $subCat) {
 
                if (empty($subCat['subSubCats'])) {
 
                    $uri = $subCat['subCat']['href'];
                    $goodsHTML = '';
                    $continue = true;
 
                    while (empty($goodsHTML) && $continue) {
 
                        if (strpos($uri, 'http') === false) {
                            $uri = $url . $uri;
                        }
                        $request = $client->get($uri, $headers, $options);
 
                        try {
                            $response = $request->send();
                            $goodsHTML = $response->getBody(true);
 
                            $crawler = new Crawler($goodsHTML);
                            $subSubCats[$catIndex]['subCats']
                                       [$subCatIndex]['goods'] = array();
                            $subSubCats[$catIndex]['subCats']
                                       [$subCatIndex]['goods'] = $crawler
                                        ->filter($goodsFilter)
                                        ->each(function(Crawler $node) {  
                                            $html = $node->html();
                                            return getGoodsData(
                                                $html, $node
                                            );
 
                                        });
                            unset($crawler);
                        } catch(Exception $e) {
                            $message = $e->getMessage();
                            if (strpos($message, $errorMessage) === false) {
                                $continue = false;
                            } else {
                                $uri = substr($message, $errorLength);
                            }
                        }
                    }
 
                } else {
                    foreach (
                        $subCat['subSubCats'] as $subSubCatIndex => $subSubCat
                    ) {
                        $uri = $subSubCat['href'];
 
                        $goodsHTML = '';
                        $continue = true;
                        while (empty($goodsHTML) && $continue) {
                            if (strpos($uri, 'http') === false) {
                                $uri = $url . $uri;
                            }
                            $request = $client->get($uri, $headers, $options);
                            try {
                                $response = $request->send();
                                $goodsHTML = $response->getBody(true);
 
                                $crawler = new Crawler($goodsHTML);
                                $subSubCats[$catIndex]['subCats']
                                           [$subCatIndex]['subSubCats']
                                           [$subSubCatIndex]['goods'] = array();
                                $subSubCats[$catIndex]['subCats']
                                           [$subCatIndex]['subSubCats']
                                           [$subSubCatIndex]['goods'] = 
                                    $crawler
                                            ->filter($goodsFilter)
                                            ->each(function(Crawler $node) {
                                                $html = $node->html();
                                                return getGoodsData(
                                                    $html, $node
                                                );
                                            });
                                unset($crawler);
                            } catch(Exception $e) {
                                $message = $e->getMessage();
                                if (strpos($message, $errorMessage) === false) {
                                    $continue = false;
                                } else {
                                    $uri = substr($message, $errorLength);
                                }
                            }
                        }
                    }
                }
            }
        }
    }

In the code above we used earlier undefined function getGoodsData($html, $node). Before we write it we need to know what data we want to get from goods. Looking at some sub-subcategory’s page (see screenshot below) we see that goods have such main attributes. But sometimes some fields may be absent: at the screenshots below there is no price or there are no rating and reviews.

item's attributes

We should prevent problems associated with these cases so it is obvious to set default values for these variables. Also it is necessary to check if these spans are present in the received HTML code.
Also let’s prepare filters for the data:

CSS selectors for price, title, rating, reviews and image

Now we can write the function.

function getGoodsData($html, Crawler $node)
{
    $price = $rating = .0;
    $reviews = 0;
    $image = $title = '';
    $priceSpan = '<span class="price price-display">';
    $priceFilter = 'span.price.price-display';
    $ratingSpan = '<span class="visuallyhidden">';
    $ratingFilter = 'span.visuallyhidden';
    $reviewsSpan = '<span class="stars-reviews">';
    $reviewsFilter = 'span.stars-reviews';
    $imageFilter = 'a > img';
    $titleFilter = 'h3.tile-heading';
 
    if (!(strpos($html, $priceSpan) === false)) {
        $price = $node
                    ->filter($priceFilter)
                    ->text();
        $price = trim($price);
        $price = substr($price, 1);
        $price = str_replace(',', '', $price);
        $price = (float) $price;
    }
 
    if (!(strpos($html, $ratingSpan) === false)) {
        $rating = $node
                    ->filter($ratingFilter)
                    ->text();
        $rating = (float) $rating;
    }
 
    if (!(strpos($html, $reviewsSpan) === false)) {
        $reviews = $node
                    ->filter($reviewsFilter)
                    ->text();
        $reviews = trim ($reviews);
        $reviews = str_replace('(', '', $reviews);
        $reviews = str_replace(')', '', $reviews);
        $reviews = (int) $reviews;
    }
 
    $image = $node
                ->filter($imageFilter)
                ->attr('data-default-image');
 
    $title = $node
                ->filter($titleFilter)
                ->text();
    $title = trim($title);
 
    return compact('image', 'price', 'title', 'rating', 'reviews');
}

4. Push data into DB

4.0. Create DB

Let’s create a database called 'walmart'. We can do this easy within PHPMyAdmin or MySQL console. Here is an SQL script:

DROP DATABASE IF EXISTS `walmart`;
CREATE DATABASE `walmart`;

4.1. Include Doctrine

To include Doctrine into our script we’ll do three steps:

require __DIR__ . '/vendor/doctrine/common/lib/Doctrine/Common/ClassLoader.php';

use Doctrine\Common\ClassLoader;

$classLoader = new ClassLoader(
        'Doctrine', __DIR__ . '/vendor/doctrine/common'
    );
    $classLoader->register();

4.2. Configure and set connection

The next step is to create an object of Doctrine class.

$config = new \Doctrine\DBAL\Configuration();
 
    $connectionParams = array(
        'dbname' => 'walmart',
        'user' => 'root',
        'password' => '',
        'host' => 'localhost',
        'driver' => 'pdo_mysql',
    );
    $conn = \Doctrine\DBAL\DriverManager::getConnection(
        $connectionParams, $config
    );

4.3. Create tables

Firstly let’s write an SQL code into the SQL file.

DROP TABLE IF EXISTS `categories`;
CREATE TABLE IF NOT EXISTS `categories` (
  `category_id` tinyint(2) UNSIGNED NOT NULL AUTO_INCREMENT,
  `category_title` VARCHAR(30) NOT NULL,
  PRIMARY KEY (`category_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1;
 
DROP TABLE IF EXISTS `goods`;
CREATE TABLE IF NOT EXISTS `goods` (
  `goods_id` INT(4) UNSIGNED NOT NULL AUTO_INCREMENT,
  `subcategory_id` INT(3) UNSIGNED DEFAULT NULL,
  `subsubcategory_id` INT(3) UNSIGNED DEFAULT NULL,
  `goods_title` VARCHAR(210) NOT NULL,
  `goods_image` VARCHAR(125) NOT NULL,
  `goods_price` DECIMAL(5,2) NOT NULL,
  `goods_reviews` INT(5) NOT NULL,
  `goods_rating` DECIMAL(1,1) NOT NULL,
  PRIMARY KEY (`goods_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1;
 
DROP TABLE IF EXISTS `subcategories`;
CREATE TABLE IF NOT EXISTS `subcategories` (
  `subcategory_id` INT(3) UNSIGNED NOT NULL AUTO_INCREMENT,
  `category_id` tinyint(2) UNSIGNED NOT NULL,
  `subcategory_title` VARCHAR(35) NOT NULL,
  `subcategory_href` VARCHAR(160) NOT NULL,
  PRIMARY KEY (`subcategory_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1;
 
DROP TABLE IF EXISTS `subsubcategories`;
CREATE TABLE IF NOT EXISTS `subsubcategories` (
  `subsubcategory_id` INT(3) UNSIGNED NOT NULL AUTO_INCREMENT,
  `subcategory_id` INT(3) UNSIGNED NOT NULL,
  `subsubcategory_title` VARCHAR(35) NOT NULL,
  `subsubcategory_href` text NOT NULL,
  PRIMARY KEY (`subsubcategory_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1;

Now we execute it in the command line:

C:\OpenServer\modules\database\MySQL-5.5\bin>mysql -h localhost -u root -p -D walmart < C:\OpenServer\domains\walmart.loc\walmart.sql 
Enter password:

Let’s verify if our tables were created.

database tables

4.4. Insert goods into DB

At last we get to the final part.
Here we are going to walk through the array and insert every category, subcategory, sub-subcategory and goods item. Notice that we need to create variables for subcategory and sub-subcategory indices, and increment them at every iteration, because their indices in the array get to zero at every level.

    $sql = array();    
    $sql['cat'] = 'INSERT INTO `categories` '
                . '(`category_id`, `category_title`)'
                . 'VALUES (?, ?);';
    $sql['subCat'] = 'INSERT INTO `subcategories` '
                   . '(`subcategory_id`, `category_id`, '
                   . '`subcategory_title`, `subcategory_href`) '
                   . 'VALUES (?, ?, ?, ?)';   
    $sql['subSubCat'] = 'INSERT INTO `subsubcategories` '
                      . '(`subsubcategory_id`, `subcategory_id`, '
                      . '`subsubcategory_title`, `subsubcategory_href`) '
                      . 'VALUES (?, ?, ?, ?)';
    $sql['goods'] = 'INSERT INTO `goods` '
                  . '(`subcategory_id`, `subsubcategory_id`, '
                  . '`goods_image`, `goods_price`, `goods_title`, '
                  . '`goods_rating`, `goods_reviews`) '
                  . 'VALUES (?, ?, ?, ?, ?, ?, ?)';
 
    $subCatID = $subSubCatID = 0;
    foreach ($subSubCats as $catIndex => $cat) {
        $conn->executeQuery($sql['cat'], array($catIndex + 1, $cat['cat']));
        foreach ($cat['subCats'] as $subCatIndex => $subCat) {
            $subCatID++;
            $conn->executeQuery($sql['subCat'], array($subCatID, $catIndex + 1, 
                $subCat['subCat']['title'], $subCat['subCat']['href']
            ));
            if (!empty($subCat['subSubCats'])) {
                foreach (
                    $subCat['subSubCats'] as $subSubCatIndex => $subSubCat
                ) {
                    $subSubCatID++;
                    $conn->executeQuery($sql['subSubCat'], array($subSubCatID, 
                        $subCatID, $subSubCat['title'], $subSubCat['href']
                    ));
                    if (!empty($subSubCat['goods'])) {
                        foreach ($subSubCat['goods'] as $item) {
                            $conn->executeQuery($sql['goods'], array(null, 
                                $subSubCatID, $item['image'], 
                                $item['price'], $item['title'], 
                                $item['rating'], $item['reviews']
                            ));
                        }
                    }
                }
            }
            if (!empty($subCat['goods'])) {
                foreach ($subCat['goods'] as $item) {
                    $conn->executeQuery($sql['goods'], array($subCatID, null,
                        $item['image'], $item['price'], $item['title'], 
                        $item['rating'], $item['reviews']
                    ));
                }
            }
        }
    }

Now we have all goods from the first pages of every category located in our database. Let’s export table ‘goods’ to CSV file. Here we have 20 rows from this file:

goods_id,"subcategory_id","subsubcategory_id","goods_title","goods_image","goods_price","goods_reviews","goods_rating"
1,NULL,"1","RCA LED40HG45RQ 40"" 1080p 60Hz Class LED HDTV","http://i5.walmartimages.com/dfw/dce07b8c-f340/k2-_f1748cc7-f639-4859-8558-7559959b32bb.v1.jpg","230","227","4"
2,NULL,"1","Sceptre E405BD-F 40"" 1080p 60Hz LED HDTV with Built-in DVD Player","http://i5.walmartimages.com/dfw/dce07b8c-89de/k2-_d45babb1-d7ba-461d-b2e4-0ae9dac7243a.v2.jpg","250","37","5"
3,NULL,"1","SCEPTRE X405BV-F 40"" LED Class 1080P HDTV with ultra slim metal brush bezel, 60Hz","http://i5.walmartimages.com/dfw/dce07b8c-3578/k2-_8725da99-da45-4a8a-b94a-9863ea26fb40.v1.jpg","240","5267","5"
4,NULL,"1","Samsung 40"" 1080p 60Hz LED HDTV, UN40H5003AFXZA","http://i5.walmartimages.com/dfw/dce07b8c-3fc2/k2-_99695518-7c25-44c8-b335-d6a44029510e.v1.jpg","0","492","0"
5,NULL,"1","Samsung 40"" 1080p 60Hz LED Smart HDTV, UN40H5203AFXZA","http://i5.walmartimages.com/dfw/dce07b8c-2b20/k2-_13957163-be6a-4793-889c-a802b801f059.v1.jpg","328","642","5"
...
41,NULL,"2","LG BP155 Wired Blu-ray Player","http://i5.walmartimages.com/dfw/dce07b8c-c530/k2-_e70f57bb-dcf0-41ef-8cc1-cccdce5d194d.v1.jpg","45","34","4"
42,NULL,"2","Sony DVD Player, DVPSR210P","http://i5.walmartimages.com/dfw/dce07b8c-de92/k2-_2a1c6a2e-5413-41ae-af94-56daf9a86867.v3.jpg","38","954","4"
43,NULL,"2","Sony BDP-S3500 2D Blu-ray Player with WiFi","http://i5.walmartimages.com/dfw/dce07b8c-17a9/k2-_f81e14de-aebc-40e6-bcf1-99826f1154b5.v1.jpg","80","334","5"
44,NULL,"2","Sony DVP-SR510H HDMI DVD Player","http://i5.walmartimages.com/dfw/dce07b8c-a8af/k2-_d4b2e9a7-4c42-4f38-8dd8-8918534f4ce6.v1.jpg","0","532","0"
45,NULL,"2","LG DP132 DVD Player","http://i5.walmartimages.com/dfw/dce07b8c-1a82/k2-_ccd9e7cc-7144-47a8-ba20-a03a44cc607e.v1.jpg","28","85","4"
...
885,"19",NULL,"16 oz Jurassic World Plastic Cup","http://i5.walmartimages.com/dfw/dce07b8c-6ef4/k2-_3eb98ef0-55c5-4afb-b1db-779b4b1e55a9.v1.jpg","2","3","5"
886,"19",NULL,"LEGO: Jurassic World (PSV)","http://i5.walmartimages.com/dfw/dce07b8c-a756/k2-_00d0c9a5-50d5-426b-b441-92cc8f35d916.v1.jpg","20","1","5"
887,"19",NULL,"Dinosaur Field Guide","http://i5.walmartimages.com/dfw/dce07b8c-2aac/k2-_53a1e3a9-7beb-44de-bb11-4cabaf5b387c.v3.jpg","6","1","5"
888,"19",NULL,"Jurassic World Chomping Indominus Rex Figure","http://i5.walmartimages.com/dfw/dce07b8c-55f9/k2-_4b7d8ea1-47d0-41c1-aa3f-9f2c2ffb8146.v1.jpg","45","12","4"
889,"19",NULL,"Jurassic Park: The Ultimate Trilogy","http://i5.walmartimages.com/dfw/dce07b8c-c62c/k2-_16303f8e-7a2c-470c-a03e-2df5e849ab29.v1.jpg","30","33","5"
...
925,"21",NULL,"Minions (Blu-ray + DVD + Digital HD)","http://i5.walmartimages.com/dfw/dce07b8c-d537/k2-_989c072c-8901-4b27-aaca-d81d496697ac.v5.jpg","15","160","5"
926,"21",NULL,"Minions (Blu-ray + DVD + Digital HD + Minion Water Bottle) (Walmart Exclusive)","http://i5.walmartimages.com/dfw/dce07b8c-4439/k2-_4fe1c5eb-6c37-44ef-afd7-b9c288acdaa7.v2.jpg","25","160","5"
927,"21",NULL,"Minions","http://i5.walmartimages.com/dfw/dce07b8c-a453/k2-_1414b630-61ed-4510-b91d-872e268384c0.v4.jpg","13","160","5"
928,"21",NULL,"16 oz Despicable Me Plastic Cup","http://i5.walmartimages.com/dfw/dce07b8c-acba/k2-_cb2cfc49-55a4-48c5-8b6a-b2b1257d530f.v1.jpg","2","12","5"
929,"21",NULL,"Universal's Minions Fabric Shower Curtain","http://i5.walmartimages.com/dfw/dce07b8c-6322/k2-_9ac7a4b5-ac56-45d3-9cfa-99001d589e69.v1.jpg","20","1","5"
930,"21",NULL,"Universal's Minions Bath Rug","http://i5.walmartimages.com/dfw/dce07b8c-cbf7/k2-_207e62b7-1908-46c9-a5b1-28876aaa8876.v1.jpg","15","0","0"

Dustin Dinsmore March 5, 2018 at 4:14 pm

2. Load HTML code of the website page
We create file, for example, scraper.php in the root of project’s directory. Firstly, we need to include file vendor/autoload.php to make everything work:
——————–
I am using symfony 4…. This does not seem right where do you put this file exactly? Because it seems to me it should go in the src folder?

- admin March 6, 2018 at 7:04 am
  
  Hi Dustin,
  
  thanks for commenting! This tutorial has been created with the thought, that you don’t use any MVC frameworks for the task. I.e. only guzzle+doctrine/dbal+symfony dom crawler (which is just a component of symfony framework). You definitely can use sf4 to create a similar scraper, but in this case I’d do it as a console symfony command (read more here: https://symfony.com/doc/current/console.html ). Note that the framework does all the autoloading tasks for you, so in case you prefer to go with sf console command, you just need to specify the namespaces used, no need to do any require/include’s.
  
  Hope you can create such a command for your symfony project!
Coder July 22, 2018 at 7:03 am

Is there a git repo for the entire codebase?

Coder July 22, 2018 at 7:04 am

Is there a git repo where I can download this project? Thank you.

Rick Stixs March 21, 2020 at 5:53 am

I was hoping for a git hub repo as well.

- admin April 11, 2020 at 5:52 pm
  
  Hi Rick, thanks for your interest! We just finished an updated article on this topic, completely rewrote it: https://lamp-dev.com/scraping-products-from-walmart-with-php-phantomjs-symfony-crawler-and-doctrine-ver-2/1729 and here is the github repo with the code from the new article: https://github.com/lampdev/scraper