Scraping products from Walmart with PHP, Guzzle, Crawler and Doctrine

February 2, 2016 PHP As Is, PHP Recipes

You know web scraping is a useful technique of data extraction from websites, especially if there is no any API or there is a ton of data that can’t be got in another way. The most simple and well-known way is to use CURL. But it is easy only in the aspect of using third-party components and widespread because of its universality. Our goal is to scrape data mostly simply for a programmer.
We are going to do this in 4 steps with Guzzle, Symfony DOM Crawler Component and Doctrine DBAL packages. As for an example of the site this one http://www.walmart.com will be taken.
We want to get all categories and goods from its catalogue and receive a CSV file with the goods at the end.

1. Install libraries

The easiest way to install all libraries with their requirements is to do this with Composer. We consider you have it installed.

1.1. Guzzle

The first thing we need for scraping is an HTTP Client. We choose Guzzle.
So let’s open command line and start with the command:

If install is successful you’ll see something like that:

As you can see one additional package was installed: symfony/event-dispatcher. It doesn’t need your special attention but it is a required dependency of Guzzle.

1.2. Symfony DOM Crawler Component

Also we need something that will help us to scrape necessary data quickly and easy. Here we’ll use Symfony DOM Crawler Component.
Write the next command into the command line:

That’s how it will looks like if everything is OK:

This package downloads additional package too, but also it suggests to install symfony/css-selector. It is very useful while scraping so we’ll accept this suggestion and write the following command:

You can see an example of the successful installation process below:

1.3. Doctrine

After getting data we’ll need to save it to the DB. That’s why some database abstraction layer will come in handy, for example, Doctrine.
The way of its installation is the same as at the previous step:

Here is a screenshot:

At the end of this part let’s look to our project’s directory:
directory with libraries

Here we see that all libraries are in the vendor directory and also composer.json and composer.lock files were created. So we can do the next step.

2. Load HTML code of the website page

We create file, for example, scraper.php in the root of project’s directory. Firstly, we need to include file vendor/autoload.php to make everything work:

2.1. Include Guzzle

Now we are ready to say that Guzzle client will be used. Also we need to think about some exceptions that may be thrown by this client.

2.2. Create request

First of all let’s define some variables: URL of the site and URI of its page we want to scrape. Look at the http://www.walmart.com. In the menu there is a link to the page with all departments http://www.walmart.com/all-departments.
It is the best page for scraping categories, so we’ll use just it.

Also it is necessary to define User-Agent header.
If not, the following error will be produced:
Access Denied

Let’s copy data from the browser. We will use Chrome default user-agent header.

To make a request we need to create object of the HTTP Client and use its method get().

2.3. Get response

When request is made we can get response from the http://www.wallmart.com. We will warp the request code block in try/catch to properly handle connection issues. Option true is required because we don’t want to echo page at the browser, we want to get it as a string.

Now we have what to scrape so let’s move to the next part.

3. Scrape the page

3.1. Include Crawler

At the top of the script we’ll say we are going to use DOM Crawler Component and CSS Selector.

3.2. Get HTML block with categories

Now let’s continue our try block. To scrape the received page’s body we should create an object of Crawler and put there the body variable.
It is time to see what part of page we have to extract to get the categories’ titles. We inspect it using Chrome Dev Tools and understand that we need the first div class='all-depts-links' (see screenshot below). It will be our filter.

container with categories

Now we extract every child node of this div and push it to the array within the method each() and an anonymous function. After that we don’t need this Crawler object anymore so we’ll remove it.

If we dump this array we’ll see 13 elements which are strings with the HTML inside.
Looking at it we get to know that we have three-level tree of categories: categories, their subcategories and sub-subcategories.

page source code with categories

3.3 Get categories titles and subcategories HTML

Categories’ titles are located in the headings, so CSS selector for them is '.all-depts-links-heading > a'. Each subcategory locates in the separate li node, that’s why filter for them is 'ul'.
To get only titles from the categories’ headings is possibly using method text(). To get subcategories’ HTML code we’ll apply method html().

And now we have array with the categories title’s and the HTML code of subcategories.

3.4. Get subcategories’ and their sub-subcategories’ data

At this step we should receive an array with the whole data inside: categories titles’ and subcategories at the first level, subcategories’ data and their sub-subcategories at the second level, and sub-subcategories data at the third level.

From the screenshot above we know that CSS filter for the subcategories is 'li > a.all-depts-links-dept', and for the sub-subcategories it is 'li > a.all-depts-links-category'.

Let’s think about data that we need to get from subcategories. Of course, the first field is a title. But also as we are going to get products from this subcategory, we need to know how to get to its page. So the second field will be a link, namely an URI for the subcategory’s page. The same we need for sub-subcategories.
The code for this section will look like this:

3.5. Get goods

In the Walmart catalogue some of the subcategories don’t have sub-subcategories. That’s why our task is to get goods:

  1. from the subcategories without sub-subcategories;
  2. from the sub-subcategories.

All goods are located inside the list with classed 'tile-list tile-list-grid'. Every item has its own ul
and all item’s data is inside child node of lidiv.
Being based on these conclusions we can form a goods filter: 'ul.tile-list.tile-list-grid > li > div'.
One more thing before we dive into the code.
While writing and testing this script we noticed that Guzzle was redirected from some links to others. But HTTP client couldn’t do this automatically, so it threw an exception with a such message: “Was unable to parse malformed url: http://url/to/what/it/was/redirected”. But we caught this exception, got the URL from it and went ahead with the new URL.

In the code above we used earlier undefined function getGoodsData($html, $node). Before we write it we need to know what data we want to get from goods. Looking at some sub-subcategory’s page (see screenshot below) we see that goods have such main attributes. But sometimes some fields may be absent: at the screenshots below there is no price or there are no rating and reviews.

item's attributes

We should prevent problems associated with these cases so it is obvious to set default values for these variables. Also it is necessary to check if these spans are present in the received HTML code.
Also let’s prepare filters for the data:

CSS selectors for price, title, rating, reviews and image

Now we can write the function.

4. Push data into DB

4.0. Create DB

Let’s create a database called 'walmart'. We can do this easy within PHPMyAdmin or MySQL console. Here is an SQL script:

4.1. Include Doctrine

To include Doctrine into our script we’ll do three steps:

4.2. Configure and set connection

The next step is to create an object of Doctrine class.

4.3. Create tables

Firstly let’s write an SQL code into the SQL file.

Now we execute it in the command line:

Let’s verify if our tables were created.

database tables

4.4. Insert goods into DB

At last we get to the final part.
Here we are going to walk through the array and insert every category, subcategory, sub-subcategory and goods item. Notice that we need to create variables for subcategory and sub-subcategory indices, and increment them at every iteration, because their indices in the array get to zero at every level.

Now we have all goods from the first pages of every category located in our database. Let’s export table ‘goods’ to CSV file. Here we have 20 rows from this file:

Dustin Dinsmore

March 5, 2018 at 4:14 pm

2. Load HTML code of the website page We create file, for example, scraper.php in the root of project’s directory. Firstly, we need to include file vendor/autoload.php to make everything work: -------------------- I am using symfony 4.... This does not seem right where do you put this file exactly? Because it seems to me it should go in the src folder?
Reply

    admin Post Author

    March 6, 2018 at 7:04 am

    Hi Dustin,thanks for commenting! This tutorial has been created with the thought, that you don't use any MVC frameworks for the task. I.e. only guzzle+doctrine/dbal+symfony dom crawler (which is just a component of symfony framework). You definitely can use sf4 to create a similar scraper, but in this case I'd do it as a console symfony command (read more here: https://symfony.com/doc/current/console.html ). Note that the framework does all the autoloading tasks for you, so in case you prefer to go with sf console command, you just need to specify the namespaces used, no need to do any require/include's.Hope you can create such a command for your symfony project!
    Reply


Leave a Reply

Your email address will not be published