Page Scraping

This Tutorial will walk you through the process of Page Scraping.

What is Page Scraping, the legalities and ethics and an example of Page Scraping with a PHP Script.

Extracting the Data

As a computer can't think like a human, it can't pull the required sales rank from the page without a little help. We have to tell it where on the page the sales rank can be found.

If you look at the page on Amazon, you'll see that the sales rank is approximately half way down the page. The relevant section is shown in the screenshot below.

The next step is look at the HTML source for the page. Select View Source in your browser, and then search for Rank: within the source code. This will take you to the correct point you should see HTML similar to the line shown below.

<b>Amazon.com Sales Rank: </b> 18,957 <br>

This code should never change, apart from the sales rank number which obviously may differ each time you use the page. So, we can tell PHP exactly where to look for the data we require. To do this we use "Regular Expressions", which are a way of specifying stings of characters. This is done with the code below:

<?php
         preg_match("/<b>Amazon.com\sSales\sRank:\s<\/b>\s(.*)\s/i",$file,$match);

    $result = $match[1];

    echo $result;   

    ?>

Regular expressions are fairly complex and could easily fill many tutorials on their own, so we'll just skim over the basics. First, we'll look at the regular expression used:

/<b>Amazon.com\sSales\sRank:\s<\/b>\s(.*)\s/i

This looks complicated, by can be created easily if it's done step by step. We start with the base HTML source code that we need to extract the data from:

<b>Amazon.com Sales Rank: </b> 18,957

Every regular expression needs to be surrounded by / characters, which tell PHP where the expression starts and ends, so we'll add this to our source code above to make:

/<b>Amazon.com Sales Rank: </b> 18,957 /i

We've also added the parameter i at the end, to tell PHP that the search is not case sensitive: the regular expression should match uppercase or lowercase letters.

Next we need to alter the tag </b> because the slash will be interpreted as a parameter instead of text to find. To stop this we add a \ character in front of the / to tell PHP it's not a parameter and just to find the text </b>. This is shown below:

/<b>Amazon.com Sales Rank: <\/b> 18,957 /i

Our next job is to remove any spaces, and replace them with \s which tells PHP to search for a single space. This makes the expression:

/<b>Amazon.com\sSales\sRank:\s<\/b>\s18,957\s/i

Our final job is to tell PHP which section to extract, in this case the sales number. We do this by substituting the sales rank number with (.*) which tells PHP to grab all the data until it comes to another space. Our finished expression is:

/<b>Amazon.com\sSales\sRank:\s<\/b>\s(.*)\s/i

We can now use this with the PHP preg_match() command, passing it the regular expression, the variable $file that holds the data, and $match which is where the command will store the results. preg_match() returns an array, where the first index ($match[0]) contains the string to extract the data from, and the second position ($match[1] ) is the extracted data.

Finally we echo the data to the screen.

If you're interested in learning more about regular expressions, there are a number of books available dedicated to them. Alternatively there are many tutorials on the Web which go much deeper into explaining how they work, and how to build your own.

Gareth Downes-Powell

Gareth Downes-PowellGareth has a range of skills, covering many computer and internet related subjects. He is proficient in many different languages including ASP and PHP, and is responsible for the setup and maintenance of both Windows and Linux servers on a daily basis.


In his daily web development work he uses the complete range of Macromedia software, including Dreamweaver MX, Flash MX, Fireworks MX and Director to build a number of websites and applications. Gareth has a close relationship with Macromedia, and as a member of Team Macromedia Dreamweaver, he has worked closely in the development of Dreamweaver, and was a beta tester for Dreamweaver MX.


On a daily basis he provides support for users in the Macromedia forums, answering questions and providing help on a range of different web related subjects. He has also written a number of free and commercial extensions for Dreamweaver MX, to further extend its capabilities using its native JavaScript API’s or C++.


As a web host, Gareth has worked with a range of different servers and operating systems, with the Linux OS as his personal favourite. Most of his development work is done using a combination of Linux, Apache and MySQL and he has written extensively about setting up this type of system, and also running Apache and MySQL under Windows.

See All Postings From Gareth Downes-Powell >>

Comments

scraping and asp

April 21, 2004 by briant sylvain78
The problem I have, is how to make scrpaing fo jobs.... job scraping on asp website

This tutorial is out of date I think!

May 22, 2006 by j ll

When I try to get this code working, a blank page is returned (i.e. no info between the body tags).

The amazon information that is used in this example no longer works as Amazon have changed the format. Could please provide an example that works. Its a deadly bit of code if it works. Thanks you John L

Not quite fixed but I tried

July 6, 2009 by Jason Guritz

The script works if..

 

Just take another look at the tutorial and the new page

Or change it to what I have here.

preg_match("/<b>Amazon.com\sSales\sRank:<\/b>\s(.*)\s/i",$file,$match);

You must me logged in to write a comment.