Page Scraping

This Tutorial will walk you through the process of Page Scraping.

What is Page Scraping, the legalities and ethics and an example of Page Scraping with a PHP Script.

Page Scraping - Legalities and Ethics

Page scraping is a fantastic technique, but it can also be abused. For example, you could display someone else's information as if it were your own, similar to the way that you can open someone else's pages in your own web page using frames.

If you're using the information for your own personal use, such as extracting share prices from a list so that you see only the shares in your portfolio, then it should be fine. In reality, it's no different to viewing the original page in your browser.

If, however, you want to display information from another site on your web site, you must ask permission from the site that has the original content. Otherwise, you're breaking copyright laws and are likely to end up on the wrong end of a lawsuit. Page scraping leaves an entry in the server log, just as when you request a page with your browser, so the original content owner will be able to see what you've been up to. Many regular log entries pointing to your web site are sure to alert the site owner to what you're doing. Just use the technique sensibly and ethically, and you won't have any problems.

Creating the Example

Now that we've looked at the legalities of page scraping, we can move on to actually creating the code, which is surprisingly simple.

As an example, we'll look at one of the situations mentioned earlier: reading a book's sales ranking from Amazon. We're going to extract the sales ranking from the book "Dreamweaver MX: Advanced PHP Web Development".

The code needs to do the following:

1.        Read in the web page from Amazon

2.        Extract and display the sales rank

We'll look at these parts in order.

Reading a Web Page from a URL

To start with, we need to find the relevant URL from which we want to extract the information. You can do this by going to http://www.amazon.com and searching for the book. If you look at the URL, we can see that it's in the following format:

http://www.amazon.com/exec/obidos/ASIN/1904151191/

where 1904151116 is the book's ISBN number. To check a different book, you just need to change the ISBN number in the URL appropriately. Now that we have a URL to work with, we can start on the code to read that particular page:

<?php

           $url = "http://www.amazon.com/exec/obidos/ASIN/1904151191/";
               $filepointer = fopen($url,"r");
       if($filepointer){
       while(!feof($filepointer)){
                      $buffer = fgets($filepointer, 4096);
                     $file .= $buffer;
                               }
       fclose($filepointer);
                      } else {
       die("Could not create a connection to Amazon.com");
             }
?>

First of all, we place the URL of the page that we're going to read in a variable, $url. Next we use the PHP fopen() command, passing it the URL and the parameter "r", which indicates that we want to read from the page only (this is the only option you have when you are opening a web page). The fopen() command creates a connection to the server at Amazon, and returns a pointer to the connection. Every time that we want to use this connection, we have to pass this pointer so that PHP knows which connection to work with, especially if you have more than one connection open at a time.

Using an if statement, we check that the connection has been opened and is ready for us to use. The code inside the if statement is only run if the connection was successful; otherwise we use the PHP die() command to display an error message and stop the script.

Next we use a while loop, which tells PHP to keep running the code inside the loop while the PHP feof() command returns false. As soon as PHP has read all the data, the command feof() returns true and the loop stops running as there's no more data to read. We read the data using the PHP fgets() command, passing it the pointer of our open connection and the parameter 4096, which tells the command to read data in chunks of 4096 bytes (4Kb). We add each chunk of data to the variable $file. At the end of the loop, we use the PHP command fclose() to close the connection referenced by our pointer.

We now have all the HTML code for the Amazon web page in the variable $file ready for us to extract the sales rank.

Gareth Downes-Powell

Gareth Downes-PowellGareth has a range of skills, covering many computer and internet related subjects. He is proficient in many different languages including ASP and PHP, and is responsible for the setup and maintenance of both Windows and Linux servers on a daily basis.


In his daily web development work he uses the complete range of Macromedia software, including Dreamweaver MX, Flash MX, Fireworks MX and Director to build a number of websites and applications. Gareth has a close relationship with Macromedia, and as a member of Team Macromedia Dreamweaver, he has worked closely in the development of Dreamweaver, and was a beta tester for Dreamweaver MX.


On a daily basis he provides support for users in the Macromedia forums, answering questions and providing help on a range of different web related subjects. He has also written a number of free and commercial extensions for Dreamweaver MX, to further extend its capabilities using its native JavaScript API’s or C++.


As a web host, Gareth has worked with a range of different servers and operating systems, with the Linux OS as his personal favourite. Most of his development work is done using a combination of Linux, Apache and MySQL and he has written extensively about setting up this type of system, and also running Apache and MySQL under Windows.

See All Postings From Gareth Downes-Powell >>

Comments

scraping and asp

April 21, 2004 by briant sylvain78
The problem I have, is how to make scrpaing fo jobs.... job scraping on asp website

This tutorial is out of date I think!

May 22, 2006 by j ll

When I try to get this code working, a blank page is returned (i.e. no info between the body tags).

The amazon information that is used in this example no longer works as Amazon have changed the format. Could please provide an example that works. Its a deadly bit of code if it works. Thanks you John L

Not quite fixed but I tried

July 6, 2009 by Jason Guritz

The script works if..

 

Just take another look at the tutorial and the new page

Or change it to what I have here.

preg_match("/<b>Amazon.com\sSales\sRank:<\/b>\s(.*)\s/i",$file,$match);

You must me logged in to write a comment.