Page Scraping

This Tutorial will walk you through the process of Page Scraping.

What is Page Scraping, the legalities and ethics and an example of Page Scraping with a PHP Script.

What is Page Scraping?

Page scraping is a technique that allows you to pull information from another web page, so that the data can be manipulated from within your own script. From your script, you can connect to another URL and request a page, exactly as a browser would do. The web server will send back the page you asked for, which you can then manipulate or extract specific information from.

So why would you want to do this? Here are a couple of examples where page scraping could be used.

Checking a Web Page to See if it's Been Updated

This is a classic use of page scraping, and there are many web monitors set up using this technique. Most of us have a favorite site that we go back to time and time again, for example, to read the latest articles and news. But how do you know whether the site has been updated since you last checked it? Every so often, you have to go to the site and see if it's been updated. Sometimes it will and you'll read the new content; otherwise you'll probably close the site and check again later.

Wouldn't it be nice if you were alerted automatically when the site changed? This is where page scraping comes in: you could get your script to "read" a web page and compare it with a stored copy. If the page has changed, your script could e-mail you and let you know that new information has been added. You'll be informed as soon as the information changes, so you can always read the new content straight away and there's no more time wasted by manually checking to see if the page has been updated.

Retrieving Figures, such as Stock Quotes and Prices

Another use for page scraping is for retrieving small snippets of information from other web pages. For example, you could build a script that reads the prices of shares from an appropriate site and displays only the shares that you're actually interested in.

I was prompted to write this article by a recent problem I had, which I solved using page scraping. I was interested in the Amazon sales ranking for some of the books that I've written, to see how it varies and what the highest and lowest sales rankings are. Previously, I had opened Amazon in my browser, searched for the book, and then waded through all the information looking for the sales rank number. This was less than ideal, as it was time consuming to find the information each time.

To solve this, I created a small PHP script which pulls the relevant pages from Amazon, "reads" the sales rank number, and logs it to a database. I then created a frontend web page showing all the statistics that I want, based on the values stored in the database. The script is set up to run every hour, and so keeps a constant check on the sales ranks. I can see how they vary on an hour-to-hour or day-to-day basis.

Page scraping is ideal for this sort of job, and allows you to create customized information from a certain web page, so that you can see only the information you're interested in, without having to wade through other content which doesn't interest you at all.

Gareth Downes-Powell

Gareth Downes-Powell Gareth has a range of skills, covering many computer and internet related subjects. He is proficient in many different languages including ASP and PHP, and is responsible for the setup and maintenance of both Windows and Linux servers on a daily basis.

In his daily web development work he uses the complete range of Macromedia software, including Dreamweaver MX, Flash MX, Fireworks MX and Director to build a number of websites and applications. Gareth has a close relationship with Macromedia, and as a member of Team Macromedia Dreamweaver, he has worked closely in the development of Dreamweaver, and was a beta tester for Dreamweaver MX.

On a daily basis he provides support for users in the Macromedia forums, answering questions and providing help on a range of different web related subjects. He has also written a number of free and commercial extensions for Dreamweaver MX, to further extend its capabilities using its native JavaScript API’s or C++.

As a web host, Gareth has worked with a range of different servers and operating systems, with the Linux OS as his personal favourite. Most of his development work is done using a combination of Linux, Apache and MySQL and he has written extensively about setting up this type of system, and also running Apache and MySQL under Windows.

See All Postings From Gareth Downes-Powell >>

Comments

scraping and asp

April 21, 2004 by briant sylvain78

The problem I have, is how to make scrpaing fo jobs.... job scraping on asp website

This tutorial is out of date I think!

May 22, 2006 by j ll

When I try to get this code working, a blank page is returned (i.e. no info between the body tags).

The amazon information that is used in this example no longer works as Amazon have changed the format. Could please provide an example that works. Its a deadly bit of code if it works. Thanks you John L

Not quite fixed but I tried

July 6, 2009 by Jason Guritz

The script works if..

Just take another look at the tutorial and the new page

Or change it to what I have here.

preg_match("/<b>Amazon.com\sSales\sRank:<\/b>\s(.*)\s/i",$file,$match);

You must me logged in to write a comment.