This Tutorial will walk you through the process of Page Scraping.
What is Page Scraping, the legalities and ethics and an example of Page Scraping with a PHP Script.
What is Page Scraping?
Page scraping is a technique that allows you to pull information from another web page, so that the data can be manipulated from within your own script. From your script, you can connect to another URL and request a page, exactly as a browser would do. The web server will send back the page you asked for, which you can then manipulate or extract specific information from.
So why would you want to do this? Here are a couple of examples where page scraping could be used.
Checking a Web Page to See if it's Been Updated
This is a classic use of page scraping, and there are many web monitors set up using this technique. Most of us have a favorite site that we go back to time and time again, for example, to read the latest articles and news. But how do you know whether the site has been updated since you last checked it? Every so often, you have to go to the site and see if it's been updated. Sometimes it will and you'll read the new content; otherwise you'll probably close the site and check again later.
Wouldn't it be nice if you were alerted automatically when the site changed? This is where page scraping comes in: you could get your script to "read" a web page and compare it with a stored copy. If the page has changed, your script could e-mail you and let you know that new information has been added. You'll be informed as soon as the information changes, so you can always read the new content straight away and there's no more time wasted by manually checking to see if the page has been updated.
Retrieving Figures, such as Stock Quotes and Prices
Another use for page scraping is for retrieving small snippets of information from other web pages. For example, you could build a script that reads the prices of shares from an appropriate site and displays only the shares that you're actually interested in.
I was prompted to write this article by a recent problem I had, which I solved using page scraping. I was interested in the Amazon sales ranking for some of the books that I've written, to see how it varies and what the highest and lowest sales rankings are. Previously, I had opened Amazon in my browser, searched for the book, and then waded through all the information looking for the sales rank number. This was less than ideal, as it was time consuming to find the information each time.
To solve this, I created a small PHP script which pulls the relevant pages from Amazon, "reads" the sales rank number, and logs it to a database. I then created a frontend web page showing all the statistics that I want, based on the values stored in the database. The script is set up to run every hour, and so keeps a constant check on the sales ranks. I can see how they vary on an hour-to-hour or day-to-day basis.
Page scraping is ideal for this sort of job, and allows you to create customized information from a certain web page, so that you can see only the information you're interested in, without having to wade through other content which doesn't interest you at all.