473,387 Members | 1,582 Online

Home Posts Topics Members FAQ

home > topics > web scraping with php

Join Bytes and contribute your articles to a community of 473,387 developers and data experts.

Web Scraping With PHP

1

In this article I will be showing you how to use PHP to scrape a web page. There is a video version of this tutorial on YouTube at https://youtu.be/Uc5mfudMTKE if you prefer learning in a video format. Personally I like reading an article because it tends to take less time because you can skim... Pick whichever format works best for you!

This article assumes that you have a basic understanding of PHP and programming concepts, and that you have access to a server capable of running PHP. If you do not have access to a server capable of running PHP, you can install WAMP on Windows 10 by watching my installation video. In a way, scraping involves reverse engineering a web page so it helps to be familiar with HTML.

Although there are other ways to scrape a web page with PHP, this article will focus on the Simple HTML DOM Parser. I have chosen to use this library because this is the library that I have experience with and it is easy to use with great documentation.

Installing the Library

The first thing you need to do is download the scraping library from SourceForge. You can do this by going to http://simplehtmldom.sourceforge.net/ and clicking on "Download latest version from SourceForge".

Once you have downloaded the library from SourceForge, unzip the compressed folder. Then move the "simple_html_dom.php" file to the folder that you will be building the web scraper in.

Writing the Scraping Code

Now that you have the library installed you can begin writing our scraping code.

Expand|Select|Wrap|Line Numbers

 <?php

   # This imports and gives us access to the scraping library

   include('simple_html_dom.php');

?>

Now that you have access to the scraping library, you can use the file_get_html function to create a DOM object from a url.

Expand|Select|Wrap|Line Numbers

 <?php

   # This imports and gives us access to the scraping library

   include('simple_html_dom.php');
 
   # Create HTML DOM object from url

   $html = file_get_html('https://google.com');

?>

You can then pull specific elements from this DOM object by calling the find method and passing in the tag name of the element you would like to grab. You can also pass an index if you would like to grab only a single instance of a particular tag. If you want to grab an array of tags, you refrain from passing an index.

Expand|Select|Wrap|Line Numbers

 # Create HTML DOM object from url

$html = file_get_html('https://google.com');
 
# Gets the 0th title element from the DOM object and echos it to the webpage

echo $html->find('title',0);
 
# If we don't pass an index we can get an array of all the anchor elements from the DOM object

$array_of_anchors = $html->find('a');
 
# We can echo all of the anchor elements from the array above by using a simple for loop

for( $i = 0; $i < sizeof($list_of_anchors); $i++ ){

   # echo each anchor by using the $i iterator to pull the anchor in each index position

   echo $list_of_anchors[$i];

}

In addition to selecting elements based on their tag name, you can select elements based on class or ID.

Expand|Select|Wrap|Line Numbers

 $html = file_get_html('https://google.com');
 
$array_of_hidden_divs = $html->find('div[class="hidden"]');
 
$array_of_thumbnails = $html->find('img[id="thumbnail"]');

The find method returns a DOM object. This means that we can call the find method on itself to grab child elements.

Expand|Select|Wrap|Line Numbers

 $html = file_get_html('https://google.com');
 
$ul = $html->find('ul',0);
 
$array_of_li = $ul->find('li');
 
# This is the same as above, but in a single line

$array_of_li = $html->find('ul',0)->find('li');

You can extract certain data such as the text of an element, or the hyperlink reference of an anchor tag, or the source of an image.

Expand|Select|Wrap|Line Numbers

 $html = file_get_html('https://google.com');
 
$button_text = $html->find('button',0)->plaintext;
 
$anchor_href = $html->find('a',0)->href;
 
$image_source = $html->find('img',0)->src;

I hope this helps you accomplish your PHP Web Scraping needs. Feel free to ask questions if you need any clarification. I highly recommend reading the documentation.

Apr 6 '19 #1

Subscribe Post Reply

0

3178

Sign in to post your reply or Sign up for a free account.

Similar topics

JavaScript web scraping test cases?

by: John J. Lee | last post by:

I've put together a Python package for scraping / testing pages that depend on embedded JavaScript code (without depending on IE, Mozilla or Konqueror, and with the DOM etc. all implemented in pure...

scraping display to obtain all on-screen text using OCR

by: Jonathan Epstein | last post by:

I would like to perform a more classical type of "screen scraping" than what most people now associate with this term. I only want to find all the text on the current screen, and obtain associated...

Web Scraping/Site Scraping

by: David Jones | last post by:

Hi, I'm interested in learning about web scraping/site scraping using Python. Does anybody know of some online resources or have any modules that are available to help out. O'Reilly published an...

screen scraping

by: Roland Hall | last post by:

Am I correct in assuming screen scraping is just the response text sent to the browser? If so, would that mean that this could not be screen scraped? function moi() { var tag = '<a href='; var...

ASP / Active Server Pages

HTML Scraping??

by: mustafa | last post by:

anyone know some good reliable html scraping (with python) tutorials. i have looked around and found a few. one uses urllib2 and beautifull soap modules for scraping and parsing...

Screen Scraping C#

by: Robert Martinez | last post by:

I've seen a lot about screen scraping with .NET, mostly in VB.net. I have been able to convert most of it over, but it is still just very basic stuff. Can someone help direct me toward some good...

General Web Scraping Question

by: Selden McCabe | last post by:

I've been working on a web scraping program, and have the basics down. But I don't understand the parameters. Normally, you go to a URL (say a reverse yellow pages directory), and enter some...

Screen scraping in ASP.NET

by: Jim Giblin | last post by:

I need to scrape specific information from another website, specifically the prices of precious metals from several different vendors. While I will credit the vendors as the data source, I do not...

Web page screen scraping?

by: Ronald S. Cook | last post by:

I've been asked to extract data from web pages. Given that they are rendered in HTML and not any sort of XML I'm wondering how to go about "scraping" such a web page of data. Can anyone give me...

stealth screen scraping with python?

by: different.engine | last post by:

Folks: I am screen scraping a large volume of data from Yahoo Finance each evening, and parsing with Beautiful Soup. I was wondering if anyone could give me some pointers on how to make it...

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.

BYTES.COM © 2024
About Bytes
Terms Of Use
Privacy Policy
Sitemap

Advertise on Bytes:
Post a Job
Sponsored Posts
Platinum & Gold Sponsors
Hire Now!