473,287 Members | 1,565 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 473,287 developers and data experts.

Web Scraping With PHP

In this article I will be showing you how to use PHP to scrape a web page. There is a video version of this tutorial on YouTube at https://youtu.be/Uc5mfudMTKE if you prefer learning in a video format. Personally I like reading an article because it tends to take less time because you can skim... Pick whichever format works best for you!

This article assumes that you have a basic understanding of PHP and programming concepts, and that you have access to a server capable of running PHP. If you do not have access to a server capable of running PHP, you can install WAMP on Windows 10 by watching my installation video. In a way, scraping involves reverse engineering a web page so it helps to be familiar with HTML.

Although there are other ways to scrape a web page with PHP, this article will focus on the Simple HTML DOM Parser. I have chosen to use this library because this is the library that I have experience with and it is easy to use with great documentation.

Installing the Library

The first thing you need to do is download the scraping library from SourceForge. You can do this by going to http://simplehtmldom.sourceforge.net/ and clicking on "Download latest version from SourceForge".



Once you have downloaded the library from SourceForge, unzip the compressed folder. Then move the "simple_html_dom.php" file to the folder that you will be building the web scraper in.



Writing the Scraping Code

Now that you have the library installed you can begin writing our scraping code.

Expand|Select|Wrap|Line Numbers
  1. <?php
  2.    # This imports and gives us access to the scraping library
  3.    include('simple_html_dom.php');
  4. ?>
Now that you have access to the scraping library, you can use the file_get_html function to create a DOM object from a url.

Expand|Select|Wrap|Line Numbers
  1. <?php
  2.    # This imports and gives us access to the scraping library
  3.    include('simple_html_dom.php');
  4.  
  5.    # Create HTML DOM object from url
  6.    $html = file_get_html('https://google.com');
  7. ?>
You can then pull specific elements from this DOM object by calling the find method and passing in the tag name of the element you would like to grab. You can also pass an index if you would like to grab only a single instance of a particular tag. If you want to grab an array of tags, you refrain from passing an index.

Expand|Select|Wrap|Line Numbers
  1. # Create HTML DOM object from url
  2. $html = file_get_html('https://google.com');
  3.  
  4. # Gets the 0th title element from the DOM object and echos it to the webpage
  5. echo $html->find('title',0);
  6.  
  7. # If we don't pass an index we can get an array of all the anchor elements from the DOM object
  8. $array_of_anchors = $html->find('a');
  9.  
  10. # We can echo all of the anchor elements from the array above by using a simple for loop
  11. for( $i = 0; $i < sizeof($list_of_anchors); $i++ ){
  12.    # echo each anchor by using the $i iterator to pull the anchor in each index position
  13.    echo $list_of_anchors[$i];
  14. }
In addition to selecting elements based on their tag name, you can select elements based on class or ID.

Expand|Select|Wrap|Line Numbers
  1. $html = file_get_html('https://google.com');
  2.  
  3. $array_of_hidden_divs = $html->find('div[class="hidden"]');
  4.  
  5. $array_of_thumbnails = $html->find('img[id="thumbnail"]');
The find method returns a DOM object. This means that we can call the find method on itself to grab child elements.

Expand|Select|Wrap|Line Numbers
  1. $html = file_get_html('https://google.com');
  2.  
  3. $ul = $html->find('ul',0);
  4.  
  5. $array_of_li = $ul->find('li');
  6.  
  7. # This is the same as above, but in a single line
  8. $array_of_li = $html->find('ul',0)->find('li');
You can extract certain data such as the text of an element, or the hyperlink reference of an anchor tag, or the source of an image.

Expand|Select|Wrap|Line Numbers
  1. $html = file_get_html('https://google.com');
  2.  
  3. $button_text = $html->find('button',0)->plaintext;
  4.  
  5. $anchor_href = $html->find('a',0)->href;
  6.  
  7. $image_source = $html->find('img',0)->src;
I hope this helps you accomplish your PHP Web Scraping needs. Feel free to ask questions if you need any clarification. I highly recommend reading the documentation.
Apr 6 '19 #1
0 3159

Sign in to post your reply or Sign up for a free account.

Similar topics

7
by: John J. Lee | last post by:
I've put together a Python package for scraping / testing pages that depend on embedded JavaScript code (without depending on IE, Mozilla or Konqueror, and with the DOM etc. all implemented in pure...
2
by: Jonathan Epstein | last post by:
I would like to perform a more classical type of "screen scraping" than what most people now associate with this term. I only want to find all the text on the current screen, and obtain associated...
4
by: David Jones | last post by:
Hi, I'm interested in learning about web scraping/site scraping using Python. Does anybody know of some online resources or have any modules that are available to help out. O'Reilly published an...
4
by: Roland Hall | last post by:
Am I correct in assuming screen scraping is just the response text sent to the browser? If so, would that mean that this could not be screen scraped? function moi() { var tag = '<a href='; var...
1
by: mustafa | last post by:
anyone know some good reliable html scraping (with python) tutorials. i have looked around and found a few. one uses urllib2 and beautifull soap modules for scraping and parsing...
0
by: Robert Martinez | last post by:
I've seen a lot about screen scraping with .NET, mostly in VB.net. I have been able to convert most of it over, but it is still just very basic stuff. Can someone help direct me toward some good...
2
by: Selden McCabe | last post by:
I've been working on a web scraping program, and have the basics down. But I don't understand the parameters. Normally, you go to a URL (say a reverse yellow pages directory), and enter some...
3
by: Jim Giblin | last post by:
I need to scrape specific information from another website, specifically the prices of precious metals from several different vendors. While I will credit the vendors as the data source, I do not...
4
by: Ronald S. Cook | last post by:
I've been asked to extract data from web pages. Given that they are rendered in HTML and not any sort of XML I'm wondering how to go about "scraping" such a web page of data. Can anyone give me...
4
by: different.engine | last post by:
Folks: I am screen scraping a large volume of data from Yahoo Finance each evening, and parsing with Beautiful Soup. I was wondering if anyone could give me some pointers on how to make it...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.