By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,665 Members | 1,489 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

Web Scraping With PHP

P: 1
In this article I will be showing you how to use PHP to scrape a web page. There is a video version of this tutorial on YouTube at https://youtu.be/Uc5mfudMTKE if you prefer learning in a video format. Personally I like reading an article because it tends to take less time because you can skim... Pick whichever format works best for you!

This article assumes that you have a basic understanding of PHP and programming concepts, and that you have access to a server capable of running PHP. If you do not have access to a server capable of running PHP, you can install WAMP on Windows 10 by watching my installation video. In a way, scraping involves reverse engineering a web page so it helps to be familiar with HTML.

Although there are other ways to scrape a web page with PHP, this article will focus on the Simple HTML DOM Parser. I have chosen to use this library because this is the library that I have experience with and it is easy to use with great documentation.

Installing the Library

The first thing you need to do is download the scraping library from SourceForge. You can do this by going to http://simplehtmldom.sourceforge.net/ and clicking on "Download latest version from SourceForge".



Once you have downloaded the library from SourceForge, unzip the compressed folder. Then move the "simple_html_dom.php" file to the folder that you will be building the web scraper in.



Writing the Scraping Code

Now that you have the library installed you can begin writing our scraping code.

Expand|Select|Wrap|Line Numbers
  1. <?php
  2.    # This imports and gives us access to the scraping library
  3.    include('simple_html_dom.php');
  4. ?>
Now that you have access to the scraping library, you can use the file_get_html function to create a DOM object from a url.

Expand|Select|Wrap|Line Numbers
  1. <?php
  2.    # This imports and gives us access to the scraping library
  3.    include('simple_html_dom.php');
  4.  
  5.    # Create HTML DOM object from url
  6.    $html = file_get_html('https://google.com');
  7. ?>
You can then pull specific elements from this DOM object by calling the find method and passing in the tag name of the element you would like to grab. You can also pass an index if you would like to grab only a single instance of a particular tag. If you want to grab an array of tags, you refrain from passing an index.

Expand|Select|Wrap|Line Numbers
  1. # Create HTML DOM object from url
  2. $html = file_get_html('https://google.com');
  3.  
  4. # Gets the 0th title element from the DOM object and echos it to the webpage
  5. echo $html->find('title',0);
  6.  
  7. # If we don't pass an index we can get an array of all the anchor elements from the DOM object
  8. $array_of_anchors = $html->find('a');
  9.  
  10. # We can echo all of the anchor elements from the array above by using a simple for loop
  11. for( $i = 0; $i < sizeof($list_of_anchors); $i++ ){
  12.    # echo each anchor by using the $i iterator to pull the anchor in each index position
  13.    echo $list_of_anchors[$i];
  14. }
In addition to selecting elements based on their tag name, you can select elements based on class or ID.

Expand|Select|Wrap|Line Numbers
  1. $html = file_get_html('https://google.com');
  2.  
  3. $array_of_hidden_divs = $html->find('div[class="hidden"]');
  4.  
  5. $array_of_thumbnails = $html->find('img[id="thumbnail"]');
The find method returns a DOM object. This means that we can call the find method on itself to grab child elements.

Expand|Select|Wrap|Line Numbers
  1. $html = file_get_html('https://google.com');
  2.  
  3. $ul = $html->find('ul',0);
  4.  
  5. $array_of_li = $ul->find('li');
  6.  
  7. # This is the same as above, but in a single line
  8. $array_of_li = $html->find('ul',0)->find('li');
You can extract certain data such as the text of an element, or the hyperlink reference of an anchor tag, or the source of an image.

Expand|Select|Wrap|Line Numbers
  1. $html = file_get_html('https://google.com');
  2.  
  3. $button_text = $html->find('button',0)->plaintext;
  4.  
  5. $anchor_href = $html->find('a',0)->href;
  6.  
  7. $image_source = $html->find('img',0)->src;
I hope this helps you accomplish your PHP Web Scraping needs. Feel free to ask questions if you need any clarification. I highly recommend reading the documentation.
2 Weeks Ago #1
Share this Article
Share on Google+