This article assumes that you have a basic understanding of PHP and programming concepts, and that you have access to a server capable of running PHP. If you do not have access to a server capable of running PHP, you can install WAMP on Windows 10 by watching my installation video. In a way, scraping involves reverse engineering a web page so it helps to be familiar with HTML.
Although there are other ways to scrape a web page with PHP, this article will focus on the Simple HTML DOM Parser. I have chosen to use this library because this is the library that I have experience with and it is easy to use with great documentation.
Installing the Library
The first thing you need to do is download the scraping library from SourceForge. You can do this by going to http://simplehtmldom.sourceforge.net/ and clicking on "Download latest version from SourceForge".
Once you have downloaded the library from SourceForge, unzip the compressed folder. Then move the "simple_html_dom.php" file to the folder that you will be building the web scraper in.
Writing the Scraping Code
Now that you have the library installed you can begin writing our scraping code.
Expand|Select|Wrap|Line Numbers
- <?php
- # This imports and gives us access to the scraping library
- include('simple_html_dom.php');
- ?>
Expand|Select|Wrap|Line Numbers
- <?php
- # This imports and gives us access to the scraping library
- include('simple_html_dom.php');
- # Create HTML DOM object from url
- $html = file_get_html('https://google.com');
- ?>
Expand|Select|Wrap|Line Numbers
- # Create HTML DOM object from url
- $html = file_get_html('https://google.com');
- # Gets the 0th title element from the DOM object and echos it to the webpage
- echo $html->find('title',0);
- # If we don't pass an index we can get an array of all the anchor elements from the DOM object
- $array_of_anchors = $html->find('a');
- # We can echo all of the anchor elements from the array above by using a simple for loop
- for( $i = 0; $i < sizeof($list_of_anchors); $i++ ){
- # echo each anchor by using the $i iterator to pull the anchor in each index position
- echo $list_of_anchors[$i];
- }
Expand|Select|Wrap|Line Numbers
- $html = file_get_html('https://google.com');
- $array_of_hidden_divs = $html->find('div[class="hidden"]');
- $array_of_thumbnails = $html->find('img[id="thumbnail"]');
Expand|Select|Wrap|Line Numbers
- $html = file_get_html('https://google.com');
- $ul = $html->find('ul',0);
- $array_of_li = $ul->find('li');
- # This is the same as above, but in a single line
- $array_of_li = $html->find('ul',0)->find('li');
Expand|Select|Wrap|Line Numbers
- $html = file_get_html('https://google.com');
- $button_text = $html->find('button',0)->plaintext;
- $anchor_href = $html->find('a',0)->href;
- $image_source = $html->find('img',0)->src;