By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,946 Members | 773 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,946 IT Pros & Developers. It's quick & easy.

A challenge? Help isolating links in a WebPage

P: n/a
Hello, I am writing a script that calls a URL and reads the resulting
HTML into a function that strips out everthing and returns ONLY the
links, this is so that I can build a link index of various pages.
I have been programming in PHP for over 2 years now and have never
encountered a problem like the one I am having now. To me this seems
like it should be just about the simplest thing in the world, but I
must admit I'm stumped BIG TIME!
For the sake of speed I choose to use preg_match_all to isolate the
links and return them in an array.
I have tried various regular expressions and modifications of the
regular expressions I find in PHP.net and scripts I've found laying
around as well, and have read through everything I can find on them,
including the stuff on PHP.net.
While researching I found an open source Class called snoopy that has
nearly the functionality I want, so like any good programmer, I used
it as a starting point.
The default regular expression that is used in snoppy for this
functionality is

preg_match_all("'<\s*a\s.*?href\s*=\s*([\"\'])?(?(1)
(.*?)\\1|([^\s\>]+))'isx",$document,$links);

For the benefit of all those new to regular expressions here it is
broken down with the authors comments

'<\s*a\s.*?href\s*=\s* # find <a href=
([\"\'])? # find single or double quote
(?(1) (.*?)\\1 | ([^\s\>]+))' # if quote found, match up to next
matching quote, otherwise match up to next space

Of course $document is the complete HTML result of the webpage I am
indexing.

This expression only returns where the link is pointing to.

I need to obtain the complete link from \< \a
href=mysite.com/mypage.html \>My Page</a>
Excuse the extra \ escape characters, I am using google to post and I
don't want it to turn that into an actual link (just hope it works)

Anyways I needed the complete link so I replaced that with this

preg_match_all( '/\<a href.*?\>(.*)(<\/a\\1>)/',$document,$links);

Again for those new to regular expressions here goes
'/\<a href.*?\> #Look for <a href
(.*) #Grab everything staring at the first match
(<\/a\\1>)/' #And continue to the < /a > end of the link \\1
tells it to return ONLY that which matches the whole expression.
This appears to work fine except when I run it, I seem to only get the
first 17-20 links on the same webpage, where the first expression may
return over a 100. This told me something might be wrong, so I looked
ALOT closer at both expressions and the pages I'm dealing with and
realized that some of the links may use various case and spacing
combos. The second expression doesn't appear to match anything but
exact spacing & case. So I went back to the drawing board and came up
with this.
preg_match_all("'<\s*a\s.*?href.*?\>(.*)(<\/a\\1>)'",$document,$links);
Again here it is broken down for those new to regular expressions
'<\s*a\s.*?href.*?\> #Find all <a href regardles of case
or spacing
(.*) #Grab everything just matched
(<\/a\\1>) #Find the closing < /a > and stop

Using the same webpage as the first two, this expression only returns
12 results! It actually is returning less than the first two.

Right now I am really mad at regular expressions. Could someone
please not just give me the solution, to the problem, but detail the
thought process to come up with that solution, and show what I'm doing
wrong here so next time I use PCRE functions, I can use correct
thinking.

Look closely at my comments, they are by no means exact, this is how I
BELIEVE the regular expression is being evaluted, I am open to
criticism on that point.

Thanx in advance, and I certainly hope this gets an informative &
instructional thread going for the benefit of everyone new to Regular
Expressions.
Jul 17 '05 #1
Share this Question
Share on Google+
9 Replies


P: n/a
Steve wrote:
Hello, I am writing a script that calls a URL and reads the resulting
HTML into a function that strips out everthing and returns ONLY the
links, this is so that I can build a link index of various pages.
I have been programming in PHP for over 2 years now and have never
encountered a problem like the one I am having now. To me this seems
like it should be just about the simplest thing in the world, but I
must admit I'm stumped BIG TIME!


Why don't you do yourself a favour and use HTMLSax from Pear:
http://pear.php.net/package-info.php...ge=XML_HTMLSax

Regards

Hartmut

--
SnakeLab - Internet und webbasierte Software | /\ /

Hartmut König (mailto:h.******@snakelab.de) | /\/ \ /

___________ http://www.snakelab.de _______/\/\| /\/ \/

Do you know your Shop-clients ? ShopStat do ->\/_____________

Jul 17 '05 #2

P: n/a
Well two reasons really.
First off I didn't know this thing existed, and I'll now probably have
to learn a new API :)
And the second was to get a good discussion going on PCRE's
preg_match_all regular expressions.
But thank you, and I will take a closer look, since I'm running out of
dev time waiting on this one part.

Hartmut König <h.******@snakelab.de> wrote in message news:<bn*************@news.t-online.com>...
Steve wrote:
Hello, I am writing a script that calls a URL and reads the resulting
HTML into a function that strips out everthing and returns ONLY the
links, this is so that I can build a link index of various pages.
I have been programming in PHP for over 2 years now and have never
encountered a problem like the one I am having now. To me this seems
like it should be just about the simplest thing in the world, but I
must admit I'm stumped BIG TIME!


Why don't you do yourself a favour and use HTMLSax from Pear:
http://pear.php.net/package-info.php...ge=XML_HTMLSax

Regards

Hartmut

Jul 17 '05 #3

P: n/a
I just downloaded it, and took a peek.
It's totally over kill for what I need, and I would have to recode
from the beginning to utilize it. I will however be using it on my
next project.
Also as stated earlier, I really want to do this with a regular
expression.

Hartmut König <h.******@snakelab.de> wrote in message news:<bn*************@news.t-online.com>...
Steve wrote:
Hello, I am writing a script that calls a URL and reads the resulting
HTML into a function that strips out everthing and returns ONLY the
links, this is so that I can build a link index of various pages.
I have been programming in PHP for over 2 years now and have never
encountered a problem like the one I am having now. To me this seems
like it should be just about the simplest thing in the world, but I
must admit I'm stumped BIG TIME!


Why don't you do yourself a favour and use HTMLSax from Pear:
http://pear.php.net/package-info.php...ge=XML_HTMLSax

Regards

Hartmut

Jul 17 '05 #4

P: n/a
Steve wrote:
I just downloaded it, and took a peek.
It's totally over kill for what I need, and I would have to recode
from the beginning to utilize it. I will however be using it on my
next project.
Also as stated earlier, I really want to do this with a regular
expression.


I used a mix of preg_match_all() and substr().

source at http://www.geocities.com/alterpedro/phps.html
result at http://www.geocities.com/alterpedro/php.html

--
I have a spam filter working.
To mail me include "urkxvq" (with or without the quotes)
in the subject line, or your mail will be ruthlessly discarded.
Jul 17 '05 #5

P: n/a
Pedro wrote:
I used a mix of preg_match_all() and substr(). and strpos(), and preg_replace()
source at http://www.geocities.com/alterpedro/phps.html
result at http://www.geocities.com/alterpedro/php.html


I have pasted the code to geocities, because it was much bigger
than I felt "safe" to post here. Much of its size was the
yahoo HTML chunk that I had to remove before the file
got accepted ... but I had thought of that and didn't want
to go back to some other way. Now I'm home, thinking clearer,
and the code is better :)

New Version!
[ I'll remove the geocities pages in a few days ]

<?php
function extract_URLs($s) {
$res = array();
preg_match_all('@(<a .*</a>)@Uis', $s, $a);
foreach ($a[1] as $x) {
$gtpos = strpos($x, '>');
$y = substr($x, 0, $gtpos);
if ($hrefpos = strpos($x, 'href=')) {
$z = substr($y, $hrefpos+5);
$z = preg_replace('/^(\S+)\s.*$/U', '$1', $z);
if ($z[0] == '"' && substr($z, -1) == '"') $z = substr($z, 1, -1);
if ($z[0] == "'" && substr($z, -1) == "'") $z = substr($z, 1, -1);
$res[] = array(substr($x, $gtpos+1, -4), $z);
}
}
unset($a);
return $res;
}
###
### example usage:
###

$data = <<<EOT
<a href=z>zz</a> <a href="z" bold="yes">ZZ</a>
<a link="y">yy</a> <a title="x" href='aa'>aa</a>
text before, <a href="href.here"><b>bold text inside</b></a> and text after
<a href="image.png"><img src="image.png"/></a>
EOT;

$LINKS = extract_URLs($data);
foreach ($LINKS as $v) {
echo $v[0], ' --> [', $v[1], "]\n";
}
?>

:x
--
I have a spam filter working.
To mail me include "urkxvq" (with or without the quotes)
in the subject line, or your mail will be ruthlessly discarded.
Jul 17 '05 #6

P: n/a
Pedro <he****@hotpop.com> wrote in message news:<bn*************@ID-203069.news.uni-berlin.de>...
Pedro wrote:
I used a mix of preg_match_all() and substr().

and strpos(), and preg_replace()
source at http://www.geocities.com/alterpedro/phps.html
result at http://www.geocities.com/alterpedro/php.html


I have pasted the code to geocities, because it was much bigger
than I felt "safe" to post here. Much of its size was the
yahoo HTML chunk that I had to remove before the file
got accepted ... but I had thought of that and didn't want
to go back to some other way. Now I'm home, thinking clearer,
and the code is better :)

New Version!
[ I'll remove the geocities pages in a few days ]

<?php
function extract_URLs($s) {
$res = array();
preg_match_all('@(<a .*</a>)@Uis', $s, $a);
foreach ($a[1] as $x) {
$gtpos = strpos($x, '>');
$y = substr($x, 0, $gtpos);
if ($hrefpos = strpos($x, 'href=')) {
$z = substr($y, $hrefpos+5);
$z = preg_replace('/^(\S+)\s.*$/U', '$1', $z);
if ($z[0] == '"' && substr($z, -1) == '"') $z = substr($z, 1, -1);
if ($z[0] == "'" && substr($z, -1) == "'") $z = substr($z, 1, -1);
$res[] = array(substr($x, $gtpos+1, -4), $z);
}
}
unset($a);
return $res;
}
###
### example usage:
###

$data = <<<EOT
<a href=z>zz</a> <a href="z" bold="yes">ZZ</a>
<a link="y">yy</a> <a title="x" href='aa'>aa</a>
text before, <a href="href.here"><b>bold text inside</b></a> and text after
<a href="image.png"><img src="image.png"/></a>
EOT;

$LINKS = extract_URLs($data);
foreach ($LINKS as $v) {
echo $v[0], ' --> [', $v[1], "]\n";
}
?>

:x


Looks great, and it does basically what I want...
Care to explain to the class how it works? Especially the regular expression part?

And thanx by the way.
Jul 17 '05 #7

P: n/a
Steve wrote:
Looks great, and it does basically what I want...
Care to explain to the class how it works? Especially the regular
expression part?

And thanx by the way.


Let's see how I go about that ... hope it makes sense :)
# extract URLs from a string return an array of arrays;
# each inner array has the text and the URL
function extract_URLs($s) {
# initialize return array
$res = array();
# grab all "<a ...</a>" bits
preg_match_all('@(<a\s.*</a>)@Uis', $s, $a);
# |`----v-----'|||`- dot metacharacter matches all (\n
# included)
# | | ||`-- case insensitive matches
# | | |`--- ungreedy, so that '<a
# href="1">1</a><a href="2">2</a>'
# | | | does *NOT* match \___
# all_of_this ________________/
# | | `---- end pattern delimiter
# | `----------- grab into $a[1]
# `----------------- pattern delimiter
#
# for the pattern inside the parenthesis:
# <a\s literal "<a" followed by whitespace, which stops the regex
# from matching "abbr", "acronym", "address", "applet", and
# "area"
# .* any number of anything
# (except "</a>" because we're in ungreedy matching)
# </a> literal "</a>"
# for all "<a ...</a>" matches
foreach ($a[1] as $x) {
# find the first ">" -- certainly it is the one that ends the opening "<a "
$gtpos = strpos($x, '>');
# and isolate that part
$y = substr($x, 0, $gtpos);
# if there's a "href=" there we have a good match!
# get rid of "title" in <a title="index" href="index.html">
if ($hrefpos = strpos($y, 'href=')) {
# put the URL, and trailing stuff (up to, but not including, the closing ">"), in $z
$z = substr($y, $hrefpos+5);
# remove everything after, and including, the first whitespace
# (whitespace is not allowed in URLs)
# get rid of "title" in <a href="index.html" title="index">
# if there's no match, there also is no change
$z = preg_replace('/^(\S+)\s.*$/U', '$1', $z);
# / start or expression
# ^ start of string
# ( grab
# \S+ one or more non whitespace characters
# ) into $1
# \s discard the first whitespace
# .* and everthing following it
# $ up to the end of the string
# /U end of expression, do ungreedy match (why? I can't remember :)
# it the URL is delimited by '"' or "'" remove those
if ($z[0] == '"' && substr($z, -1) == '"') $z = substr($z, 1, -1);
if ($z[0] == "'" && substr($z, -1) == "'") $z = substr($z, 1, -1);
# save result in array
# $x still is the whole string "<a href='index.html' title='index'>link text</a>"
# $gtpos is the position of the first ">": _______________________^__
# and the last 4 charcaters of $x are "</a>"
#
# $z is the URL from the href that has been dealt with previously
$res[] = array(substr($x, $gtpos+1, -4), $z);
}
}
# I don't like leaving "large" things abandoned
unset($a);

return $res;
}

compact new version function:

<?php
function extract_URLs($s) {
### version 3
### changes from version 2:
### the character separating "<a" from "href" (or whatever) may be any whitespace
### only need to test for "href=" in the <a ...> part
$res = array();
preg_match_all('@(<a\s.*</a>)@Uis', $s, $a);
foreach ($a[1] as $x) {
$gtpos = strpos($x, '>');
$y = substr($x, 0, $gtpos);
if ($hrefpos = strpos($y, 'href=')) {
$z = substr($y, $hrefpos+5);
$z = preg_replace('/^(\S+)\s.*$/U', '$1', $z);
if ($z[0] == '"' && substr($z, -1) == '"') $z = substr($z, 1, -1);
if ($z[0] == "'" && substr($z, -1) == "'") $z = substr($z, 1, -1);
$res[] = array(substr($x, $gtpos+1, -4), $z);
}
}
unset($a);
return $res;
}
?>
--
I have a spam filter working.
To mail me include "urkxvq" (with or without the quotes)
in the subject line, or your mail will be ruthlessly discarded.
Jul 17 '05 #8

P: n/a
Pedro <he****@hotpop.com> wrote in message news:<bn*************@ID-203069.news.uni-berlin.de>...
Steve wrote:
Looks great, and it does basically what I want...
Care to explain to the class how it works? Especially the regular
expression part?

And thanx by the way.


Let's see how I go about that ... hope it makes sense :)
# extract URLs from a string return an array of arrays;
# each inner array has the text and the URL
function extract_URLs($s) {
# initialize return array
$res = array();
# grab all "<a ...</a>" bits
preg_match_all('@(<a\s.*</a>)@Uis', $s, $a);
# |`----v-----'|||`- dot metacharacter matches all (\n
# included)
# | | ||`-- case insensitive matches
# | | |`--- ungreedy, so that '<a
# href="1">1</a><a href="2">2</a>'
# | | | does *NOT* match \___
# all_of_this ________________/
# | | `---- end pattern delimiter
# | `----------- grab into $a[1]
# `----------------- pattern delimiter
#
# for the pattern inside the parenthesis:
# <a\s literal "<a" followed by whitespace, which stops the regex
# from matching "abbr", "acronym", "address", "applet", and
# "area"
# .* any number of anything
# (except "</a>" because we're in ungreedy matching)
# </a> literal "</a>"
# for all "<a ...</a>" matches
foreach ($a[1] as $x) {
# find the first ">" -- certainly it is the one that ends the opening "<a "
$gtpos = strpos($x, '>');
# and isolate that part
$y = substr($x, 0, $gtpos);
# if there's a "href=" there we have a good match!
# get rid of "title" in <a title="index" href="index.html">
if ($hrefpos = strpos($y, 'href=')) {
# put the URL, and trailing stuff (up to, but not including, the closing ">"), in $z
$z = substr($y, $hrefpos+5);
# remove everything after, and including, the first whitespace
# (whitespace is not allowed in URLs)
# get rid of "title" in <a href="index.html" title="index">
# if there's no match, there also is no change
$z = preg_replace('/^(\S+)\s.*$/U', '$1', $z);
# / start or expression
# ^ start of string
# ( grab
# \S+ one or more non whitespace characters
# ) into $1
# \s discard the first whitespace
# .* and everthing following it
# $ up to the end of the string
# /U end of expression, do ungreedy match (why? I can't remember :)
# it the URL is delimited by '"' or "'" remove those
if ($z[0] == '"' && substr($z, -1) == '"') $z = substr($z, 1, -1);
if ($z[0] == "'" && substr($z, -1) == "'") $z = substr($z, 1, -1);
# save result in array
# $x still is the whole string "<a href='index.html' title='index'>link text</a>"
# $gtpos is the position of the first ">": _______________________^__
# and the last 4 charcaters of $x are "</a>"
#
# $z is the URL from the href that has been dealt with previously
$res[] = array(substr($x, $gtpos+1, -4), $z);
}
}
# I don't like leaving "large" things abandoned
unset($a);

return $res;
}

compact new version function:

<?php
function extract_URLs($s) {
### version 3
### changes from version 2:
### the character separating "<a" from "href" (or whatever) may be any whitespace
### only need to test for "href=" in the <a ...> part
$res = array();
preg_match_all('@(<a\s.*</a>)@Uis', $s, $a);
foreach ($a[1] as $x) {
$gtpos = strpos($x, '>');
$y = substr($x, 0, $gtpos);
if ($hrefpos = strpos($y, 'href=')) {
$z = substr($y, $hrefpos+5);
$z = preg_replace('/^(\S+)\s.*$/U', '$1', $z);
if ($z[0] == '"' && substr($z, -1) == '"') $z = substr($z, 1, -1);
if ($z[0] == "'" && substr($z, -1) == "'") $z = substr($z, 1, -1);
$res[] = array(substr($x, $gtpos+1, -4), $z);
}
}
unset($a);
return $res;
}
?>
--
I have a spam filter working.
To mail me include "urkxvq" (with or without the quotes)
in the subject line, or your mail will be ruthlessly discarded.


<--Gives Pedro a Gold Star and says, Thank You for that detailed
report, you get a Gold Star!
Jul 17 '05 #9

P: n/a
Steve wrote:
Thank You for that detailed report.


You're very welcome.

--
I have a spam filter working.
To mail me include "urkxvq" (with or without the quotes)
in the subject line, or your mail will be ruthlessly discarded.
Jul 17 '05 #10

This discussion thread is closed

Replies have been disabled for this discussion.