By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,490 Members | 1,292 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,490 IT Pros & Developers. It's quick & easy.

using PHP to parse through HTML

P: n/a
Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF
attributes of anchor tags and SRC attributes of IMG tags. Does anyone
know of any libraries/freeware to help parse through HTML to find these
things. Right now, I'm doing a lot of "strstr" calls, but there is
probably a better way to do what I need.

Thanks for any help, - Dave

Jul 17 '05 #1
Share this Question
Share on Google+
8 Replies


P: n/a
On 19 Feb 2005 11:49:24 -0800, la***********@gmail.com wrote:
Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF
attributes of anchor tags and SRC attributes of IMG tags. Does anyone
know of any libraries/freeware to help parse through HTML to find these
things. Right now, I'm doing a lot of "strstr" calls, but there is
probably a better way to do what I need.


Haven't used it myself, but seen mentions of:

http://pear.php.net/package/XML_HTMLSax

... which looks possibly suitable from the description on the page.

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Jul 17 '05 #2

P: n/a
la***********@gmail.com wrote in
news:11*********************@c13g2000cwb.googlegro ups.com:
Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF
attributes of anchor tags and SRC attributes of IMG tags. Does anyone
know of any libraries/freeware to help parse through HTML to find these
things. Right now, I'm doing a lot of "strstr" calls, but there is
probably a better way to do what I need.


Take a look at preg_split()
http://www.php.net/manual/en/function.preg-split.php

--
Dave Patton
Canadian Coordinator, Degree Confluence Project
http://www.confluence.org/
My website: http://members.shaw.ca/davepatton/
Jul 17 '05 #3

P: n/a
Too bad none of the examples work. I untarred/uncompressed the file,
copied the folder to a public html directory and then every time I try
and launch an example, I get errors like

Warning: main(XML/HTMLSax/XML_HTMLSax_States.php): failed to open
stream: No such file or directory in
/usr/local/apache/htdocs/temp/XML/XML_HTMLSax.php on line 36

Fatal error: main(): Failed opening required
'XML/HTMLSax/XML_HTMLSax_States.php'
(include_path='.:/usr/local/lib/php') in
/usr/local/apache/htdocs/temp/XML/XML_HTMLSax.php on line 36
Andy Hassall wrote:
On 19 Feb 2005 11:49:24 -0800, la***********@gmail.com wrote:
Hi, I'm using PHP 4 and trying to parse through HTML to look for HREFattributes of anchor tags and SRC attributes of IMG tags. Does anyoneknow of any libraries/freeware to help parse through HTML to find thesethings. Right now, I'm doing a lot of "strstr" calls, but there is
probably a better way to do what I need.


Haven't used it myself, but seen mentions of:

http://pear.php.net/package/XML_HTMLSax

... which looks possibly suitable from the description on the page.

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis

tool

Jul 17 '05 #4

P: n/a
"laredotornado" wrote:
Hi, I'm using PHP 4 and trying to parse through HTML to look
for HREF
attributes of anchor tags and SRC attributes of IMG tags.
Does anyone
know of any libraries/freeware to help parse through HTML to
find these
things. Right now, I'm doing a lot of "strstr" calls, but
there is
probably a better way to do what I need.

Thanks for any help, - Dave


strstr is the LAST thing you want to do in this case! I donít know
of libraries, but you can use preg_match to grab the tags that you
need.

If you are into php, learning preg_match and regular expressions in
general is almost a must.. it will substantially increase the power
of your code.

steve

--
Posted using the http://www.dbforumz.com interface, at author's request
Articles individually checked for conformance to usenet standards
Topic URL: http://www.dbforumz.com/PHP-parse-HT...ict199658.html
Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbforumz.com/eform.php?p=677948
Jul 17 '05 #5

P: n/a
>
strstr is the LAST thing you want to do in this case! I don't know
of libraries, but you can use preg_match to grab the tags that you
need.

If you are into php, learning preg_match and regular expressions in
general is almost a must.. it will substantially increase the power
of your code.

steve

--

Sorry can you elaborate on you first statement.
Are you saying that "strstr" is slower that "preg_match"? what about
"strpos"?

The reason I ask is, if it was faster to look for a character in string
using "preg_match" then why wouldn't strpos/strstr us it themselves?

I need to look for 2 characters in some data, (case sensitive), what would
be the fastest way of finding the first occurrence?

$first = strpos( $data, $charA );
$sec = strpos( $data, $charB );
// check for ===false;
return ($first<$sec)?$first:$sec;

// would there be a faster way to achieve the above using "preg_match"?

Simon
Jul 17 '05 #6

P: n/a
On 19 Feb 2005 20:22:22 -0800, la***********@zipmail.com wrote:
Andy Hassall wrote:
On 19 Feb 2005 11:49:24 -0800, la***********@gmail.com wrote:
>Hi, I'm using PHP 4 and trying to parse through HTML to look forHREF >attributes of anchor tags and SRC attributes of IMG tags. Doesanyone >know of any libraries/freeware to help parse through HTML to findthese >things. Right now, I'm doing a lot of "strstr" calls, but there is
>probably a better way to do what I need.
Haven't used it myself, but seen mentions of:

http://pear.php.net/package/XML_HTMLSax

... which looks possibly suitable from the description on the page.


Too bad none of the examples work. I untarred/uncompressed the file,
copied the folder to a public html directory


That's not how you're supposed to install PEAR modules; here's an example how:

root@server:~# pear install http://pear.php.net/get/XML_HTMLSax-2.1.2.tgz
downloading XML_HTMLSax-2.1.2.tgz ...
Starting to download XML_HTMLSax-2.1.2.tgz (16,099 bytes)
.......done: 16,099 bytes
install ok: XML_HTMLSax 2.1.2

You could probably get away with unpacking to a public_html directory but
you'd need to fiddle with your include_path else you get errors like:
Warning: main(XML/HTMLSax/XML_HTMLSax_States.php): failed to open
stream: No such file or directory in
/usr/local/apache/htdocs/temp/XML/XML_HTMLSax.php on line 36


The examples work OK for me after installing through pear as above.

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Jul 17 '05 #7

P: n/a
I posted an example at: http://hotscripts.com/Detailed/44390.html

Jul 17 '05 #8

P: n/a
"Simon" wrote:

strstr is the LAST thing you want to do in this case! Idonít know
of libraries, but you can use preg_match to grab the tags that

you
need.

If you are into php, learning preg_match and regular expressions

in
general is almost a must.. it will substantially increase the

power
of your code.

steve

--

Sorry can you elaborate on you first statement.
Are you saying that "strstr" is slower that "preg_match"? what

about"strpos"?

The reason I ask is, if it was faster to look for a character in
string
using "preg_match" then why wouldnít strpos/strstr us it
themselves?

I need to look for 2 characters in some data, (case sensitive), what
would
be the fastest way of finding the first occurrence?

$first = strpos( $data, $charA );
$sec = strpos( $data, $charB );
// check for ===false;
return ($first<$sec)?$first:$sec;

// would there be a faster way to achieve the above using
"preg_match"?

Simon


Simon, in 99% of the cases, speed does not matter, i.e. you can
achieve good speed regardless --not something I have ever had to worry
about in the code. The point is that with preg_match and regex, you
can achieve with one statement what it takes 10 statement to achive,
if you did not have regex. If you ever parse free text in any shape
or form, regex is the way to go. Your example above is simple and if
that is all you need fine, but as soon as the text has spurious (sp?)
spaces, other characters that may or may not be present, and a whole
bunch of other conditions outside your control, you need a much more
powerful engine, and that is regex.

--
Posted using the http://www.dbforumz.com interface, at author's request
Articles individually checked for conformance to usenet standards
Topic URL: http://www.dbforumz.com/PHP-parse-HT...ict199658.html
Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbforumz.com/eform.php?p=678383
Jul 17 '05 #9

This discussion thread is closed

Replies have been disabled for this discussion.