473,396 Members | 1,775 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

using PHP to parse through HTML

Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF
attributes of anchor tags and SRC attributes of IMG tags. Does anyone
know of any libraries/freeware to help parse through HTML to find these
things. Right now, I'm doing a lot of "strstr" calls, but there is
probably a better way to do what I need.

Thanks for any help, - Dave

Jul 17 '05 #1
8 2698
On 19 Feb 2005 11:49:24 -0800, la***********@gmail.com wrote:
Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF
attributes of anchor tags and SRC attributes of IMG tags. Does anyone
know of any libraries/freeware to help parse through HTML to find these
things. Right now, I'm doing a lot of "strstr" calls, but there is
probably a better way to do what I need.


Haven't used it myself, but seen mentions of:

http://pear.php.net/package/XML_HTMLSax

... which looks possibly suitable from the description on the page.

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Jul 17 '05 #2
la***********@gmail.com wrote in
news:11*********************@c13g2000cwb.googlegro ups.com:
Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF
attributes of anchor tags and SRC attributes of IMG tags. Does anyone
know of any libraries/freeware to help parse through HTML to find these
things. Right now, I'm doing a lot of "strstr" calls, but there is
probably a better way to do what I need.


Take a look at preg_split()
http://www.php.net/manual/en/function.preg-split.php

--
Dave Patton
Canadian Coordinator, Degree Confluence Project
http://www.confluence.org/
My website: http://members.shaw.ca/davepatton/
Jul 17 '05 #3
Too bad none of the examples work. I untarred/uncompressed the file,
copied the folder to a public html directory and then every time I try
and launch an example, I get errors like

Warning: main(XML/HTMLSax/XML_HTMLSax_States.php): failed to open
stream: No such file or directory in
/usr/local/apache/htdocs/temp/XML/XML_HTMLSax.php on line 36

Fatal error: main(): Failed opening required
'XML/HTMLSax/XML_HTMLSax_States.php'
(include_path='.:/usr/local/lib/php') in
/usr/local/apache/htdocs/temp/XML/XML_HTMLSax.php on line 36
Andy Hassall wrote:
On 19 Feb 2005 11:49:24 -0800, la***********@gmail.com wrote:
Hi, I'm using PHP 4 and trying to parse through HTML to look for HREFattributes of anchor tags and SRC attributes of IMG tags. Does anyoneknow of any libraries/freeware to help parse through HTML to find thesethings. Right now, I'm doing a lot of "strstr" calls, but there is
probably a better way to do what I need.


Haven't used it myself, but seen mentions of:

http://pear.php.net/package/XML_HTMLSax

... which looks possibly suitable from the description on the page.

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis

tool

Jul 17 '05 #4
"laredotornado" wrote:
Hi, I'm using PHP 4 and trying to parse through HTML to look
for HREF
attributes of anchor tags and SRC attributes of IMG tags.
Does anyone
know of any libraries/freeware to help parse through HTML to
find these
things. Right now, I'm doing a lot of "strstr" calls, but
there is
probably a better way to do what I need.

Thanks for any help, - Dave


strstr is the LAST thing you want to do in this case! I don’t know
of libraries, but you can use preg_match to grab the tags that you
need.

If you are into php, learning preg_match and regular expressions in
general is almost a must.. it will substantially increase the power
of your code.

steve

--
Posted using the http://www.dbforumz.com interface, at author's request
Articles individually checked for conformance to usenet standards
Topic URL: http://www.dbforumz.com/PHP-parse-HT...ict199658.html
Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbforumz.com/eform.php?p=677948
Jul 17 '05 #5
>
strstr is the LAST thing you want to do in this case! I don't know
of libraries, but you can use preg_match to grab the tags that you
need.

If you are into php, learning preg_match and regular expressions in
general is almost a must.. it will substantially increase the power
of your code.

steve

--

Sorry can you elaborate on you first statement.
Are you saying that "strstr" is slower that "preg_match"? what about
"strpos"?

The reason I ask is, if it was faster to look for a character in string
using "preg_match" then why wouldn't strpos/strstr us it themselves?

I need to look for 2 characters in some data, (case sensitive), what would
be the fastest way of finding the first occurrence?

$first = strpos( $data, $charA );
$sec = strpos( $data, $charB );
// check for ===false;
return ($first<$sec)?$first:$sec;

// would there be a faster way to achieve the above using "preg_match"?

Simon
Jul 17 '05 #6
On 19 Feb 2005 20:22:22 -0800, la***********@zipmail.com wrote:
Andy Hassall wrote:
On 19 Feb 2005 11:49:24 -0800, la***********@gmail.com wrote:
>Hi, I'm using PHP 4 and trying to parse through HTML to look forHREF >attributes of anchor tags and SRC attributes of IMG tags. Doesanyone >know of any libraries/freeware to help parse through HTML to findthese >things. Right now, I'm doing a lot of "strstr" calls, but there is
>probably a better way to do what I need.
Haven't used it myself, but seen mentions of:

http://pear.php.net/package/XML_HTMLSax

... which looks possibly suitable from the description on the page.


Too bad none of the examples work. I untarred/uncompressed the file,
copied the folder to a public html directory


That's not how you're supposed to install PEAR modules; here's an example how:

root@server:~# pear install http://pear.php.net/get/XML_HTMLSax-2.1.2.tgz
downloading XML_HTMLSax-2.1.2.tgz ...
Starting to download XML_HTMLSax-2.1.2.tgz (16,099 bytes)
.......done: 16,099 bytes
install ok: XML_HTMLSax 2.1.2

You could probably get away with unpacking to a public_html directory but
you'd need to fiddle with your include_path else you get errors like:
Warning: main(XML/HTMLSax/XML_HTMLSax_States.php): failed to open
stream: No such file or directory in
/usr/local/apache/htdocs/temp/XML/XML_HTMLSax.php on line 36


The examples work OK for me after installing through pear as above.

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Jul 17 '05 #7
I posted an example at: http://hotscripts.com/Detailed/44390.html

Jul 17 '05 #8
"Simon" wrote:

strstr is the LAST thing you want to do in this case! Idon’t know
of libraries, but you can use preg_match to grab the tags that

you
need.

If you are into php, learning preg_match and regular expressions

in
general is almost a must.. it will substantially increase the

power
of your code.

steve

--

Sorry can you elaborate on you first statement.
Are you saying that "strstr" is slower that "preg_match"? what

about"strpos"?

The reason I ask is, if it was faster to look for a character in
string
using "preg_match" then why wouldn’t strpos/strstr us it
themselves?

I need to look for 2 characters in some data, (case sensitive), what
would
be the fastest way of finding the first occurrence?

$first = strpos( $data, $charA );
$sec = strpos( $data, $charB );
// check for ===false;
return ($first<$sec)?$first:$sec;

// would there be a faster way to achieve the above using
"preg_match"?

Simon


Simon, in 99% of the cases, speed does not matter, i.e. you can
achieve good speed regardless --not something I have ever had to worry
about in the code. The point is that with preg_match and regex, you
can achieve with one statement what it takes 10 statement to achive,
if you did not have regex. If you ever parse free text in any shape
or form, regex is the way to go. Your example above is simple and if
that is all you need fine, but as soon as the text has spurious (sp?)
spaces, other characters that may or may not be present, and a whole
bunch of other conditions outside your control, you need a much more
powerful engine, and that is regex.

--
Posted using the http://www.dbforumz.com interface, at author's request
Articles individually checked for conformance to usenet standards
Topic URL: http://www.dbforumz.com/PHP-parse-HT...ict199658.html
Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbforumz.com/eform.php?p=678383
Jul 17 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: John A. Irwin | last post by:
I'm very new to PHP and am trying to figure out how to parse out a variable "HTTP_REFERER". My reason for this is my site was recently "FEATURED" (sic) on a website called FARK.COM. Because of...
0
by: bugbear | last post by:
Subject pretty much says it all. I'd like to parse XML (duh!) using Xerces (because its fast, and reliable, and comprehensive, and supports lots of features). I'd like to conform to standards...
8
by: Spartanicus | last post by:
The document at http://homepage.ntlworld.com/spartanicus/custom_dtd.htm uses a custom DTD, the w3c validator validates it but with this warning: "Unknown Parse Mode! The MIME Media Type...
3
by: Mark | last post by:
I am looking for an example of using checkboxes in a repeater control where the checkbox state is persisted from page to page. Thank you, Mark
1
by: ratnakarp | last post by:
Hi, I have a search text box. The user enters the value in the text box and click on enter button. In code behind on button click i'm writing the code to get the values from the database and...
0
by: sharif | last post by:
Anyone could help me out for n=my code ......I have written following code ,Here i m able to get and post the form successfuly..but after posting im not gettng proper response content... ...
5
by: moddster | last post by:
Hi Guys. I am a newbie to perl and need some help with a problem. PROBLEM: I have to parse an HTML file and get rid of all the HTML tags and count the number of sumbissions a person has through...
25
by: Jon Slaughter | last post by:
I have some code that loads up some php/html files and does a few things to them and ultimately returns an html file with some php code in it. I then pass that file onto the user by using echo. Of...
4
by: MissElegant | last post by:
Hi all, I have tried to do a test to a lesson which was in the internet, but it doesn't work? ANYBody here to help please?? The problem that what I enter in the textbox should be sent to the...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.