473,769 Members | 6,286 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

using PHP to parse through HTML

Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF
attributes of anchor tags and SRC attributes of IMG tags. Does anyone
know of any libraries/freeware to help parse through HTML to find these
things. Right now, I'm doing a lot of "strstr" calls, but there is
probably a better way to do what I need.

Thanks for any help, - Dave

Jul 17 '05 #1
8 2719
On 19 Feb 2005 11:49:24 -0800, la***********@g mail.com wrote:
Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF
attributes of anchor tags and SRC attributes of IMG tags. Does anyone
know of any libraries/freeware to help parse through HTML to find these
things. Right now, I'm doing a lot of "strstr" calls, but there is
probably a better way to do what I need.


Haven't used it myself, but seen mentions of:

http://pear.php.net/package/XML_HTMLSax

... which looks possibly suitable from the description on the page.

--
Andy Hassall / <an**@andyh.co. uk> / <http://www.andyh.co.uk >
<http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis tool
Jul 17 '05 #2
la***********@g mail.com wrote in
news:11******** *************@c 13g2000cwb.goog legroups.com:
Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF
attributes of anchor tags and SRC attributes of IMG tags. Does anyone
know of any libraries/freeware to help parse through HTML to find these
things. Right now, I'm doing a lot of "strstr" calls, but there is
probably a better way to do what I need.


Take a look at preg_split()
http://www.php.net/manual/en/function.preg-split.php

--
Dave Patton
Canadian Coordinator, Degree Confluence Project
http://www.confluence.org/
My website: http://members.shaw.ca/davepatton/
Jul 17 '05 #3
Too bad none of the examples work. I untarred/uncompressed the file,
copied the folder to a public html directory and then every time I try
and launch an example, I get errors like

Warning: main(XML/HTMLSax/XML_HTMLSax_Sta tes.php): failed to open
stream: No such file or directory in
/usr/local/apache/htdocs/temp/XML/XML_HTMLSax.php on line 36

Fatal error: main(): Failed opening required
'XML/HTMLSax/XML_HTMLSax_Sta tes.php'
(include_path=' .:/usr/local/lib/php') in
/usr/local/apache/htdocs/temp/XML/XML_HTMLSax.php on line 36
Andy Hassall wrote:
On 19 Feb 2005 11:49:24 -0800, la***********@g mail.com wrote:
Hi, I'm using PHP 4 and trying to parse through HTML to look for HREFattributes of anchor tags and SRC attributes of IMG tags. Does anyoneknow of any libraries/freeware to help parse through HTML to find thesethings. Right now, I'm doing a lot of "strstr" calls, but there is
probably a better way to do what I need.


Haven't used it myself, but seen mentions of:

http://pear.php.net/package/XML_HTMLSax

... which looks possibly suitable from the description on the page.

--
Andy Hassall / <an**@andyh.co. uk> / <http://www.andyh.co.uk >
<http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis

tool

Jul 17 '05 #4
"laredotorn ado" wrote:
Hi, I'm using PHP 4 and trying to parse through HTML to look
for HREF
attributes of anchor tags and SRC attributes of IMG tags.
Does anyone
know of any libraries/freeware to help parse through HTML to
find these
things. Right now, I'm doing a lot of "strstr" calls, but
there is
probably a better way to do what I need.

Thanks for any help, - Dave


strstr is the LAST thing you want to do in this case! I don’t know
of libraries, but you can use preg_match to grab the tags that you
need.

If you are into php, learning preg_match and regular expressions in
general is almost a must.. it will substantially increase the power
of your code.

steve

--
Posted using the http://www.dbforumz.com interface, at author's request
Articles individually checked for conformance to usenet standards
Topic URL: http://www.dbforumz.com/PHP-parse-HT...ict199658.html
Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbforumz.com/eform.php?p=677948
Jul 17 '05 #5
>
strstr is the LAST thing you want to do in this case! I don't know
of libraries, but you can use preg_match to grab the tags that you
need.

If you are into php, learning preg_match and regular expressions in
general is almost a must.. it will substantially increase the power
of your code.

steve

--

Sorry can you elaborate on you first statement.
Are you saying that "strstr" is slower that "preg_match "? what about
"strpos"?

The reason I ask is, if it was faster to look for a character in string
using "preg_match " then why wouldn't strpos/strstr us it themselves?

I need to look for 2 characters in some data, (case sensitive), what would
be the fastest way of finding the first occurrence?

$first = strpos( $data, $charA );
$sec = strpos( $data, $charB );
// check for ===false;
return ($first<$sec)?$ first:$sec;

// would there be a faster way to achieve the above using "preg_match "?

Simon
Jul 17 '05 #6
On 19 Feb 2005 20:22:22 -0800, la***********@z ipmail.com wrote:
Andy Hassall wrote:
On 19 Feb 2005 11:49:24 -0800, la***********@g mail.com wrote:
>Hi, I'm using PHP 4 and trying to parse through HTML to look forHREF >attributes of anchor tags and SRC attributes of IMG tags. Doesanyone >know of any libraries/freeware to help parse through HTML to findthese >things. Right now, I'm doing a lot of "strstr" calls, but there is
>probably a better way to do what I need.
Haven't used it myself, but seen mentions of:

http://pear.php.net/package/XML_HTMLSax

... which looks possibly suitable from the description on the page.


Too bad none of the examples work. I untarred/uncompressed the file,
copied the folder to a public html directory


That's not how you're supposed to install PEAR modules; here's an example how:

root@server:~# pear install http://pear.php.net/get/XML_HTMLSax-2.1.2.tgz
downloading XML_HTMLSax-2.1.2.tgz ...
Starting to download XML_HTMLSax-2.1.2.tgz (16,099 bytes)
.......done: 16,099 bytes
install ok: XML_HTMLSax 2.1.2

You could probably get away with unpacking to a public_html directory but
you'd need to fiddle with your include_path else you get errors like:
Warning: main(XML/HTMLSax/XML_HTMLSax_Sta tes.php): failed to open
stream: No such file or directory in
/usr/local/apache/htdocs/temp/XML/XML_HTMLSax.php on line 36


The examples work OK for me after installing through pear as above.

--
Andy Hassall / <an**@andyh.co. uk> / <http://www.andyh.co.uk >
<http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis tool
Jul 17 '05 #7
I posted an example at: http://hotscripts.com/Detailed/44390.html

Jul 17 '05 #8
"Simon" wrote:

strstr is the LAST thing you want to do in this case! Idon’t know
of libraries, but you can use preg_match to grab the tags that

you
need.

If you are into php, learning preg_match and regular expressions

in
general is almost a must.. it will substantially increase the

power
of your code.

steve

--

Sorry can you elaborate on you first statement.
Are you saying that "strstr" is slower that "preg_match "? what

about"strpos"?

The reason I ask is, if it was faster to look for a character in
string
using "preg_match " then why wouldn’t strpos/strstr us it
themselves?

I need to look for 2 characters in some data, (case sensitive), what
would
be the fastest way of finding the first occurrence?

$first = strpos( $data, $charA );
$sec = strpos( $data, $charB );
// check for ===false;
return ($first<$sec)?$ first:$sec;

// would there be a faster way to achieve the above using
"preg_match" ?

Simon


Simon, in 99% of the cases, speed does not matter, i.e. you can
achieve good speed regardless --not something I have ever had to worry
about in the code. The point is that with preg_match and regex, you
can achieve with one statement what it takes 10 statement to achive,
if you did not have regex. If you ever parse free text in any shape
or form, regex is the way to go. Your example above is simple and if
that is all you need fine, but as soon as the text has spurious (sp?)
spaces, other characters that may or may not be present, and a whole
bunch of other conditions outside your control, you need a much more
powerful engine, and that is regex.

--
Posted using the http://www.dbforumz.com interface, at author's request
Articles individually checked for conformance to usenet standards
Topic URL: http://www.dbforumz.com/PHP-parse-HT...ict199658.html
Visit Topic URL to contact author (reg. req'd). Report abuse: http://www.dbforumz.com/eform.php?p=678383
Jul 17 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
9562
by: John A. Irwin | last post by:
I'm very new to PHP and am trying to figure out how to parse out a variable "HTTP_REFERER". My reason for this is my site was recently "FEATURED" (sic) on a website called FARK.COM. Because of this I received over 100,000 Hits in less then one hour and it caused my host's server farm to crash. While I understand that I could move to a more robust Web Host, I would rather trap any further links from FARK and redirect them to a rejection...
0
4271
by: bugbear | last post by:
Subject pretty much says it all. I'd like to parse XML (duh!) using Xerces (because its fast, and reliable, and comprehensive, and supports lots of features). I'd like to conform to standards as much as possible, so I'd like to call Xerces under the JAXP API. I'd like to validate the XML against a DTD, so that errors are flagged up to the user, and I can transcribe
8
7423
by: Spartanicus | last post by:
The document at http://homepage.ntlworld.com/spartanicus/custom_dtd.htm uses a custom DTD, the w3c validator validates it but with this warning: "Unknown Parse Mode! The MIME Media Type (text/html) for this document is used to serve both SGML and XML based documents, and it is not possible to disambiguate it based on the DOCTYPE Declaration in your document. Parsing will continue in SGML mode."...
3
2137
by: Mark | last post by:
I am looking for an example of using checkboxes in a repeater control where the checkbox state is persisted from page to page. Thank you, Mark
1
6428
by: ratnakarp | last post by:
Hi, I have a search text box. The user enters the value in the text box and click on enter button. In code behind on button click i'm writing the code to get the values from the database and binding it to a repeater control. This repeater control has multiple text boxes and buttons. Can you please tell me how can i do paging in this case ? I'm posting my code below. The problem is that if i click on "AdjustThisAd" button, it opens...
0
2550
by: sharif | last post by:
Anyone could help me out for n=my code ......I have written following code ,Here i m able to get and post the form successfuly..but after posting im not gettng proper response content... #!usr/bin/perl -w use strict; use LWP; use LWP::Simple; use LWP::UserAgent; use HTML::Form; use Switch; use HTTP::Cookies;
5
2336
by: moddster | last post by:
Hi Guys. I am a newbie to perl and need some help with a problem. PROBLEM: I have to parse an HTML file and get rid of all the HTML tags and count the number of sumbissions a person has through out the dates found. The condition is that multiple submissions by the same person on the same date is counted as 1. I have already gotten rid of the HTML tags using: #!/usr/bin/perl -w use strict;
25
3131
by: Jon Slaughter | last post by:
I have some code that loads up some php/html files and does a few things to them and ultimately returns an html file with some php code in it. I then pass that file onto the user by using echo. Of course then the file doesn't get seen by the user. Is there any command that essentially executes the code and then echo's it? something that will take a string like '<body>blah<?php echo 'Hello'; ?></body>' and actually interpret the php
4
9761
by: MissElegant | last post by:
Hi all, I have tried to do a test to a lesson which was in the internet, but it doesn't work? ANYBody here to help please?? The problem that what I enter in the textbox should be sent to the database to be stored, but it doesn't.!!! I don't know why the aspx. page...
0
9587
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9423
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10211
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10045
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9993
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8870
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7406
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5298
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5447
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.