By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
428,529 Members | 864 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 428,529 IT Pros & Developers. It's quick & easy.

regular expression for parsing html using preg_match_all

P: n/a
Hi all,

I've been trying unsuccessfully to get the text from html page. Html
tag that I'm interested in looks like this:

<a class=link
href="http://www.something.com/_something.php?type=cart">Shopping
Cart</a>
<div><em class=newentry><a href=http://nothing.com>New
Age</a></em></div>
>From the above tag, I want to extract "Shopping Cart". I'm not very
good with RE. I tried this:
$lines = file_get_contents("http://theabovetag.com/page.html");
preg_match_all("/(<a\ class\=link\ href\=(.*)>)(<\/a>)/", $lines,
$matches1);

The above RE gives me "Shopping Cart" plus "New Age" as well. I just
want "Shopping Cart". What am I doing wrong? My RE is somehow ignoring
</atag right after Shopping Cart and instead accepting </aafter New
Age. Please help!

Jul 6 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a

cr*********@yahoo.com wrote:
Hi all,

I've been trying unsuccessfully to get the text from html page. Html
tag that I'm interested in looks like this:

<a class=link
href="http://www.something.com/_something.php?type=cart">Shopping
Cart</a>
<div><em class=newentry><a href=http://nothing.com>New
Age</a></em></div>
From the above tag, I want to extract "Shopping Cart". I'm not very
good with RE. I tried this:
$lines = file_get_contents("http://theabovetag.com/page.html");
preg_match_all("/(<a\ class\=link\ href\=(.*)>)(<\/a>)/", $lines,
$matches1);

The above RE gives me "Shopping Cart" plus "New Age" as well. I just
want "Shopping Cart". What am I doing wrong? My RE is somehow ignoring
</atag right after Shopping Cart and instead accepting </aafter New
Age. Please help!
It most likely has to do with the greediness of *. Regular expressions
will match the *longest* possible string. To prevent this, use '?'.
given the string: "<a>text</a>more</a>"
<a>.*</amatches "<a>text</a>more</a>"
<a>.*?</amatches "<a>text</a>"

Jul 6 '06 #2

P: n/a
It most likely has to do with the greediness of *. Regular expressions
will match the *longest* possible string. To prevent this, use '?'.
given the string: "<a>text</a>more</a>"
<a>.*</amatches "<a>text</a>more</a>"
<a>.*?</amatches "<a>text</a>"
Well what i basically want is:
<a class="something" href=http://something.com/abc.php">Shopping
Cart</a>

I want the RE to parse the HTML tag and see if it starts with '<a
class="something" href=', then IGNORE whatever is between 'href=' and
'>', and ending with '</a>'. I couldn't figure out how to "ignore" the
text in between.

Jul 7 '06 #3

P: n/a
cr*********@yahoo.com wrote:
>It most likely has to do with the greediness of *. Regular expressions
will match the *longest* possible string. To prevent this, use '?'.
given the string: "<a>text</a>more</a>"
<a>.*</amatches "<a>text</a>more</a>"
<a>.*?</amatches "<a>text</a>"

Well what i basically want is:
<a class="something" href=http://something.com/abc.php">Shopping
Cart</a>

I want the RE to parse the HTML tag and see if it starts with '<a
class="something" href=', then IGNORE whatever is between 'href=' and
'>', and ending with '</a>'. I couldn't figure out how to "ignore" the
text in between.
Instead of a (greedy) * operator, use a negation class that parses
everything upto an certain character :
/(<a\ class\=link\ href\=([^>]*)>)([^<]*)(<\/a>)/
Jul 10 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.