By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,852 Members | 2,198 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,852 IT Pros & Developers. It's quick & easy.

what is wrong with my script.

P: 5
Im using the below to extract the text between all the <br></br>.

But it does not prints out all text and prints the normal text which is not a part of html link tag.

Example, if you have <a href="test.html" ><b>The Testing Page is here</b></a>
<b> extrat text</b>
I want to extract only - "The Testing Page is here"



Here variable $myfile

Here variable $myfile contains the whole HTML page
Expand|Select|Wrap|Line Numbers
  1. while ($myfile =~ /<br.+?>(.*)<\/br>/xg) 
  2.  {print ("a");
  3.  print $1;
  4.  }
  5.  
Can some one help me out, what I am doing wrong here?

More Information, I am trying to extract all the text which is a link in the given HTML page.
Feb 6 '10 #1
Share this Question
Share on Google+
3 Replies


P: 11
There is no breakline tag in your example and the breakline does not have a closing tag, it is self closing.... I will assume you mean the bold tag.

The way you have written your regex, it is looking for a breakline tag so right off the bat, that needs to be fixed.

Furthermore, the way you have it written, it will only pickup on a pattern that contains a URL text between bold tags. Not very flexible.

the pattern you want to look for is anchor tag, followed by 0 or more tags which is followed by alphanumeric characters of any length and ends when you hit the open bracket of a tag.

but even with that, there is a problem if a tag is embeded in the middle of a sentence used as the link text. I'll leave that to you to figure out though, if you care to.
Feb 7 '10 #2

numberwhun
Expert Mod 2.5K+
P: 3,503
You need to really examine what you are telling your code to extract and what you actually have in your data.

You are telling it to match everything between <br> and </br>, but those tags do not exist in your example. Instead, remove the 'r' and try matching the <b> </b> tag set.

Regards,

Jeff
Feb 8 '10 #3

nithinpes
Expert 100+
P: 410
If you use :
Expand|Select|Wrap|Line Numbers
  1. $myfile =~ /<b>(.*)<\/b>/xg
  2.  
$1 would have "The Testing Page is here</b></a>
<b> extrat text".
This is because of the greedy nature of * quantifier. To limit this behaviour in order to match minimum number of characters before finding a </b>, use:
Expand|Select|Wrap|Line Numbers
  1. $myfile =~ /<b>(.*?)<\/b>/xg
  2.  
Feb 8 '10 #4

Post your reply

Sign in to post your reply or Sign up for a free account.