
February 28th, 2007, 06:15 PM
| | | PHP and regular expressions
Hi,
I'm new to this group and regular expressions. I want to extract text
from a newspaper website using regular expressions and php
I'm using this regular expression at the moment
$regexp = "%<table cellspacing=\"0\" cellpadding=\"0\"(.+) <img
height=\"1\" alt=\"Today's News\" />%s";
Each news story is in between those tags , So if I extract those
chunks of html using
preg_match($regexp,$document,$matches);
where $document is a handle on the file. I can store them in %matches
for further processing. Alas it does not work and i cannot figure out
why
Can anyone help?
Thanks | 
February 28th, 2007, 09:35 PM
| | | Re: PHP and regular expressions
__where $document is a handle on the file__
the 2nd parameter should be a string
read the file to a string...
Do you really want the space around "(.+)" ?
you can clean this up by enclosing it in '' rather than ""
$regexp = '%<table cellspacing="0" cellpadding="0"(.+) <img
height="1" alt="Today's News" />%s';
On Feb 28, 12:03 pm, "Damo" <cormacdeba...@gmail.comwrote: Quote:
Hi,
I'm new to this group and regular expressions. I want to extract text
from a newspaper website using regular expressions and php
I'm using this regular expression at the moment
>
$regexp = "%<table cellspacing=\"0\" cellpadding=\"0\"(.+) <img
height=\"1\" alt=\"Today's News\" />%s";
>
Each news story is in between those tags , So if I extract those
chunks of html using
>
preg_match($regexp,$document,$matches);
>
where $document is a handle on the file. I can store them in %matches
for further processing. Alas it does not work and i cannot figure out
why
>
Can anyone help?
Thanks
| | 
February 28th, 2007, 10:35 PM
| | | Re: PHP and regular expressions
On Feb 28, 10:03 am, "Damo" <cormacdeba...@gmail.comwrote: Quote:
>
I'm using this regular expression at the moment
>
$regexp = "%<table cellspacing=\"0\" cellpadding=\"0\"(.+) <img
height=\"1\" alt=\"Today's News\" />%s";
>
Each news story is in between those tags , So if I extract those
chunks of html using
>
preg_match($regexp,$document,$matches);
>
where $document is a handle on the file. I can store them in %matches
for further processing. Alas it does not work
| How exactly does it not work? What gets into $matches and what value
does
preg_match() return? First, do you really need the whitespace around (.+)? Second,
$document
must be a string, not a handle on the file. Third, your regular
expression
as written is greedy; is this intentional?
Cheers,
NC | 
March 1st, 2007, 10:15 PM
| | | Re: PHP and regular expressions Quote:
>
First, do you really need the whitespace around (.+)?
| Does the presence/absence of whitespace make a difference... As I said
I'm new to regex Quote: |
Second,$document must be a string, not a handle on the file.
| $document is a handle on a URL taht I was reading in , so ye it was
just a string Quote: |
Third, your regular expression as written is greedy; is this intentional?
| There was no ? at the end of my regular expression it was just (.+) I got it working since, Thanks for all you help.
cheers | 
March 1st, 2007, 11:25 PM
| | | Re: PHP and regular expressions
Damo <cormacdebarra@gmail.comwrote: Quote: Quote: |
>First, do you really need the whitespace around (.+)?
| Does the presence/absence of whitespace make a difference... As I said
I'm new to regex
| Yes, it will match that whitespace unless the /x modifier is set. Quote: Quote: |
>Second,$document must be a string, not a handle on the file.
| $document is a handle on a URL taht I was reading in , so ye it was
just a string
| Quote: Quote:
>Third, your regular expression as written is greedy; is this
>intentional?
| There was no ? at the end of my regular expression it was just (.+)
| Yes, so it's greedy. It will match as much as possible untill the second
match.
consider:
'<a>foo</a>bar<a>baz</a>foz'
'|<a>.+</a>|' will match '<a>foo</a>bar<a>baz</a>'
'|<a>.+?</a>|' will match '<a>foo</a>'
For a lot of info about regular expressions:
<http://www.regularexpressions.info>
In your case, I'd possibly use:
$regexp = "%<table[^>]*>(.+?)<img%si";
(the /i modifier will make the dot match linebreaks, which is possibly the
breaking point for your regex).
Highly depends on the actual markup wether this will work though...
--
Rik Wasmus | 
March 2nd, 2007, 05:15 AM
| | | Re: PHP and regular expressions
On Mar 1, 4:07 pm, "Damo" <cormacdeba...@gmail.comwrote: Quote: Quote: |
Second,$document must be a string, not a handle on the file.
| >
$document is a handle on a URL that I was reading in , so ye it was just a string
| Do we have different definitions of what a handle is? | 
March 2nd, 2007, 08:25 AM
| | | Re: PHP and regular expressions
On Fri, 02 Mar 2007 06:03:00 +0100, BKDotCom <bkfake-google@yahoo.com>
wrote: Quote:
On Mar 1, 4:07 pm, "Damo" <cormacdeba...@gmail.comwrote: Quote: Quote: |
Second,$document must be a string, not a handle on the file.
| >>
>$document is a handle on a URL that I was reading in , so ye it was
>just a string
| >
Do we have different definitions of what a handle is?
>
| Yup. http://nl3.php.net/manual/en/resource.php
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ | 
March 2nd, 2007, 05:25 PM
| | | Re: PHP and regular expressions
I'm slowly getting the hang of regex's, Instead ive starting a new
thread i thought id ask here. I have this regex:
$regexp = "%<li><a href=\"(.+?)\">.+[^>]+>([^<]+)</a>.+<span class=
\"date\">(.+?)</span>.+<div class=\"abstract\">(.+?)</div>%s";
It matches a the URL of a link, the text of a link, a date and the
abstract of a story.
If I use preg_match_all and theres multiple matches to the pattern in
a html string, it inly stores the last one in $matches
this is how i'm printing them:
preg_match_all($regexp,$document,$matches);
$numMatches = count($matches[1]);
for ($i=0;$i <= $numMatches-1; $i++)
{
echo $matches[1][$i]."<br>";
echo $matches[2][$i]."<br>";
echo $matches[3][$i]."<br>";
echo $matches[4][$i]."<br>";
}
I have no idea why its skipping the earlier matches.
Thanks | 
March 2nd, 2007, 06:25 PM
| | | Re: PHP and regular expressions
On Fri, 02 Mar 2007 18:16:38 +0100, Damo <cormacdebarra@gmail.comwrote: Quote:
I'm slowly getting the hang of regex's, Instead ive starting a new
thread i thought id ask here. I have this regex:
>
$regexp = "%<li><a href=\"(.+?)\">.+[^>]+>([^<]+)</a>.+<span class=
| -------------------------------------^greedy dot--------^again Quote: |
\"date\">(.+?)</span>.+<div class=\"abstract\">(.+?)</div>%s";
| ------------------------^greedy dot Quote:
It matches a the URL of a link, the text of a link, a date and the
abstract of a story.
If I use preg_match_all and theres multiple matches to the pattern in
a html string, it inly stores the last one in $matches
| $regexp = '%<li>\s*
<a[^>]*?href="([^"]+)"[^>]*>
(.*?)
</a>
..*?
<span[^>]*?class="date"[^>]*>
(.+?)
</span>
..+?
<div[^>]*?class="abstract"[^>]*>
(.+?)
</div>%six';
--
Rik Wasmus |
Posting Rules
| You may not post new threads You may not post replies You may not post attachments You may not edit your posts HTML code is Off | | | | | | What is Bytes?
We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights.
Get the best answers to your questions from over network members.
|