Connecting Tech Pros Worldwide Forums | Help | Site Map

PHP and regular expressions

Damo
Guest
 
Posts: n/a
#1: Feb 28 '07
Hi,
I'm new to this group and regular expressions. I want to extract text
from a newspaper website using regular expressions and php
I'm using this regular expression at the moment

$regexp = "%<table cellspacing=\"0\" cellpadding=\"0\"(.+) <img
height=\"1\" alt=\"Today's News\" />%s";

Each news story is in between those tags , So if I extract those
chunks of html using

preg_match($regexp,$document,$matches);

where $document is a handle on the file. I can store them in %matches
for further processing. Alas it does not work and i cannot figure out
why

Can anyone help?
Thanks


BKDotCom
Guest
 
Posts: n/a
#2: Feb 28 '07

re: PHP and regular expressions


__where $document is a handle on the file__
the 2nd parameter should be a string
read the file to a string...

Do you really want the space around "(.+)" ?
you can clean this up by enclosing it in '' rather than ""
$regexp = '%<table cellspacing="0" cellpadding="0"(.+) <img
height="1" alt="Today's News" />%s';

On Feb 28, 12:03 pm, "Damo" <cormacdeba...@gmail.comwrote:
Quote:
Hi,
I'm new to this group and regular expressions. I want to extract text
from a newspaper website using regular expressions and php
I'm using this regular expression at the moment
>
$regexp = "%<table cellspacing=\"0\" cellpadding=\"0\"(.+) <img
height=\"1\" alt=\"Today's News\" />%s";
>
Each news story is in between those tags , So if I extract those
chunks of html using
>
preg_match($regexp,$document,$matches);
>
where $document is a handle on the file. I can store them in %matches
for further processing. Alas it does not work and i cannot figure out
why
>
Can anyone help?
Thanks

NC
Guest
 
Posts: n/a
#3: Feb 28 '07

re: PHP and regular expressions


On Feb 28, 10:03 am, "Damo" <cormacdeba...@gmail.comwrote:
Quote:
>
I'm using this regular expression at the moment
>
$regexp = "%<table cellspacing=\"0\" cellpadding=\"0\"(.+) <img
height=\"1\" alt=\"Today's News\" />%s";
>
Each news story is in between those tags , So if I extract those
chunks of html using
>
preg_match($regexp,$document,$matches);
>
where $document is a handle on the file. I can store them in %matches
for further processing. Alas it does not work
How exactly does it not work? What gets into $matches and what value
does
preg_match() return?
Quote:
Can anyone help?
First, do you really need the whitespace around (.+)? Second,
$document
must be a string, not a handle on the file. Third, your regular
expression
as written is greedy; is this intentional?

Cheers,
NC

Damo
Guest
 
Posts: n/a
#4: Mar 1 '07

re: PHP and regular expressions


Quote:
>
First, do you really need the whitespace around (.+)?
Does the presence/absence of whitespace make a difference... As I said
I'm new to regex
Quote:
Second,$document must be a string, not a handle on the file.
$document is a handle on a URL taht I was reading in , so ye it was
just a string
Quote:
Third, your regular expression as written is greedy; is this intentional?
There was no ? at the end of my regular expression it was just (.+)
Quote:
Cheers,
NC
I got it working since, Thanks for all you help.
cheers


Rik
Guest
 
Posts: n/a
#5: Mar 1 '07

re: PHP and regular expressions


Damo <cormacdebarra@gmail.comwrote:
Quote:
Quote:
>First, do you really need the whitespace around (.+)?
Does the presence/absence of whitespace make a difference... As I said
I'm new to regex
Yes, it will match that whitespace unless the /x modifier is set.
Quote:
Quote:
>Second,$document must be a string, not a handle on the file.
$document is a handle on a URL taht I was reading in , so ye it was
just a string
Quote:
Quote:
>Third, your regular expression as written is greedy; is this
>intentional?
There was no ? at the end of my regular expression it was just (.+)
Yes, so it's greedy. It will match as much as possible untill the second
match.

consider:
'<a>foo</a>bar<a>baz</a>foz'

'|<a>.+</a>|' will match '<a>foo</a>bar<a>baz</a>'
'|<a>.+?</a>|' will match '<a>foo</a>'

For a lot of info about regular expressions:
<http://www.regularexpressions.info>

In your case, I'd possibly use:

$regexp = "%<table[^>]*>(.+?)<img%si";

(the /i modifier will make the dot match linebreaks, which is possibly the
breaking point for your regex).

Highly depends on the actual markup wether this will work though...
--
Rik Wasmus
BKDotCom
Guest
 
Posts: n/a
#6: Mar 2 '07

re: PHP and regular expressions


On Mar 1, 4:07 pm, "Damo" <cormacdeba...@gmail.comwrote:
Quote:
Quote:
Second,$document must be a string, not a handle on the file.
>
$document is a handle on a URL that I was reading in , so ye it was just a string
Do we have different definitions of what a handle is?

OmegaJunior
Guest
 
Posts: n/a
#7: Mar 2 '07

re: PHP and regular expressions


On Fri, 02 Mar 2007 06:03:00 +0100, BKDotCom <bkfake-google@yahoo.com>
wrote:
Quote:
On Mar 1, 4:07 pm, "Damo" <cormacdeba...@gmail.comwrote:
Quote:
Quote:
Second,$document must be a string, not a handle on the file.
>>
>$document is a handle on a URL that I was reading in , so ye it was
>just a string
>
Do we have different definitions of what a handle is?
>
Yup.
http://nl3.php.net/manual/en/resource.php


--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
Damo
Guest
 
Posts: n/a
#8: Mar 2 '07

re: PHP and regular expressions


I'm slowly getting the hang of regex's, Instead ive starting a new
thread i thought id ask here. I have this regex:

$regexp = "%<li><a href=\"(.+?)\">.+[^>]+>([^<]+)</a>.+<span class=
\"date\">(.+?)</span>.+<div class=\"abstract\">(.+?)</div>%s";

It matches a the URL of a link, the text of a link, a date and the
abstract of a story.
If I use preg_match_all and theres multiple matches to the pattern in
a html string, it inly stores the last one in $matches

this is how i'm printing them:

preg_match_all($regexp,$document,$matches);
$numMatches = count($matches[1]);
for ($i=0;$i <= $numMatches-1; $i++)
{
echo $matches[1][$i]."<br>";
echo $matches[2][$i]."<br>";
echo $matches[3][$i]."<br>";
echo $matches[4][$i]."<br>";
}

I have no idea why its skipping the earlier matches.
Thanks

Rik
Guest
 
Posts: n/a
#9: Mar 2 '07

re: PHP and regular expressions


On Fri, 02 Mar 2007 18:16:38 +0100, Damo <cormacdebarra@gmail.comwrote:
Quote:
I'm slowly getting the hang of regex's, Instead ive starting a new
thread i thought id ask here. I have this regex:
>
$regexp = "%<li><a href=\"(.+?)\">.+[^>]+>([^<]+)</a>.+<span class=
-------------------------------------^greedy dot--------^again
Quote:
\"date\">(.+?)</span>.+<div class=\"abstract\">(.+?)</div>%s";
------------------------^greedy dot
Quote:
It matches a the URL of a link, the text of a link, a date and the
abstract of a story.
If I use preg_match_all and theres multiple matches to the pattern in
a html string, it inly stores the last one in $matches
$regexp = '%<li>\s*
<a[^>]*?href="([^"]+)"[^>]*>
(.*?)
</a>
..*?
<span[^>]*?class="date"[^>]*>
(.+?)
</span>
..+?
<div[^>]*?class="abstract"[^>]*>
(.+?)
</div>%six';

--
Rik Wasmus
Closed Thread