By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,968 Members | 1,684 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,968 IT Pros & Developers. It's quick & easy.

PHP and regular expressions

P: n/a
Hi,
I'm new to this group and regular expressions. I want to extract text
from a newspaper website using regular expressions and php
I'm using this regular expression at the moment

$regexp = "%<table cellspacing=\"0\" cellpadding=\"0\"(.+) <img
height=\"1\" alt=\"Today's News\" />%s";

Each news story is in between those tags , So if I extract those
chunks of html using

preg_match($regexp,$document,$matches);

where $document is a handle on the file. I can store them in %matches
for further processing. Alas it does not work and i cannot figure out
why

Can anyone help?
Thanks

Feb 28 '07 #1
Share this Question
Share on Google+
8 Replies


P: n/a
__where $document is a handle on the file__
the 2nd parameter should be a string
read the file to a string...

Do you really want the space around "(.+)" ?
you can clean this up by enclosing it in '' rather than ""
$regexp = '%<table cellspacing="0" cellpadding="0"(.+) <img
height="1" alt="Today's News" />%s';

On Feb 28, 12:03 pm, "Damo" <cormacdeba...@gmail.comwrote:
Hi,
I'm new to this group and regular expressions. I want to extract text
from a newspaper website using regular expressions and php
I'm using this regular expression at the moment

$regexp = "%<table cellspacing=\"0\" cellpadding=\"0\"(.+) <img
height=\"1\" alt=\"Today's News\" />%s";

Each news story is in between those tags , So if I extract those
chunks of html using

preg_match($regexp,$document,$matches);

where $document is a handle on the file. I can store them in %matches
for further processing. Alas it does not work and i cannot figure out
why

Can anyone help?
Thanks

Feb 28 '07 #2

P: n/a
NC
On Feb 28, 10:03 am, "Damo" <cormacdeba...@gmail.comwrote:
>
I'm using this regular expression at the moment

$regexp = "%<table cellspacing=\"0\" cellpadding=\"0\"(.+) <img
height=\"1\" alt=\"Today's News\" />%s";

Each news story is in between those tags , So if I extract those
chunks of html using

preg_match($regexp,$document,$matches);

where $document is a handle on the file. I can store them in %matches
for further processing. Alas it does not work
How exactly does it not work? What gets into $matches and what value
does
preg_match() return?
Can anyone help?
First, do you really need the whitespace around (.+)? Second,
$document
must be a string, not a handle on the file. Third, your regular
expression
as written is greedy; is this intentional?

Cheers,
NC

Feb 28 '07 #3

P: n/a
>
First, do you really need the whitespace around (.+)?
Does the presence/absence of whitespace make a difference... As I said
I'm new to regex
Second,$document must be a string, not a handle on the file.
$document is a handle on a URL taht I was reading in , so ye it was
just a string
Third, your regular expression as written is greedy; is this intentional?
There was no ? at the end of my regular expression it was just (.+)
Cheers,
NC
I got it working since, Thanks for all you help.
cheers
Mar 1 '07 #4

P: n/a
Rik
Damo <co***********@gmail.comwrote:
>First, do you really need the whitespace around (.+)?
Does the presence/absence of whitespace make a difference... As I said
I'm new to regex
Yes, it will match that whitespace unless the /x modifier is set.
>Second,$document must be a string, not a handle on the file.
$document is a handle on a URL taht I was reading in , so ye it was
just a string
>Third, your regular expression as written is greedy; is this
intentional?
There was no ? at the end of my regular expression it was just (.+)
Yes, so it's greedy. It will match as much as possible untill the second
match.

consider:
'<a>foo</a>bar<a>baz</a>foz'

'|<a>.+</a>|' will match '<a>foo</a>bar<a>baz</a>'
'|<a>.+?</a>|' will match '<a>foo</a>'

For a lot of info about regular expressions:
<http://www.regularexpressions.info>

In your case, I'd possibly use:

$regexp = "%<table[^>]*>(.+?)<img%si";

(the /i modifier will make the dot match linebreaks, which is possibly the
breaking point for your regex).

Highly depends on the actual markup wether this will work though...
--
Rik Wasmus
Mar 1 '07 #5

P: n/a
On Mar 1, 4:07 pm, "Damo" <cormacdeba...@gmail.comwrote:
Second,$document must be a string, not a handle on the file.

$document is a handle on a URL that I was reading in , so ye it was just a string
Do we have different definitions of what a handle is?

Mar 2 '07 #6

P: n/a
On Fri, 02 Mar 2007 06:03:00 +0100, BKDotCom <bk***********@yahoo.com>
wrote:
On Mar 1, 4:07 pm, "Damo" <cormacdeba...@gmail.comwrote:
Second,$document must be a string, not a handle on the file.

$document is a handle on a URL that I was reading in , so ye it was
just a string

Do we have different definitions of what a handle is?
Yup.
http://nl3.php.net/manual/en/resource.php
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
Mar 2 '07 #7

P: n/a
I'm slowly getting the hang of regex's, Instead ive starting a new
thread i thought id ask here. I have this regex:

$regexp = "%<li><a href=\"(.+?)\">.+[^>]+>([^<]+)</a>.+<span class=
\"date\">(.+?)</span>.+<div class=\"abstract\">(.+?)</div>%s";

It matches a the URL of a link, the text of a link, a date and the
abstract of a story.
If I use preg_match_all and theres multiple matches to the pattern in
a html string, it inly stores the last one in $matches

this is how i'm printing them:

preg_match_all($regexp,$document,$matches);
$numMatches = count($matches[1]);
for ($i=0;$i <= $numMatches-1; $i++)
{
echo $matches[1][$i]."<br>";
echo $matches[2][$i]."<br>";
echo $matches[3][$i]."<br>";
echo $matches[4][$i]."<br>";
}

I have no idea why its skipping the earlier matches.
Thanks

Mar 2 '07 #8

P: n/a
Rik
On Fri, 02 Mar 2007 18:16:38 +0100, Damo <co***********@gmail.comwrote:
I'm slowly getting the hang of regex's, Instead ive starting a new
thread i thought id ask here. I have this regex:

$regexp = "%<li><a href=\"(.+?)\">.+[^>]+>([^<]+)</a>.+<span class=
-------------------------------------^greedy dot--------^again
\"date\">(.+?)</span>.+<div class=\"abstract\">(.+?)</div>%s";
------------------------^greedy dot
It matches a the URL of a link, the text of a link, a date and the
abstract of a story.
If I use preg_match_all and theres multiple matches to the pattern in
a html string, it inly stores the last one in $matches
$regexp = '%<li>\s*
<a[^>]*?href="([^"]+)"[^>]*>
(.*?)
</a>
..*?
<span[^>]*?class="date"[^>]*>
(.+?)
</span>
..+?
<div[^>]*?class="abstract"[^>]*>
(.+?)
</div>%six';

--
Rik Wasmus
Mar 2 '07 #9

This discussion thread is closed

Replies have been disabled for this discussion.