Help | Site Map
Connecting Tech Pros Worldwide
 
 
LinkBack Thread Tools
  #1  
Old February 28th, 2007, 06:15 PM
Damo
Guest
 
Posts: n/a
Default PHP and regular expressions

Hi,
I'm new to this group and regular expressions. I want to extract text
from a newspaper website using regular expressions and php
I'm using this regular expression at the moment

$regexp = "%<table cellspacing=\"0\" cellpadding=\"0\"(.+) <img
height=\"1\" alt=\"Today's News\" />%s";

Each news story is in between those tags , So if I extract those
chunks of html using

preg_match($regexp,$document,$matches);

where $document is a handle on the file. I can store them in %matches
for further processing. Alas it does not work and i cannot figure out
why

Can anyone help?
Thanks

  #2  
Old February 28th, 2007, 09:35 PM
BKDotCom
Guest
 
Posts: n/a
Default Re: PHP and regular expressions

__where $document is a handle on the file__
the 2nd parameter should be a string
read the file to a string...

Do you really want the space around "(.+)" ?
you can clean this up by enclosing it in '' rather than ""
$regexp = '%<table cellspacing="0" cellpadding="0"(.+) <img
height="1" alt="Today's News" />%s';

On Feb 28, 12:03 pm, "Damo" <cormacdeba...@gmail.comwrote:
Quote:
Hi,
I'm new to this group and regular expressions. I want to extract text
from a newspaper website using regular expressions and php
I'm using this regular expression at the moment
>
$regexp = "%<table cellspacing=\"0\" cellpadding=\"0\"(.+) <img
height=\"1\" alt=\"Today's News\" />%s";
>
Each news story is in between those tags , So if I extract those
chunks of html using
>
preg_match($regexp,$document,$matches);
>
where $document is a handle on the file. I can store them in %matches
for further processing. Alas it does not work and i cannot figure out
why
>
Can anyone help?
Thanks

  #3  
Old February 28th, 2007, 10:35 PM
NC
Guest
 
Posts: n/a
Default Re: PHP and regular expressions

On Feb 28, 10:03 am, "Damo" <cormacdeba...@gmail.comwrote:
Quote:
>
I'm using this regular expression at the moment
>
$regexp = "%<table cellspacing=\"0\" cellpadding=\"0\"(.+) <img
height=\"1\" alt=\"Today's News\" />%s";
>
Each news story is in between those tags , So if I extract those
chunks of html using
>
preg_match($regexp,$document,$matches);
>
where $document is a handle on the file. I can store them in %matches
for further processing. Alas it does not work
How exactly does it not work? What gets into $matches and what value
does
preg_match() return?
Quote:
Can anyone help?
First, do you really need the whitespace around (.+)? Second,
$document
must be a string, not a handle on the file. Third, your regular
expression
as written is greedy; is this intentional?

Cheers,
NC

  #4  
Old March 1st, 2007, 10:15 PM
Damo
Guest
 
Posts: n/a
Default Re: PHP and regular expressions

Quote:
>
First, do you really need the whitespace around (.+)?
Does the presence/absence of whitespace make a difference... As I said
I'm new to regex
Quote:
Second,$document must be a string, not a handle on the file.
$document is a handle on a URL taht I was reading in , so ye it was
just a string
Quote:
Third, your regular expression as written is greedy; is this intentional?
There was no ? at the end of my regular expression it was just (.+)
Quote:
Cheers,
NC
I got it working since, Thanks for all you help.
cheers


  #5  
Old March 1st, 2007, 11:25 PM
Rik
Guest
 
Posts: n/a
Default Re: PHP and regular expressions

Damo <cormacdebarra@gmail.comwrote:
Quote:
Quote:
>First, do you really need the whitespace around (.+)?
Does the presence/absence of whitespace make a difference... As I said
I'm new to regex
Yes, it will match that whitespace unless the /x modifier is set.
Quote:
Quote:
>Second,$document must be a string, not a handle on the file.
$document is a handle on a URL taht I was reading in , so ye it was
just a string
Quote:
Quote:
>Third, your regular expression as written is greedy; is this
>intentional?
There was no ? at the end of my regular expression it was just (.+)
Yes, so it's greedy. It will match as much as possible untill the second
match.

consider:
'<a>foo</a>bar<a>baz</a>foz'

'|<a>.+</a>|' will match '<a>foo</a>bar<a>baz</a>'
'|<a>.+?</a>|' will match '<a>foo</a>'

For a lot of info about regular expressions:
<http://www.regularexpressions.info>

In your case, I'd possibly use:

$regexp = "%<table[^>]*>(.+?)<img%si";

(the /i modifier will make the dot match linebreaks, which is possibly the
breaking point for your regex).

Highly depends on the actual markup wether this will work though...
--
Rik Wasmus
  #6  
Old March 2nd, 2007, 05:15 AM
BKDotCom
Guest
 
Posts: n/a
Default Re: PHP and regular expressions

On Mar 1, 4:07 pm, "Damo" <cormacdeba...@gmail.comwrote:
Quote:
Quote:
Second,$document must be a string, not a handle on the file.
>
$document is a handle on a URL that I was reading in , so ye it was just a string
Do we have different definitions of what a handle is?

  #7  
Old March 2nd, 2007, 08:25 AM
OmegaJunior
Guest
 
Posts: n/a
Default Re: PHP and regular expressions

On Fri, 02 Mar 2007 06:03:00 +0100, BKDotCom <bkfake-google@yahoo.com>
wrote:
Quote:
On Mar 1, 4:07 pm, "Damo" <cormacdeba...@gmail.comwrote:
Quote:
Quote:
Second,$document must be a string, not a handle on the file.
>>
>$document is a handle on a URL that I was reading in , so ye it was
>just a string
>
Do we have different definitions of what a handle is?
>
Yup.
http://nl3.php.net/manual/en/resource.php


--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
  #8  
Old March 2nd, 2007, 05:25 PM
Damo
Guest
 
Posts: n/a
Default Re: PHP and regular expressions

I'm slowly getting the hang of regex's, Instead ive starting a new
thread i thought id ask here. I have this regex:

$regexp = "%<li><a href=\"(.+?)\">.+[^>]+>([^<]+)</a>.+<span class=
\"date\">(.+?)</span>.+<div class=\"abstract\">(.+?)</div>%s";

It matches a the URL of a link, the text of a link, a date and the
abstract of a story.
If I use preg_match_all and theres multiple matches to the pattern in
a html string, it inly stores the last one in $matches

this is how i'm printing them:

preg_match_all($regexp,$document,$matches);
$numMatches = count($matches[1]);
for ($i=0;$i <= $numMatches-1; $i++)
{
echo $matches[1][$i]."<br>";
echo $matches[2][$i]."<br>";
echo $matches[3][$i]."<br>";
echo $matches[4][$i]."<br>";
}

I have no idea why its skipping the earlier matches.
Thanks

  #9  
Old March 2nd, 2007, 06:25 PM
Rik
Guest
 
Posts: n/a
Default Re: PHP and regular expressions

On Fri, 02 Mar 2007 18:16:38 +0100, Damo <cormacdebarra@gmail.comwrote:
Quote:
I'm slowly getting the hang of regex's, Instead ive starting a new
thread i thought id ask here. I have this regex:
>
$regexp = "%<li><a href=\"(.+?)\">.+[^>]+>([^<]+)</a>.+<span class=
-------------------------------------^greedy dot--------^again
Quote:
\"date\">(.+?)</span>.+<div class=\"abstract\">(.+?)</div>%s";
------------------------^greedy dot
Quote:
It matches a the URL of a link, the text of a link, a date and the
abstract of a story.
If I use preg_match_all and theres multiple matches to the pattern in
a html string, it inly stores the last one in $matches
$regexp = '%<li>\s*
<a[^>]*?href="([^"]+)"[^>]*>
(.*?)
</a>
..*?
<span[^>]*?class="date"[^>]*>
(.+?)
</span>
..+?
<div[^>]*?class="abstract"[^>]*>
(.+?)
</div>%six';

--
Rik Wasmus
 

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles