By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,278 Members | 1,329 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,278 IT Pros & Developers. It's quick & easy.

Replacing characters + stripping HTML

P: n/a
I have a HTML parser that reads product pages from various retailers - and I
want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:

- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which
I'm sure is not the fastest/best way):

$cont = str_replace(".",".",$cont);
$cont = str_replace(",",",",$cont);
$cont = str_replace("£","£",$cont);
$cont = str_replace("€","?",$cont);
$cont = str_replace("'","'",$cont);
$cont = str_replace("-","-",$cont);
$cont = str_replace("(","(",$cont);
$cont = str_replace(")",")",$cont);
$cont = str_replace("[","[",$cont);
$cont = str_replace("]","]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.
Jul 16 '05 #1
Share this Question
Share on Google+
9 Replies


P: n/a
"Martin" <ma****@home.se> writes:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);
Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"
- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which
I'm sure is not the fastest/best way):


PHP 4.3.0 and later
$cont = html_entity_decode($cont);

--
Chris
Jul 16 '05 #2

P: n/a

"Chris Morris" <c.********@durham.ac.uk> wrote in message
news:87************@dinopsis.dur.ac.uk...
"Martin" <ma****@home.se> writes:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);
Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"


Thanks - oh, what a difference a dot can do :).
- As some pages have special characters, I'd like to redo these to normal characters for ease when setting up new parser. Right now I have this (which I'm sure is not the fastest/best way):


PHP 4.3.0 and later
$cont = html_entity_decode($cont);


Forgot to mention, but had tried that before - that does not have the
desired effect, I'm afraid. It doesn't convert the characters I convert
"manually"...
--
Chris

Jul 16 '05 #3

P: n/a

"Martin" <ma****@home.se> wrote in message
news:TH****************@news1.bredband.com...
I have a HTML parser that reads product pages from various retailers - and I want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:

- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which I'm sure is not the fastest/best way):

$cont = str_replace(".",".",$cont);
$cont = str_replace(",",",",$cont);
$cont = str_replace("£","£",$cont);
$cont = str_replace("€","?",$cont);
$cont = str_replace("'","'",$cont);
$cont = str_replace("-","-",$cont);
$cont = str_replace("(","(",$cont);
$cont = str_replace(")",")",$cont);
$cont = str_replace("[","[",$cont);
$cont = str_replace("]","]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.


Maybe you're going around this in the wrong way... Would it not be better to
grab the data you want, as opposed to strip the data you don't want?
Jul 16 '05 #4

P: n/a
> > I have a HTML parser that reads product pages from various retailers -
and
I
want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal characters for ease when setting up new parser. Right now I have this

(which
I'm sure is not the fastest/best way):

$cont = str_replace(".",".",$cont);
$cont = str_replace(",",",",$cont);
$cont = str_replace("£","£",$cont);
$cont = str_replace("€","?",$cont);
$cont = str_replace("'","'",$cont);
$cont = str_replace("-","-",$cont);
$cont = str_replace("(","(",$cont);
$cont = str_replace(")",")",$cont);
$cont = str_replace("[","[",$cont);
$cont = str_replace("]","]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.


Maybe you're going around this in the wrong way... Would it not be better

to grab the data you want, as opposed to strip the data you don't want?


Sure - problem is I need most of the HTML - but some very specific parts I
don't want - therefore I grab the entire page - but want to remove what I
don't need.
Jul 16 '05 #5

P: n/a
Chris Morris <c.********@durham.ac.uk> wrote in message news:<87************@dinopsis.dur.ac.uk>...
"Martin" <ma****@home.se> writes:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);


Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"


Can't get to www.php.net for some reason, and so I can formulate my
question as nicely as I'd like to ask it. But:

On another line of thought, I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

..articleHeadline{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."
Jul 16 '05 #6

P: n/a
In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) wrote:
I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

.articleHeadline{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."


Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"

-or-

"\.[^{]*\{"

--
CC
Jul 16 '05 #7

P: n/a
CC Zona <cc****@nospam.invalid> wrote in message news:<cc**************************@netnews.attbi.c om>...
In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) wrote:
I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

.articleHeadline{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."


Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"

-or-

and if I want to make some changes to all of my PHP files,and I need
to insert something into the first line of all my functions, I suppose
to select the line where the function is declared would be something
like this? :

"function .*(.*).*\{"
Jul 16 '05 #8

P: n/a
In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) wrote:
CC Zona <cc****@nospam.invalid> wrote in message
news:<cc**************************@netnews.attbi.c om>...
In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) wrote:

Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"
<snip> and if I want to make some changes to all of my PHP files,and I need
to insert something into the first line of all my functions, I suppose
to select the line where the function is declared would be something
like this? :

"function .*(.*).*\{"


No, definitely not like that. You're in danger of matching far too much
and it will surely match something quite different than what you intended.
That says: "match string 'function ', then match zero or more of anything,
then match and capture zero or more of anything, then match zero or more of
anything until you smack up against the final match character which is a
left curly brace (likely the last one occuring in the haystack)".

Using regular expressions without understand the syntax can cause you a lot
of grief. For your sake, I urge you to spend some time learning POSIX
regular expression syntax (which is used by eregi/eregi) now before you
accidentally overwrite something that's important to you. I'm suggesting
POSIX, even though I personally prefer PCRE, because most people seem to
find POSIX easier to start with, rather than diving headfirst into the
richer (though somewhat more complex) possibilities of PCRE. O'Reilly has
a (*the*) wonderful guide to regex, written by Jeffry Friedl. Don't be
intimidated by its size: he's covering everything from the veyr basic to
the highly sophisticated, and doing it form many more flavors of regex than
are available in PHP. You'll learn a lot just from Friedl's first few
chapters, even if you never read farther than that.

Oh yeah, and ALWAYS test your regular expressions thoroughly before
trusting them to production use.

Good luck!

--
CC
Jul 16 '05 #9

P: n/a
CC Zona wrote:
Using regular expressions without understand the syntax can cause you a lot
of grief. For your sake, I urge you to spend some time learning POSIX
regular expression syntax (which is used by eregi/eregi) now before you
accidentally overwrite something that's important to you. I'm suggesting
POSIX, even though I personally prefer PCRE, because most people seem to
find POSIX easier to start with, rather than diving headfirst into the
richer (though somewhat more complex) possibilities of PCRE. O'Reilly has
a (*the*) wonderful guide to regex, written by Jeffry Friedl. Don't be
intimidated by its size: he's covering everything from the veyr basic to
the highly sophisticated, and doing it form many more flavors of regex than
are available in PHP. You'll learn a lot just from Friedl's first few
chapters, even if you never read farther than that.


I'm reading that book right now, I highly
recommend it to anyone wanting to master regular
expressions. It's superb. A definite classic :8].

--
Seks, seksię, seksolatki... news:pl.soc.seks.moderowana
http://hyperreal.info { iWanToDie } WiNoNa ) (
http://szatanowskie-ladacznice.0-700.pl foReVeR( * )
Poznaj jej zwiewne kszta³ty... http://www.opera.com 007
Jul 16 '05 #10

This discussion thread is closed

Replies have been disabled for this discussion.