473,395 Members | 1,584 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Replacing characters + stripping HTML

I have a HTML parser that reads product pages from various retailers - and I
want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:

- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which
I'm sure is not the fastest/best way):

$cont = str_replace(".",".",$cont);
$cont = str_replace(",",",",$cont);
$cont = str_replace("£","£",$cont);
$cont = str_replace("€","?",$cont);
$cont = str_replace("'","'",$cont);
$cont = str_replace("-","-",$cont);
$cont = str_replace("(","(",$cont);
$cont = str_replace(")",")",$cont);
$cont = str_replace("[","[",$cont);
$cont = str_replace("]","]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.
Jul 16 '05 #1
9 3427
"Martin" <ma****@home.se> writes:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);
Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"
- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which
I'm sure is not the fastest/best way):


PHP 4.3.0 and later
$cont = html_entity_decode($cont);

--
Chris
Jul 16 '05 #2

"Chris Morris" <c.********@durham.ac.uk> wrote in message
news:87************@dinopsis.dur.ac.uk...
"Martin" <ma****@home.se> writes:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);
Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"


Thanks - oh, what a difference a dot can do :).
- As some pages have special characters, I'd like to redo these to normal characters for ease when setting up new parser. Right now I have this (which I'm sure is not the fastest/best way):


PHP 4.3.0 and later
$cont = html_entity_decode($cont);


Forgot to mention, but had tried that before - that does not have the
desired effect, I'm afraid. It doesn't convert the characters I convert
"manually"...
--
Chris

Jul 16 '05 #3

"Martin" <ma****@home.se> wrote in message
news:TH****************@news1.bredband.com...
I have a HTML parser that reads product pages from various retailers - and I want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:

- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which I'm sure is not the fastest/best way):

$cont = str_replace(".",".",$cont);
$cont = str_replace(",",",",$cont);
$cont = str_replace("£","£",$cont);
$cont = str_replace("€","?",$cont);
$cont = str_replace("'","'",$cont);
$cont = str_replace("-","-",$cont);
$cont = str_replace("(","(",$cont);
$cont = str_replace(")",")",$cont);
$cont = str_replace("[","[",$cont);
$cont = str_replace("]","]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.


Maybe you're going around this in the wrong way... Would it not be better to
grab the data you want, as opposed to strip the data you don't want?
Jul 16 '05 #4
> > I have a HTML parser that reads product pages from various retailers -
and
I
want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal characters for ease when setting up new parser. Right now I have this

(which
I'm sure is not the fastest/best way):

$cont = str_replace(".",".",$cont);
$cont = str_replace(",",",",$cont);
$cont = str_replace("£","£",$cont);
$cont = str_replace("€","?",$cont);
$cont = str_replace("'","'",$cont);
$cont = str_replace("-","-",$cont);
$cont = str_replace("(","(",$cont);
$cont = str_replace(")",")",$cont);
$cont = str_replace("[","[",$cont);
$cont = str_replace("]","]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.


Maybe you're going around this in the wrong way... Would it not be better

to grab the data you want, as opposed to strip the data you don't want?


Sure - problem is I need most of the HTML - but some very specific parts I
don't want - therefore I grab the entire page - but want to remove what I
don't need.
Jul 16 '05 #5
Chris Morris <c.********@durham.ac.uk> wrote in message news:<87************@dinopsis.dur.ac.uk>...
"Martin" <ma****@home.se> writes:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);


Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"


Can't get to www.php.net for some reason, and so I can formulate my
question as nicely as I'd like to ask it. But:

On another line of thought, I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

..articleHeadline{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."
Jul 16 '05 #6
In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) wrote:
I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

.articleHeadline{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."


Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"

-or-

"\.[^{]*\{"

--
CC
Jul 16 '05 #7
CC Zona <cc****@nospam.invalid> wrote in message news:<cc**************************@netnews.attbi.c om>...
In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) wrote:
I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

.articleHeadline{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."


Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"

-or-

and if I want to make some changes to all of my PHP files,and I need
to insert something into the first line of all my functions, I suppose
to select the line where the function is declared would be something
like this? :

"function .*(.*).*\{"
Jul 16 '05 #8
In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) wrote:
CC Zona <cc****@nospam.invalid> wrote in message
news:<cc**************************@netnews.attbi.c om>...
In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) wrote:

Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"
<snip> and if I want to make some changes to all of my PHP files,and I need
to insert something into the first line of all my functions, I suppose
to select the line where the function is declared would be something
like this? :

"function .*(.*).*\{"


No, definitely not like that. You're in danger of matching far too much
and it will surely match something quite different than what you intended.
That says: "match string 'function ', then match zero or more of anything,
then match and capture zero or more of anything, then match zero or more of
anything until you smack up against the final match character which is a
left curly brace (likely the last one occuring in the haystack)".

Using regular expressions without understand the syntax can cause you a lot
of grief. For your sake, I urge you to spend some time learning POSIX
regular expression syntax (which is used by eregi/eregi) now before you
accidentally overwrite something that's important to you. I'm suggesting
POSIX, even though I personally prefer PCRE, because most people seem to
find POSIX easier to start with, rather than diving headfirst into the
richer (though somewhat more complex) possibilities of PCRE. O'Reilly has
a (*the*) wonderful guide to regex, written by Jeffry Friedl. Don't be
intimidated by its size: he's covering everything from the veyr basic to
the highly sophisticated, and doing it form many more flavors of regex than
are available in PHP. You'll learn a lot just from Friedl's first few
chapters, even if you never read farther than that.

Oh yeah, and ALWAYS test your regular expressions thoroughly before
trusting them to production use.

Good luck!

--
CC
Jul 16 '05 #9
CC Zona wrote:
Using regular expressions without understand the syntax can cause you a lot
of grief. For your sake, I urge you to spend some time learning POSIX
regular expression syntax (which is used by eregi/eregi) now before you
accidentally overwrite something that's important to you. I'm suggesting
POSIX, even though I personally prefer PCRE, because most people seem to
find POSIX easier to start with, rather than diving headfirst into the
richer (though somewhat more complex) possibilities of PCRE. O'Reilly has
a (*the*) wonderful guide to regex, written by Jeffry Friedl. Don't be
intimidated by its size: he's covering everything from the veyr basic to
the highly sophisticated, and doing it form many more flavors of regex than
are available in PHP. You'll learn a lot just from Friedl's first few
chapters, even if you never read farther than that.


I'm reading that book right now, I highly
recommend it to anyone wanting to master regular
expressions. It's superb. A definite classic :8].

--
Seks, seksiæ, seksolatki... news:pl.soc.seks.moderowana
http://hyperreal.info { iWanToDie } WiNoNa ) (
http://szatanowskie-ladacznice.0-700.pl foReVeR( * )
Poznaj jej zwiewne kszta³ty... http://www.opera.com 007
Jul 16 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: thehuby | last post by:
Isn't inserting good data and getting it out of a db a pain in the a$$? I am going to be using the Markdown text to HTML parser (http://daringfireball.net/projects/markdown/dingus) for creating...
4
by: Ewok | last post by:
let me just say. it's not by choice but im dealing with a .net web app (top down approach with VB and a MySQL database) sigh..... Anyhow, I've just about got all the kinks worked out but I am...
3
by: Andy B | last post by:
I've tried using Trim or RTrim to strip trailing space characters from my data. When I check on the transformed data space characters are still there. We have an address table containing two...
2
by: Buddy Ackerman | last post by:
Apparently .NET strips these white space characters (MSXML doesn't) regardless of what the output method is set to. I'm using <xsl:text> </xsl:text> to output a tab character and...
4
by: Lu | last post by:
Hi, i am currently working on ASP.Net v1.0 and is encountering the following problem. In javascript, I'm passing in: "somepage.aspx?QSParameter=<RowID>Chèques</RowID>" as part of the query...
12
by: Adam J. Schaff | last post by:
I am writing a quick program to edit a binary file that contains file paths (amongst other things). If I look at the files in notepad, they look like: ...
9
by: Larry | last post by:
OK, I've been searching around the net for numerous hours and seem to just be getting more confused about handling special characters. In my host's configuration MagicQuotes is ON. (I understand...
13
by: preport | last post by:
I'm trying to ensure that all the characters in my XML document are characters specified in this document: http://www.w3.org/TR/2000/REC-xml-20001006#charsets Would a function like this work: ...
2
by: Big Moxy | last post by:
I want to send html formatted text yet strip out special characters (e.g. quotes and semi colons). I've seen preg_replace examples like $messageout = preg_replace('/\(\)<>]/i','',$message); to...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.