Replacing characters + stripping HTML

Martin

I have a HTML parser that reads product pages from various retailers - and I
want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:

- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which
I'm sure is not the fastest/best way):

$cont = str_replace(".",".",$cont);
$cont = str_replace(",",",",$cont);
$cont = str_replace("£","£",$cont);
$cont = str_replace("€","?",$cont);
$cont = str_replace("'","'",$cont);
$cont = str_replace("-","-",$cont);
$cont = str_replace("(","(",$cont);
$cont = str_replace(")",")",$cont);
$cont = str_replace("[","[",$cont);
$cont = str_replace("]","]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.

Jul 16 '05 #1

Subscribe Post Reply

3427

Chris Morris

"Martin" <ma****@home.se> writes:

- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);
Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"
- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which
I'm sure is not the fastest/best way):

PHP 4.3.0 and later
$cont = html_entity_decode($cont);

--
Chris

Jul 16 '05 #2

Martin

"Chris Morris" <c.********@durham.ac.uk> wrote in message
news:87************@dinopsis.dur.ac.uk...

"Martin" <ma****@home.se> writes:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);
Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"

Thanks - oh, what a difference a dot can do :).

- As some pages have special characters, I'd like to redo these to normal characters for ease when setting up new parser. Right now I have this (which I'm sure is not the fastest/best way):

PHP 4.3.0 and later
$cont = html_entity_decode($cont);

Forgot to mention, but had tried that before - that does not have the
desired effect, I'm afraid. It doesn't convert the characters I convert
"manually"...
--
Chris

Jul 16 '05 #3

Randell D.

"Martin" <ma****@home.se> wrote in message
news:TH****************@news1.bredband.com...

I have a HTML parser that reads product pages from various retailers - and I want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:

- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which I'm sure is not the fastest/best way):

$cont = str_replace(".",".",$cont);
$cont = str_replace(",",",",$cont);
$cont = str_replace("£","£",$cont);
$cont = str_replace("€","?",$cont);
$cont = str_replace("'","'",$cont);
$cont = str_replace("-","-",$cont);
$cont = str_replace("(","(",$cont);
$cont = str_replace(")",")",$cont);
$cont = str_replace("[","[",$cont);
$cont = str_replace("]","]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.

Maybe you're going around this in the wrong way... Would it not be better to
grab the data you want, as opposed to strip the data you don't want?

Jul 16 '05 #4

Martin

> > I have a HTML parser that reads product pages from various retailers -
and

I
want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal characters for ease when setting up new parser. Right now I have this

(which
I'm sure is not the fastest/best way):

$cont = str_replace(".",".",$cont);
$cont = str_replace(",",",",$cont);
$cont = str_replace("£","£",$cont);
$cont = str_replace("€","?",$cont);
$cont = str_replace("'","'",$cont);
$cont = str_replace("-","-",$cont);
$cont = str_replace("(","(",$cont);
$cont = str_replace(")",")",$cont);
$cont = str_replace("[","[",$cont);
$cont = str_replace("]","]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.

Maybe you're going around this in the wrong way... Would it not be better

to grab the data you want, as opposed to strip the data you don't want?

Sure - problem is I need most of the HTML - but some very specific parts I
don't want - therefore I grab the entire page - but want to remove what I
don't need.

Jul 16 '05 #5

lawrence

Chris Morris <c.********@durham.ac.uk> wrote in message news:<87************@dinopsis.dur.ac.uk>...

"Martin" <ma****@home.se> writes:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace("<head>*</head>","", $string);

Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"

Can't get to www.php.net for some reason, and so I can formulate my
question as nicely as I'd like to ask it. But:

On another line of thought, I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

..articleHeadline{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."

Jul 16 '05 #6

CC Zona

In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) wrote:

I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

.articleHeadline{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."

Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"

-or-

"\.[^{]*\{"

--
CC

Jul 16 '05 #7

lawrence

CC Zona <cc****@nospam.invalid> wrote in message news:<cc**************************@netnews.attbi.c om>...

In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) wrote:
I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

.articleHeadline{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."

Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"

-or-

and if I want to make some changes to all of my PHP files,and I need
to insert something into the first line of all my functions, I suppose
to select the line where the function is declared would be something
like this? :

"function .*(.*).*\{"

Jul 16 '05 #8

CC Zona

In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) wrote:

CC Zona <cc****@nospam.invalid> wrote in message
news:<cc**************************@netnews.attbi.c om>...
In article <da**************************@posting.google.com >,
lk******@geocities.com (lawrence) wrote:

Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"
<snip> and if I want to make some changes to all of my PHP files,and I need
to insert something into the first line of all my functions, I suppose
to select the line where the function is declared would be something
like this? :

"function .*(.*).*\{"

No, definitely not like that. You're in danger of matching far too much
and it will surely match something quite different than what you intended.
That says: "match string 'function ', then match zero or more of anything,
then match and capture zero or more of anything, then match zero or more of
anything until you smack up against the final match character which is a
left curly brace (likely the last one occuring in the haystack)".

Using regular expressions without understand the syntax can cause you a lot
of grief. For your sake, I urge you to spend some time learning POSIX
regular expression syntax (which is used by eregi/eregi) now before you
accidentally overwrite something that's important to you. I'm suggesting
POSIX, even though I personally prefer PCRE, because most people seem to
find POSIX easier to start with, rather than diving headfirst into the
richer (though somewhat more complex) possibilities of PCRE. O'Reilly has
a (*the*) wonderful guide to regex, written by Jeffry Friedl. Don't be
intimidated by its size: he's covering everything from the veyr basic to
the highly sophisticated, and doing it form many more flavors of regex than
are available in PHP. You'll learn a lot just from Friedl's first few
chapters, even if you never read farther than that.

Oh yeah, and ALWAYS test your regular expressions thoroughly before
trusting them to production use.

Good luck!

--
CC

Jul 16 '05 #9

Adam i Agnieszka Gasiorowski FNORD

CC Zona wrote:

Using regular expressions without understand the syntax can cause you a lot
of grief. For your sake, I urge you to spend some time learning POSIX
regular expression syntax (which is used by eregi/eregi) now before you
accidentally overwrite something that's important to you. I'm suggesting
POSIX, even though I personally prefer PCRE, because most people seem to
find POSIX easier to start with, rather than diving headfirst into the
richer (though somewhat more complex) possibilities of PCRE. O'Reilly has
a (*the*) wonderful guide to regex, written by Jeffry Friedl. Don't be
intimidated by its size: he's covering everything from the veyr basic to
the highly sophisticated, and doing it form many more flavors of regex than
are available in PHP. You'll learn a lot just from Friedl's first few
chapters, even if you never read farther than that.

I'm reading that book right now, I highly
recommend it to anyone wanting to master regular
expressions. It's superb. A definite classic :8].

--
Seks, seksiæ, seksolatki... news:pl.soc.seks.moderowana
http://hyperreal.info { iWanToDie } WiNoNa ) (
http://szatanowskie-ladacznice.0-700.pl foReVeR( * )
Poznaj jej zwiewne kszta³ty... http://www.opera.com 007

Jul 16 '05 #10

by: thehuby | last post by:

Isn't inserting good data and getting it out of a db a pain in the a$$? I am going to be using the Markdown text to HTML parser (http://daringfireball.net/projects/markdown/dingus) for creating...

PHP

Allowing special characters in a MySql Database

by: Ewok | last post by:

let me just say. it's not by choice but im dealing with a .net web app (top down approach with VB and a MySQL database) sigh..... Anyhow, I've just about got all the kinks worked out but I am...

.NET Framework

Trim/RTrim not stripping trailing space characters in Access 97

by: Andy B | last post by:

I've tried using Trim or RTrim to strip trailing space characters from my data. When I check on the transformed data space characters are still there. We have an address table containing two...

Microsoft Access / VBA

Transforming XML to text (and output tab and new line characters

by: Buddy Ackerman | last post by:

Apparently .NET strips these white space characters (MSXML doesn't) regardless of what the output method is set to. I'm using <xsl:text> </xsl:text> to output a tab character and...

.NET Framework

Request.QueryString() is stripping out French characters

by: Lu | last post by:

Hi, i am currently working on ASP.Net v1.0 and is encountering the following problem. In javascript, I'm passing in: "somepage.aspx?QSParameter=<RowID>ChÃ¨ques</RowID>" as part of the query...

ASP.NET

replacing text data in a binary file

by: Adam J. Schaff | last post by:

I am writing a quick program to edit a binary file that contains file paths (amongst other things). If I look at the files in notepad, they look like: ...

Visual Basic .NET

Retrieving special characters

by: Larry | last post by:

OK, I've been searching around the net for numerous hours and seem to just be getting more confused about handling special characters. In my host's configuration MagicQuotes is ON. (I understand...

PHP

Valid Characters

by: preport | last post by:

I'm trying to ensure that all the characters in my XML document are characters specified in this document: http://www.w3.org/TR/2000/REC-xml-20001006#charsets Would a function like this work: ...

C# / C Sharp

Help - stripping special characters from email and retaining html tags

by: Big Moxy | last post by:

I want to send html formatted text yet strip out special characters (e.g. quotes and semi colons). I've seen preg_replace examples like $messageout = preg_replace('/<>]/i','',$message); to...

PHP

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Replacing characters + stripping HTML

Similar topics