Replacing characters + stripping HTML

Martin

I have a HTML parser that reads product pages from various retailers - and I
want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:

- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace(" <head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which
I'm sure is not the fastest/best way):

$cont = str_replace("." ,".",$cont);
$cont = str_replace("," ,",",$cont);
$cont = str_replace("£" ,"£",$cont);
$cont = str_replace("€" ,"?",$cont);
$cont = str_replace("'" ,"'",$cont);
$cont = str_replace("-" ,"-",$cont);
$cont = str_replace("(" ,"(",$cont);
$cont = str_replace(")" ,")",$cont);
$cont = str_replace("[" ,"[",$cont);
$cont = str_replace("]" ,"]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.

Jul 16 '05 #1

Subscribe Reply

3448

Chris Morris

"Martin" <ma****@home.se > writes:

- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace(" <head>*</head>","", $string);
Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"
- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which
I'm sure is not the fastest/best way):

PHP 4.3.0 and later
$cont = html_entity_dec ode($cont);

--
Chris

Jul 16 '05 #2

Martin

"Chris Morris" <c.********@dur ham.ac.uk> wrote in message
news:87******** ****@dinopsis.d ur.ac.uk...

"Martin" <ma****@home.se > writes:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how to set this up ? I tried this with no luck:

$string = eregi_replace(" <head>*</head>","", $string);
Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"

Thanks - oh, what a difference a dot can do :).

- As some pages have special characters, I'd like to redo these to normal characters for ease when setting up new parser. Right now I have this (which I'm sure is not the fastest/best way):

PHP 4.3.0 and later
$cont = html_entity_dec ode($cont);

Forgot to mention, but had tried that before - that does not have the
desired effect, I'm afraid. It doesn't convert the characters I convert
"manually". ..
--
Chris

Jul 16 '05 #3

Randell D.

"Martin" <ma****@home.se > wrote in message
news:TH******** ********@news1. bredband.com...

I have a HTML parser that reads product pages from various retailers - and I want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:

- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how to set this up ? I tried this with no luck:

$string = eregi_replace(" <head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which I'm sure is not the fastest/best way):

$cont = str_replace("." ,".",$cont);
$cont = str_replace("," ,",",$cont);
$cont = str_replace("£" ,"£",$cont);
$cont = str_replace("€" ,"?",$cont);
$cont = str_replace("'" ,"'",$cont);
$cont = str_replace("-" ,"-",$cont);
$cont = str_replace("(" ,"(",$cont);
$cont = str_replace(")" ,")",$cont);
$cont = str_replace("[" ,"[",$cont);
$cont = str_replace("]" ,"]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.

Maybe you're going around this in the wrong way... Would it not be better to
grab the data you want, as opposed to strip the data you don't want?

Jul 16 '05 #4

Martin

> > I have a HTML parser that reads product pages from various retailers -
and

I
want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace(" <head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal characters for ease when setting up new parser. Right now I have this

(which
I'm sure is not the fastest/best way):

$cont = str_replace("." ,".",$cont);
$cont = str_replace("," ,",",$cont);
$cont = str_replace("£" ,"£",$cont);
$cont = str_replace("€" ,"?",$cont);
$cont = str_replace("'" ,"'",$cont);
$cont = str_replace("-" ,"-",$cont);
$cont = str_replace("(" ,"(",$cont);
$cont = str_replace(")" ,")",$cont);
$cont = str_replace("[" ,"[",$cont);
$cont = str_replace("]" ,"]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.

Maybe you're going around this in the wrong way... Would it not be better

to grab the data you want, as opposed to strip the data you don't want?

Sure - problem is I need most of the HTML - but some very specific parts I
don't want - therefore I grab the entire page - but want to remove what I
don't need.

Jul 16 '05 #5

lawrence

Chris Morris <c.********@dur ham.ac.uk> wrote in message news:<87******* *****@dinopsis. dur.ac.uk>...

"Martin" <ma****@home.se > writes:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace(" <head>*</head>","", $string);

Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"

Can't get to www.php.net for some reason, and so I can formulate my
question as nicely as I'd like to ask it. But:

On another line of thought, I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

..articleHeadli ne{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."

Jul 16 '05 #6

CC Zona

In article <da************ **************@ posting.google. com>,
lk******@geocit ies.com (lawrence) wrote:

I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

.articleHeadlin e{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."

Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"

-or-

"\.[^{]*\{"

--
CC

Jul 16 '05 #7

lawrence

CC Zona <cc****@nospam. invalid> wrote in message news:<cc******* *************** ****@netnews.at tbi.com>...

In article <da************ **************@ posting.google. com>,
lk******@geocit ies.com (lawrence) wrote:
I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

.articleHeadlin e{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."

Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"

-or-

and if I want to make some changes to all of my PHP files,and I need
to insert something into the first line of all my functions, I suppose
to select the line where the function is declared would be something
like this? :

"function .*(.*).*\{"

Jul 16 '05 #8

CC Zona

In article <da************ **************@ posting.google. com>,
lk******@geocit ies.com (lawrence) wrote:

CC Zona <cc****@nospam. invalid> wrote in message
news:<cc******* *************** ****@netnews.at tbi.com>...
In article <da************ **************@ posting.google. com>,
lk******@geocit ies.com (lawrence) wrote:

Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"
<snip> and if I want to make some changes to all of my PHP files,and I need
to insert something into the first line of all my functions, I suppose
to select the line where the function is declared would be something
like this? :

"function .*(.*).*\{"

No, definitely not like that. You're in danger of matching far too much
and it will surely match something quite different than what you intended.
That says: "match string 'function ', then match zero or more of anything,
then match and capture zero or more of anything, then match zero or more of
anything until you smack up against the final match character which is a
left curly brace (likely the last one occuring in the haystack)".

Using regular expressions without understand the syntax can cause you a lot
of grief. For your sake, I urge you to spend some time learning POSIX
regular expression syntax (which is used by eregi/eregi) now before you
accidentally overwrite something that's important to you. I'm suggesting
POSIX, even though I personally prefer PCRE, because most people seem to
find POSIX easier to start with, rather than diving headfirst into the
richer (though somewhat more complex) possibilities of PCRE. O'Reilly has
a (*the*) wonderful guide to regex, written by Jeffry Friedl. Don't be
intimidated by its size: he's covering everything from the veyr basic to
the highly sophisticated, and doing it form many more flavors of regex than
are available in PHP. You'll learn a lot just from Friedl's first few
chapters, even if you never read farther than that.

Oh yeah, and ALWAYS test your regular expressions thoroughly before
trusting them to production use.

Good luck!

--
CC

Jul 16 '05 #9

Adam i Agnieszka Gasiorowski FNORD

CC Zona wrote:

Using regular expressions without understand the syntax can cause you a lot
of grief. For your sake, I urge you to spend some time learning POSIX
regular expression syntax (which is used by eregi/eregi) now before you
accidentally overwrite something that's important to you. I'm suggesting
POSIX, even though I personally prefer PCRE, because most people seem to
find POSIX easier to start with, rather than diving headfirst into the
richer (though somewhat more complex) possibilities of PCRE. O'Reilly has
a (*the*) wonderful guide to regex, written by Jeffry Friedl. Don't be
intimidated by its size: he's covering everything from the veyr basic to
the highly sophisticated, and doing it form many more flavors of regex than
are available in PHP. You'll learn a lot just from Friedl's first few
chapters, even if you never read farther than that.

I'm reading that book right now, I highly
recommend it to anyone wanting to master regular
expressions. It's superb. A definite classic :8].

--
Seks, seksiæ, seksolatki... news:pl.soc.sek s.moderowana
http://hyperreal.info { iWanToDie } WiNoNa ) (
http://szatanowskie-ladacznice.0-700.pl foReVeR( * )
Poznaj jej zwiewne kszta³ty... http://www.opera.com 007

Jul 16 '05 #10

Similar topics

2060

Escaping Data and Replacing HTML for PHP/MySQL

by: thehuby | last post by:

Isn't inserting good data and getting it out of a db a pain in the a$$? I am going to be using the Markdown text to HTML parser (http://daringfireball.net/projects/markdown/dingus) for creating HTML from user input (for a bespoke CMS) so that users can put in their own headings, lists and links etc. This is great and gets round all the issues of apostrophes etc. when inserting informaiton into a database for me.

PHP

5264

Allowing special characters in a MySql Database

by: Ewok | last post by:

let me just say. it's not by choice but im dealing with a .net web app (top down approach with VB and a MySQL database) sigh..... Anyhow, I've just about got all the kinks worked out but I am having trouble preserving data as it gets entered into the database. Primarily, quotes and special characters. Spcifically, I noticed it stripped out some double quotes and a "Registered" symbol ® (not the ascii but the actual character"

.NET Framework

10697

Trim/RTrim not stripping trailing space characters in Access 97

by: Andy B | last post by:

I've tried using Trim or RTrim to strip trailing space characters from my data. When I check on the transformed data space characters are still there. We have an address table containing two fields: BuildName and RoadName. Both have the following properties: size 50, not indexed, not required, allowed zero length. Some records have BuildName, RoadName as null, some have content. No content is 50 chr long. When i run a Len(BuildName)...

Microsoft Access / VBA

3049

Transforming XML to text (and output tab and new line characters

by: Buddy Ackerman | last post by:

Apparently .NET strips these white space characters (MSXML doesn't) regardless of what the output method is set to. I'm using <xsl:text> </xsl:text> to output a tab character and <xsl:text> </xsl:text> to output a carriage return and line feed but they get stripped at some point. I'm passing a StringWriter to the XslTransform.Transform method. Anyway to get this to actually output the tab and carriage and line feed characters?

.NET Framework

6673

Request.QueryString() is stripping out French characters

by: Lu | last post by:

Hi, i am currently working on ASP.Net v1.0 and is encountering the following problem. In javascript, I'm passing in: "somepage.aspx?QSParameter=<RowID>ChÃ¨ques</RowID>" as part of the query string. However, in the code behind when I tried to get the query string value by calling Request.QueryString("QSParameter"), the value I got is: "<RowID>Chques</RowID>". The special character "Ã¨" has been stripped out. The web.config file is...

ASP.NET

5915

replacing text data in a binary file

by: Adam J. Schaff | last post by:

I am writing a quick program to edit a binary file that contains file paths (amongst other things). If I look at the files in notepad, they look like: <gibberish>file//g:\pathtofile1<gibberish>file//g:\pathtofile2<gibberish> etc. I want to remove the "g:\" from the file paths. I wrote a console app that successfully reads the file and writes a duplicate of it, but fails for some reason to do the "replacing" of the "g:\". The code...

Visual Basic .NET

2560

Retrieving special characters

by: Larry | last post by:

OK, I've been searching around the net for numerous hours and seem to just be getting more confused about handling special characters. In my host's configuration MagicQuotes is ON. (I understand this is considered a bad thing by many) A user submitted an email in the form 'Bob Smith' <bob@nospam.com> Now when I look in the MySql database (via PhpMyAdmin) it's exactly that, but when I try to retrieve it with a standard query, it echo's...

PHP

3228

Valid Characters

by: preport | last post by:

I'm trying to ensure that all the characters in my XML document are characters specified in this document: http://www.w3.org/TR/2000/REC-xml-20001006#charsets Would a function like this work: private static string formatXMLString(string n) { if (string.IsNullOrEmpty(n)) return n; System.Text.StringBuilder sb = new System.Text.StringBuilder();

C# / C Sharp

3824

Help - stripping special characters from email and retaining html tags

by: Big Moxy | last post by:

I want to send html formatted text yet strip out special characters (e.g. quotes and semi colons). I've seen preg_replace examples like $messageout = preg_replace('/<>]/i','',$message); to preserve some additional characters but don't know how to approach preserving html in general. This is a typical message line: $message.= "<b>Date: </b>" . $today . "<br />"; I am setting these headers:

PHP

8917

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8761

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9426

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9142

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8148

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6022

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4795

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2680

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2163

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General