473,729 Members | 2,355 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Replacing characters + stripping HTML

I have a HTML parser that reads product pages from various retailers - and I
want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:

- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace(" <head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which
I'm sure is not the fastest/best way):

$cont = str_replace("." ,".",$cont);
$cont = str_replace("," ,",",$cont);
$cont = str_replace("£" ,"£",$cont);
$cont = str_replace("€" ,"?",$cont);
$cont = str_replace("'" ,"'",$cont);
$cont = str_replace("-" ,"-",$cont);
$cont = str_replace("(" ,"(",$cont);
$cont = str_replace(")" ,")",$cont);
$cont = str_replace("[" ,"[",$cont);
$cont = str_replace("]" ,"]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.
Jul 16 '05 #1
9 3448
"Martin" <ma****@home.se > writes:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace(" <head>*</head>","", $string);
Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"
- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which
I'm sure is not the fastest/best way):


PHP 4.3.0 and later
$cont = html_entity_dec ode($cont);

--
Chris
Jul 16 '05 #2

"Chris Morris" <c.********@dur ham.ac.uk> wrote in message
news:87******** ****@dinopsis.d ur.ac.uk...
"Martin" <ma****@home.se > writes:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how to set this up ? I tried this with no luck:

$string = eregi_replace(" <head>*</head>","", $string);
Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"


Thanks - oh, what a difference a dot can do :).
- As some pages have special characters, I'd like to redo these to normal characters for ease when setting up new parser. Right now I have this (which I'm sure is not the fastest/best way):


PHP 4.3.0 and later
$cont = html_entity_dec ode($cont);


Forgot to mention, but had tried that before - that does not have the
desired effect, I'm afraid. It doesn't convert the characters I convert
"manually". ..
--
Chris

Jul 16 '05 #3

"Martin" <ma****@home.se > wrote in message
news:TH******** ********@news1. bredband.com...
I have a HTML parser that reads product pages from various retailers - and I want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:

- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how to set this up ? I tried this with no luck:

$string = eregi_replace(" <head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal
characters for ease when setting up new parser. Right now I have this (which I'm sure is not the fastest/best way):

$cont = str_replace("." ,".",$cont);
$cont = str_replace("," ,",",$cont);
$cont = str_replace("£" ,"£",$cont);
$cont = str_replace("€" ,"?",$cont);
$cont = str_replace("'" ,"'",$cont);
$cont = str_replace("-" ,"-",$cont);
$cont = str_replace("(" ,"(",$cont);
$cont = str_replace(")" ,")",$cont);
$cont = str_replace("[" ,"[",$cont);
$cont = str_replace("]" ,"]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.


Maybe you're going around this in the wrong way... Would it not be better to
grab the data you want, as opposed to strip the data you don't want?
Jul 16 '05 #4
> > I have a HTML parser that reads product pages from various retailers -
and
I
want to optimize it somewhat:

I download all HTML before I start the parsing - and to do that I want to:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace(" <head>*</head>","", $string);

- As some pages have special characters, I'd like to redo these to normal characters for ease when setting up new parser. Right now I have this

(which
I'm sure is not the fastest/best way):

$cont = str_replace("." ,".",$cont);
$cont = str_replace("," ,",",$cont);
$cont = str_replace("£" ,"£",$cont);
$cont = str_replace("€" ,"?",$cont);
$cont = str_replace("'" ,"'",$cont);
$cont = str_replace("-" ,"-",$cont);
$cont = str_replace("(" ,"(",$cont);
$cont = str_replace(")" ,")",$cont);
$cont = str_replace("[" ,"[",$cont);
$cont = str_replace("]" ,"]",$cont);

Any ideas to improve on this ?

Thanks.

Martin.


Maybe you're going around this in the wrong way... Would it not be better

to grab the data you want, as opposed to strip the data you don't want?


Sure - problem is I need most of the HTML - but some very specific parts I
don't want - therefore I grab the entire page - but want to remove what I
don't need.
Jul 16 '05 #5
Chris Morris <c.********@dur ham.ac.uk> wrote in message news:<87******* *****@dinopsis. dur.ac.uk>...
"Martin" <ma****@home.se > writes:
- Get rid of all HTML parts that I don't need, i.e. <head>, <title>,
<javascript> etc.
I'm considering using eregi_replace for this. Anyone have an example of how
to set this up ? I tried this with no luck:

$string = eregi_replace(" <head>*</head>","", $string);


Replaces "<head" followed by 0 or more ">" followed by "</head>" with "".

You want "<head>.*</head>"


Can't get to www.php.net for some reason, and so I can formulate my
question as nicely as I'd like to ask it. But:

On another line of thought, I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

..articleHeadli ne{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."
Jul 16 '05 #6
In article <da************ **************@ posting.google. com>,
lk******@geocit ies.com (lawrence) wrote:
I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

.articleHeadlin e{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."


Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"

-or-

"\.[^{]*\{"

--
CC
Jul 16 '05 #7
CC Zona <cc****@nospam. invalid> wrote in message news:<cc******* *************** ****@netnews.at tbi.com>...
In article <da************ **************@ posting.google. com>,
lk******@geocit ies.com (lawrence) wrote:
I would get all class names in a style
sheet like this:

"\..*{"

Would this be the correct search string for something like this:

.articleHeadlin e{

I'm trying to say, "look for a pattern that starts with a dot and then
goes through any number of characters and numbers until you get to a
curly opening bracket."


Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"

-or-

and if I want to make some changes to all of my PHP files,and I need
to insert something into the first line of all my functions, I suppose
to select the line where the function is declared would be something
like this? :

"function .*(.*).*\{"
Jul 16 '05 #8
In article <da************ **************@ posting.google. com>,
lk******@geocit ies.com (lawrence) wrote:
CC Zona <cc****@nospam. invalid> wrote in message
news:<cc******* *************** ****@netnews.at tbi.com>...
In article <da************ **************@ posting.google. com>,
lk******@geocit ies.com (lawrence) wrote:

Braces are special characters in regular expressions, so escape that left
curly brace too. Also, I'm not sure what the Regex library's
(ereg_*/eregi_*) rule is on greediness, but in PCRE (preg-_*) at least
you'd want to specify that the zero-or-more quantifier is ungreedy.

"\..*?\{"
<snip> and if I want to make some changes to all of my PHP files,and I need
to insert something into the first line of all my functions, I suppose
to select the line where the function is declared would be something
like this? :

"function .*(.*).*\{"


No, definitely not like that. You're in danger of matching far too much
and it will surely match something quite different than what you intended.
That says: "match string 'function ', then match zero or more of anything,
then match and capture zero or more of anything, then match zero or more of
anything until you smack up against the final match character which is a
left curly brace (likely the last one occuring in the haystack)".

Using regular expressions without understand the syntax can cause you a lot
of grief. For your sake, I urge you to spend some time learning POSIX
regular expression syntax (which is used by eregi/eregi) now before you
accidentally overwrite something that's important to you. I'm suggesting
POSIX, even though I personally prefer PCRE, because most people seem to
find POSIX easier to start with, rather than diving headfirst into the
richer (though somewhat more complex) possibilities of PCRE. O'Reilly has
a (*the*) wonderful guide to regex, written by Jeffry Friedl. Don't be
intimidated by its size: he's covering everything from the veyr basic to
the highly sophisticated, and doing it form many more flavors of regex than
are available in PHP. You'll learn a lot just from Friedl's first few
chapters, even if you never read farther than that.

Oh yeah, and ALWAYS test your regular expressions thoroughly before
trusting them to production use.

Good luck!

--
CC
Jul 16 '05 #9
CC Zona wrote:
Using regular expressions without understand the syntax can cause you a lot
of grief. For your sake, I urge you to spend some time learning POSIX
regular expression syntax (which is used by eregi/eregi) now before you
accidentally overwrite something that's important to you. I'm suggesting
POSIX, even though I personally prefer PCRE, because most people seem to
find POSIX easier to start with, rather than diving headfirst into the
richer (though somewhat more complex) possibilities of PCRE. O'Reilly has
a (*the*) wonderful guide to regex, written by Jeffry Friedl. Don't be
intimidated by its size: he's covering everything from the veyr basic to
the highly sophisticated, and doing it form many more flavors of regex than
are available in PHP. You'll learn a lot just from Friedl's first few
chapters, even if you never read farther than that.


I'm reading that book right now, I highly
recommend it to anyone wanting to master regular
expressions. It's superb. A definite classic :8].

--
Seks, seksiæ, seksolatki... news:pl.soc.sek s.moderowana
http://hyperreal.info { iWanToDie } WiNoNa ) (
http://szatanowskie-ladacznice.0-700.pl foReVeR( * )
Poznaj jej zwiewne kszta³ty... http://www.opera.com 007
Jul 16 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
2060
by: thehuby | last post by:
Isn't inserting good data and getting it out of a db a pain in the a$$? I am going to be using the Markdown text to HTML parser (http://daringfireball.net/projects/markdown/dingus) for creating HTML from user input (for a bespoke CMS) so that users can put in their own headings, lists and links etc. This is great and gets round all the issues of apostrophes etc. when inserting informaiton into a database for me.
4
5264
by: Ewok | last post by:
let me just say. it's not by choice but im dealing with a .net web app (top down approach with VB and a MySQL database) sigh..... Anyhow, I've just about got all the kinks worked out but I am having trouble preserving data as it gets entered into the database. Primarily, quotes and special characters. Spcifically, I noticed it stripped out some double quotes and a "Registered" symbol &reg; (not the ascii but the actual character"
3
10697
by: Andy B | last post by:
I've tried using Trim or RTrim to strip trailing space characters from my data. When I check on the transformed data space characters are still there. We have an address table containing two fields: BuildName and RoadName. Both have the following properties: size 50, not indexed, not required, allowed zero length. Some records have BuildName, RoadName as null, some have content. No content is 50 chr long. When i run a Len(BuildName)...
2
3049
by: Buddy Ackerman | last post by:
Apparently .NET strips these white space characters (MSXML doesn't) regardless of what the output method is set to. I'm using <xsl:text> </xsl:text> to output a tab character and <xsl:text> </xsl:text> to output a carriage return and line feed but they get stripped at some point. I'm passing a StringWriter to the XslTransform.Transform method. Anyway to get this to actually output the tab and carriage and line feed characters?
4
6673
by: Lu | last post by:
Hi, i am currently working on ASP.Net v1.0 and is encountering the following problem. In javascript, I'm passing in: "somepage.aspx?QSParameter=<RowID>Chèques</RowID>" as part of the query string. However, in the code behind when I tried to get the query string value by calling Request.QueryString("QSParameter"), the value I got is: "<RowID>Chques</RowID>". The special character "è" has been stripped out. The web.config file is...
12
5915
by: Adam J. Schaff | last post by:
I am writing a quick program to edit a binary file that contains file paths (amongst other things). If I look at the files in notepad, they look like: <gibberish>file//g:\pathtofile1<gibberish>file//g:\pathtofile2<gibberish> etc. I want to remove the "g:\" from the file paths. I wrote a console app that successfully reads the file and writes a duplicate of it, but fails for some reason to do the "replacing" of the "g:\". The code...
9
2560
by: Larry | last post by:
OK, I've been searching around the net for numerous hours and seem to just be getting more confused about handling special characters. In my host's configuration MagicQuotes is ON. (I understand this is considered a bad thing by many) A user submitted an email in the form 'Bob Smith' <bob@nospam.com> Now when I look in the MySql database (via PhpMyAdmin) it's exactly that, but when I try to retrieve it with a standard query, it echo's...
13
3228
by: preport | last post by:
I'm trying to ensure that all the characters in my XML document are characters specified in this document: http://www.w3.org/TR/2000/REC-xml-20001006#charsets Would a function like this work: private static string formatXMLString(string n) { if (string.IsNullOrEmpty(n)) return n; System.Text.StringBuilder sb = new System.Text.StringBuilder();
2
3824
by: Big Moxy | last post by:
I want to send html formatted text yet strip out special characters (e.g. quotes and semi colons). I've seen preg_replace examples like $messageout = preg_replace('/\(\)<>]/i','',$message); to preserve some additional characters but don't know how to approach preserving html in general. This is a typical message line: $message.= "<b>Date: </b>" . $today . "<br />"; I am setting these headers:
0
8917
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8761
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9426
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9142
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8148
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6022
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4795
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
2680
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2163
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.