473,385 Members | 1,907 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

How to strip styles embedded within html tags?

Hi,
I'm a complete PHP n00b slowly finding my way around
I'm using the following function that I found on php.net to strip
out html and return only the text. It works well except for when
you find styles embedded within the tags
eg: <h3 id="pageName">Have a great day!! </h3>
This throws an error, whereas
<h3 >Thank you for your purchase! </h3works like a charm.
It also falls over when crappy code has <h3>&nbsp;</h3between
the tags.

What do I need to add to the below function to get it to work on
cases like above?
regards,
Steve

The function is:

function html2txt($txt){
$search = array('@<script[^>]*?>.*?</script>@si', // Strip out
javascript
'@<[\/\!]*?[^<>]*?>@si', // Strip out
HTML tags
'@<style[^>]*?>.*?</style>@siU', // Strip style
tags properly
'@<![\s\S]*?--[ \t\n\r]*>@', // Strip
multi-line comments including CDATA
"@</?[^>]*>*@"
);
$text = preg_replace($search, '', $txt);
return $text;
}

Oct 18 '06 #1
4 3005
"Steve" <be*******@NOSPAM.gmail.comwrote in message
news:4b*******************@fe78.usenetserver.com.. .
Hi,
I'm a complete PHP n00b slowly finding my way around
I'm using the following function that I found on php.net to strip out html
and return only the text.
You know, there is a built-in function strip_tags() that does exactly that,
so if your only goal is to strip tags, you might just as well use that. If
you're just trying to learn php and regex, then you should go the hard way
and write the strip function by yourself. :)

--
"Ohjelmoija on organismi joka muuttaa kofeiinia koodiksi" - lpk
http://outolempi.net/ahdistus/ - Satunnaisesti päivittyvä nettisarjis
sp**@outolempi.net | rot13(xv***@bhgbyrzcv.arg)
Oct 18 '06 #2

"Kimmo Laine" <sp**@outolempi.netwrote in message
news:tA*****************@reader1.news.jippii.net.. .
"Steve" <be*******@NOSPAM.gmail.comwrote in message
news:4b*******************@fe78.usenetserver.com.. .
>Hi,
I'm a complete PHP n00b slowly finding my way around
I'm using the following function that I found on php.net to
strip out html and return only the text.

You know, there is a built-in function strip_tags() that does
exactly that, so if your only goal is to strip tags, you might
just as well use that. If you're just trying to learn php and
regex, then you should go the hard way and write the strip
function by yourself. :)

--
"Ohjelmoija on organismi joka muuttaa kofeiinia
koodiksi" - lpk
http://outolempi.net/ahdistus/ - Satunnaisesti päivittyvä
nettisarjis
sp**@outolempi.net | rot13(xv***@bhgbyrzcv.arg)
Hi,
Yes, but strip_tags doesn't address my problem does it? I was
using it before I came across that faction. As for learning
regex, I agree, but right now I need to be able to do this ASAP
:)
cheers

Oct 18 '06 #3
Rik
Steve wrote:
"Kimmo Laine" <sp**@outolempi.netwrote in message
news:tA*****************@reader1.news.jippii.net.. .
>"Steve" <be*******@NOSPAM.gmail.comwrote in message
news:4b*******************@fe78.usenetserver.com. ..
>>Hi,
I'm a complete PHP n00b slowly finding my way around
I'm using the following function that I found on php.net to
strip out html and return only the text.

You know, there is a built-in function strip_tags() that does
exactly that, so if your only goal is to strip tags, you might
just as well use that. If you're just trying to learn php and
regex, then you should go the hard way and write the strip
function by yourself. :)
>
Hi,
Yes, but strip_tags doesn't address my problem does it? I was
using it before I came across that faction. As for learning
regex, I agree, but right now I need to be able to do this ASAP
:)
If you just simply want to strip ALL html tags:
preg_replace('/<[^>]*>/s','',$html);

--
Grtz,

Rik Wasmus
Oct 18 '06 #4
Steve:
I'm using the following function that I found on php.net to strip
out html and return only the text.
PHP has a built-in function strip_tags():

http://www.php.net/manual/en/function.strip-tags.php

But strip_tags() doesn't really do what it says on the tin. It doesn't
know what is and what isn't a tag, and it commits the cardinal sin of
lumping other markup constructs - e.g., comment declarations - under
the rubric of "tag". Neither does it know about the minutiae of HTML,
such as markup minimisation. Whatever looks like a tag, *is* a tag in
its eyes, and vice versa. Add to the mix tag-soup and non-arbitrary
markup, strip_tags() causes real problems.
$search = array('@<script[^>]*?>.*?</script>@si', // Strip out javascript
Since '>' can occur in attribute values, the scan-forward-until-'>'
technique is primitive and can generate false positives. Without
parsing the rest of the tag, a '>' could be character data, it could
close the tag. Who knows? But I shouldn't think this would cause a
problem in all probabilities. I mean, major browser vendors got away
with this technique, why can't you?

Inverting a quantifier twice, as the pattern below does, means it
reverts to its default greediness:
'@<style[^>]*?>.*?</style>@siU', // Strip style tags properly
The U pattern modifier inverts quantifier greediness pattern-wide but
the '?' reverts the second star's greediness, meaning that star is now
greedy again. Not what you want. The fix is to either remove the U
pattern modifier or remove the second '?'.
'@<![\s\S]*?--[ \t\n\r]*>@', // Strip multi-line comments including CDATA
This is made up.

Comments are defined:

[91] comment declaration =
mdo,
( comment,
( s |
comment )* )?,
mdc

[92] comment =
com,
SGML character*,
com

In English, the declaration opens with '<!' (MDO) and closes with '>'
(MDC). In between is an optional comment followed by zero or more
comments or "whitespaces". Comments themselves are composed of '--'
(COM) followed by zero or more SGML characters followed by COM. As a
regular expression (untested):

/<!(--.*?--([ \r\n\t]|--.*?--)*)?>/s
"@</?[^>]*>*@"
You say error, I say broken by design.

--
Jock

Oct 18 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Aquarius2431 | last post by:
Hi!, I don't think I have posted to this group before. Have been using PHP on my webserver for a few months now and finding that I like it quite a bit. Here is a question that just occurred...
3
by: Michelle | last post by:
I am trying to write a general function XML file parser, but it keeps choking when it finds embedded HTML within the text, for example: <item> <title>New PHP eBooks in PDF</title> <description>...
8
by: Fazer | last post by:
Hello, I was wondering what would be the easiest way to strip away HTML tags from a string? Or how would I remove everything between < and > also the < , > as well using regex? Thanks for...
9
by: Julie Miles | last post by:
I need to pull several tables of data from Excel into a web page, but when I use Excel's "Save as web page" function, I get an enormous file containing a massive amount of css formatting. I'd like...
3
by: JezB | last post by:
What's the generally accepted approach for using Styles and Stylesheets in a web application based on .aspx files, Web Controls, User Controls, and code-behind modules (c# in my case)? Most style...
2
by: Daniel M. Hendricks | last post by:
I'm looking for a function/regex in C# to strip unwanted HTML tags from comments posted to my web site. Previously, it was written in PHP and I used this function to strip unwanted tags: ...
3
by: Danny | last post by:
Hallo, all I have used this function. $string = strip_tags($p1,'<i><b><u><br><p><font>'); The problem is that the title will be printed as well, but i dont allow the title tag. How to i...
2
by: tshad | last post by:
Is there an easy way to strip HTML tags from Text to get just the plain text? I am using a program called FreeTextBox that lets you format Text in a TextBox. It does this by adding HTML tags...
2
by: bruce131 | last post by:
Hi. I have the following XML .. <?xml-stylesheet type="text/xsl" href="try.xslt"?> <head> <title>Title</title> <desc> <p>Description<ul><li>item 1</li><li>item 2</li></ul></p> </desc>...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.