Connecting Tech Pros Worldwide Help | Site Map

How to strip styles embedded within html tags?

 
LinkBack Thread Tools Search this Thread
  #1  
Old October 18th, 2006, 10:05 AM
Steve
Guest
 
Posts: n/a
Default How to strip styles embedded within html tags?

Hi,
I'm a complete PHP n00b slowly finding my way around
I'm using the following function that I found on php.net to strip
out html and return only the text. It works well except for when
you find styles embedded within the tags
eg: <h3 id="pageName">Have a great day!! </h3>
This throws an error, whereas
<h3 >Thank you for your purchase! </h3works like a charm.
It also falls over when crappy code has <h3>&nbsp;</h3between
the tags.

What do I need to add to the below function to get it to work on
cases like above?
regards,
Steve

The function is:

function html2txt($txt){
$search = array('@<script[^>]*?>.*?</script>@si', // Strip out
javascript
'@<[\/\!]*?[^<>]*?>@si', // Strip out
HTML tags
'@<style[^>]*?>.*?</style>@siU', // Strip style
tags properly
'@<![\s\S]*?--[ \t\n\r]*>@', // Strip
multi-line comments including CDATA
"@</?[^>]*>*@"


);
$text = preg_replace($search, '', $txt);
return $text;
}




  #2  
Old October 18th, 2006, 11:35 AM
Kimmo Laine
Guest
 
Posts: n/a
Default Re: How to strip styles embedded within html tags?

"Steve" <beachie41@NOSPAM.gmail.comwrote in message
news:4bnZg.15962$lU6.10673@fe78.usenetserver.com.. .
Quote:
Hi,
I'm a complete PHP n00b slowly finding my way around
I'm using the following function that I found on php.net to strip out html
and return only the text.
You know, there is a built-in function strip_tags() that does exactly that,
so if your only goal is to strip tags, you might just as well use that. If
you're just trying to learn php and regex, then you should go the hard way
and write the strip function by yourself. :)

--
"Ohjelmoija on organismi joka muuttaa kofeiinia koodiksi" - lpk
http://outolempi.net/ahdistus/ - Satunnaisesti päivittyvä nettisarjis
spam@outolempi.net | rot13(xvzzb@bhgbyrzcv.arg)


  #3  
Old October 18th, 2006, 12:55 PM
Steve
Guest
 
Posts: n/a
Default Re: How to strip styles embedded within html tags?


"Kimmo Laine" <spam@outolempi.netwrote in message
news:tAoZg.8539$7C4.2594@reader1.news.jippii.net.. .
Quote:
"Steve" <beachie41@NOSPAM.gmail.comwrote in message
news:4bnZg.15962$lU6.10673@fe78.usenetserver.com.. .
Quote:
>Hi,
>I'm a complete PHP n00b slowly finding my way around
>I'm using the following function that I found on php.net to
>strip out html and return only the text.
>
You know, there is a built-in function strip_tags() that does
exactly that, so if your only goal is to strip tags, you might
just as well use that. If you're just trying to learn php and
regex, then you should go the hard way and write the strip
function by yourself. :)
>
--
"Ohjelmoija on organismi joka muuttaa kofeiinia
koodiksi" - lpk
http://outolempi.net/ahdistus/ - Satunnaisesti päivittyvä
nettisarjis
spam@outolempi.net | rot13(xvzzb@bhgbyrzcv.arg)
>
Hi,
Yes, but strip_tags doesn't address my problem does it? I was
using it before I came across that faction. As for learning
regex, I agree, but right now I need to be able to do this ASAP
:)
cheers



  #4  
Old October 18th, 2006, 02:25 PM
Rik
Guest
 
Posts: n/a
Default Re: How to strip styles embedded within html tags?

Steve wrote:
Quote:
"Kimmo Laine" <spam@outolempi.netwrote in message
news:tAoZg.8539$7C4.2594@reader1.news.jippii.net.. .
Quote:
>"Steve" <beachie41@NOSPAM.gmail.comwrote in message
>news:4bnZg.15962$lU6.10673@fe78.usenetserver.com. ..
Quote:
>>Hi,
>>I'm a complete PHP n00b slowly finding my way around
>>I'm using the following function that I found on php.net to
>>strip out html and return only the text.
>>
>You know, there is a built-in function strip_tags() that does
>exactly that, so if your only goal is to strip tags, you might
>just as well use that. If you're just trying to learn php and
>regex, then you should go the hard way and write the strip
>function by yourself. :)
Quote:
>
Hi,
Yes, but strip_tags doesn't address my problem does it? I was
using it before I came across that faction. As for learning
regex, I agree, but right now I need to be able to do this ASAP
:)
If you just simply want to strip ALL html tags:
preg_replace('/<[^>]*>/s','',$html);

--
Grtz,

Rik Wasmus


  #5  
Old October 18th, 2006, 03:15 PM
John Dunlop
Guest
 
Posts: n/a
Default Re: How to strip styles embedded within html tags?

Steve:
Quote:
I'm using the following function that I found on php.net to strip
out html and return only the text.
PHP has a built-in function strip_tags():

http://www.php.net/manual/en/function.strip-tags.php

But strip_tags() doesn't really do what it says on the tin. It doesn't
know what is and what isn't a tag, and it commits the cardinal sin of
lumping other markup constructs - e.g., comment declarations - under
the rubric of "tag". Neither does it know about the minutiae of HTML,
such as markup minimisation. Whatever looks like a tag, *is* a tag in
its eyes, and vice versa. Add to the mix tag-soup and non-arbitrary
markup, strip_tags() causes real problems.
Quote:
$search = array('@<script[^>]*?>.*?</script>@si', // Strip out javascript
Since '>' can occur in attribute values, the scan-forward-until-'>'
technique is primitive and can generate false positives. Without
parsing the rest of the tag, a '>' could be character data, it could
close the tag. Who knows? But I shouldn't think this would cause a
problem in all probabilities. I mean, major browser vendors got away
with this technique, why can't you?

Inverting a quantifier twice, as the pattern below does, means it
reverts to its default greediness:
Quote:
'@<style[^>]*?>.*?</style>@siU', // Strip style tags properly
The U pattern modifier inverts quantifier greediness pattern-wide but
the '?' reverts the second star's greediness, meaning that star is now
greedy again. Not what you want. The fix is to either remove the U
pattern modifier or remove the second '?'.
Quote:
'@<![\s\S]*?--[ \t\n\r]*>@', // Strip multi-line comments including CDATA
This is made up.

Comments are defined:

[91] comment declaration =
mdo,
( comment,
( s |
comment )* )?,
mdc

[92] comment =
com,
SGML character*,
com

In English, the declaration opens with '<!' (MDO) and closes with '>'
(MDC). In between is an optional comment followed by zero or more
comments or "whitespaces". Comments themselves are composed of '--'
(COM) followed by zero or more SGML characters followed by COM. As a
regular expression (untested):

/<!(--.*?--([ \r\n\t]|--.*?--)*)?>/s
Quote:
"@</?[^>]*>*@"
You say error, I say broken by design.

--
Jock

 

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Popular Articles

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over 220,989 network members.