Stripping HTML tags from a TEXTAREA field

Jeff North

Hi,
I'm using a control called HTMLArea which allows a person to enter
text and converts the format instructions to html tags. Most of my
users know nothing about html so this is perfect for my use.
http://www.interactivetools.com/products/htmlarea/
This only works with IE5.5+.

What I need to do is to take this html formatted text and only display
part of the text on a web page (much like a news article which shows
only part of the story line).

I need to be able to remove all of the html tags to correctly display
the data.

Is there a regex/replace instruction(s) that I can use to do this?

Many thanks
---------------------------------------------------------------
jn****@yourpantsbigpond.net.au : Remove your pants to reply
---------------------------------------------------------------

Jul 20 '05 #1

Subscribe Post Reply

5983

Michael Winter

On Mon, 19 Jan 2004 23:59:47 GMT, Jeff North
<jn****@yourpantsbigpond.net.au> wrote:

Is there a regex/replace instruction(s) that I can use to do this?

This will simply delete anything that resembles a tag in a string. This
means that the user cannot include anything within angle brackets, even if
the text does not form HTML.

string.replace( /<\S+>/g, '' );

For example,

var testString = '<textarea>this should still be here</textarea>';
testString.replace( /<\S+>/g, '' );

will give testString the new value of "this should still be here".

Mike

--
Michael Winter
M.******@blueyonder.co.invalid (replace ".invalid" with ".uk" to reply)

Jul 20 '05 #2

Evertjan.

Jeff North wrote on 20 jan 2004 in comp.lang.javascript:

I'm using a control called HTMLArea which allows a person to enter
text and converts the format instructions to html tags. Most of my
users know nothing about html so this is perfect for my use.
http://www.interactivetools.com/products/htmlarea/
This only works with IE5.5+.

What I need to do is to take this html formatted text and only display
part of the text on a web page (much like a news article which shows
only part of the story line).

I need to be able to remove all of the html tags to correctly display
the data.

Is there a regex/replace instruction(s) that I can use to do this?

Only for IE:

<div id=temp></div>
<SCRIPT>
t="<span>example <b>of</b> html text</span>"
temp.innerHTML=t
t=temp.innerText
temp.innerHTML=""
alert(t)
</SCRIPT>
--
Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)

Jul 20 '05 #3

Michael Winter

On Tue, 20 Jan 2004 00:12:49 GMT, Michael Winter
<M.******@blueyonder.co.invalid> wrote:

string.replace( /<\S+>/g, '' );

Oops. That should be something more like:

string.replace( /<.+>/g, '' );

Sorry,
Mike

--
Michael Winter
M.******@blueyonder.co.invalid (replace ".invalid" with ".uk" to reply)

Jul 20 '05 #4

Lasse Reichstein Nielsen

Michael Winter <M.******@blueyonder.co.invalid> writes:

On Tue, 20 Jan 2004 00:12:49 GMT, Michael Winter
<M.******@blueyonder.co.invalid> wrote:
string.replace( /<\S+>/g, '' );

Oops. That should be something more like:

string.replace( /<.+>/g, '' );

Sorry,

I thought it was deliberate.
The first would correctly clean up "I am <b>so very tired</b>".
The second would leave it as "I am ".

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 20 '05 #5

Jeff North

On Tue, 20 Jan 2004 00:31:58 GMT, in comp.lang.javascript Michael
Winter <M.******@blueyonder.co.invalid> wrote:

| On Tue, 20 Jan 2004 00:12:49 GMT, Michael Winter
| <M.******@blueyonder.co.invalid> wrote:
|
| > string.replace( /<\S+>/g, '' );
|
| Oops. That should be something more like:
|
| string.replace( /<.+>/g, '' );

Thanks Mike for your help. It's most appreciated.

A couple of small problems :-)
The first example didn't remove all of the tags. It mainly left the
font opening tag but successfully removed the closing tag.

The second example wiped the entire text.

So this is what I came up with
-----------------------
var txt2= new String();
var tmp = new String();
while( !rs.EOF )
{
tmp = rs.Fields.Item("Contents").Value;
tmp = tmp.replace( /<\S+>/gi, ' ' );
tmp = tmp.replace( /<.+>/gi, ' ' );
txt2 += tmp;
rs.moveNext();
}
----------------------
Works like a charm :-)
Why it works I have no idea, my regex knowledge is practically zero.
Could there be a difference with client-side and server-side
implementations within IIS?
---------------------------------------------------------------
jn****@yourpantsbigpond.net.au : Remove your pants to reply
---------------------------------------------------------------

Jul 20 '05 #6

Jeff North

On 20 Jan 2004 00:15:37 GMT, in comp.lang.javascript "Evertjan."
<ex**************@interxnl.net> wrote:

| Jeff North wrote on 20 jan 2004 in comp.lang.javascript:
|
| > I'm using a control called HTMLArea which allows a person to enter
| > text and converts the format instructions to html tags. Most of my
| > users know nothing about html so this is perfect for my use.
| > http://www.interactivetools.com/products/htmlarea/
| > This only works with IE5.5+.
| >
| > What I need to do is to take this html formatted text and only display
| > part of the text on a web page (much like a news article which shows
| > only part of the story line).
| >
| > I need to be able to remove all of the html tags to correctly display
| > the data.
| >
| > Is there a regex/replace instruction(s) that I can use to do this?
|
| Only for IE:
|
| <div id=temp></div>
| <SCRIPT>
| t="<span>example <b>of</b> html text</span>"
| temp.innerHTML=t
| t=temp.innerText
| temp.innerHTML=""
| alert(t)
| </SCRIPT>

An interesting technique. Unfortunately I need it to be non-browser
specific.
---------------------------------------------------------------
jn****@yourpantsbigpond.net.au : Remove your pants to reply
---------------------------------------------------------------

Jul 20 '05 #7

Evertjan.

Jeff North wrote on 20 jan 2004 in comp.lang.javascript:

So this is what I came up with
-----------------------
var txt2= new String();
var tmp = new String();
while( !rs.EOF )
{
tmp = rs.Fields.Item("Contents").Value;
tmp = tmp.replace( /<\S+>/gi, ' ' );
tmp = tmp.replace( /<.+>/gi, ' ' );
the /i case insensitive is superfluous
txt2 += tmp;
rs.moveNext();
}

Next to my IEonly posting, which gives IMHO the best results and could be
used in a browser testing code, you could try a nongreedy regex:
tmp = tmp.replace( /<[^>]+>/g, ' ' );

Or more modern with the '?' nongreedy operator:

tmp = tmp.replace( /<.+?>/gi, ' ' );

Both will fail in this string:

<img src='x.gif' alt='not visible > hi there < not visible'>

--
Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)

Jul 20 '05 #8

Jeff North

On 20 Jan 2004 09:57:01 GMT, in comp.lang.javascript "Evertjan."
<ex**************@interxnl.net> wrote:

| Jeff North wrote on 20 jan 2004 in comp.lang.javascript:
|
| > So this is what I came up with
| > -----------------------
| > var txt2= new String();
| > var tmp = new String();
| > while( !rs.EOF )
| > {
| > tmp = rs.Fields.Item("Contents").Value;
| > tmp = tmp.replace( /<\S+>/gi, ' ' );
| > tmp = tmp.replace( /<.+>/gi, ' ' );
|
| the /i case insensitive is superfluous
I thought that too but added it as a precaution :-)
Would this add any significant processing time? The strings I'm using
can get pretty long.
| > txt2 += tmp;
| > rs.moveNext();
| >}
|
| Next to my IEonly posting, which gives IMHO the best results and could be
| used in a browser testing code, you could try a nongreedy regex:
|
| tmp = tmp.replace( /<[^>]+>/g, ' ' );
|
| Or more modern with the '?' nongreedy operator:
|
| tmp = tmp.replace( /<.+?>/gi, ' ' );
|
| Both will fail in this string:
|
| <img src='x.gif' alt='not visible > hi there < not visible'>

No wonder I could never understand regex :-)
Is there any good tutorials available for regex (plus lots of examples
to use)?
---------------------------------------------------------------
jn****@yourpantsbigpond.net.au : Remove your pants to reply
---------------------------------------------------------------

Jul 20 '05 #9

Michael Winter

On Tue, 20 Jan 2004 03:39:18 +0100, Lasse Reichstein Nielsen
<lr*@hotpop.com> wrote:

Michael Winter <M.******@blueyonder.co.invalid> writes:
On Tue, 20 Jan 2004 00:12:49 GMT, Michael Winter
<M.******@blueyonder.co.invalid> wrote:
string.replace( /<\S+>/g, '' );

Oops. That should be something more like:

string.replace( /<.+>/g, '' );

Sorry,

I thought it was deliberate.
The first would correctly clean up "I am <b>so very tired</b>".
The second would leave it as "I am ".

It would. However, it would not remove any tags that contain spaces. It
might not be an issue, but the second version doesn't (seem to) do any
harm.

Mike

--
Michael Winter
M.******@blueyonder.co.invalid (replace ".invalid" with ".uk" to reply)

Jul 20 '05 #10

Michael Winter

On Tue, 20 Jan 2004 08:54:17 GMT, Jeff North
<jn****@yourpantsbigpond.net.au> wrote:

On Tue, 20 Jan 2004 00:31:58 GMT, in comp.lang.javascript Michael
Winter <M.******@blueyonder.co.invalid> wrote:
| On Tue, 20 Jan 2004 00:12:49 GMT, Michael Winter
| <M.******@blueyonder.co.invalid> wrote:
|
| > string.replace( /<\S+>/g, '' );
|
| Oops. That should be something more like:
|
| string.replace( /<.+>/g, '' );
The first example didn't remove all of the tags. It mainly left the
font opening tag but successfully removed the closing tag.

The first wouldn't remove tags that contained any whitespace, so tags with
attributes, or XHTML-style empty tags (<br />, for example) would remain.
That's what prompted the second suggestion.
The second example wiped the entire text.
I tested it with strings that I thought would cause unwanted results, but
they came out fine. I was surprised (with a little more thought after
posting it) that the entire text wasn't wiped. I just found out why[1].

The best safe result I can get is:

.replace( /<[^<>]+>/g, '' )

The only problem is that if angle brackets appear inside tags, the tag
won't be removed properly. Such an occurance isn't really likely to occur,
unless someone wants to explicitly exploit this hole.
tmp = tmp.replace( /<\S+>/gi, ' ' );
tmp = tmp.replace( /<.+>/gi, ' ' );

I think I can explain why this works in your tests. The expression /<.+>/
matches "<anything>", where "anything" is literally that: letters,
numbers, punctuation, symbols, etc. If a tag is paired, like this:

<em id="example">This is emphasised</em>

the "em id=....</em" matches the '.' token in the regular expression. The
earlier expression, /<\S+>/ would remove the closing tag, leaving:

<em id="example">This is emphasised

which is then correctly handled by the greedy second expression. However,
if you try this:

The word, <em>this</em> is emphasised

you'll only get:

The word, is emphasised

back. That is why you should try the third suggestion, /<[^<>]+>/g,
despite it's flaw.

What a mess this is becoming. :)

Mike
[1] The reason is inconsequential, but it made the testing unfair.

--
Michael Winter
M.******@blueyonder.co.invalid (replace ".invalid" with ".uk" to reply)

Jul 20 '05 #11

Evertjan.

Jeff North wrote on 20 jan 2004 in comp.lang.javascript:

| the /i case insensitive is superfluous [..]|
| tmp = tmp.replace( /<[^>]+>/g, ' ' );
|
| Or more modern with the '?' nongreedy operator:
|
| tmp = tmp.replace( /<.+?>/gi, ' ' );
|
| Both will fail in this string:
|
| <img src='x.gif' alt='not visible > hi there < not visible'>
No wonder I could never understand regex :-)

Yes, those i's in /gi have a tendency to reappear by themselves ;-)

/<[^>]+>/g

start with a <
accept all next chars except > ([^>]) with a minimunm of 1 (+)
and a > at the end
/g do this at nauseam

/<.+?>/g

start with a <
accept all next chars (.) with a minimunm of 1 (+) till(?) the first and
including > at the end
/g do this at nauseam
Is there any good tutorials available for regex (plus lots of examples
to use)?

<http://www.google.com/search?q=regex.tutorial> 819 hits
<http://www.google.com/search?q=regex.examples> 491 hits

--
Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)

Jul 20 '05 #12

Jeff North

On Tue, 20 Jan 2004 16:22:14 GMT, in comp.lang.javascript Michael
Winter <M.******@blueyonder.co.invalid> wrote:

| On Tue, 20 Jan 2004 08:54:17 GMT, Jeff North
| <jn****@yourpantsbigpond.net.au> wrote:
|
| > On Tue, 20 Jan 2004 00:31:58 GMT, in comp.lang.javascript Michael
| > Winter <M.******@blueyonder.co.invalid> wrote:
| >
| >> | On Tue, 20 Jan 2004 00:12:49 GMT, Michael Winter
| >> | <M.******@blueyonder.co.invalid> wrote:
| >> |
| >> | > string.replace( /<\S+>/g, '' );
| >> |
| >> | Oops. That should be something more like:
| >> |
| >> | string.replace( /<.+>/g, '' );
| >
| > The first example didn't remove all of the tags. It mainly left the
| > font opening tag but successfully removed the closing tag.
|
| The first wouldn't remove tags that contained any whitespace, so tags with
| attributes, or XHTML-style empty tags (<br />, for example) would remain.
| That's what prompted the second suggestion.
|
| > The second example wiped the entire text.
|
| I tested it with strings that I thought would cause unwanted results, but
| they came out fine. I was surprised (with a little more thought after
| posting it) that the entire text wasn't wiped. I just found out why[1].
|
| The best safe result I can get is:
|
| .replace( /<[^<>]+>/g, '' )
|
| The only problem is that if angle brackets appear inside tags, the tag
| won't be removed properly. Such an occurance isn't really likely to occur,
| unless someone wants to explicitly exploit this hole.
|
| > tmp = tmp.replace( /<\S+>/gi, ' ' );
| > tmp = tmp.replace( /<.+>/gi, ' ' );
|
| I think I can explain why this works in your tests. The expression /<.+>/
| matches "<anything>", where "anything" is literally that: letters,
| numbers, punctuation, symbols, etc. If a tag is paired, like this:
|
| <em id="example">This is emphasised</em>
|
| the "em id=....</em" matches the '.' token in the regular expression. The
| earlier expression, /<\S+>/ would remove the closing tag, leaving:
|
| <em id="example">This is emphasised
|
| which is then correctly handled by the greedy second expression. However,
| if you try this:
|
| The word, <em>this</em> is emphasised
|
| you'll only get:
|
| The word, is emphasised
|
| back. That is why you should try the third suggestion, /<[^<>]+>/g,
| despite it's flaw.
|
| What a mess this is becoming. :)
|
| Mike
|
|
| [1] The reason is inconsequential, but it made the testing unfair.

Mike and Evertjan, thanks for all your time and effort it is greatly
appreciated.

Mike, I tried your 3rd suggestion and it appears to work (so I won't
annoy you anymore LOL).

Here is what I've ended up with and some sample text. I know that
there is probably a more elegant way of doing this but I think that
this is almost self-documenting and easily modifiable:
----------------------------------
//--- read data from database
//--- strip out html tags and convert symbols to characters.
//--- var msg is called in client-side script.
var msg = new String( rsDir.Fields.Item("contents").Value );
msg = msg.replace(/\n/g,"");
msg = msg.replace(/\r/g,"");

//--- any double quote -> single quote
msg = msg.replace(/"/gi,"\'");
msg = msg.replace(/–/g,"-");

//--- any left/right quotes to a single quote
msg = msg.replace(/’/g,"\'");
msg = msg.replace(/“/g,"\'");
msg = msg.replace(/”/g,"\'");

//--- remove non-breaking spaces
msg = msg.replace(/ /gi," ");

//-- strip html tags from text (courtesy of Michael Winter at
comp.lang.javascript newsgroup)
msg = msg.replace( /<[^<>]+>/g, '' );
..
..
..
..
<script>
function ShowMsg()
{
//--- display a message. Do not break/split a word.
var ct = 200; //--- max. characters
var msg = new String();
msg = "<%=msg%>";
//--- move back to first space character.
while( ct > 0 && msg.charAt(ct) != " ") ct--;

document.write( msg.substr(0,ct) + "..." );
}
</script>

------------ sample text ------------
<P><FONT face="arial, helvetica, sans-serif">Dear
All,</FONT></P>\r\n<P><FONT face="arial, helvetica,
sans-serif">2003 will soon be nothing more than a memory. But to
my mind, this last year will continue to live on as an "annus
mirabilis" -  year of wonders. </FONT></P>\r\n<P><FONT
face="arial, helvetica, sans-serif">And it has been
wonderful - our staff and students really covered themselves in glory
during 2003, with awards and accolades coming from virtually every
quarter. But we all know that awards only tell part of the story. What
made this last year “truly wonderful” was the fact that
the Institute achieved so much, in spite of a host of challenges and
uncertainties. We were able to succeed because of one simple fact
– our fantastic staff. All staff regularly did more with less
and continued to provide the very best in vocational education and
training. Thank you for all your hard work.</FONT></P>\r\n<P><FONT
face="arial, helvetica, sans-serif">In many ways, the coming
year will mark the beginning of profound changes to the way in which
Sydney Institute operates. Staff numbers will increase. Reporting
lines and responsibilities will change. Our business and work culture
will have to adapt to new circumstances, personalities and
opportunities. It will be a challenge. However, I am confident we will
meet these challenges in the same way TAFE has coped with change for
over 110 years – with professionalism and dedication. Those
qualities made 2003 a year to remember and I know that 2004
won’t be any different.</FONT></P>\r\n<P><FONT face="arial,
helvetica, sans-serif">Thank you again for all your efforts
during this last year. I look forward to 2004 with anticipation. I
hope you have a safe and happy holiday
season.</FONT></P>\r\n<P><BR><FONT face="arial, helvetica,
sans-serif">
-------------------------------
---------------------------------------------------------------
jn****@yourpantsbigpond.net.au : Remove your pants to reply
---------------------------------------------------------------

Jul 20 '05 #13

John

Jeff North <jn****@yourpantsbigpond.net.au> wrote in
news:n4********************************@4ax.com:

On 20 Jan 2004 00:15:37 GMT, in comp.lang.javascript "Evertjan."
<ex**************@interxnl.net> wrote:
| Jeff North wrote on 20 jan 2004 in comp.lang.javascript:
|
| > I'm using a control called HTMLArea which allows a person to enter
| > text and converts the format instructions to html tags. Most of my
| > users know nothing about html so this is perfect for my use.
| > http://www.interactivetools.com/products/htmlarea/
| > This only works with IE5.5+.
| >
| > What I need to do is to take this html formatted text and only
| > display part of the text on a web page (much like a news article
| > which shows only part of the story line).
| >
| > I need to be able to remove all of the html tags to correctly
| > display the data.
| >
| > Is there a regex/replace instruction(s) that I can use to do this?
|
| Only for IE:
|
| <div id=temp></div>
| <SCRIPT>
| t="<span>example <b>of</b> html text</span>"
| temp.innerHTML=t
| t=temp.innerText
| temp.innerHTML=""
| alert(t)
| </SCRIPT>

An interesting technique. Unfortunately I need it to be non-browser
specific.
---------------------------------------------------------------
jn****@yourpantsbigpond.net.au : Remove your pants to reply
---------------------------------------------------------------

How about:
<SCRIPT>

function strip(el)
{
var retVal="";
for(var z=0; z < el.childNodes.length; z++)
if(el.childNodes[z].nodeName=="#text")
retVal+=el.childNodes[z].nodeValue;
else
retVal+=strip(el.childNodes[z]);
return retVal;
}

e=document.createElement("div");
e.innerHTML="<span>example <b>of</b> html text</span>";
t=strip(e);
alert(t)
</SCRIPT>

Jul 20 '05 #14

Evertjan.

John wrote on 13 feb 2004 in comp.lang.javascript:

An interesting technique. Unfortunately I need it to be non-browser
specific.

How about:
<SCRIPT>

function strip(el)
{
var retVal="";
for(var z=0; z < el.childNodes.length; z++)
if(el.childNodes[z].nodeName=="#text")
retVal+=el.childNodes[z].nodeValue;
else
retVal+=strip(el.childNodes[z]);
return retVal;
}

e=document.createElement("div");
e.innerHTML="<span>example <b>of</b> html text</span>";
t=strip(e);
alert(t)
</SCRIPT>

I thought the purpose was to eliminate innerHTNL ?

--
Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)

Jul 20 '05 #15

John

"Evertjan." <ex**************@interxnl.net> wrote in
news:Xn********************@194.109.133.29:

John wrote on 13 feb 2004 in comp.lang.javascript:
An interesting technique. Unfortunately I need it to be non-browser
specific.

How about:
<SCRIPT>

function strip(el)
{
var retVal="";
for(var z=0; z < el.childNodes.length; z++)
if(el.childNodes[z].nodeName=="#text")
retVal+=el.childNodes[z].nodeValue;
else
retVal+=strip(el.childNodes[z]);
return retVal;
}

e=document.createElement("div");
e.innerHTML="<span>example <b>of</b> html text</span>";
t=strip(e);
alert(t)
</SCRIPT>

I thought the purpose was to eliminate innerHTNL ?

Doesn't work very well anyway!!!
I was up late and drunk ;-)

Jul 20 '05 #16

Stripping HTML tags from a TEXTAREA field

Similar topics