removing content between specified tokens using java script

"rajarao" <ra******@yahoo.com> writes:

I want to remove the content embedded in <script> and </script> tags
submitted via text box.
My java script should remove the content embedded between <script> and
</script> tag.
my current code is

function RemoveHTMLScript(strText)
{
var regEx = /<script\w*<\/script>/g
This matches "<script" followed by zero or more "word
characters". Word characters doesn't include ">", so this is unlikely
to work.
return strText.replace(regEx, "");
}
let us say,
strText = "Hi <script> .... .... ..... </script> How are u";
the expected out put is "Hi How are u"
More likely "Hi How are u", if one needs to be pedantic, as evidently
I do :)
Regular expression solution is preferred

First thing to consider is what to do if the text is:

"abc<script>...</script>def<script>...</script>ghi"

You would probably want this to be simplified to "abcdefghi". However,
if you use a simple regualar expression matching from <script> to
</script>, it will match from the first <script> to the last </script>,
returning only "abcghi".

To avoid this, you need a non-greedy matching by the regular
expression, something only available in recent browsers. You don't say
whether this code should be executed on a web page or on a server,
but if it is on a server, you control the version of Javascript, and
can rely on non-greedy matching if available.

Try this RegExp then:
/<\s*script.+?<\/\s*script\s*>/ig

If non-greedy regular expressions are not available, you can find the
instances manually using indexOf. It's not very effective, though,
since it doesn't ignore case and whitespace. It can be made to work,
but it's not nearly as much fun :)
/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 23 '05 #2

Lasse Reichstein Nielsen wrote:

"rajarao" <ra******@yahoo.com> writes:
Regular expression solution is preferred
First thing to consider is what to do if the text is:

"abc<script>...</script>def<script>...</script>ghi"

You would probably want this to be simplified to "abcdefghi". However,
if you use a simple regualar expression matching from <script> to
</script>, it will match from the first <script> to the last </script>,
returning only "abcghi".

To avoid this, you need a non-greedy matching by the regular
expression, something only available in recent browsers. You don't say
whether this code should be executed on a web page or on a server,
but if it is on a server, you control the version of Javascript, and
can rely on non-greedy matching if available.

Try this RegExp then:
/<\s*script.+?<\/\s*script\s*>/ig

Is there really a UA out there that is so b0rken to parse "< script>" as
"<script>" and "</ script>" as "</script>"? The SGML declaration of HTML
clearly forbids that for all elements. "<" is STAGO (Start Tag Open) and
"</" is ETAGO (End Tag Open) where both must not be followed by white
space.
If non-greedy regular expressions are not available, you can find the
instances manually using indexOf. It's not very effective, though,
since it doesn't ignore case and whitespace. It can be made to work,
but it's not nearly as much fun :)

That is why one wants to use

/<script[^>]*>[^<>]*<\/script>/ig

then. Since this is not the first time I encountered the problem,
I am going to extend my stripTags() method[1] so that you can strip
only specific tags and also their content if you want.
PointedEars
___________
[1] <http://pointedears.de.vu/scripts/string.js>

Jul 23 '05 #3

Thomas 'PointedEars' Lahn <Po*********@web.de> writes:

Is there really a UA out there that is so b0rken to parse "< script>" as
"<script>" and "</ script>" as "</script>"?
Probably :) But I don't know of any.

That is why one wants to use

/<script[^>]*>[^<>]*<\/script>/ig

That rules out:
---
<script type="text/javascript">
if (screen.innerWidth < 1000) { alert("your resolution sucks");}
</script>
---
since it contains a "<" inside the script.
You should match up to "</" for correctness, or up to "</script"
for compliance with browsers.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 23 '05 #4

Lasse Reichstein Nielsen wrote:

Thomas 'PointedEars' Lahn <Po*********@web.de> writes:
That is why one wants to use

/<script[^>]*>[^<>]*<\/script>/ig
That rules out:
---
<script type="text/javascript">
if (screen.innerWidth < 1000) { alert("your resolution sucks");}
</script>
---
since it contains a "<" inside the script.

True.
You should match up to "</" for correctness, or up to "</script"
for compliance with browsers.

You mean

/<script[^>]*>.*(?!<\/script>).*<\/script>/ig

and the like?

The problem is that such matches would require negative lookahead
(/(?!...)/) which would require ECMAScript 3 support and I wanted to avoid
this since my solution was meant as an backwards compatible alternative to
yours. But even if I would use that and thus lose backwards compatibility,
I think it could still fail if someone uses "</" or "</script" or
"<\/script" within script code for some reason.

Your non-greedy RegExp requires ECMAScript 3 support as well, and yet fails
if someone uses "</script>" or even "<\/script>" within the script code. So
neither the OP nor anyone "can rely on non-greedy matching if available".

Alas, until someone proves the opposite, it remains an intrinsic property of
nested expressions and languages created by such expressions like markup
languages that successful parsing of them using Regular Expressions is just
impossible in general. There are cases where RegExp parsing of such context
can be successful, though; the more detailed/strict its structure/syntax is
defined and the less nested its subexpressions are, the higher is the
statistical probability of successful RegExp parsing of it. Remember we
already had this discussion here a few months before.
PointedEars

Jul 23 '05 #5

Thomas 'PointedEars' Lahn <Po*********@web.de> writes:

Lasse Reichstein Nielsen wrote:
You should match up to "</" for correctness, or up to "</script"
for compliance with browsers.
You mean

/<script[^>]*>.*(?!<\/script>).*<\/script>/ig

and the like?

The problem is that such matches would require negative lookahead
(/(?!...)/)
If it is to be easy, it required eiter negative lookahead, or
non-greedy matching
/<script.*?>.*?<\/script\s*>/ig

However, neither gives any power to regular expressions that they
didn't have already, so you can make a regular expression without either
that matches the same expression. It's just likely to be huge.

A non-greedy match until the string abcd (/.*?abcd/) can be written as
[^a]*a(((a|ba|bca)*([^ba]|b[^ca]|bc[^da])[^a]*a)*bcd)
^ until first a
next a is before bcd: restart
not bcd and or a = either [^ba], or b[^ca], or bc[^da]
then findnext a and restart
or bcd => finished

A similar non-greedy match for ".*?</script" would be:

[^<]*<((<|\/<|\/s<|\/sc<|\/scr<|\/scri<|\/scrip<)*
([^\/<]|\/[^s<]|\/s[^c<]|\/sc[^r<]|\/scr[^i<]|\/scri[^p<]|\/scrip[^t<])
[^<]*<)*\/script

The struture is simple, so you can generate it automatically (provided
the string doesn't contain repeats of the first character!):

function reEscape(string) {
return string.replace(/([[+*?.(){\\\/])/g,"\\$1"); // did I miss any?
}

function matchUntilRE(string) {
if (string.length == 0) { return; }
if (string.length == 1) { return "[^"+reEscape(string)+"]*" +
reEscape(string); }
var buf = []; // StringBuffer
var firstChar = reEscape(string.charAt(0));
buf.push("[^",firstChar,"]*",firstChar);
buf.push("((");
for(var i=0;i<string.length-1;i++) {
if (i>0) { buf.push("|"); }
buf.push(reEscape(string.substring(1,i+1)),firstCh ar);
}
buf.push(")*(");
for(var i=0;i<string.length-1;i++) {
if (i>0) { buf.push("|"); }
buf.push(reEscape(string.substring(1,i+1)),
"[^",reEscape(string.charAt(i+1)),firstChar,"]");
}
buf.push(")");
buf.push("[^",firstChar,"]*",firstChar);
buf.push(")*");
buf.push(reEscape(string.substring(1)));
return buf.join("");
}

(Yey, it gives me exactly the same as the one I created manually :)

I don't see how a non-greedy match until </script can fail.
Your non-greedy RegExp requires ECMAScript 3 support as well, and yet fails
if someone uses "</script>" or even "<\/script>" within the script code.
Fails how? The first is not permitted inside script code (it should
end the script right there), the latter is, and should not be matched
by a search for "</script".

The only problem I see here is the decission whether to search for
</ or </script. I'd go for the latter, for the same reason browsers
do it: it is sufficient, and allows erroneous scripts without breaking.
Alas, until someone proves the opposite, it remains an intrinsic property of
nested expressions and languages created by such expressions like markup
languages that successful parsing of them using Regular Expressions is just
impossible in general.
Yes, but we are not parsing the HTML here.
There are cases where RegExp parsing of such context
can be successful, though; the more detailed/strict its structure/syntax is
defined and the less nested its subexpressions are, the higher is the
statistical probability of successful RegExp parsing of it.

Exactly. And the script element does not contain markup so it cannot
be nested. It stops at the *first* following occurence of "</script",
which is something RE's can test for successfully.

Likewise, you can use regexps to find all tags in a document, because
tags are not nested (elements are).
/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 23 '05 #6

Lasse Reichstein Nielsen <lr*@hotpop.com> writes:

[lookeahead and non-greedy matching]

However, neither gives any power to regular expressions that they
didn't have already, so you can make a regular expression without either
that matches the same expression. It's just likely to be huge.
I'm confuzing two things here.

It is correct that non-greedy matching doesn't allow regular
expressions to match anything they couldn't without. They don't even
need to be rewritten to match the same strings, just use the greedy
operators instead. What non-greedy matching does is, when there are
*more* than one way to match a string, the returned match will be the
shortest possible.
A non-greedy match until the string abcd (/.*?abcd/) can be written as [^a]*a(((a|ba|bca)*([^ba]|b[^ca]|bc[^da])[^a]*a)*bcd)

That is incorrect. This expression matches the string up to and including
the first occurence of abcd. That is not the same as a non-greedy .*?,
whic can match past the first occurence if needed.

Matching up to the first occurence is what we need in this case, but
it is not the same as non-greedy matching.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 23 '05 #7

Lasse Reichstein Nielsen wrote:

Matching up to the first occurence is what we need in this case,

No, it is not, as we are trying to parse a markup language, consisting of
nested subexpressions. The first occurrence of the close tag after the open
tag is not necessarily the correct one as I already pointed out.
PointedEars

Jul 23 '05 #8

Lasse Reichstein Nielsen wrote:

Thomas 'PointedEars' Lahn <Po*********@web.de> writes:
Lasse Reichstein Nielsen wrote:
You should match up to "</" for correctness, or up to "</script"

[...]
I don't see how a non-greedy match until </script can fail.
Your non-greedy RegExp requires ECMAScript 3 support as well, and yet fails
if someone uses "</script>" or even "<\/script>" within the script code.

Fails how? The first is not permitted inside script code (it should
end the script right there), the latter is, and should not be matched
by a search for "</script".

Note that although specified in SGML that ETAGO ends an element rather than
its entire end tag, not all UAs follow the spec in this regard so one could
use the non-conforming syntax and get away with it, e.g. placing malicious
code within a bulletin board posting viewed with IE. Such needs to be covered.
[...]
Alas, until someone proves the opposite, it remains an intrinsic property of
nested expressions and languages created by such expressions like markup
languages that successful parsing of them using Regular Expressions is just
impossible in general.

Yes, but we are not parsing the HTML here.

IBTD.
PointedEars

Jul 23 '05 #9

Thomas 'PointedEars' Lahn <Po*********@web.de> writes:

Lasse Reichstein Nielsen wrote:
Matching up to the first occurence is what we need in this case,
No, it is not, as we are trying to parse a markup language, consisting of
nested subexpressions.

But we are not. We are trying "to remove the content embedded in
<script> and </script> tags". Script tags have CDATA as content type,
so they are not containing nested HTML tags.

It is true that regular expressions cannot match recursive tree structures
(HTML is really a special case of the "matched parenthesis" problem, the
traditional non-recursive language).
The first occurrence of the close tag after the open
tag is not necessarily the correct one as I already pointed out.

Yes it is. In HTML, the script tag ends at the first occurence of
"</". Browsers don't follow the HTML specification and end script tags
at the first occurence of the literal character sequences "</script".
There is no way to include that literal sequence inside a script tag.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 23 '05 #10

Lasse Reichstein Nielsen <lr*@hotpop.com> writes:

(HTML is really a special case of the "matched parenthesis" problem, the
traditional non-recursive language).

non-REGULAR, of course. It's definitly recursive :)

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 23 '05 #11

Removing comments... tokenize error

Lasse Reichstein Nielsen wrote:

Thomas 'PointedEars' Lahn <Po*********@web.de> writes:
Lasse Reichstein Nielsen wrote:
Matching up to the first occurence is what we need in this case,
No, it is not, as we are trying to parse a markup language, consisting of
nested subexpressions.

But we are not. We are trying "to remove the content embedded in
<script> and </script> tags". Script tags have CDATA as content type,

True if you mean the content model of the HTML "script" element.
so they are not containing nested HTML tags.
False. CDATA is content that is not parsed by an HTML UA and
thus it does not contribute to the parse tree. It can contain
(nested) <script type="text/javascript">
document.write('<strong><em>tags</em></strong>'); // [1]
</script> anyway.

[1] Yes, I know that this is invalid HTML but it works in
non-conforming UAs and this is for demo only, anyway.

The first occurrence of the close tag after the open
tag is not necessarily the correct one as I already pointed out.

Yes it is. In HTML, the script tag ends at the first occurence of
"</".

True.
Browsers don't follow the HTML specification and end script tags
at the first occurence of the literal character sequences "</script".
s/tags/elements/

ACK, my bad.
There is no way to include that literal sequence inside a script tag.

Well, you *can* include it in a "script" element's content but it does
not *work* as intended (a script error due to incomplete code is highly
likely). Yet garbage content remains if scriptwise parsing/replacement
follows that misguided paradigm. That is clearly a Bad Thing.

So (again) no RegExp presented in this thread (incl. mine) is suitable to
solve the problem (which this discussion is about after all). Instead one
should write a markup parser prototype or use a (DOM) object that provides
such a functionality.
PointedEars

Jul 23 '05 #12

Similar topics

by: qwweeeit | last post by:

In analysing a very big application (pysol) made of almost 100 sources, I had the need to remove comments. Removing the comments which take all the line is straightforward... Instead for the...

Python

IIS Remote Content and Kerberos Delegation

by: Jacob | last post by:

Hello All, I am trying to serve out some content via IIS that is hosted on a remote fileserver, and am unable to get the delegation working correctly. Our setup is as follows: Local LAN...

ASP / Active Server Pages

Removing an expression set in a stylesheet

by: Jim Ley | last post by:

Hi, IE has the ability to setExpressions on stylesheets so you can calculate the value of the css property through script. For various reasons I'm wanting to use a side-effect of this to...

Removing and event from the html code.

by: graham.reeds | last post by:

I am updating a website that uses a countdown script embedded on the page. When the page is served the var's are set to how long the countdown has left in minutes and seconds, but the rest of the...

Removing Array Elements

by: RyanTaylor | last post by:

I have a final coming up later this week in my beginning Java class and my prof has decided to give us possible Javascript code we may have to write. Problem is, we didn't really cover JS and what...

Reading a COMPLEX CONTENT : Stan Can you help?

by: Ganesh Muthuvelu | last post by:

Hi STAN, Stan: Thanks for your response to my previous post on reading a XSD file using your article in "https://blogs.msdn.com/stan_kitsis/archive/2005/08/06/448572.aspx". it works quite well...

.NET Framework

Ruby regex for removing C/Java-style /* ... */ comments

by: beatTheDevil | last post by:

Hey guys, As the title says I'm trying to make a regular expression (regex/regexp) for use in removing the comments from code. In this case, this particular regex is meant to match /* ... */...

Ruby / Ruby on Rails

removing address,tool menu bars

by: mantrid | last post by:

I use <form action="screen.php" target="_blank" method="post"and a submit button to open a new separate window which is simply a display screen. I therefore do not wish to have the address, tool...

HTML / CSS

using strtok to mark delimiters as tokens

by: gpaps87 | last post by:

hi, i wanted to know whether we can use strtok command to mark delimiters as tokens as well.In Java,we have a command: StringTokennizer(String str, String delimiters, boolean delimAsToken) ...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...