removing content between specified tokens using java script

rajarao

hi
I want to remove the content embedded in <script> and </script> tags
submitted via text box.
My java script should remove the content embedded between <script> and
</script> tag.
my current code is

function RemoveHTMLScrip t(strText)
{
var regEx = /<script\w*<\/script>/g
return strText.replace (regEx, "");
}
let us say,
strText = "Hi <script> .... .... ..... </script> How are u";
the expected out put is "Hi How are u"

Regular expression solution is preferred
thanks and regards
Raja rao

Jul 23 '05 #1

Subscribe Reply

3076

Lasse Reichstein Nielsen

"rajarao" <ra******@yahoo .com> writes:

I want to remove the content embedded in <script> and </script> tags
submitted via text box.
My java script should remove the content embedded between <script> and
</script> tag.
my current code is

function RemoveHTMLScrip t(strText)
{
var regEx = /<script\w*<\/script>/g
This matches "<script" followed by zero or more "word
characters". Word characters doesn't include ">", so this is unlikely
to work.
return strText.replace (regEx, "");
}
let us say,
strText = "Hi <script> .... .... ..... </script> How are u";
the expected out put is "Hi How are u"
More likely "Hi How are u", if one needs to be pedantic, as evidently
I do :)
Regular expression solution is preferred

First thing to consider is what to do if the text is:

"abc<script>... </script>def<scri pt>...</script>ghi"

You would probably want this to be simplified to "abcdefghi" . However,
if you use a simple regualar expression matching from <script> to
</script>, it will match from the first <script> to the last </script>,
returning only "abcghi".

To avoid this, you need a non-greedy matching by the regular
expression, something only available in recent browsers. You don't say
whether this code should be executed on a web page or on a server,
but if it is on a server, you control the version of Javascript, and
can rely on non-greedy matching if available.

Try this RegExp then:
/<\s*script.+? <\/\s*script\s*>/ig

If non-greedy regular expressions are not available, you can find the
instances manually using indexOf. It's not very effective, though,
since it doesn't ignore case and whitespace. It can be made to work,
but it's not nearly as much fun :)
/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 23 '05 #2

Thomas 'PointedEars' Lahn

Lasse Reichstein Nielsen wrote:

"rajarao" <ra******@yahoo .com> writes:
Regular expression solution is preferred
First thing to consider is what to do if the text is:

"abc<script>... </script>def<scri pt>...</script>ghi"

You would probably want this to be simplified to "abcdefghi" . However,
if you use a simple regualar expression matching from <script> to
</script>, it will match from the first <script> to the last </script>,
returning only "abcghi".

To avoid this, you need a non-greedy matching by the regular
expression, something only available in recent browsers. You don't say
whether this code should be executed on a web page or on a server,
but if it is on a server, you control the version of Javascript, and
can rely on non-greedy matching if available.

Try this RegExp then:
/<\s*script.+? <\/\s*script\s*>/ig

Is there really a UA out there that is so b0rken to parse "< script>" as
"<script>" and "</ script>" as "</script>"? The SGML declaration of HTML
clearly forbids that for all elements. "<" is STAGO (Start Tag Open) and
"</" is ETAGO (End Tag Open) where both must not be followed by white
space.
If non-greedy regular expressions are not available, you can find the
instances manually using indexOf. It's not very effective, though,
since it doesn't ignore case and whitespace. It can be made to work,
but it's not nearly as much fun :)

That is why one wants to use

/<script[^>]*>[^<>]*<\/script>/ig

then. Since this is not the first time I encountered the problem,
I am going to extend my stripTags() method[1] so that you can strip
only specific tags and also their content if you want.
PointedEars
___________
[1] <http://pointedears.de. vu/scripts/string.js>

Jul 23 '05 #3

Lasse Reichstein Nielsen

Thomas 'PointedEars' Lahn <Po*********@we b.de> writes:

Is there really a UA out there that is so b0rken to parse "< script>" as
"<script>" and "</ script>" as "</script>"?
Probably :) But I don't know of any.

That is why one wants to use

/<script[^>]*>[^<>]*<\/script>/ig

That rules out:
---
<script type="text/javascript">
if (screen.innerWi dth < 1000) { alert("your resolution sucks");}
</script>
---
since it contains a "<" inside the script.
You should match up to "</" for correctness, or up to "</script"
for compliance with browsers.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 23 '05 #4

Thomas 'PointedEars' Lahn

Lasse Reichstein Nielsen wrote:

Thomas 'PointedEars' Lahn <Po*********@we b.de> writes:
That is why one wants to use

/<script[^>]*>[^<>]*<\/script>/ig
That rules out:
---
<script type="text/javascript">
if (screen.innerWi dth < 1000) { alert("your resolution sucks");}
</script>
---
since it contains a "<" inside the script.

True.
You should match up to "</" for correctness, or up to "</script"
for compliance with browsers.

You mean

/<script[^>]*>.*(?!<\/script>).*<\/script>/ig

and the like?

The problem is that such matches would require negative lookahead
(/(?!...)/) which would require ECMAScript 3 support and I wanted to avoid
this since my solution was meant as an backwards compatible alternative to
yours. But even if I would use that and thus lose backwards compatibility,
I think it could still fail if someone uses "</" or "</script" or
"<\/script" within script code for some reason.

Your non-greedy RegExp requires ECMAScript 3 support as well, and yet fails
if someone uses "</script>" or even "<\/script>" within the script code. So
neither the OP nor anyone "can rely on non-greedy matching if available".

Alas, until someone proves the opposite, it remains an intrinsic property of
nested expressions and languages created by such expressions like markup
languages that successful parsing of them using Regular Expressions is just
impossible in general. There are cases where RegExp parsing of such context
can be successful, though; the more detailed/strict its structure/syntax is
defined and the less nested its subexpressions are, the higher is the
statistical probability of successful RegExp parsing of it. Remember we
already had this discussion here a few months before.
PointedEars

Jul 23 '05 #5

Lasse Reichstein Nielsen

Thomas 'PointedEars' Lahn <Po*********@we b.de> writes:

Lasse Reichstein Nielsen wrote:
You should match up to "</" for correctness, or up to "</script"
for compliance with browsers.
You mean

/<script[^>]*>.*(?!<\/script>).*<\/script>/ig

and the like?

The problem is that such matches would require negative lookahead
(/(?!...)/)
If it is to be easy, it required eiter negative lookahead, or
non-greedy matching
/<script.*?>.*?< \/script\s*>/ig

However, neither gives any power to regular expressions that they
didn't have already, so you can make a regular expression without either
that matches the same expression. It's just likely to be huge.

A non-greedy match until the string abcd (/.*?abcd/) can be written as
[^a]*a(((a|ba|bca)* ([^ba]|b[^ca]|bc[^da])[^a]*a)*bcd)
^ until first a
next a is before bcd: restart
not bcd and or a = either [^ba], or b[^ca], or bc[^da]
then findnext a and restart
or bcd => finished

A similar non-greedy match for ".*?</script" would be:

[^<]*<((<|\/<|\/s<|\/sc<|\/scr<|\/scri<|\/scrip<)*
([^\/<]|\/[^s<]|\/s[^c<]|\/sc[^r<]|\/scr[^i<]|\/scri[^p<]|\/scrip[^t<])
[^<]*<)*\/script

The struture is simple, so you can generate it automatically (provided
the string doesn't contain repeats of the first character!):

function reEscape(string ) {
return string.replace(/([[+*?.(){\\\/])/g,"\\$1"); // did I miss any?
}

function matchUntilRE(st ring) {
if (string.length == 0) { return; }
if (string.length == 1) { return "[^"+reEscape(str ing)+"]*" +
reEscape(string ); }
var buf = []; // StringBuffer
var firstChar = reEscape(string .charAt(0));
buf.push("[^",firstChar ,"]*",firstChar );
buf.push("((");
for(var i=0;i<string.le ngth-1;i++) {
if (i>0) { buf.push("|"); }
buf.push(reEsca pe(string.subst ring(1,i+1)),fi rstChar);
}
buf.push(")*(") ;
for(var i=0;i<string.le ngth-1;i++) {
if (i>0) { buf.push("|"); }
buf.push(reEsca pe(string.subst ring(1,i+1)),
"[^",reEscape(str ing.charAt(i+1) ),firstChar,"]");
}
buf.push(")");
buf.push("[^",firstChar ,"]*",firstChar );
buf.push(")*");
buf.push(reEsca pe(string.subst ring(1)));
return buf.join("");
}

(Yey, it gives me exactly the same as the one I created manually :)

I don't see how a non-greedy match until </script can fail.
Your non-greedy RegExp requires ECMAScript 3 support as well, and yet fails
if someone uses "</script>" or even "<\/script>" within the script code.
Fails how? The first is not permitted inside script code (it should
end the script right there), the latter is, and should not be matched
by a search for "</script".

The only problem I see here is the decission whether to search for
</ or </script. I'd go for the latter, for the same reason browsers
do it: it is sufficient, and allows erroneous scripts without breaking.
Alas, until someone proves the opposite, it remains an intrinsic property of
nested expressions and languages created by such expressions like markup
languages that successful parsing of them using Regular Expressions is just
impossible in general.
Yes, but we are not parsing the HTML here.
There are cases where RegExp parsing of such context
can be successful, though; the more detailed/strict its structure/syntax is
defined and the less nested its subexpressions are, the higher is the
statistical probability of successful RegExp parsing of it.

Exactly. And the script element does not contain markup so it cannot
be nested. It stops at the *first* following occurence of "</script",
which is something RE's can test for successfully.

Likewise, you can use regexps to find all tags in a document, because
tags are not nested (elements are).
/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 23 '05 #6

Lasse Reichstein Nielsen

Lasse Reichstein Nielsen <lr*@hotpop.com > writes:

[lookeahead and non-greedy matching]

However, neither gives any power to regular expressions that they
didn't have already, so you can make a regular expression without either
that matches the same expression. It's just likely to be huge.
I'm confuzing two things here.

It is correct that non-greedy matching doesn't allow regular
expressions to match anything they couldn't without. They don't even
need to be rewritten to match the same strings, just use the greedy
operators instead. What non-greedy matching does is, when there are
*more* than one way to match a string, the returned match will be the
shortest possible.
A non-greedy match until the string abcd (/.*?abcd/) can be written as [^a]*a(((a|ba|bca)* ([^ba]|b[^ca]|bc[^da])[^a]*a)*bcd)

That is incorrect. This expression matches the string up to and including
the first occurence of abcd. That is not the same as a non-greedy .*?,
whic can match past the first occurence if needed.

Matching up to the first occurence is what we need in this case, but
it is not the same as non-greedy matching.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 23 '05 #7

Thomas 'PointedEars' Lahn

Lasse Reichstein Nielsen wrote:

Matching up to the first occurence is what we need in this case,

No, it is not, as we are trying to parse a markup language, consisting of
nested subexpressions. The first occurrence of the close tag after the open
tag is not necessarily the correct one as I already pointed out.
PointedEars

Jul 23 '05 #8

Thomas 'PointedEars' Lahn

Lasse Reichstein Nielsen wrote:

Thomas 'PointedEars' Lahn <Po*********@we b.de> writes:
Lasse Reichstein Nielsen wrote:
You should match up to "</" for correctness, or up to "</script"

[...]
I don't see how a non-greedy match until </script can fail.
Your non-greedy RegExp requires ECMAScript 3 support as well, and yet fails
if someone uses "</script>" or even "<\/script>" within the script code.

Fails how? The first is not permitted inside script code (it should
end the script right there), the latter is, and should not be matched
by a search for "</script".

Note that although specified in SGML that ETAGO ends an element rather than
its entire end tag, not all UAs follow the spec in this regard so one could
use the non-conforming syntax and get away with it, e.g. placing malicious
code within a bulletin board posting viewed with IE. Such needs to be covered.
[...]
Alas, until someone proves the opposite, it remains an intrinsic property of
nested expressions and languages created by such expressions like markup
languages that successful parsing of them using Regular Expressions is just
impossible in general.

Yes, but we are not parsing the HTML here.

IBTD.
PointedEars

Jul 23 '05 #9

Lasse Reichstein Nielsen

Thomas 'PointedEars' Lahn <Po*********@we b.de> writes:

Lasse Reichstein Nielsen wrote:
Matching up to the first occurence is what we need in this case,
No, it is not, as we are trying to parse a markup language, consisting of
nested subexpressions.

But we are not. We are trying "to remove the content embedded in
<script> and </script> tags". Script tags have CDATA as content type,
so they are not containing nested HTML tags.

It is true that regular expressions cannot match recursive tree structures
(HTML is really a special case of the "matched parenthesis" problem, the
traditional non-recursive language).
The first occurrence of the close tag after the open
tag is not necessarily the correct one as I already pointed out.

Yes it is. In HTML, the script tag ends at the first occurence of
"</". Browsers don't follow the HTML specification and end script tags
at the first occurence of the literal character sequences "</script".
There is no way to include that literal sequence inside a script tag.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleD OM.html>
'Faith without judgement merely degrades the spirit divine.'

Jul 23 '05 #10

Similar topics

2301

Removing comments... tokenize error

by: qwweeeit | last post by:

In analysing a very big application (pysol) made of almost 100 sources, I had the need to remove comments. Removing the comments which take all the line is straightforward... Instead for the embedded comments I used the tokenize module. To my surprise the analysed output is different from the input (the last tuple element should exactly replicate the input line) The error comes out in correspondance of a triple string.

Python

5254

IIS Remote Content and Kerberos Delegation

by: Jacob | last post by:

Hello All, I am trying to serve out some content via IIS that is hosted on a remote fileserver, and am unable to get the delegation working correctly. Our setup is as follows: Local LAN Windows 2000 domain (mixed-mode): MYDOMAIN (mydomain.net) Windows 2003 Server w/IIS6: WEB01 Windows 2000 Server hosting files: FILE01 Windows XP Pro client workstation: CLIENT01

ASP / Active Server Pages

1692

Removing an expression set in a stylesheet

by: Jim Ley | last post by:

Hi, IE has the ability to setExpressions on stylesheets so you can calculate the value of the css property through script. For various reasons I'm wanting to use a side-effect of this to attach an event to every element of a class in a document (I'm including content from a lot of large 3rd party content, and iterating over the entire DOM searching for the classes and then attaching the event is proving too slow, aswell as being too...

Javascript

2055

Removing and event from the html code.

by: graham.reeds | last post by:

I am updating a website that uses a countdown script embedded on the page. When the page is served the var's are set to how long the countdown has left in minutes and seconds, but the rest of the script is left untouched. However I want to take the script out of the page and have it as a seperate file that can be cached, reducing serving costs - the page gets hit a couple of thousand times per day, sometimes as high a 5K, so any...

Javascript

4393

Removing Array Elements

by: RyanTaylor | last post by:

I have a final coming up later this week in my beginning Java class and my prof has decided to give us possible Javascript code we may have to write. Problem is, we didn't really cover JS and what we covered was within the last week of the class and all self taught. Our prof gave us an example of a Java method used to remove elements from an array: public void searchProcess() { int outIt=0;

Javascript

1776

Reading a COMPLEX CONTENT : Stan Can you help?

by: Ganesh Muthuvelu | last post by:

Hi STAN, Stan: Thanks for your response to my previous post on reading a XSD file using your article in "https://blogs.msdn.com/stan_kitsis/archive/2005/08/06/448572.aspx". it works quite well but I have one problem.. I am not able to read a Complex Content.. Here is a portion of the XSD that contains the complex content. I need to read the elements under it and could not get an handle to it.. Could you please help?

.NET Framework

7339

Ruby regex for removing C/Java-style /* ... */ comments

by: beatTheDevil | last post by:

Hey guys, As the title says I'm trying to make a regular expression (regex/regexp) for use in removing the comments from code. In this case, this particular regex is meant to match /* ... */ comments. I'm using Ruby v.1.8.6 Here's my regex: multiline_comments = /\/\*(.*?)\*\// When I try myStr.gsub(multiline_comments, "")

Ruby / Ruby on Rails

2091

removing address,tool menu bars

by: mantrid | last post by:

I use <form action="screen.php" target="_blank" method="post"and a submit button to open a new separate window which is simply a display screen. I therefore do not wish to have the address, tool and menu bars displayed. Can I do this from the form script? or would I have to put some HTML script on the page that is to be opened. or is it only possible with java script? Can anyone offer a solution on how to do this?

HTML / CSS

3358

using strtok to mark delimiters as tokens

by: gpaps87 | last post by:

hi, i wanted to know whether we can use strtok command to mark delimiters as tokens as well.In Java,we have a command: StringTokennizer(String str, String delimiters, boolean delimAsToken) which considers the delimiters as tokens,too.Can strtok accomplish this requirement?or could you please let me know if there is any other command in C that would carry out this task?

C / C++

8832

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9566

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9388

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9333

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

6800

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6078

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4608

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

3319

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2217

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General