Connecting Tech Pros Worldwide Help | Site Map

Split <SCRIPT> tags

 
LinkBack Thread Tools Search this Thread
  #1  
Old July 17th, 2006, 08:25 AM
bilaribilari@yahoo.com
Guest
 
Posts: n/a
Default Split <SCRIPT> tags

Hi all,
I am using Tidy (C) for parsing html pages. I encountered a page that
has some script as follows:

<script>
....
var abc = "<script>some stuff here</" + "script>";
....
</script>

Tidy fails to parse this correctly. It ignores everything after the
script. As a result I lose the body and the links and all.
Is there any way I can correctly interpret a closing </scripttag as
the end of the script rather than trying to parse stuff inside the
script?

Thanks in advance!
B.


  #2  
Old July 17th, 2006, 10:05 AM
Lasse Reichstein Nielsen
Guest
 
Posts: n/a
Default Re: Split <SCRIPT> tags

bilaribilari@yahoo.com writes:
Quote:
I am using Tidy (C) for parsing html pages. I encountered a page that
has some script as follows:
>
<script>
...
var abc = "<script>some stuff here</" + "script>";
This is trying to avoid "</script>" being seen as the end of the script.
Since browsers are more lenient than the HTML standard allows for,
they only end script element at "</script>". However, the standard
requires that the script element is ended at the first occurence of
"</".
The solution, as suggested in the HTML specification, is to break
up "<" and "/" like:
var abc = "<script>some stuff here<\/script>";
Quote:
Tidy fails to parse this correctly.
That might also be the case, although I doubt it.

Without having access to Tidy or your page, I don't know what the
exact error might be. It might consider the nested <scriptan opening,
but that would be an error in parsing HTML, and I think Tidy should
be able to do that correctly.

More likely code occuring after the incorrect script closing above
causes Tidy to begin some other, unknown element, e.g.,:
if (x<y) ...
would look like the opening of a "y" element, and it is parsed as
HTML since it occours after the closing of the script element by the
first "</".
Quote:
It ignores everything after the script. As a result I lose the body
and the links and all.
How does this "ignoring" show itself? What do you mean by "lose"?
Quote:
Is there any way I can correctly interpret a closing </scripttag
as the end of the script rather than trying to parse stuff inside
the script?
The above fix for ending the script element correctly might help.

/L
--
Lasse Reichstein Nielsen - lrn@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'
  #3  
Old July 17th, 2006, 10:25 AM
bilaribilari@yahoo.com
Guest
 
Posts: n/a
Default Re: Split <SCRIPT> tags

This is trying to avoid "</script>" being seen as the end of the script.
Quote:
Since browsers are more lenient than the HTML standard allows for,
they only end script element at "</script>". However, the standard
requires that the script element is ended at the first occurence of
"</".
The solution, as suggested in the HTML specification, is to break
up "<" and "/" like:
var abc = "<script>some stuff here<\/script>";
The page I am trying to parse is not my own page. Its from some other
site which I cannot modify. A sample page having the same structure is
this:
<html>

<head>
<title>Test for tidy</title>
</head>

<body>
<script type="text/javascript">
<!--
var d = "<script>this is within a nested script tag</"+"script>";
-->
</script>
<a href="1.html">This is link 1</a>
<a href="2.html">This is link 2</a>
<a href="3.html">This is link 3</a>
<a href="4.html">This is link 4</a>
<a href="5.html">This is link 5</a>

</body>

</html>
Quote:
Without having access to Tidy or your page, I don't know what the
exact error might be. It might consider the nested <scriptan opening,
but that would be an error in parsing HTML, and I think Tidy should
be able to do that correctly.
The problem is that the <scripttag inside the quotes is treated as an
opening tag and the </scriptbefore the links is the closing tag. So
the outer <scriptnever gets closed. Tidy's cleaned up doc comes up
like this:
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 14 February 2006), see www.w3.org">
<title>Test for tidy</title>
</head>
<body>
<script type="text/javascript">
<!--
var d = "<script>this is within a nested script
tag</"+"script>";
-->
<\/script>
<a href="1.html">This is link 1<\/a>
<a href="2.html">This is link 2<\/a>
<a href="3.html">This is link 3<\/a>
<a href="4.html">This is link 4<\/a>
<a href="5.html">This is link 5<\/a>

<\/body>

<\/html>
</script>
</body>
</html>

Quote:
Quote:
It ignores everything after the script. As a result I lose the body
and the links and all.
>
How does this "ignoring" show itself? What do you mean by "lose"?
As you can see above, by 'lose' I mean that the links are now a part of
the script tag and not the Body as they were earlier. I have code
further that extracts the text from the body, but in this case, since
its in the <scripttag, it gets 'lost'.
Quote:
Quote:
Is there any way I can correctly interpret a closing </scripttag
as the end of the script rather than trying to parse stuff inside
the script?
>
The above fix for ending the script element correctly might help.
>
Actually, I was thinking of modifying tidy code in some way to
completely ignore the script tags coz I dont need them at all.

Thanks.

 

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Popular Articles

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over 220,989 network members.