This is trying to avoid "</script>" being seen as the end of the script.
Since browsers are more lenient than the HTML standard allows for,
they only end script element at "</script>". However, the standard
requires that the script element is ended at the first occurence of
"</".
The solution, as suggested in the HTML specification, is to break
up "<" and "/" like:
var abc = "<script>some stuff here<\/script>";
The page I am trying to parse is not my own page. Its from some other
site which I cannot modify. A sample page having the same structure is
this:
<html>
<head>
<title>Test for tidy</title>
</head>
<body>
<script type="text/javascript">
<!--
var d = "<script>this is within a nested script tag</"+"script>";
-->
</script>
<a href="1.html">This is link 1</a>
<a href="2.html">This is link 2</a>
<a href="3.html">This is link 3</a>
<a href="4.html">This is link 4</a>
<a href="5.html">This is link 5</a>
</body>
</html>
Without having access to Tidy or your page, I don't know what the
exact error might be. It might consider the nested <scriptan opening,
but that would be an error in parsing HTML, and I think Tidy should
be able to do that correctly.
The problem is that the <scripttag inside the quotes is treated as an
opening tag and the </scriptbefore the links is the closing tag. So
the outer <scriptnever gets closed. Tidy's cleaned up doc comes up
like this:
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 14 February 2006), see www.w3.org">
<title>Test for tidy</title>
</head>
<body>
<script type="text/javascript">
<!--
var d = "<script>this is within a nested script
tag</"+"script>";
-->
<\/script>
<a href="1.html">This is link 1<\/a>
<a href="2.html">This is link 2<\/a>
<a href="3.html">This is link 3<\/a>
<a href="4.html">This is link 4<\/a>
<a href="5.html">This is link 5<\/a>
<\/body>
<\/html>
</script>
</body>
</html>
It ignores everything after the script. As a result I lose the body
and the links and all.
How does this "ignoring" show itself? What do you mean by "lose"?
As you can see above, by 'lose' I mean that the links are now a part of
the script tag and not the Body as they were earlier. I have code
further that extracts the text from the body, but in this case, since
its in the <scripttag, it gets 'lost'.
Is there any way I can correctly interpret a closing </scripttag
as the end of the script rather than trying to parse stuff inside
the script?
The above fix for ending the script element correctly might help.
Actually, I was thinking of modifying tidy code in some way to
completely ignore the script tags coz I dont need them at all.
Thanks.