By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,137 Members | 2,282 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,137 IT Pros & Developers. It's quick & easy.

Split <SCRIPT> tags

P: n/a
Hi all,
I am using Tidy (C) for parsing html pages. I encountered a page that
has some script as follows:

<script>
....
var abc = "<script>some stuff here</" + "script>";
....
</script>

Tidy fails to parse this correctly. It ignores everything after the
script. As a result I lose the body and the links and all.
Is there any way I can correctly interpret a closing </scripttag as
the end of the script rather than trying to parse stuff inside the
script?

Thanks in advance!
B.

Jul 17 '06 #1
Share this Question
Share on Google+
2 Replies


P: n/a
bi**********@yahoo.com writes:
I am using Tidy (C) for parsing html pages. I encountered a page that
has some script as follows:

<script>
...
var abc = "<script>some stuff here</" + "script>";
This is trying to avoid "</script>" being seen as the end of the script.
Since browsers are more lenient than the HTML standard allows for,
they only end script element at "</script>". However, the standard
requires that the script element is ended at the first occurence of
"</".
The solution, as suggested in the HTML specification, is to break
up "<" and "/" like:
var abc = "<script>some stuff here<\/script>";
Tidy fails to parse this correctly.
That might also be the case, although I doubt it.

Without having access to Tidy or your page, I don't know what the
exact error might be. It might consider the nested <scriptan opening,
but that would be an error in parsing HTML, and I think Tidy should
be able to do that correctly.

More likely code occuring after the incorrect script closing above
causes Tidy to begin some other, unknown element, e.g.,:
if (x<y) ...
would look like the opening of a "y" element, and it is parsed as
HTML since it occours after the closing of the script element by the
first "</".
It ignores everything after the script. As a result I lose the body
and the links and all.
How does this "ignoring" show itself? What do you mean by "lose"?
Is there any way I can correctly interpret a closing </scripttag
as the end of the script rather than trying to parse stuff inside
the script?
The above fix for ending the script element correctly might help.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'
Jul 17 '06 #2

P: n/a
This is trying to avoid "</script>" being seen as the end of the script.
Since browsers are more lenient than the HTML standard allows for,
they only end script element at "</script>". However, the standard
requires that the script element is ended at the first occurence of
"</".
The solution, as suggested in the HTML specification, is to break
up "<" and "/" like:
var abc = "<script>some stuff here<\/script>";
The page I am trying to parse is not my own page. Its from some other
site which I cannot modify. A sample page having the same structure is
this:
<html>

<head>
<title>Test for tidy</title>
</head>

<body>
<script type="text/javascript">
<!--
var d = "<script>this is within a nested script tag</"+"script>";
-->
</script>
<a href="1.html">This is link 1</a>
<a href="2.html">This is link 2</a>
<a href="3.html">This is link 3</a>
<a href="4.html">This is link 4</a>
<a href="5.html">This is link 5</a>

</body>

</html>
Without having access to Tidy or your page, I don't know what the
exact error might be. It might consider the nested <scriptan opening,
but that would be an error in parsing HTML, and I think Tidy should
be able to do that correctly.
The problem is that the <scripttag inside the quotes is treated as an
opening tag and the </scriptbefore the links is the closing tag. So
the outer <scriptnever gets closed. Tidy's cleaned up doc comes up
like this:
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 14 February 2006), see www.w3.org">
<title>Test for tidy</title>
</head>
<body>
<script type="text/javascript">
<!--
var d = "<script>this is within a nested script
tag</"+"script>";
-->
<\/script>
<a href="1.html">This is link 1<\/a>
<a href="2.html">This is link 2<\/a>
<a href="3.html">This is link 3<\/a>
<a href="4.html">This is link 4<\/a>
<a href="5.html">This is link 5<\/a>

<\/body>

<\/html>
</script>
</body>
</html>

It ignores everything after the script. As a result I lose the body
and the links and all.

How does this "ignoring" show itself? What do you mean by "lose"?
As you can see above, by 'lose' I mean that the links are now a part of
the script tag and not the Body as they were earlier. I have code
further that extracts the text from the body, but in this case, since
its in the <scripttag, it gets 'lost'.
Is there any way I can correctly interpret a closing </scripttag
as the end of the script rather than trying to parse stuff inside
the script?

The above fix for ending the script element correctly might help.
Actually, I was thinking of modifying tidy code in some way to
completely ignore the script tags coz I dont need them at all.

Thanks.

Jul 17 '06 #3

This discussion thread is closed

Replies have been disabled for this discussion.