By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
448,505 Members | 1,168 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 448,505 IT Pros & Developers. It's quick & easy.

Handling Erroneous HTML Comments

P: n/a
Hi,
I'm interested in finding out how erroneous comment syntax within an
HTML document should be handled by a parser, according to SGML rules.
At present, every browser handles comments in different ways, with only
Mozilla and very recent builds of a few others browsers have anything at
least closely resembling proper SGML comment handling.

In particular, take this comment for example:
<!-- Hello -- World -- ! -->

According to SGML rules, "World" is not within a comment, but,
theoretically, what should a parser do upon encountering that?

Mozilla, for example, handles the whole thing as a single comment. But,
when OpenSP encounters the 'W', it seems to implicitly close the comment
declaration, drop the 'W' completely and continue from 'orld ...' as
though it were not commented out.

I'd tend to take OpenSP as being more conforming than Mozilla in most
cases, but is this behaviour actually defined in SGML at all?

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Jan 23 '06 #1
Share this Question
Share on Google+
8 Replies


P: n/a
Lachlan Hunt wrote:
I'm interested in finding out how erroneous comment syntax within an
HTML document should be handled by a parser, according to SGML rules.
There's no good answer in the WWW context, since the World (Wide Web
browsers and lookalikes) do not play by SGML rules, and therefore HTML
documents on the WWW cannot be expected to do so either.

It really depends on the purpose of parsing. If the parser is part of a
validator, it should naturally play by SGML rules exactly, but it might
also _warn_ about some constructs that comply with SGML rules but are
not good practice since the Web is not safe for SGML. It might even warn
about any occurrence of ">" inside a comment.

If the parser is part of a converter, browser, or other tool that is
meant to process the bulk of actual WWW pages, it should be rather
permissive in syntax. Generally, it is better to display text visibly
even when it might be meant to be comment than to hide text as comment
when it might be meant to be real content.
In particular, take this comment for example:
<!-- Hello -- World -- ! -->

According to SGML rules, "World" is not within a comment, but,
theoretically, what should a parser do upon encountering that?
A parser should report an error, and the caller (the software that
invoked the parser) should then take an appropriate action. In a
browser, would say that the safest approach is to treat the situation as
if it were
<!-- Hello --> World -- ! -->
since it is reasonably sure that " Hello " was meant to be a comment,
and we cannot know about the rest.
I'd tend to take OpenSP as being more conforming than Mozilla in most
cases, but is this behaviour actually defined in SGML at all?


As far as I can see, clause 10.3 of ISO 8879 specifies that a comment
declaration consists of
- mdo ("<!" in the reference concrete syntax)
- optionally, a sequence of comments optionally followed by whitespace
- mdc (">" in the reference concrete syntax)
and a comment starts and ends with com ("--" in the reference concrete
syntax). I cannot find any room for "World" there, so it is a low-level
syntax error, and a reportable markup error. Thus, there are no
requirements on processing the document in general. A _validating_
parser is required to report the error, but clause 15.4.1 explicitly
says that there are no requirements on handling the error, beyond
reporting it. And there is no requirement on _how_ it should be reported.
Jan 23 '06 #2

P: n/a
Jukka K. Korpela wrote:
Lachlan Hunt wrote:
I'm interested in finding out how erroneous comment syntax within an
HTML document should be handled by a parser, according to SGML rules.

...
In particular, take this comment for example:
<!-- Hello -- World -- ! -->

According to SGML rules, "World" is not within a comment, but,
theoretically, what should a parser do upon encountering that?


A parser should report an error, and the caller (the software that
invoked the parser) should then take an appropriate action. In a
browser, would say that the safest approach is to treat the situation as
if it were
<!-- Hello --> World -- ! -->


I agree, that seems like the most appropriate action to take and is
fairly close to what OpenSP seems to do, which is to treat it like this:
<!-- Hello -- >orld -- ! -->
I'd tend to take OpenSP as being more conforming than Mozilla in most
cases, but is this behaviour actually defined in SGML at all?


As far as I can see, clause 10.3 of ISO 8879 specifies that a comment
declaration consists of
- mdo ("<!" in the reference concrete syntax)
- optionally, a sequence of comments optionally followed by whitespace
- mdc (">" in the reference concrete syntax)
and a comment starts and ends with com ("--" in the reference concrete
syntax). I cannot find any room for "World" there, so it is a low-level
syntax error, and a reportable markup error. Thus, there are no
requirements on processing the document in general. A _validating_
parser is required to report the error, but clause 15.4.1 explicitly
says that there are no requirements on handling the error, beyond
reporting it. And there is no requirement on _how_ it should be reported.


So, in other words, it basically says "This is an error, tell someone
who cares. But I'm not going to tell you what to do with it." :-)

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Jan 23 '06 #3

P: n/a
Lachlan Hunt wrote:
Hi,
I'm interested in finding out how erroneous comment syntax within an
HTML document should be handled by a parser, according to SGML rules. At
present, every browser handles comments in different ways, with only
Mozilla and very recent builds of a few others browsers have anything at
least closely resembling proper SGML comment handling.

In particular, take this comment for example:
<!-- Hello -- World -- ! -->

According to SGML rules, "World" is not within a comment, but,
theoretically, what should a parser do upon encountering that?

Mozilla, for example, handles the whole thing as a single comment. But,
when OpenSP encounters the 'W', it seems to implicitly close the comment
declaration, drop the 'W' completely and continue from 'orld ...' as
though it were not commented out.

I'd tend to take OpenSP as being more conforming than Mozilla in most
cases, but is this behaviour actually defined in SGML at all?


Please see Section 3.2.4 of the HTML 4.01 specification. This describes
the difference between <! ("markup declaration open") and the
immediately following -- ("comment open"). It also indicates that a
second -- can only be "comment close" and that white space can occur
between "comment close" and "markup declaration close" (>) but not
between "markup declaration open" and "comment open". Thus, any --
within a comment that does not mean "comment close" must be an error.

--

David E. Ross
<http://www.rossde.com/>

Concerned about someone (e.g., Pres. Bush) snooping
into your E-mail? Use PGP.
See my <http://www.rossde.com/PGP/>
Jan 24 '06 #4

P: n/a

On Mon, 23 Jan 2006, David E. Ross wrote:
Lachlan Hunt wrote:
[ample quotage now snipped for brevity]
I'd tend to take OpenSP as being more conforming than Mozilla in
most cases, but is this behaviour actually defined in SGML at all?

Please see Section 3.2.4 of the HTML 4.01 specification. This
describes the difference between <! ("markup declaration open") and
the immediately following -- ("comment open"). It also indicates
that a second -- can only be "comment close" and that white space
can occur between "comment close" and "markup declaration close" (>)
but not between "markup declaration open" and "comment open".
Thus, any -- within a comment that does not mean "comment close"
must be an error.


I don't think there's any disagreement that it must be an error. But
AFAIK the HTML specification doesn't make any real proposal of how a
browser might recover from, or fix-up, that kind of error. AIUI,
Lachlan was discussing whether the offending i.e surplus text ought to
be rendered in the display or not.

As for SGML, it has some much more interesting uses for the "<!"
syntax, amongst which, comments are often somewhat incidental (take a
look at an HTML DTD, for example). But I'm not aware of SGML
requiring any particular kind of error recovery here.

regards
Jan 24 '06 #5

P: n/a
Deciding to do something for the good of humanity, "David E. Ross"
<no****@nowhere.not> declared in comp.infosystems.www.authoring.html:
It also indicates that a
second -- can only be "comment close" and that white space can occur
between "comment close" and "markup declaration close" (>) but not
between "markup declaration open" and "comment open". Thus, any --
within a comment that does not mean "comment close" must be an error.


Kind of. "--" within a comment *is* a "comment close", as you said
yourself ("can only be...").

So the error is not the "--" itself, but rather anything other than
whitespace between it and the "markup declaration close".

--
Mark Parnell
================================================== ===
Att. Google Groups users - this is your last warning:
http://www.safalra.com/special/googlegroupsreply/
Jan 24 '06 #6

P: n/a
David E. Ross wrote:
Lachlan Hunt wrote:
Hi,
I'm interested in finding out how erroneous comment syntax within an
HTML document should be handled by a parser, according to SGML rules.
...
In particular, take this comment for example:
<!-- Hello -- World -- ! -->

According to SGML rules, "World" is not within a comment, but,
theoretically, what should a parser do upon encountering that?
...


Please see Section 3.2.4 of the HTML 4.01 specification. This describes
the difference between <! ("markup declaration open") and the
immediately following -- ("comment open"). It also indicates that a
second -- can only be "comment close" and that white space can occur
between "comment close" and "markup declaration close" (>) but not
between "markup declaration open" and "comment open". Thus, any --
within a comment that does not mean "comment close" must be an error.


Yes, I'm well aware of what the conforming syntax is, but as I'm sure
everyone is well aware, authors frequently write invalid markup
(including invalid comments) and browsers not only need to be able to
handle it, but ideally handle it in an interoperable way (preferably
according to the specification).

The reason I ask is because there has been recent discussion [1] between
various browser vendors (including at least Mozilla, Opera, Safari,
Konqueror and Prince) about whether or not true SGML-style comment
parsing should be widely implemented (even though Mozilla has for nearly
7 years and the rest did it to pass Acid 2) and I wanted to find out
what SGML actually said about handling such errors or whether it was
really left undefined.

Unfortunately, for backwards compatibility reasons, the decision to
remove support for true SGML-style comments was made, in favour of
redefining a much simpler and more backwards compatible syntax as part
of the WHATWG's "HTML 5" work. It now seems that a conforming HTML 5
comment will be defined to match the XML syntax, but with different
parsing and error handling requirements. The good thing, however, is
that browsers may soon have better interoperability on the matter than
they do at present.

[1] http://ln.hixie.ch/?start=1137799947&count=1

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Jan 24 '06 #7

P: n/a
On Tue, 24 Jan 2006, Lachlan Hunt wrote:
[1] http://ln.hixie.ch/?start=1137799947&count=1


Nearby I read this .sig-worthy comment:

"to a rough approximation, all the content on the Web is errorneous,
invalid, or non-conformant." - Hixie.

http://ln.hixie.ch/?start=1137740632&count=1

*What* an enormous amount of effort has been wasted over the years in
trying to make sense of this tag-soup rubbish.

*If only* browsers had agreed from the outset to refuse to display
anything with formal errors in it.

I can't really make out yet whether "HTML5" is supposed to be a better
HTML, or just an "improved" recipe for tag-soup. If it's going to be
the latter, then it seems as one of the original promises of XML (to
make a clean break with tag soup and guarantee formal correctness for
ever more) has failed. Ho hum.
Jan 24 '06 #8

P: n/a
Alan J. Flavell wrote:
I can't really make out yet whether "HTML5" is supposed to be a better
HTML, or just an "improved" recipe for tag-soup. If it's going to be
the latter, then it seems as one of the original promises of XML (to
make a clean break with tag soup and guarantee formal correctness for
ever more) has failed. Ho hum.


Well, I think it's kind of both. It's supposedly better in the sense
that it has increased semantics, including new elements/attributes and
clarifying the semantic definitions of existing elements. It's an
"improved recipe for tag-soup" in the sense that the error handling is
being more well defined in the interests of increasing interoperability,
so that when an author does inevitably write rubbish, at least the
browsers will (theoretically) produce the same result.

The bad thing though, is that it has essentially dropped the association
with SGML and there will be no official DTD published for it, but it
does seem that, for the most part, a conforming document will be
compatible with SGML tools (like a validator), even though the defined
parsing/error handling will differ in many significant ways.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Jan 24 '06 #9

This discussion thread is closed

Replies have been disabled for this discussion.