Help | Site Map
Connecting Tech Pros Worldwide
 
 
LinkBack Thread Tools
  #1  
Old January 23rd, 2006, 09:15 AM
Lachlan Hunt
Guest
 
Posts: n/a
Default Handling Erroneous HTML Comments

Hi,
I'm interested in finding out how erroneous comment syntax within an
HTML document should be handled by a parser, according to SGML rules.
At present, every browser handles comments in different ways, with only
Mozilla and very recent builds of a few others browsers have anything at
least closely resembling proper SGML comment handling.

In particular, take this comment for example:
<!-- Hello -- World -- ! -->

According to SGML rules, "World" is not within a comment, but,
theoretically, what should a parser do upon encountering that?

Mozilla, for example, handles the whole thing as a single comment. But,
when OpenSP encounters the 'W', it seems to implicitly close the comment
declaration, drop the 'W' completely and continue from 'orld ...' as
though it were not commented out.

I'd tend to take OpenSP as being more conforming than Mozilla in most
cases, but is this behaviour actually defined in SGML at all?

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
  #2  
Old January 23rd, 2006, 10:25 AM
Jukka K. Korpela
Guest
 
Posts: n/a
Default Re: Handling Erroneous HTML Comments

Lachlan Hunt wrote:
[color=blue]
> I'm interested in finding out how erroneous comment syntax within an
> HTML document should be handled by a parser, according to SGML rules.[/color]

There's no good answer in the WWW context, since the World (Wide Web
browsers and lookalikes) do not play by SGML rules, and therefore HTML
documents on the WWW cannot be expected to do so either.

It really depends on the purpose of parsing. If the parser is part of a
validator, it should naturally play by SGML rules exactly, but it might
also _warn_ about some constructs that comply with SGML rules but are
not good practice since the Web is not safe for SGML. It might even warn
about any occurrence of ">" inside a comment.

If the parser is part of a converter, browser, or other tool that is
meant to process the bulk of actual WWW pages, it should be rather
permissive in syntax. Generally, it is better to display text visibly
even when it might be meant to be comment than to hide text as comment
when it might be meant to be real content.
[color=blue]
> In particular, take this comment for example:
> <!-- Hello -- World -- ! -->
>
> According to SGML rules, "World" is not within a comment, but,
> theoretically, what should a parser do upon encountering that?[/color]

A parser should report an error, and the caller (the software that
invoked the parser) should then take an appropriate action. In a
browser, would say that the safest approach is to treat the situation as
if it were
<!-- Hello --> World -- ! -->
since it is reasonably sure that " Hello " was meant to be a comment,
and we cannot know about the rest.
[color=blue]
> I'd tend to take OpenSP as being more conforming than Mozilla in most
> cases, but is this behaviour actually defined in SGML at all?[/color]

As far as I can see, clause 10.3 of ISO 8879 specifies that a comment
declaration consists of
- mdo ("<!" in the reference concrete syntax)
- optionally, a sequence of comments optionally followed by whitespace
- mdc (">" in the reference concrete syntax)
and a comment starts and ends with com ("--" in the reference concrete
syntax). I cannot find any room for "World" there, so it is a low-level
syntax error, and a reportable markup error. Thus, there are no
requirements on processing the document in general. A _validating_
parser is required to report the error, but clause 15.4.1 explicitly
says that there are no requirements on handling the error, beyond
reporting it. And there is no requirement on _how_ it should be reported.
  #3  
Old January 23rd, 2006, 10:45 AM
Lachlan Hunt
Guest
 
Posts: n/a
Default Re: Handling Erroneous HTML Comments

Jukka K. Korpela wrote:[color=blue]
> Lachlan Hunt wrote:
>[color=green]
>> I'm interested in finding out how erroneous comment syntax within an
>> HTML document should be handled by a parser, according to SGML rules.[/color]
> ...[color=green]
>> In particular, take this comment for example:
>> <!-- Hello -- World -- ! -->
>>
>> According to SGML rules, "World" is not within a comment, but,
>> theoretically, what should a parser do upon encountering that?[/color]
>
> A parser should report an error, and the caller (the software that
> invoked the parser) should then take an appropriate action. In a
> browser, would say that the safest approach is to treat the situation as
> if it were
> <!-- Hello --> World -- ! -->[/color]

I agree, that seems like the most appropriate action to take and is
fairly close to what OpenSP seems to do, which is to treat it like this:
<!-- Hello -- >orld -- ! -->
[color=blue][color=green]
>> I'd tend to take OpenSP as being more conforming than Mozilla in most
>> cases, but is this behaviour actually defined in SGML at all?[/color]
>
> As far as I can see, clause 10.3 of ISO 8879 specifies that a comment
> declaration consists of
> - mdo ("<!" in the reference concrete syntax)
> - optionally, a sequence of comments optionally followed by whitespace
> - mdc (">" in the reference concrete syntax)
> and a comment starts and ends with com ("--" in the reference concrete
> syntax). I cannot find any room for "World" there, so it is a low-level
> syntax error, and a reportable markup error. Thus, there are no
> requirements on processing the document in general. A _validating_
> parser is required to report the error, but clause 15.4.1 explicitly
> says that there are no requirements on handling the error, beyond
> reporting it. And there is no requirement on _how_ it should be reported.[/color]

So, in other words, it basically says "This is an error, tell someone
who cares. But I'm not going to tell you what to do with it." :-)

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
  #4  
Old January 24th, 2006, 12:25 AM
David E. Ross
Guest
 
Posts: n/a
Default Re: Handling Erroneous HTML Comments

Lachlan Hunt wrote:[color=blue]
> Hi,
> I'm interested in finding out how erroneous comment syntax within an
> HTML document should be handled by a parser, according to SGML rules. At
> present, every browser handles comments in different ways, with only
> Mozilla and very recent builds of a few others browsers have anything at
> least closely resembling proper SGML comment handling.
>
> In particular, take this comment for example:
> <!-- Hello -- World -- ! -->
>
> According to SGML rules, "World" is not within a comment, but,
> theoretically, what should a parser do upon encountering that?
>
> Mozilla, for example, handles the whole thing as a single comment. But,
> when OpenSP encounters the 'W', it seems to implicitly close the comment
> declaration, drop the 'W' completely and continue from 'orld ...' as
> though it were not commented out.
>
> I'd tend to take OpenSP as being more conforming than Mozilla in most
> cases, but is this behaviour actually defined in SGML at all?
>[/color]

Please see Section 3.2.4 of the HTML 4.01 specification. This describes
the difference between <! ("markup declaration open") and the
immediately following -- ("comment open"). It also indicates that a
second -- can only be "comment close" and that white space can occur
between "comment close" and "markup declaration close" (>) but not
between "markup declaration open" and "comment open". Thus, any --
within a comment that does not mean "comment close" must be an error.

--

David E. Ross
<http://www.rossde.com/>

Concerned about someone (e.g., Pres. Bush) snooping
into your E-mail? Use PGP.
See my <http://www.rossde.com/PGP/>
  #5  
Old January 24th, 2006, 12:45 AM
Alan J. Flavell
Guest
 
Posts: n/a
Default Re: Handling Erroneous HTML Comments


On Mon, 23 Jan 2006, David E. Ross wrote:
[color=blue]
> Lachlan Hunt wrote:[/color]

[ample quotage now snipped for brevity]
[color=blue][color=green]
> > I'd tend to take OpenSP as being more conforming than Mozilla in
> > most cases, but is this behaviour actually defined in SGML at all?[/color][/color]
[color=blue]
> Please see Section 3.2.4 of the HTML 4.01 specification. This
> describes the difference between <! ("markup declaration open") and
> the immediately following -- ("comment open"). It also indicates
> that a second -- can only be "comment close" and that white space
> can occur between "comment close" and "markup declaration close" (>)
> but not between "markup declaration open" and "comment open".
> Thus, any -- within a comment that does not mean "comment close"
> must be an error.[/color]

I don't think there's any disagreement that it must be an error. But
AFAIK the HTML specification doesn't make any real proposal of how a
browser might recover from, or fix-up, that kind of error. AIUI,
Lachlan was discussing whether the offending i.e surplus text ought to
be rendered in the display or not.

As for SGML, it has some much more interesting uses for the "<!"
syntax, amongst which, comments are often somewhat incidental (take a
look at an HTML DTD, for example). But I'm not aware of SGML
requiring any particular kind of error recovery here.

regards
  #6  
Old January 24th, 2006, 12:55 AM
Mark Parnell
Guest
 
Posts: n/a
Default Re: Handling Erroneous HTML Comments

Deciding to do something for the good of humanity, "David E. Ross"
<nobody@nowhere.not> declared in comp.infosystems.www.authoring.html:
[color=blue]
> It also indicates that a
> second -- can only be "comment close" and that white space can occur
> between "comment close" and "markup declaration close" (>) but not
> between "markup declaration open" and "comment open". Thus, any --
> within a comment that does not mean "comment close" must be an error.[/color]

Kind of. "--" within a comment *is* a "comment close", as you said
yourself ("can only be...").

So the error is not the "--" itself, but rather anything other than
whitespace between it and the "markup declaration close".

--
Mark Parnell
================================================== ===
Att. Google Groups users - this is your last warning:
http://www.safalra.com/special/googlegroupsreply/
  #7  
Old January 24th, 2006, 07:35 AM
Lachlan Hunt
Guest
 
Posts: n/a
Default Re: Handling Erroneous HTML Comments

David E. Ross wrote:[color=blue]
> Lachlan Hunt wrote:[color=green]
>> Hi,
>> I'm interested in finding out how erroneous comment syntax within an
>> HTML document should be handled by a parser, according to SGML rules.
>> ...
>> In particular, take this comment for example:
>> <!-- Hello -- World -- ! -->
>>
>> According to SGML rules, "World" is not within a comment, but,
>> theoretically, what should a parser do upon encountering that?
>> ...[/color]
>
> Please see Section 3.2.4 of the HTML 4.01 specification. This describes
> the difference between <! ("markup declaration open") and the
> immediately following -- ("comment open"). It also indicates that a
> second -- can only be "comment close" and that white space can occur
> between "comment close" and "markup declaration close" (>) but not
> between "markup declaration open" and "comment open". Thus, any --
> within a comment that does not mean "comment close" must be an error.[/color]

Yes, I'm well aware of what the conforming syntax is, but as I'm sure
everyone is well aware, authors frequently write invalid markup
(including invalid comments) and browsers not only need to be able to
handle it, but ideally handle it in an interoperable way (preferably
according to the specification).

The reason I ask is because there has been recent discussion [1] between
various browser vendors (including at least Mozilla, Opera, Safari,
Konqueror and Prince) about whether or not true SGML-style comment
parsing should be widely implemented (even though Mozilla has for nearly
7 years and the rest did it to pass Acid 2) and I wanted to find out
what SGML actually said about handling such errors or whether it was
really left undefined.

Unfortunately, for backwards compatibility reasons, the decision to
remove support for true SGML-style comments was made, in favour of
redefining a much simpler and more backwards compatible syntax as part
of the WHATWG's "HTML 5" work. It now seems that a conforming HTML 5
comment will be defined to match the XML syntax, but with different
parsing and error handling requirements. The good thing, however, is
that browsers may soon have better interoperability on the matter than
they do at present.

[1] http://ln.hixie.ch/?start=1137799947&count=1

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
  #8  
Old January 24th, 2006, 12:35 PM
Alan J. Flavell
Guest
 
Posts: n/a
Default Re: Handling Erroneous HTML Comments

On Tue, 24 Jan 2006, Lachlan Hunt wrote:
[color=blue]
> [1] http://ln.hixie.ch/?start=1137799947&count=1[/color]

Nearby I read this .sig-worthy comment:

"to a rough approximation, all the content on the Web is errorneous,
invalid, or non-conformant." - Hixie.

http://ln.hixie.ch/?start=1137740632&count=1

*What* an enormous amount of effort has been wasted over the years in
trying to make sense of this tag-soup rubbish.

*If only* browsers had agreed from the outset to refuse to display
anything with formal errors in it.

I can't really make out yet whether "HTML5" is supposed to be a better
HTML, or just an "improved" recipe for tag-soup. If it's going to be
the latter, then it seems as one of the original promises of XML (to
make a clean break with tag soup and guarantee formal correctness for
ever more) has failed. Ho hum.
  #9  
Old January 24th, 2006, 01:15 PM
Lachlan Hunt
Guest
 
Posts: n/a
Default Re: Handling Erroneous HTML Comments

Alan J. Flavell wrote:[color=blue]
> I can't really make out yet whether "HTML5" is supposed to be a better
> HTML, or just an "improved" recipe for tag-soup. If it's going to be
> the latter, then it seems as one of the original promises of XML (to
> make a clean break with tag soup and guarantee formal correctness for
> ever more) has failed. Ho hum.[/color]

Well, I think it's kind of both. It's supposedly better in the sense
that it has increased semantics, including new elements/attributes and
clarifying the semantic definitions of existing elements. It's an
"improved recipe for tag-soup" in the sense that the error handling is
being more well defined in the interests of increasing interoperability,
so that when an author does inevitably write rubbish, at least the
browsers will (theoretically) produce the same result.

The bad thing though, is that it has essentially dropped the association
with SGML and there will be no official DTD published for it, but it
does seem that, for the most part, a conforming document will be
compatible with SGML tools (like a validator), even though the defined
parsing/error handling will differ in many significant ways.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
 

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles