470,855 Members | 1,201 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 470,855 developers. It's quick & easy.

legacy comments


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi,

I am currently reviewing some HTML parsing software.

One of the source code comments reads:
# Scan to end of comment.
# Comments are defined any of a number of ways.
# IE 5.0: <!-- followed by >
# "HTML The Definitive Guide": <!-- text with at least one space in it -->
# Netscape: <!-- --> comments nest
# w3c: whitespace can appear between -- and > of comment close

Does anyone know of post 1998 HTML documents that use the IE or
Netscape "features"?
Thanks for any hints and comments.

Thomas

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.9.9 (GNU/Linux)

iD8DBQFBpXeS3w+/yD4P9tIRAjW9AKDPiBf/lQ5N6w6ac+ok9Q2a29SzagCeNPgE
1DG2XNq7bSYI/omcUrC6tkA=
=GSzX
-----END PGP SIGNATURE-----
Jul 23 '05 #1
5 2009
In article <jl************@kuehne.cn>,
Thomas Kuehne <st*************@example.com> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Isn't that singularly pointless?
I am currently reviewing some HTML parsing software.
Does it claim to follow HTML (SGML) rules, XHTML (XML) rules, or tag-soup
(whatever takes the author's fancy) rules?
# Scan to end of comment.
# Comments are defined any of a number of ways.
# IE 5.0: <!-- followed by >
That bears no relation to any form of HTML.
# "HTML The Definitive Guide": <!-- text with at least one space in it -->
Why the space? The start and end are right for XML.
# Netscape: <!-- --> comments nest
Comments nest? Interesting thought. It could almost be a
misinterpretation for doing the right thing - though that seems unlikely.
# w3c: whitespace can appear between -- and > of comment close
Indeed, under SGML rules it can, but there's more to it than that.
Seems like the author of that software hasn't grasped SGML comments.
Does anyone know of post 1998 HTML documents that use the IE or
Netscape "features"?


XML-style comments are valid both as HTML and XHTML as well as
broken-parser-safe, and seem to be the norm. The only serious
brokenness often seen in the wild is use of -- within what the
author intends to be a comment.

--
Nick Kew

Nick's manifesto: http://www.htmlhelp.com/~nick/
Jul 23 '05 #2

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Nick Kew schrieb am Thu, 25 Nov 2004 09:22:30 +0000:
I am currently reviewing some HTML parsing software.


Does it claim to follow HTML (SGML) rules, XHTML (XML) rules, or tag-soup
(whatever takes the author's fancy) rules?


It states: "supports HTML".
The software in question uses a very plain parser that only extracts
the plain text enclosed by CODE tags and then starts the real
processing.

- From what I can see: The Soup roules! (not only tag-soup but also entity-soup).
# Netscape: <!-- --> comments nest


Comments nest? Interesting thought. It could almost be a
misinterpretation for doing the right thing - though that seems unlikely.


I've never read that comments could be nested inside of comments.
Have I missed something while reading the HTML & XHTML docs?
Does anyone know of post 1998 HTML documents that use the IE or
Netscape "features"?


XML-style comments are valid both as HTML and XHTML as well as
broken-parser-safe, and seem to be the norm. The only serious
brokenness often seen in the wild is use of -- within what the
author intends to be a comment.


Glad to hear that, now I can remove/cleanup a lot of the parsing code.

Thomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.9.9 (GNU/Linux)

iD8DBQFBpcS93w+/yD4P9tIRAjtUAJ4/xzgZGBhUTJzS0l7IgnI/ZAi1rACglE5v
Vwz/mhRNJ/WqumkUo7gpEd0=
=rAbX
-----END PGP SIGNATURE-----
Jul 23 '05 #3
On Thu, 25 Nov 2004 07:11:31 +0100, Thomas Kuehne
<st*************@example.com> wrote:
[...]
# Comments are defined any of a number of ways.
No; "comments" has a very specific definition.
# IE 5.0: <!-- followed by >
Bullshit!
# "HTML The Definitive Guide": <!-- text with at least one space in it -->
Ambiguous. It may come out right but it is not as per definition.
# Netscape: <!-- --> comments nest
Bullshit!
# w3c: whitespace can appear between -- and > of comment close
Yes, that's correct.
Does anyone know of post 1998 HTML documents that use the IE or
Netscape "features"?
No, we try our best to forget about those.

An SGML (and XML) comment, is a special case of a "MARKUP DECLARATION"
that can be inserted in markup as follows...

<! = MDO = Markup Declaration Open

MDO must be directly followed by a NAME-START character,
or by a COM, where COM is defined as '--' i.e. two ASCII dashes.

A conforming processor that has found a correct MDO+COM in its data
stream shall treat anything that follows as "disregardable data", i.e
its a comment, up to the next occurrence of a COM.

Between balanced COM's there can be an arbitrary number of any
characters.

Between the last balancing COM there can be an arbitrary number of
declared white space characters until the final MDC "Markup Declaration
Close" is found in the data stream. MDC = '>'

And that would be where the "comment" ends.

Syntax description...

<! = MDO = Markup Declaration Open
-- = COM = Comment start or end = MDC = Markup Declaration Close
Example...

<!-- this text is a good comment --
-- and so is this text too --
but this text is outside of a comment area
-- once again a good comment text --
Note the white space between that last COM and the MDC.

Well now, what about this...

<!--- Is this a good comment? -->

Yes it is, the content of the comment is...

- Is this a good comment?

(note that the third dash becomes content of the commentary text)

Further; is this a good comment?

<!---- Is this a good comment? -->

Nope, it's not since now the parser will find...

<! = MDO
-- = COM
-- = COM
Is this a good comment? -->

.... where the text is outside of comment area.

Another example...

<!-- Is this a good comment? --->

No it's not since it leaves a "hanging dash" to be handled by the parser
as in this parsing example...

<!
--
Is this a good comment?
--
-


That last "hanging" dash will give a parse error since it's not defined
to be a member of the NAMESTART or NAME character groups.

How is it that such a simple thing could become one of the most misused
things on the www. I mean; MS has "innovated" a non intended use of SGML
comments... (there should be a law... :-)

--
Rex
Jul 23 '05 #4
Thomas Kuehne wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Are you aware that each message starts with the above and ends with...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.9.9 (GNU/Linux)

iD8DBQFBpcS93w+/yD4P9tIRAjtUAJ4/xzgZGBhUTJzS0l7IgnI/ZAi1rACglE5v
Vwz/mhRNJ/WqumkUo7gpEd0=
=rAbX
-----END PGP SIGNATURE-----


Rather dumb, wouldn't you say? I can't find your newsreader in your
headers, but there must be a way to fix it.
Jul 23 '05 #5
Nick Kew wrote:
In article <jl************@kuehne.cn>,
Thomas Kuehne <st*************@example.com> writes:
# Scan to end of comment.
# Comments are defined any of a number of ways.
# IE 5.0: <!-- followed by >


That bears no relation to any form of HTML.


Not standard HTML, though might be WinIE conditional comments which do
follow that general syntax.

--
Reply email address is a bottomless spam bucket.
Please reply to the group so everyone can share.
Jul 23 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by JellBell | last post: by
2 posts views Thread by Rajiv Kumar | last post: by
9 posts views Thread by Roy Chastain | last post: by
4 posts views Thread by Jason Madison | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.