Hi all,
I'm trying to resolve what appears to me an inconsistency in the XML 1.0
recommendation involving entities encoding in UTF-16 and the requirement
for a byte order mark.
Section 4.3.3 has the following text:
http://www.w3.org/TR/REC-xml/#charencoding
"Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin
with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000],
section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH
NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not
part of either the markup or the character data of the XML document. XML
processors MUST be able to use this character to differentiate between
UTF-8 and UTF-16 encoded documents."
This seems unambiguous to me -- all entities encoded in UTF-16 must have
a BOM. Since this is only an error, does that mean that a processor can
recover from it by then attempting to decode the byte sequence to verify
that it is UTF-16?
But then, later it also says:
"In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is a fatal error for an entity
including an encoding declaration to be presented to the XML processor
in an encoding other than that named in the declaration, or for an
entity which begins with neither a Byte Order Mark nor an encoding
declaration to use an encoding other than UTF-8. Note that since ASCII
is a subset of UTF-8, ordinary ASCII entities do not strictly need an
encoding declaration."
I'm assuming this means an entity encoded in UTF-16 that has neither a
BOM, nor an encoding declaration must be rejected, since it's a fatal error.
The reason I'm confused is that many popular XML parsers seem to accept
entities encoded in UTF-16 that do not have a BOM, but have an encoding
declaration. However, at least two popular XML parsers (Xerces-C and
MSXML) accept UTF-16 entities without a BOM or an encoding declaration.
Any thoughts or pointers to previous discussion will be much appreciated.
Thanks!
Dave