472,096 Members | 1,314 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,096 software developers and data experts.

UTF-16 entities and BOMs

Hi all,

I'm trying to resolve what appears to me an inconsistency in the XML 1.0
recommendation involving entities encoding in UTF-16 and the requirement
for a byte order mark.

Section 4.3.3 has the following text:

http://www.w3.org/TR/REC-xml/#charencoding

"Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin
with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000],
section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH
NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not
part of either the markup or the character data of the XML document. XML
processors MUST be able to use this character to differentiate between
UTF-8 and UTF-16 encoded documents."

This seems unambiguous to me -- all entities encoded in UTF-16 must have
a BOM. Since this is only an error, does that mean that a processor can
recover from it by then attempting to decode the byte sequence to verify
that it is UTF-16?

But then, later it also says:

"In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is a fatal error for an entity
including an encoding declaration to be presented to the XML processor
in an encoding other than that named in the declaration, or for an
entity which begins with neither a Byte Order Mark nor an encoding
declaration to use an encoding other than UTF-8. Note that since ASCII
is a subset of UTF-8, ordinary ASCII entities do not strictly need an
encoding declaration."

I'm assuming this means an entity encoded in UTF-16 that has neither a
BOM, nor an encoding declaration must be rejected, since it's a fatal error.

The reason I'm confused is that many popular XML parsers seem to accept
entities encoded in UTF-16 that do not have a BOM, but have an encoding
declaration. However, at least two popular XML parsers (Xerces-C and
MSXML) accept UTF-16 entities without a BOM or an encoding declaration.

Any thoughts or pointers to previous discussion will be much appreciated.

Thanks!

Dave
Mar 24 '06 #1
1 1895
David Bertoni wrote:
Any thoughts or pointers to previous discussion will be much appreciated.


Hi, Dave. See the appendix "Autodetection of Character Encodings
(Non-Normative)". (Appendix F in 1.0, E in 1.1.) There, they point out
that in fact the byte order and general encoding group can be deduced
without the byte-order mark.

Tim Bray, in his Annotated XML Spec, notes that the rule saying "Parsed
entities which are stored in an encoding other than UTF-8 or UTF-16 must
begin with a text declaration containing an encoding declaration" was
included specifically to ensure that autodetection has a chance of
working. As he puts it: "We recognize that although the Web provides a
method for a server to tell the client what kind of encoding is being
used, sometimes it breaks down, and sometimes there's no server (like
when you're reading something straight off a disk). In these situations,
everything works much better if entities give the processor some help in
figuring out how things are encoded."

In other words, folks producing XML documents MUST produce the BOM when
required ... but in case they don't, a processor apparently MAY attempt
to read past that error. "It is a fatal error when an XML processor
encounters an entity with an encoding that it is unable to process," but
if it can find a way to process it despite the BOM being missing that's
apparently copacetic.
That's how I interpolate it, anyway. If you want an official answer, I'd
suggest pinging the W3C directly and asking them to write either a
clarification or an erratum which deals with this.

(The XML 1.0 errata do provide some additional words re when the BOM is
and isn't expected to be present -- most of which were incorporated into
1.1, I think -- but I don't see a direct answer to this question.)

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Mar 24 '06 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Alban Hertroys | last post: by
7 posts views Thread by Philipp Lenssen | last post: by
1 post views Thread by stevelooking41 | last post: by
1 post views Thread by sheldon.regular | last post: by
23 posts views Thread by Allan Ebdrup | last post: by
4 posts views Thread by =?ISO-8859-2?Q?Boris_Du=B9ek?= | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.