473,396 Members | 1,678 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

UTF-16 entities and BOMs

Hi all,

I'm trying to resolve what appears to me an inconsistency in the XML 1.0
recommendation involving entities encoding in UTF-16 and the requirement
for a byte order mark.

Section 4.3.3 has the following text:

http://www.w3.org/TR/REC-xml/#charencoding

"Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin
with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000],
section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH
NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not
part of either the markup or the character data of the XML document. XML
processors MUST be able to use this character to differentiate between
UTF-8 and UTF-16 encoded documents."

This seems unambiguous to me -- all entities encoded in UTF-16 must have
a BOM. Since this is only an error, does that mean that a processor can
recover from it by then attempting to decode the byte sequence to verify
that it is UTF-16?

But then, later it also says:

"In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is a fatal error for an entity
including an encoding declaration to be presented to the XML processor
in an encoding other than that named in the declaration, or for an
entity which begins with neither a Byte Order Mark nor an encoding
declaration to use an encoding other than UTF-8. Note that since ASCII
is a subset of UTF-8, ordinary ASCII entities do not strictly need an
encoding declaration."

I'm assuming this means an entity encoded in UTF-16 that has neither a
BOM, nor an encoding declaration must be rejected, since it's a fatal error.

The reason I'm confused is that many popular XML parsers seem to accept
entities encoded in UTF-16 that do not have a BOM, but have an encoding
declaration. However, at least two popular XML parsers (Xerces-C and
MSXML) accept UTF-16 entities without a BOM or an encoding declaration.

Any thoughts or pointers to previous discussion will be much appreciated.

Thanks!

Dave
Mar 24 '06 #1
1 1995
David Bertoni wrote:
Any thoughts or pointers to previous discussion will be much appreciated.


Hi, Dave. See the appendix "Autodetection of Character Encodings
(Non-Normative)". (Appendix F in 1.0, E in 1.1.) There, they point out
that in fact the byte order and general encoding group can be deduced
without the byte-order mark.

Tim Bray, in his Annotated XML Spec, notes that the rule saying "Parsed
entities which are stored in an encoding other than UTF-8 or UTF-16 must
begin with a text declaration containing an encoding declaration" was
included specifically to ensure that autodetection has a chance of
working. As he puts it: "We recognize that although the Web provides a
method for a server to tell the client what kind of encoding is being
used, sometimes it breaks down, and sometimes there's no server (like
when you're reading something straight off a disk). In these situations,
everything works much better if entities give the processor some help in
figuring out how things are encoded."

In other words, folks producing XML documents MUST produce the BOM when
required ... but in case they don't, a processor apparently MAY attempt
to read past that error. "It is a fatal error when an XML processor
encounters an entity with an encoding that it is unable to process," but
if it can find a way to process it despite the BOM being missing that's
apparently copacetic.
That's how I interpolate it, anyway. If you want an official answer, I'd
suggest pinging the W3C directly and asking them to write either a
clarification or an erratum which deals with this.

(The XML 1.0 errata do provide some additional words re when the BOM is
and isn't expected to be present -- most of which were incorporated into
1.1, I think -- but I don't see a direct answer to this question.)

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Mar 24 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Alban Hertroys | last post by:
Another python/psycopg question, for which the solution is probably quite simple; I just don't know where to look. I have a query that inserts data originating from an utf-8 encoded XML file....
48
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...
7
by: Philipp Lenssen | last post by:
How do I load and save a UTF-8 document in XML in ASP/VBS? Well, the loading* is not the problem actually -- the file is in UTF-8, and understood correctly -- but once saved, the UTF-8 is...
1
by: stevelooking41 | last post by:
Can someone explain why I don't seem unable to use document.write to produce a valid UTF-8 none breaking space sequence (Hex: C2A0) ? I've tried everyway I've been able to find to tell the...
0
by: Tim Northrup | last post by:
Help! We have DB2 V7.2 (fixpak 12) installed on Windows2003 Server, and the latest V7.2 client installed on another system. The DB2CODEPAGE on all systems is set to 1208, and the database was...
1
by: JJBW | last post by:
Hi I am creating some aspx files in Visual Studio 2003 for a Danish web site. The page is encoded as UTF-8 - However, when I save the the aspx file as "UTF-8 without signature" the Danish...
1
by: sheldon.regular | last post by:
I am new to unicode so please bear with my stupidity. I am doing the following in a Python IDE called Wing with Python 23. äöü äöü '\xc3\xa4\xc3\xb6\xc3\xbc' u'\xe4\xf6\xfc'...
10
by: Jed | last post by:
I have a form that needs to handle international characters withing the UTF-8 character set. I have tried all the recommended strategies for getting utf-8 characters from form input to email...
23
by: Allan Ebdrup | last post by:
I hava an ajax web application where i hvae problems with UTF-8 encoding oc chineese chars. My Ajax webapplication runs in a HTML page that is UTF-8 Encoded. I copy and paste some chineese chars...
4
by: =?ISO-8859-2?Q?Boris_Du=B9ek?= | last post by:
Hi, I have an API that returns UTF-8 encoded strings. I have a utf8 codevt facet available to do the conversion from UTF-8 to wchar_t encoding defined by the platform. I have no trouble...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.