UTF-16 entities and BOMs - .NET Framework

David Bertoni

Hi all,

I'm trying to resolve what appears to me an inconsistency in the XML 1.0
recommendation involving entities encoding in UTF-16 and the requirement
for a byte order mark.

Section 4.3.3 has the following text:

http://www.w3.org/TR/REC-xml/#charencoding

"Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin
with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000],
section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH
NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not
part of either the markup or the character data of the XML document. XML
processors MUST be able to use this character to differentiate between
UTF-8 and UTF-16 encoded documents."

This seems unambiguous to me -- all entities encoded in UTF-16 must have
a BOM. Since this is only an error, does that mean that a processor can
recover from it by then attempting to decode the byte sequence to verify
that it is UTF-16?

But then, later it also says:

"In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is a fatal error for an entity
including an encoding declaration to be presented to the XML processor
in an encoding other than that named in the declaration, or for an
entity which begins with neither a Byte Order Mark nor an encoding
declaration to use an encoding other than UTF-8. Note that since ASCII
is a subset of UTF-8, ordinary ASCII entities do not strictly need an
encoding declaration."

I'm assuming this means an entity encoded in UTF-16 that has neither a
BOM, nor an encoding declaration must be rejected, since it's a fatal error.

The reason I'm confused is that many popular XML parsers seem to accept
entities encoded in UTF-16 that do not have a BOM, but have an encoding
declaration. However, at least two popular XML parsers (Xerces-C and
MSXML) accept UTF-16 entities without a BOM or an encoding declaration.

Any thoughts or pointers to previous discussion will be much appreciated.

Thanks!

Dave

Mar 24 '06 #1

Subscribe Post Reply

1995

Joe Kesselman

David Bertoni wrote:

Any thoughts or pointers to previous discussion will be much appreciated.

Hi, Dave. See the appendix "Autodetection of Character Encodings
(Non-Normative)". (Appendix F in 1.0, E in 1.1.) There, they point out
that in fact the byte order and general encoding group can be deduced
without the byte-order mark.

Tim Bray, in his Annotated XML Spec, notes that the rule saying "Parsed
entities which are stored in an encoding other than UTF-8 or UTF-16 must
begin with a text declaration containing an encoding declaration" was
included specifically to ensure that autodetection has a chance of
working. As he puts it: "We recognize that although the Web provides a
method for a server to tell the client what kind of encoding is being
used, sometimes it breaks down, and sometimes there's no server (like
when you're reading something straight off a disk). In these situations,
everything works much better if entities give the processor some help in
figuring out how things are encoded."

In other words, folks producing XML documents MUST produce the BOM when
required ... but in case they don't, a processor apparently MAY attempt
to read past that error. "It is a fatal error when an XML processor
encounters an entity with an encoding that it is unable to process," but
if it can find a way to process it despite the BOM being missing that's
apparently copacetic.
That's how I interpolate it, anyway. If you want an official answer, I'd
suggest pinging the W3C directly and asking them to write either a
clarification or an erratum which deals with this.

(The XML 1.0 errata do provide some additional words re when the BOM is
and isn't expected to be present -- most of which were incorporated into
1.1, I think -- but I don't see a direct answer to this question.)

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry

Mar 24 '06 #2

Similar topics

Psycopg and queries with UTF-8 data

by: Alban Hertroys | last post by:

Another python/psycopg question, for which the solution is probably quite simple; I just don't know where to look. I have a query that inserts data originating from an utf-8 encoded XML file....

Python

Adobe GoLive 6 - Nasty feature with UTF-8 encoding

by: Zenobia | last post by:

Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...

HTML / CSS

Saving XML as UTF-8?

by: Philipp Lenssen | last post by:

How do I load and save a UTF-8 document in XML in ASP/VBS? Well, the loading* is not the problem actually -- the file is in UTF-8, and understood correctly -- but once saved, the UTF-8 is...

ASP / Active Server Pages

Trouble with document.write and UTF-8

by: stevelooking41 | last post by:

Can someone explain why I don't seem unable to use document.write to produce a valid UTF-8 none breaking space sequence (Hex: C2A0) ? I've tried everyway I've been able to find to tell the...

Javascript

DB2 V7.2 fixpak 12 / UTF-8 db doing extra Unicode -> UTF-8 conversion on client?

by: Tim Northrup | last post by:

Help! We have DB2 V7.2 (fixpak 12) installed on Windows2003 Server, and the latest V7.2 client installed on another system. The DB2CODEPAGE on all systems is set to 1208, and the database was...

DB2 Database

UTF-8 with signature & UTF-8 without signature

by: JJBW | last post by:

Hi I am creating some aspx files in Visual Studio 2003 for a Danish web site. The page is encoded as UTF-8 - However, when I save the the aspx file as "UTF-8 without signature" the Danish...

ASP.NET

Printing UTF-8

by: sheldon.regular | last post by:

I am new to unicode so please bear with my stupidity. I am doing the following in a Python IDE called Wing with Python 23. Ã¤Ã¶Ã¼ Ã¤Ã¶Ã¼ '\xc3\xa4\xc3\xb6\xc3\xbc' u'\xe4\xf6\xfc'...

Python

CDONTS or CDOSYS UTF-8 Email

by: Jed | last post by:

I have a form that needs to handle international characters withing the UTF-8 character set. I have tried all the recommended strategies for getting utf-8 characters from form input to email...

ASP / Active Server Pages

UTF-8 encoding in AJAX web application.

by: Allan Ebdrup | last post by:

I hava an ajax web application where i hvae problems with UTF-8 encoding oc chineese chars. My Ajax webapplication runs in a HTML page that is UTF-8 Encoded. I copy and paste some chineese chars...

C# / C Sharp

std::wstringbuf and imbue to convert from utf-8 to wchar_t?

by: =?ISO-8859-2?Q?Boris_Du=B9ek?= | last post by:

Hi, I have an API that returns UTF-8 encoded strings. I have a utf8 codevt facet available to do the conversion from UTF-8 to wchar_t encoding defined by the platform. I have no trouble...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice