473,895 Members | 2,499 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

UTF-16 entities and BOMs

Hi all,

I'm trying to resolve what appears to me an inconsistency in the XML 1.0
recommendation involving entities encoding in UTF-16 and the requirement
for a byte order mark.

Section 4.3.3 has the following text:


"Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin
with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000],
section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH
NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not
part of either the markup or the character data of the XML document. XML
processors MUST be able to use this character to differentiate between
UTF-8 and UTF-16 encoded documents."

This seems unambiguous to me -- all entities encoded in UTF-16 must have
a BOM. Since this is only an error, does that mean that a processor can
recover from it by then attempting to decode the byte sequence to verify
that it is UTF-16?

But then, later it also says:

"In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is a fatal error for an entity
including an encoding declaration to be presented to the XML processor
in an encoding other than that named in the declaration, or for an
entity which begins with neither a Byte Order Mark nor an encoding
declaration to use an encoding other than UTF-8. Note that since ASCII
is a subset of UTF-8, ordinary ASCII entities do not strictly need an
encoding declaration."

I'm assuming this means an entity encoded in UTF-16 that has neither a
BOM, nor an encoding declaration must be rejected, since it's a fatal error.

The reason I'm confused is that many popular XML parsers seem to accept
entities encoded in UTF-16 that do not have a BOM, but have an encoding
declaration. However, at least two popular XML parsers (Xerces-C and
MSXML) accept UTF-16 entities without a BOM or an encoding declaration.

Any thoughts or pointers to previous discussion will be much appreciated.


Mar 24 '06 #1
1 2022
David Bertoni wrote:
Any thoughts or pointers to previous discussion will be much appreciated.

Hi, Dave. See the appendix "Autodetect ion of Character Encodings
(Non-Normative)". (Appendix F in 1.0, E in 1.1.) There, they point out
that in fact the byte order and general encoding group can be deduced
without the byte-order mark.

Tim Bray, in his Annotated XML Spec, notes that the rule saying "Parsed
entities which are stored in an encoding other than UTF-8 or UTF-16 must
begin with a text declaration containing an encoding declaration" was
included specifically to ensure that autodetection has a chance of
working. As he puts it: "We recognize that although the Web provides a
method for a server to tell the client what kind of encoding is being
used, sometimes it breaks down, and sometimes there's no server (like
when you're reading something straight off a disk). In these situations,
everything works much better if entities give the processor some help in
figuring out how things are encoded."

In other words, folks producing XML documents MUST produce the BOM when
required ... but in case they don't, a processor apparently MAY attempt
to read past that error. "It is a fatal error when an XML processor
encounters an entity with an encoding that it is unable to process," but
if it can find a way to process it despite the BOM being missing that's
apparently copacetic.
That's how I interpolate it, anyway. If you want an official answer, I'd
suggest pinging the W3C directly and asking them to write either a
clarification or an erratum which deals with this.

(The XML 1.0 errata do provide some additional words re when the BOM is
and isn't expected to be present -- most of which were incorporated into
1.1, I think -- but I don't see a direct answer to this question.)

() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Mar 24 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

by: Alban Hertroys | last post by:
Another python/psycopg question, for which the solution is probably quite simple; I just don't know where to look. I have a query that inserts data originating from an utf-8 encoded XML file. And guess what, it contains utf-8 encoded characters... Now my problem is that psycopg will only accept queries of type str, so how do I get my utf-8 encoded data into the DB? I can't do query.encode('ascii'), that would be similar to: >>> x =...
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at once) * regular expression search & replace. Normally my documents are encoded with the ISO setting. Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the
by: Philipp Lenssen | last post by:
How do I load and save a UTF-8 document in XML in ASP/VBS? Well, the loading* is not the problem actually -- the file is in UTF-8, and understood correctly -- but once saved, the UTF-8 is replaced by what seems to be iso-8859-1 (which Flash doesn't understand, but that's another problem). Any help greatly appreciated. * Something like this...
by: stevelooking41 | last post by:
Can someone explain why I don't seem unable to use document.write to produce a valid UTF-8 none breaking space sequence (Hex: C2A0) ? I've tried everyway I've been able to find to tell the browser I'm trying to print UTF-8 and still no luck. I'd like the first 2 tries to match the second two tries as far as output. <HTML> <meta http-equiv="Content-Type" content="application/x-script; charset=UTF-8">
by: Tim Northrup | last post by:
Help! We have DB2 V7.2 (fixpak 12) installed on Windows2003 Server, and the latest V7.2 client installed on another system. The DB2CODEPAGE on all systems is set to 1208, and the database was created with code set UTF-8 / codepage 1208. (Note: Running our test application described below on the database host as opposed to a separate client system produced the same results as described below). When we perform an INSERT statement...
by: JJBW | last post by:
Hi I am creating some aspx files in Visual Studio 2003 for a Danish web site. The page is encoded as UTF-8 - However, when I save the the aspx file as "UTF-8 without signature" the Danish characters Ň ∆ ō are not displayed correctly, when I choose to save the file as "UTF-8 with signature" the characters are displayed correctly.
by: sheldon.regular | last post by:
I am new to unicode so please bear with my stupidity. I am doing the following in a Python IDE called Wing with Python 23. √§√∂√ľ √§√∂√ľ '\xc3\xa4\xc3\xb6\xc3\xbc' u'\xe4\xf6\xfc' u'\xe4\xf6\xfc' √§√∂√ľ
by: Jed | last post by:
I have a form that needs to handle international characters withing the UTF-8 character set. I have tried all the recommended strategies for getting utf-8 characters from form input to email message and I cannot get it to work. I need to stay with classic asp for this. Here are some things I tried: 'CDONTS Call msg.SetLocaleIDs(65001)
by: Allan Ebdrup | last post by:
I hava an ajax web application where i hvae problems with UTF-8 encoding oc chineese chars. My Ajax webapplication runs in a HTML page that is UTF-8 Encoded. I copy and paste some chineese chars from another HTML page viewed in IE7, that is also UTF-8 encoded (search for "china" on google.com). I paste the chineese chars into a content editable div. My Ajax webservice compiles an XML where the data from the content editable div is...
by: =?ISO-8859-2?Q?Boris_Du=B9ek?= | last post by:
Hi, I have an API that returns UTF-8 encoded strings. I have a utf8 codevt facet available to do the conversion from UTF-8 to wchar_t encoding defined by the platform. I have no trouble converting when a UTF-8 encoded string comes from file - I just create a std::wifstream and imbue it with a locale that uses the utf-8 facet for std::locale::ctype. Then I just use operator>to get wstring properly decoded from UTF-8. I thought I could...
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, weíll explore What is ONU, What Is Router, ONU & Routerís main usage, and What is the difference between ONU and Router. Letís take a closer look ! Part I. Meaning of...
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectóplanning, coding, testing, and deploymentówithout human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.