using unicode in XML - .NET Framework

Naresh Agarwal

Hi

XML uses UTF-8 by default. Is that correct?

Also, can we use Unicode in XML?

thanks,
Naresh

Jul 20 '05 #1

Subscribe Post Reply

3306

Bjorn Brox

Naresh Agarwal wrote:

Hi

XML uses UTF-8 by default. Is that correct?

Also, can we use Unicode in XML?

UTF-8 is the most common way to encode Unicode.

--
Bjorn Brox, CORENA Norge AS, http://www.corena.no/, ICQ 17872043
Industritunet, Dyrmyrgt. 35, N-3611 Kongsberg, NORWAY
Phone: +47 32717210, Fax: +47 32717201, Mobile: +47 92638590

Jul 20 '05 #2

Edwin Dankert

> XML uses UTF-8 by default. Is that correct?

That is correct (without BOM). [1]

Also, can we use Unicode in XML?

UTF-8, UTF-16 and UTF-32 are three encodings defined by the [2] unicode
people. These character sets addopt the whole Unicode character set.

UTF-8 takes 1-4 (6) bytes for a character.
UTF-16 takes 2/4 bytes for a character.
UTF-32 takes 4 bytes for a character.

[1] http://www.w3.org/TR/2000/REC-xml-20001006#charencoding
[2] http://www.unicode.org/

Regards,
Edwin Dankert
Cladonia Ltd.
http://www.cladonia.com/

Jul 20 '05 #3

Mike Brown

"Naresh Agarwal" <na******@informatica.com> wrote

XML uses UTF-8 by default. Is that correct?

People say this quite often. You'd think it were true. Usually when they say
it, they are thinking "if I don't put an encoding declaration in the prolog,
the XML parser is going to assume the document is utf-8 encoded, right?" And
that may seem to be true most of the time, but some better understanding is
in order.

First, understand that XML, being on one level just a string of abstract
Unicode characters, may be represented in any encoding. That is, the
"physical" bytes of the document (or rather, the bytes of each 'entity'
[file]) can represent Unicode characters according to any character map you
wish to use -- e.g., iso-8859-1, utf-8, us-ascii, shift-jis, whatever.

However, an XML parser is only *required* to support two encodings: utf-8
and utf-16, each of which provides a way to map all 1.1 million Unicode
characters to specific sequences of 1 to 4 bytes each... whereas other
encodings are typically using just one byte per character and are thus only
good for representing a very small subset of Unicode's repertoire. You will
find that most parsers do at least support us-ascii and iso-8859-1 in
addition to the required utf-8 and utf-16, since these are fairly common
encodings.

The XML spec, which you should have handy and should read when you want to
find answers like this, requires that an XML parser determine the encoding
of a document by checking for declarations and hints in a number of places
which I will not list here, since it's not an easily summarizable list. One
of the things it will look for, though, in the absence of
externally-supplied encoding info, is the presence of a UTF-16 byte order
mark (BOM) at the start of the file. This is something unique to the UTF-16
encoding -- the byte stream is prefaced by a pair of bytes that are the
encoded form of the "zero-width no-break space" character, Unicode code
point 0xFEFF. These bytes, which will typically (but not necessarily) be in
the order 0xFF 0xFE on Intel platforms, signal to the parser that the
document is UTF-16 encoded and that the bytes are in big-endian or
little-endian order.

So, contrary to popular belief, it is quite possible to save a document with
no encoding declaration in its prolog, using UTF-16 encoding (such as in
Windows Notepad, if you choose "Unicode" from the "Save As" dialog), and the
parser will not in this case "default to UTF-8", but will instead recognize
the BOM as a UTF-16 declaration, of sorts, and it will decode the document
properly.

Jul 20 '05 #4

Similar topics

Using Unicode scripts

by: yzzzzz | last post by:

Hi, I am writing my python programs using a Unicode text editor. The files are encoded in UTF-8. Python's default encoding seems to be Latin 1 (ISO-8859-1) or maybe Windows-1252 (CP1252) which...

Python

Retrieve and display unicode data using ADO and DB2 V7.2

by: Daman | last post by:

Hi, I am currently facing difficulty displaying chinese, japanese, russian etc. characters. I am using VB 6 and ADO to query the DB2 Version 7.2 unicode database (UTF-8). The resultset that...

DB2 Database

Using Unicode in C programs

by: Marco Iannaccone | last post by:

I'd like to start using Unicod (especially UTF-8) in my C programs, and would like some infos on how to start. Can you tell me some documents (possibily online) explaining Unidoce and UTF-8, and...

C / C++

decode unicode string using 'unicode_escape' codecs

by: aurora | last post by:

I have some unicode string with some characters encode using python notation like '\n' for LF. I need to convert that to the actual LF character. There is a 'unicode_escape' codec that seems to...

Python

How can I convert UNICODEwchar to ANSIC_char using STL.

by: recover | last post by:

#include <xxx> int main() { const wchar* pwcHello=L"hello"; char* pcHello; xxxxxx //do something using stl cout<<pcHello<<endl; } =============out===========

C / C++

Using Hindi Language with Unicode

by: pratik.best | last post by:

Hi, I just seen the web site of the unicode committee and was amazed to see the site showing document in Hindi without using any such fonts like "Kruti Dev" or "Dev Lys". "Webdunia.com" is also...

HTML / CSS

Problem using wchar_t and wprintf

by: Rui Maciel | last post by:

I've just started learning how to use the wchar_t data type as the basis for Unicode strings and unfortunately I'm having quite a bit of problems, both in the C front and the Unicode front. In...

C / C++

Using Rich Edit 2 DLL with Unicode

by: tristanlbailey | last post by:

I been scouring the Internet for an answer to my problem, and a couple of times thought I had almost found the answer, but still to no avail. I'm tying to use the Rich Edit class (riched20.dll),...

Visual Basic 4 / 5 / 6

Writing Unicode to database using ODBC

by: Mudcat | last post by:

In short what I'm trying to do is read a document using an xml parser and then upload that data back into a database. I've got the code more or less completed using xml.etree.ElementTree for the...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice