473,396 Members | 1,756 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

using unicode in XML

Hi

XML uses UTF-8 by default. Is that correct?

Also, can we use Unicode in XML?

thanks,
Naresh
Jul 20 '05 #1
3 3306
Naresh Agarwal wrote:
Hi

XML uses UTF-8 by default. Is that correct?

Also, can we use Unicode in XML?

UTF-8 is the most common way to encode Unicode.

--
Bjorn Brox, CORENA Norge AS, http://www.corena.no/, ICQ 17872043
Industritunet, Dyrmyrgt. 35, N-3611 Kongsberg, NORWAY
Phone: +47 32717210, Fax: +47 32717201, Mobile: +47 92638590

Jul 20 '05 #2
> XML uses UTF-8 by default. Is that correct?

That is correct (without BOM). [1]
Also, can we use Unicode in XML?


UTF-8, UTF-16 and UTF-32 are three encodings defined by the [2] unicode
people. These character sets addopt the whole Unicode character set.

UTF-8 takes 1-4 (6) bytes for a character.
UTF-16 takes 2/4 bytes for a character.
UTF-32 takes 4 bytes for a character.

[1] http://www.w3.org/TR/2000/REC-xml-20001006#charencoding
[2] http://www.unicode.org/

Regards,
Edwin Dankert
Cladonia Ltd.
http://www.cladonia.com/
Jul 20 '05 #3
"Naresh Agarwal" <na******@informatica.com> wrote
XML uses UTF-8 by default. Is that correct?


People say this quite often. You'd think it were true. Usually when they say
it, they are thinking "if I don't put an encoding declaration in the prolog,
the XML parser is going to assume the document is utf-8 encoded, right?" And
that may seem to be true most of the time, but some better understanding is
in order.

First, understand that XML, being on one level just a string of abstract
Unicode characters, may be represented in any encoding. That is, the
"physical" bytes of the document (or rather, the bytes of each 'entity'
[file]) can represent Unicode characters according to any character map you
wish to use -- e.g., iso-8859-1, utf-8, us-ascii, shift-jis, whatever.

However, an XML parser is only *required* to support two encodings: utf-8
and utf-16, each of which provides a way to map all 1.1 million Unicode
characters to specific sequences of 1 to 4 bytes each... whereas other
encodings are typically using just one byte per character and are thus only
good for representing a very small subset of Unicode's repertoire. You will
find that most parsers do at least support us-ascii and iso-8859-1 in
addition to the required utf-8 and utf-16, since these are fairly common
encodings.

The XML spec, which you should have handy and should read when you want to
find answers like this, requires that an XML parser determine the encoding
of a document by checking for declarations and hints in a number of places
which I will not list here, since it's not an easily summarizable list. One
of the things it will look for, though, in the absence of
externally-supplied encoding info, is the presence of a UTF-16 byte order
mark (BOM) at the start of the file. This is something unique to the UTF-16
encoding -- the byte stream is prefaced by a pair of bytes that are the
encoded form of the "zero-width no-break space" character, Unicode code
point 0xFEFF. These bytes, which will typically (but not necessarily) be in
the order 0xFF 0xFE on Intel platforms, signal to the parser that the
document is UTF-16 encoded and that the bytes are in big-endian or
little-endian order.

So, contrary to popular belief, it is quite possible to save a document with
no encoding declaration in its prolog, using UTF-16 encoding (such as in
Windows Notepad, if you choose "Unicode" from the "Save As" dialog), and the
parser will not in this case "default to UTF-8", but will instead recognize
the BOM as a UTF-16 declaration, of sorts, and it will decode the document
properly.
Jul 20 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: yzzzzz | last post by:
Hi, I am writing my python programs using a Unicode text editor. The files are encoded in UTF-8. Python's default encoding seems to be Latin 1 (ISO-8859-1) or maybe Windows-1252 (CP1252) which...
1
by: Daman | last post by:
Hi, I am currently facing difficulty displaying chinese, japanese, russian etc. characters. I am using VB 6 and ADO to query the DB2 Version 7.2 unicode database (UTF-8). The resultset that...
4
by: Marco Iannaccone | last post by:
I'd like to start using Unicod (especially UTF-8) in my C programs, and would like some infos on how to start. Can you tell me some documents (possibily online) explaining Unidoce and UTF-8, and...
2
by: aurora | last post by:
I have some unicode string with some characters encode using python notation like '\n' for LF. I need to convert that to the actual LF character. There is a 'unicode_escape' codec that seems to...
1
by: recover | last post by:
#include <xxx> int main() { const wchar* pwcHello=L"hello"; char* pcHello; xxxxxx //do something using stl cout<<pcHello<<endl; } =============out===========
3
by: pratik.best | last post by:
Hi, I just seen the web site of the unicode committee and was amazed to see the site showing document in Hindi without using any such fonts like "Kruti Dev" or "Dev Lys". "Webdunia.com" is also...
8
by: Rui Maciel | last post by:
I've just started learning how to use the wchar_t data type as the basis for Unicode strings and unfortunately I'm having quite a bit of problems, both in the C front and the Unicode front. In...
2
by: tristanlbailey | last post by:
I been scouring the Internet for an answer to my problem, and a couple of times thought I had almost found the answer, but still to no avail. I'm tying to use the Rich Edit class (riched20.dll),...
1
by: Mudcat | last post by:
In short what I'm trying to do is read a document using an xml parser and then upload that data back into a database. I've got the code more or less completed using xml.etree.ElementTree for the...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.