473,396 Members | 2,024 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Unicode and utf 8 /utf 16

Hi all,

can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set.

whic i should use to support character ucs-2.

I want to use ucs-2 character in streamreader and streamwriter.

How unicode and utf chacters are stored.

Please help me.

thanks in advance.

Jun 29 '06 #1
6 13866
archana wrote:
can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set.

whic i should use to support character ucs-2.

I want to use ucs-2 character in streamreader and streamwriter.

How unicode and utf chacters are stored.


See http://www.pobox.com/~skeet/csharp/unicode.html

I'm always hazy about the difference between UCS-2 and UTF-16 - it's
almost certainly to do with surrogate pairs, if there is a difference -
but you can get a UTF-16 encoding with Encoding.Unicode.

Jon

Jun 29 '06 #2
Unicode is a character set, just as UCS.

UTF-8 and UTF-16 are UCS Transformation Formats. As Unicode and UCS are
effectively synonymous, UTF-8 and UTF-16 is used to encode Unicode strings.

In UTF-16 the characters are encoded as 16 bit sequences (two bytes).
UTF-16 and UCS-2 are identical for all characters that USC-2 handles.
You can treat UCS-2 data as UTF-16 without any problems.

In UTF-8 the most common characters are encoded as 8 bit sequences (one
byte). Special characters are encoded as 24 bit sequences (three bytes).

As the character type in .NET is a 16 bit Uncode character, it's
synonymous with the UCS BMP (Basic Multilingual Plane) that UCS-2 handles.

In conclusion, in .NET the Unicode and UCS BMP character sets are the
same, and UCS-2 and UTF-16 are the same.

There is no encoding in UCS that corresponds to UTF-8. If you export
data to something that only handles UCS, you have to use UTF-16.
archana wrote:
Hi all,

can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set.

whic i should use to support character ucs-2.

I want to use ucs-2 character in streamreader and streamwriter.

How unicode and utf chacters are stored.

Please help me.

thanks in advance.

Jun 29 '06 #3
Jon Skeet [C# MVP] wrote:
archana wrote:
can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set.

whic i should use to support character ucs-2.

I want to use ucs-2 character in streamreader and streamwriter.

How unicode and utf chacters are stored.


See http://www.pobox.com/~skeet/csharp/unicode.html

I'm always hazy about the difference between UCS-2 and UTF-16 - it's
almost certainly to do with surrogate pairs, if there is a difference -
but you can get a UTF-16 encoding with Encoding.Unicode.

Jon


From what I can gather, the only difference is that UTF-16 is capable
of encoding the full 31 bit range of unicode characters, while UCS-2
only handles the 16 bit range specified as the UCS BMP (Basic
Multilingual Plane).

As the Char datatype in .NET is a 16 bit data type, it doesn't handle
any characters that UCS-2 doesn't handle. As I understand it, that would
make UTF-16 and UCS-2 synonymous in .NET.
Jun 29 '06 #4
Göran Andersson <gu***@guffa.com> wrote:
From what I can gather, the only difference is that UTF-16 is capable
of encoding the full 31 bit range of unicode characters, while UCS-2
only handles the 16 bit range specified as the UCS BMP (Basic
Multilingual Plane).

As the Char datatype in .NET is a 16 bit data type, it doesn't handle
any characters that UCS-2 doesn't handle. As I understand it, that would
make UTF-16 and UCS-2 synonymous in .NET.


..NET chars have surrogate pair forms (check out Char.IsHighSurrogate()
and Char.IsLowSurrogate()) combining two characters to form a single
abstract character. Thus, the number of physical characters in a .NET
string may be greater than the number of actual, abstract characters.

-- Barry

--
http://barrkel.blogspot.com/
Jun 29 '06 #5
> As the Char datatype in .NET is a 16 bit data type, it doesn't handle
any characters that UCS-2 doesn't handle. As I understand it, that would
make UTF-16 and UCS-2 synonymous in .NET. No. UTF-16 is a superset of UCS2. And .NET is UTF-16, not UCS2.

Short example
You decide initially that 10 digits is enough to encode a certain character
set.
So you can have
0 1 2 3 4 5 6 7 8 9

Later on, you discover this is not true, and you need a way to represent
more. But you have some areas that are not allocated yet in your encoding, so
you can reuse that:
0 1 [ 2 3 4 | 5 6 7 ] 8 9
Let's call the 2-4 range "high surrogate" and the 5-7 "low surrogate"

Then you can represent stuff like this:
0 1 8 9 = 4 values
(you are not allowed to use the surrogate area for real characters)
but you can also represent characters using two code units:
25 26 27 35 36 37 45 46 47 = 9 values
And you have a way to map 25 => 10, 26=>11, ... 47=>18

So you end up being able to represent 13 values!

This is 10 + HighSurrogate * LowSurrogate =
= 10 + 9 = 19 = covered range
And number of usefull codes for encoding (you cannot use surrogates):
= 10 + HighSurrogate * LowSurrogate - HighSurrogate - LowSurrogate
= 19 - 3 - 3 = 13 = number of characters that you can now encode
Now, for Unicode before surrogate introduction you had
0000 -FFFF
But when it proved that more than FFFF code points where needed,
the mechanism described above was created (at another scale):
0000 0001 0002 0003 ... D7FF [ D800 - DBFF | DC00 DFFF ] E000 ... FFFF
D800 - DBFF = high surrogates
DC00 - DFFF = low surrogates

So what you can represent is:
0000 0001 0002 0003 ... D7FF E000 ... FFFF
and you add the stuff above BMP with one high and one low surrogate:
D800 DC00 D800 DC01 .... D800 DFFF
D800 DC00 D800 DC01 .... D800 DFFF
D800 DC00 D800 DC01 .... D800 DFFF

Covered range:
FFFF + ( DBFF - D800 + 1 ) x (DFFF - DC00 + 1 ) =
FFFF + 0400 x 0400 = 10FFFF
Wow! Exactly what is covered by UTF-16! Coincidence?

Number of code points disponible for encoding:
FFFF + 0400 x 0400 - 0400 - 0400 = 10FFFF - 0400 - 0400 = 10F7FF =
1112063 (decimal)
If you read this http://www.unicode.org/book/uc20ch1.html and you will find
that "more than 1 million characters can be encoded"
Well, the 1112063 value is the "technically possible" value, but you should
exclude reserved areas, private use areas and others.
Anyway, long story short: UCS2 = before UTF-8/surrogates mechanism was
introduced.
When an application is surrogate aware, you can say is utf-16.
If it is not surrogate aware, then is probably ucs2

And .NET is UTF-16
=================================================
To answer the original questions:
can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set. There is no utf-18, it is utf-16
Unicode is a "coded character set" basically mapping characters with numbers
(A=0x41, B=0x42 and so on)
UTF-8 and UTF-18 are different ways of representing this mapping.
And there is no coverage difference.

You can compare it to (in a way) with various base of numeration systems
If you say A=0x41, B=0x42 in hex
or if you say A=65, B=66 in decimal
or if you say A=0101, B=0102 in octal
is the same thing.
So your utf-8, utf-16 question is a bit like asking "hex or decimal, which
one can represent more numbers?" Answer: they are the same.
See some official standard here:
http://www.unicode.org/versions/Unic...h02.pdf#G13708
and here:
http://www.unicode.org/reports/tr17/index.html
or here
http://scripts.sil.org/cms/scripts/p...i&item_id=IWS-
Chapter04a#96f19a02

whic i should use to support character ucs-2.
I want to use ucs-2 character in streamreader and streamwriter. Use utf-16. It is a superset of ucs2 and is the one supported by all the .NET
API.

How unicode and utf chacters are stored.

The story is long, but I would send you to the standard (free):
http://www.unicode.org/versions/Unic...bookmarks.html
And if you have to get deep into this, I would recomend
http://www.amazon.com/gp/product/0201700522
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Jun 30 '06 #6
> Anyway, long story short: UCS2 = before UTF-8/surrogates mechanism was
introduced.

Correction: UCS2 = before UTF-16/surrogates mechanism

--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Jul 1 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Michael Weir | last post by:
I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code...
8
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...
8
by: Francis Girard | last post by:
Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary...
48
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...
4
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
2
by: Neil Schemenauer | last post by:
python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...
10
by: Nikolay Petrov | last post by:
How can I convert DOS cyrillic text to Unicode
6
by: Jeff | last post by:
Hi - I'm setting up a streamreader in a VB.NET app to read a text file and display its contents in a multiline textbox. If I set it up with System.Text.Encoding.Unicode, it reads a unicode...
13
by: Tomás | last post by:
Let's start off with: class Nation { public: virtual const char* GetName() const = 0; } class Norway : public Nation { public: virtual const char* GetName() const
24
by: ChaosKCW | last post by:
Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.