473,836 Members | 1,539 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Unicode and utf 8 /utf 16

Hi all,

can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set.

whic i should use to support character ucs-2.

I want to use ucs-2 character in streamreader and streamwriter.

How unicode and utf chacters are stored.

Please help me.

thanks in advance.

Jun 29 '06 #1
6 13897
archana wrote:
can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set.

whic i should use to support character ucs-2.

I want to use ucs-2 character in streamreader and streamwriter.

How unicode and utf chacters are stored.


See http://www.pobox.com/~skeet/csharp/unicode.html

I'm always hazy about the difference between UCS-2 and UTF-16 - it's
almost certainly to do with surrogate pairs, if there is a difference -
but you can get a UTF-16 encoding with Encoding.Unicod e.

Jon

Jun 29 '06 #2
Unicode is a character set, just as UCS.

UTF-8 and UTF-16 are UCS Transformation Formats. As Unicode and UCS are
effectively synonymous, UTF-8 and UTF-16 is used to encode Unicode strings.

In UTF-16 the characters are encoded as 16 bit sequences (two bytes).
UTF-16 and UCS-2 are identical for all characters that USC-2 handles.
You can treat UCS-2 data as UTF-16 without any problems.

In UTF-8 the most common characters are encoded as 8 bit sequences (one
byte). Special characters are encoded as 24 bit sequences (three bytes).

As the character type in .NET is a 16 bit Uncode character, it's
synonymous with the UCS BMP (Basic Multilingual Plane) that UCS-2 handles.

In conclusion, in .NET the Unicode and UCS BMP character sets are the
same, and UCS-2 and UTF-16 are the same.

There is no encoding in UCS that corresponds to UTF-8. If you export
data to something that only handles UCS, you have to use UTF-16.
archana wrote:
Hi all,

can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set.

whic i should use to support character ucs-2.

I want to use ucs-2 character in streamreader and streamwriter.

How unicode and utf chacters are stored.

Please help me.

thanks in advance.

Jun 29 '06 #3
Jon Skeet [C# MVP] wrote:
archana wrote:
can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set.

whic i should use to support character ucs-2.

I want to use ucs-2 character in streamreader and streamwriter.

How unicode and utf chacters are stored.


See http://www.pobox.com/~skeet/csharp/unicode.html

I'm always hazy about the difference between UCS-2 and UTF-16 - it's
almost certainly to do with surrogate pairs, if there is a difference -
but you can get a UTF-16 encoding with Encoding.Unicod e.

Jon


From what I can gather, the only difference is that UTF-16 is capable
of encoding the full 31 bit range of unicode characters, while UCS-2
only handles the 16 bit range specified as the UCS BMP (Basic
Multilingual Plane).

As the Char datatype in .NET is a 16 bit data type, it doesn't handle
any characters that UCS-2 doesn't handle. As I understand it, that would
make UTF-16 and UCS-2 synonymous in .NET.
Jun 29 '06 #4
Göran Andersson <gu***@guffa.co m> wrote:
From what I can gather, the only difference is that UTF-16 is capable
of encoding the full 31 bit range of unicode characters, while UCS-2
only handles the 16 bit range specified as the UCS BMP (Basic
Multilingual Plane).

As the Char datatype in .NET is a 16 bit data type, it doesn't handle
any characters that UCS-2 doesn't handle. As I understand it, that would
make UTF-16 and UCS-2 synonymous in .NET.


..NET chars have surrogate pair forms (check out Char.IsHighSurr ogate()
and Char.IsLowSurro gate()) combining two characters to form a single
abstract character. Thus, the number of physical characters in a .NET
string may be greater than the number of actual, abstract characters.

-- Barry

--
http://barrkel.blogspot.com/
Jun 29 '06 #5
> As the Char datatype in .NET is a 16 bit data type, it doesn't handle
any characters that UCS-2 doesn't handle. As I understand it, that would
make UTF-16 and UCS-2 synonymous in .NET. No. UTF-16 is a superset of UCS2. And .NET is UTF-16, not UCS2.

Short example
You decide initially that 10 digits is enough to encode a certain character
set.
So you can have
0 1 2 3 4 5 6 7 8 9

Later on, you discover this is not true, and you need a way to represent
more. But you have some areas that are not allocated yet in your encoding, so
you can reuse that:
0 1 [ 2 3 4 | 5 6 7 ] 8 9
Let's call the 2-4 range "high surrogate" and the 5-7 "low surrogate"

Then you can represent stuff like this:
0 1 8 9 = 4 values
(you are not allowed to use the surrogate area for real characters)
but you can also represent characters using two code units:
25 26 27 35 36 37 45 46 47 = 9 values
And you have a way to map 25 => 10, 26=>11, ... 47=>18

So you end up being able to represent 13 values!

This is 10 + HighSurrogate * LowSurrogate =
= 10 + 9 = 19 = covered range
And number of usefull codes for encoding (you cannot use surrogates):
= 10 + HighSurrogate * LowSurrogate - HighSurrogate - LowSurrogate
= 19 - 3 - 3 = 13 = number of characters that you can now encode
Now, for Unicode before surrogate introduction you had
0000 -FFFF
But when it proved that more than FFFF code points where needed,
the mechanism described above was created (at another scale):
0000 0001 0002 0003 ... D7FF [ D800 - DBFF | DC00 DFFF ] E000 ... FFFF
D800 - DBFF = high surrogates
DC00 - DFFF = low surrogates

So what you can represent is:
0000 0001 0002 0003 ... D7FF E000 ... FFFF
and you add the stuff above BMP with one high and one low surrogate:
D800 DC00 D800 DC01 .... D800 DFFF
D800 DC00 D800 DC01 .... D800 DFFF
D800 DC00 D800 DC01 .... D800 DFFF

Covered range:
FFFF + ( DBFF - D800 + 1 ) x (DFFF - DC00 + 1 ) =
FFFF + 0400 x 0400 = 10FFFF
Wow! Exactly what is covered by UTF-16! Coincidence?

Number of code points disponible for encoding:
FFFF + 0400 x 0400 - 0400 - 0400 = 10FFFF - 0400 - 0400 = 10F7FF =
1112063 (decimal)
If you read this http://www.unicode.org/book/uc20ch1.html and you will find
that "more than 1 million characters can be encoded"
Well, the 1112063 value is the "technicall y possible" value, but you should
exclude reserved areas, private use areas and others.
Anyway, long story short: UCS2 = before UTF-8/surrogates mechanism was
introduced.
When an application is surrogate aware, you can say is utf-16.
If it is not surrogate aware, then is probably ucs2

And .NET is UTF-16
=============== =============== =============== ====
To answer the original questions:
can someone tell me difference between unicode and utf 8 or utf 18 and
which one is supporting more character set. There is no utf-18, it is utf-16
Unicode is a "coded character set" basically mapping characters with numbers
(A=0x41, B=0x42 and so on)
UTF-8 and UTF-18 are different ways of representing this mapping.
And there is no coverage difference.

You can compare it to (in a way) with various base of numeration systems
If you say A=0x41, B=0x42 in hex
or if you say A=65, B=66 in decimal
or if you say A=0101, B=0102 in octal
is the same thing.
So your utf-8, utf-16 question is a bit like asking "hex or decimal, which
one can represent more numbers?" Answer: they are the same.
See some official standard here:
http://www.unicode.org/versions/Unic...h02.pdf#G13708
and here:
http://www.unicode.org/reports/tr17/index.html
or here
http://scripts.sil.org/cms/scripts/p...i&item_id=IWS-
Chapter04a#96f1 9a02

whic i should use to support character ucs-2.
I want to use ucs-2 character in streamreader and streamwriter. Use utf-16. It is a superset of ucs2 and is the one supported by all the .NET
API.

How unicode and utf chacters are stored.

The story is long, but I would send you to the standard (free):
http://www.unicode.org/versions/Unic...bookmarks.html
And if you have to get deep into this, I would recomend
http://www.amazon.com/gp/product/0201700522
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Jun 30 '06 #6
> Anyway, long story short: UCS2 = before UTF-8/surrogates mechanism was
introduced.

Correction: UCS2 = before UTF-16/surrogates mechanism

--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Jul 1 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
17626
by: Michael Weir | last post by:
I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code that - opens a file appropriately for output - writes to this file Thanks very much. Michael Weir
8
5284
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5, etc.) What I'd like is something as simple as: CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8)); import MySQLdb, re,urllib
8
3670
by: Francis Girard | last post by:
Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary representation, some systems add at the beginning of the file a BOM mark (Windows?), some don't. (Linux?). Therefore, the exact same text encoded in the same UTF-8 will result in two different binary files, and of a slightly different length. Right ?
48
4659
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at once) * regular expression search & replace. Normally my documents are encoded with the ISO setting. Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the
4
6076
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using python minidom and everything concerning
2
2638
by: Neil Schemenauer | last post by:
python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is feasible. It would be helpful if people could test the patched Python with their own applications and report any incompatibilities. PEP: 349
10
8119
by: Nikolay Petrov | last post by:
How can I convert DOS cyrillic text to Unicode
6
7036
by: Jeff | last post by:
Hi - I'm setting up a streamreader in a VB.NET app to read a text file and display its contents in a multiline textbox. If I set it up with System.Text.Encoding.Unicode, it reads a unicode file just fine. If I set it up as ASCII, it reads a non-unicode text file. But I don't know the file format in advance. How can my app determine whether to use Unicode encoding before I read the
13
3317
by: Tomás | last post by:
Let's start off with: class Nation { public: virtual const char* GetName() const = 0; } class Norway : public Nation { public: virtual const char* GetName() const
24
9081
by: ChaosKCW | last post by:
Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special charcaters from an ASCII perspective. I get the following error: > SQLiteCur.execute(sql, row)
0
9825
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9671
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10852
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9382
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7793
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6980
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5829
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4459
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
3116
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.