VK wrote:
[snip]
>It's a character encoding: characters are encoded as an integer
within a certain "codespace" , namely the range 0..10FFFF.
Unicode is a charset (set of characters)
Character set and character encoding are synonymous, however Unicode is
not defined using the former.
with each character unit represented by words (in the programming
sense) with the smallest word consisting of 2 bytes (16 bits).
If by "character unit" you mean code point, that's nonsense. A code
point is an integer, simple as that. How it is represented varies.
This way the range doesn't go from 0: there is not such character in
Unicode.
In the Unicode Standard, the codespace consists of the integers
from 0 to 10FFFF [base 16], comprising 1,114,112 code points
available for assigning the repertoire of abstract characters.
-- 2.4 Code Points and Characters,
The Unicode Standard, Version 4.1.0
Unicode starts from the character 0x0000.
The Unicode codespace starts from the integer 0. The first assigned
character exists at code point 0.
Again you are thinking and talking about character entities, bytes,
Unicode and UTF-8 at once:
No, I'm not. I used terms that are distinctly abstract.
It seems to me that you are confusing a notational convention -
referring to characters with the form U+xxxx - for some sort of definition.
which is not helpful if one tries to understand the matter.
Quite. Why then do you try so hard to misrepresent technical issues?
[snip]
"lower-ASCII" in the sense "0-127 characters" or "US ASCII" is good
enough for the matter.
I'm not really going to debate the issue, so long as you understand what
I mean when I refer to ASCII.
>whereas UTF-8 is an 8-bit, variable-width format.
Again you are mixing charsets and bytes.
No, I'm not.
UTF-8 is a transport encoding representing Unicode characters using
"US ASCII" only character sequences.
My point was that, given your own definition of (US-)ASCII above, this
sort of statement is absurd. The most significant bit is important in
the octets generated when using the UTF-8 encoding scheme - all scalar
values greater than 7F are serialised to two or more octets, each of
which have the MSB set - yet you are describing it in terms of something
where only the lowest 7-bits are use to represent characters.
For example, U+0430 is represented by the octets D0 and B0. In binary,
these octets are 11010000 and 10110000, respectively. If UTF-8 uses "US
ASCII only character sequences", and you agree that US-ASCII is strictly
7-bit, do you care to explain that evident contradiction?
>a document might stored on disk using UTF-8, and then transmitted
verbatim across a network.
Technically well possible but for what reason? ...
Efficiency. Most Western texts will be smaller when the UTF-8 encoding
scheme is employed as the 0..7F code points are the most common,
encompassing both common letters, digits, and punctuation.
Such document is not viewable without specially written parser and
not directly usable for Internet.
Oh dear. Of all of the documents that use one of the Unicode encoding
schemes on the Web, I should think that the /vast/ majority of them use
UTF-8. As for "specially written parser", XML processors are required to
accept UTF-8 input and browsers at least as far back as NN4 also do so.
[snip]
Mike