Thomas 'PointedEars' Lahn wrote:
Csaba Gabor wrote:
I just have one question at this point. As I mentioned in my original
post, String.fromCharCode(2500) == "\u09C4" => %E0%A7%84
Those are percent-escaped representations of the three UTF-8 code ...
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.
Thanks Thomas, I like links.
It let me figure out the unicode / UTF8 mapping.
He's got a function, convertCP2UTF8 (spaceSeparatedHexValues) that does
essentially:
n = ...unicodeValue...
if (n <= 0x7F) return dec2hex2(n);
else if (n <= 0x7FF) return
dec2hex2(0xC0 | ((n>>6) & 0x1F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0xFFFF) return
dec2hex2(0xE0 | ((n>>12) & 0x0F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0x10FFFF) return
dec2hex2(0xF0 | ((n>>18) & 0x07)) + ' ' +
dec2hex2(0x80 | ((n>>12) & 0x3F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else return '!erreur ' + dec2hex(n);
In words: If your positive integer (the char code) is not less than
17*16^4, report an error,
and If it is 7 bits or less (in the range (2^7,0], that is), just
return the two digit hex representation.
Otherwise, let k be the number of bits in your number. That is to say,
k is the smallest integer such that 2^k is greater than your number -
e.g. [2^(k-1),2^k)->k; [128,256)->8; [8,16)->4; [4,8)->3; [2,4)->2;
1->1; 0->0). Now, starting at the low end, section the number into
m=ceiling((k-1)/5) groups of 6 bits, with any leftovers in the final
(high) group. Prefix all but the high groups with (bits) 10 (that is
to say, OR them with (hex) 80). Prefix the high group with the m+1
bits corresponding to 2^(m+1)-2. That is to say, prefix the first
group of 2 with (bits) 110, the first group of 3 with 1110, or the
first group of 4 with 11110.
Thus, if your number has 7 bits or less, it takes two hex digits to
represent. From 8 to 11 (inclusive) it takes four hex digits, from 12
to 16 (inclusive) it takes six, and from 17 to 21 (inclusive) bits it
takes eight hex digits to represent.
Example: 2500 -> 0x9C4 ->
1001 1100 0100 so k=12 and m=3 ->
(0000) 100111 000100 (that first group got no bits so it is implied) ->
(1110)0000 (10)100111 (10)000100 ->
E0 A7 84
With this it's also easy to see how to work from UTF-8 to unicode.
Given a byte, scan for (from the high (left) side, the first 0 bit).
If the high bit is 0, you are done and you have a "normal" character.
Otherwise, the character is specified by the next m bytes (including
the one the scan started with), where m is one less than the number of
1s encountered before finding that first 0 bit. Knock out all the bits
up to the first 0 bit, and the top 2 bits of all the rest, and
concatenate the remaining bits to get the char code.
Thus, we see the correspondence between UTF8 and unicode
Csaba
I found the following sites useful for seeing mappings and glyphs:
http://www.unicode.org/charts/About.html and
http://www.macchiato.com/unicode/chart/