Thomas 'PointedEars' Lahn wrote:

Csaba Gabor wrote:

I just have one question at this point. As I mentioned in my original

post, String.fromCharCode(2500) == "\u09C4" => %E0%A7%84

Those are percent-escaped representations of the three UTF-8 code ...

<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

Thanks Thomas, I like links.

It let me figure out the unicode / UTF8 mapping.

He's got a function, convertCP2UTF8 (spaceSeparatedHexValues) that does

essentially:

n = ...unicodeValue...

if (n <= 0x7F) return dec2hex2(n);

else if (n <= 0x7FF) return

dec2hex2(0xC0 | ((n>>6) & 0x1F)) + ' ' +

dec2hex2(0x80 | (n & 0x3F));

else if (n <= 0xFFFF) return

dec2hex2(0xE0 | ((n>>12) & 0x0F)) + ' ' +

dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +

dec2hex2(0x80 | (n & 0x3F));

else if (n <= 0x10FFFF) return

dec2hex2(0xF0 | ((n>>18) & 0x07)) + ' ' +

dec2hex2(0x80 | ((n>>12) & 0x3F)) + ' ' +

dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +

dec2hex2(0x80 | (n & 0x3F));

else return '!erreur ' + dec2hex(n);

In words: If your positive integer (the char code) is not less than

17*16^4, report an error,

and If it is 7 bits or less (in the range (2^7,0], that is), just

return the two digit hex representation.

Otherwise, let k be the number of bits in your number. That is to say,

k is the smallest integer such that 2^k is greater than your number -

e.g. [2^(k-1),2^k)->k; [128,256)->8; [8,16)->4; [4,8)->3; [2,4)->2;

1->1; 0->0). Now, starting at the low end, section the number into

m=ceiling((k-1)/5) groups of 6 bits, with any leftovers in the final

(high) group. Prefix all but the high groups with (bits) 10 (that is

to say, OR them with (hex) 80). Prefix the high group with the m+1

bits corresponding to 2^(m+1)-2. That is to say, prefix the first

group of 2 with (bits) 110, the first group of 3 with 1110, or the

first group of 4 with 11110.

Thus, if your number has 7 bits or less, it takes two hex digits to

represent. From 8 to 11 (inclusive) it takes four hex digits, from 12

to 16 (inclusive) it takes six, and from 17 to 21 (inclusive) bits it

takes eight hex digits to represent.

Example: 2500 -> 0x9C4 ->

1001 1100 0100 so k=12 and m=3 ->

(0000) 100111 000100 (that first group got no bits so it is implied) ->

(1110)0000 (10)100111 (10)000100 ->

E0 A7 84

With this it's also easy to see how to work from UTF-8 to unicode.

Given a byte, scan for (from the high (left) side, the first 0 bit).

If the high bit is 0, you are done and you have a "normal" character.

Otherwise, the character is specified by the next m bytes (including

the one the scan started with), where m is one less than the number of

1s encountered before finding that first 0 bit. Knock out all the bits

up to the first 0 bit, and the top 2 bits of all the rest, and

concatenate the remaining bits to get the char code.

Thus, we see the correspondence between UTF8 and unicode

Csaba

I found the following sites useful for seeing mappings and glyphs:

http://www.unicode.org/charts/About.html and

http://www.macchiato.com/unicode/chart/