today i made some tests...
i tested some unicode symbols, that are above the 16bit limit
(gothic:http://www.unicode.org/charts/PDF/U10330.pdf)
..
i played around with iconv and so on,
so at the end i created an utf8 encoded text file,
with the text "Marrakesh",
where the second 'a' wes replaced with
GOTHIC_LETTER_AHSA (unicode-value:0x10330).
(i simply wrote the text file "Marrakesh", used iconv to convert it to
utf32big-endian, and replaced the character in hexedit, then converted
with iconv back to utf8).
now i started python:
data = open("utf8.txt").read()
data 'Marr\xf0\x90\x8c\xb0kesh' text = data.decode("utf8")
text u'Marr\U00010330kesh'
so far it seemed ok.
then i did:
len(text) 10
this is wrong. the length should be 9.
and why?
text[0] u'M' text[1] u'a' text[2] u'r' text[3] u'r' text[4] u'\ud800' text[5] u'\udf30' text[6] u'k'
so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).
i don't understand.
if tthe representation of 'text' is correct, why is the length wrong?
btw. i understand that it's a very exotic character, but i tried for
example kwrite and gedit, and none of the was able to display the
symbol, but both successfully identified it as ONE unknown symbol.
thanks,
gabor