473,795 Members | 2,630 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

python-unicode doesn't support >65535 symbols?

hi,

today i made some tests...

i tested some unicode symbols, that are above the 16bit limit
(gothic:http://www.unicode.org/charts/PDF/U10330.pdf)
..

i played around with iconv and so on,
so at the end i created an utf8 encoded text file,
with the text "Marrakesh" ,
where the second 'a' wes replaced with
GOTHIC_LETTER_A HSA (unicode-value:0x10330).

(i simply wrote the text file "Marrakesh" , used iconv to convert it to
utf32big-endian, and replaced the character in hexedit, then converted
with iconv back to utf8).

now i started python:
data = open("utf8.txt" ).read()
data 'Marr\xf0\x90\x 8c\xb0kesh' text = data.decode("ut f8")
text u'Marr\U0001033 0kesh'

so far it seemed ok.
then i did:
len(text) 10

this is wrong. the length should be 9.
and why?
text[0] u'M' text[1] u'a' text[2] u'r' text[3] u'r' text[4] u'\ud800' text[5] u'\udf30' text[6] u'k'


so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).

i don't understand.
if tthe representation of 'text' is correct, why is the length wrong?

btw. i understand that it's a very exotic character, but i tried for
example kwrite and gedit, and none of the was able to display the
symbol, but both successfully identified it as ONE unknown symbol.

thanks,
gabor


Jul 18 '05 #1
4 2716
gabor <ga***@z10n.net > writes:
i played around with iconv and so on,
so at the end i created an utf8 encoded text file,
with the text "Marrakesh" ,
where the second 'a' wes replaced with
GOTHIC_LETTER_A HSA (unicode-value:0x10330).

(i simply wrote the text file "Marrakesh" , used iconv to convert it to
utf32big-endian, and replaced the character in hexedit, then converted
with iconv back to utf8).

now i started python:
data = open("utf8.txt" ).read()
data 'Marr\xf0\x90\x 8c\xb0kesh' text = data.decode("ut f8")
text u'Marr\U0001033 0kesh'

so far it seemed ok.
then i did:
len(text) 10

this is wrong. the length should be 9.
I suspect you have a "narrow unicode" build of Python. You can make
yourself a "wide unicode" build easily enough.
and why?
text[0] u'M' text[1] u'a' text[2] u'r' text[3] u'r' text[4] u'\ud800' text[5] u'\udf30' text[6] u'k'


so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).

i don't understand.
if tthe representation of 'text' is correct, why is the length wrong?


I expect that this has to do with surrogates or some other unicode
thing that's beyond my understanding.. .

Cheers,
mwh

--
It's actually a corruption of "starling". They used to be carried.
Since they weighed a full pound (hence the name), they had to be
carried by two starlings in tandem, with a line between them.
-- Alan J Rosenthal explains "Pounds Sterling" on asr
Jul 18 '05 #2
gabor <ga***@z10n.net > wrote:
so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).
The default encoding for native Unicode strings in Python in UTF-16, which
cannot hold the extended planes beyond 0xFFFF in a single character. Instead,
it uses two 'surrogate' characters. Bit of a nasty hack, but that's what
Unicode does and there's nothing can be done about it now.

Python can be compiled to use UCS-4 for native Unicode strings if you prefer.
Then every conceptual 'character' from the Unicode repertoire will be one
item long. It'll eat more memory too of course.
if tthe representation of 'text' is correct, why is the length wrong?


The representation of 'text' you are seeing is just the nicely-readable
version output by Python 2.2+. Despite the \U sequence, it is actually still
stored internally as two UTF-16 surrogates. You'll see this if you enter
'\U00012345' into Python 2.0 or 2.1, which don't use the \U form to output
strings.

--
Andrew Clover
mailto:an*@doxd esk.com
http://www.doxdesk.com/
Jul 18 '05 #3
Andrew Clover wrote:
gabor <ga***@z10n.net > wrote:
so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).


The default encoding for native Unicode strings in Python in UTF-16,
which cannot hold the extended planes beyond 0xFFFF in a single
character.


That's not quite right. UTF-16 encodes unicode characters as either single
16 bit values and pairs of 16 bit values. However, one character is still
one character.

Python makes the mistake of exposing the internal representation instead of
the logical value of unicode objects. This means that, aside from space
optimization, unicode objects have no advantage over UTF-8 encoded plain
strings for storing unicode text.
--
Rainer Deyke - ra*****@eldwood .com - http://eldwood.com
Jul 18 '05 #4
"Rainer Deyke" <ra*****@eldwoo d.com> writes:
Python makes the mistake of exposing the internal representation instead of
the logical value of unicode objects. This means that, aside from space
optimization, unicode objects have no advantage over UTF-8 encoded plain
strings for storing unicode text.


That is not true. First, it is not "Python", but a specific Python
configuration - in "wide Unicode" builds, it uses UCS-4 internally.

In either build, len() and indexing addresses code units, not
characters: that is true.

However, it is not true that there is no advantage over UTF-8 encoded
byte strings. Instead, there are several advantages:
- In a UCS-4 build, Unicode characters and code units are in a 1:1
relationship
- In a UCS-2 build, Unicode characters and code units are in a 1:1
relationship as long as the application only ever processes BMP
characters.
- In either case, a Unicode object has inherent information about the
character set, which a UTF-8 byte string does not have. IOW, you know
what a Unicode object is, but you don't know (inherently) whether a
byte string is UTF-8.

Regards,
Martin

Jul 18 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
2464
by: eScrewDotCom | last post by:
eScrew Welcome to eScrew! eScrew is eScrew and this is eScrew story. eScrew will tell you eScrew story if you promise eScrew to consider eScrew story as joke. eScrew story is very funny. eScrew story is so funny that eScrew will have to take break from time to time because eScrew needs some rest from laughing. Oh boy, here it comes... eScrew funny laugh laughing screaming crying must stop can not take any more this is killing eScrew...
2
2385
by: Simon Newton | last post by:
Hi, I've just starting out embedding Python in C and get the following error when I try and import a module that in turn imports the math module. ImportError: /usr/lib/python2.4/lib-dynload/math.so: undefined symbol: PyExc_FloatingPointError The module is as follows:
11
1877
by: Mike Meyer | last post by:
Out of random curiosity, is there a PEP/thread/? that explains why Python symbols are restricted to 7-bit ascii? <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/home/mwm/ Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
47
3371
by: Pierre Barbier de Reuille | last post by:
Please, note that I am entirely open for every points on this proposal (which I do not dare yet to call PEP). Abstract ======== This proposal suggests to add symbols into Python. Symbols are objects whose representation within the code is more important than their actual value. Two symbols needs only to be
5
3231
by: wahn | last post by:
Hi, Here is a problem I came across and need some help with. I developed a little Python script with some classes which runs standalone and communicates with a database via sockets. So far everything works fine. I also successfully embedded the Python interpreter into a plugin for a commercial 3D package (in this case Houdini - http://www.sidefx.com/ - using the HDK). I can use the interpreter to run Python commands and get values back...
1
3082
by: freesteel | last post by:
I have posted about this problem before. SInce then I found a much better article to help with embedding python in a multithreaded application: http://www.linuxjournal.com/article/3641 I found this article very good and it clarified for me what needs doing. Now I have an example application that almost works, but it is far from reliable. If I run this several times I get crashes telling me that the heap is modified after deallocation....
3
3281
by: jefishman | last post by:
I have a Python (2.3.x) interpreter running embedded in a C++ application on a host machine. I would like to run a specific package on that host machine (numpy). I have managed to compile (cross-compile) the library, so that I have the python modules and the compiled .so files. I am relatively sure that these would import normally in a standard interpreter (they give bad magic on the build machine). However, I would like to be able...
852
28757
by: Mark Tarver | last post by:
How do you compare Python to Lisp? What specific advantages do you think that one has over the other? Note I'm not a Python person and I have no axes to grind here. This is just a question for my general education. Mark
4
1887
by: klappnase | last post by:
Hello, I use the tktreectrl Tk extension (http://tktreectrl.sourceforge.net) through the python wrapper module (http://klappnase.zexxo.net/TkinterTreectrl/index.html). With python-2.5 it seems that each time some text is inserted into the treectrl widget a segfault occurs (on linux, on windows2000 it seemed to work). Some people sent me mail describing the same problem, so I think it is not a problem with my installation of python-2.5....
13
6519
by: 7stud | last post by:
test1.py: -------------------- import shelve s = shelve.open("/Users/me/2testing/dir1/aaa.txt") s = "red" s.close() --------output:------ $ python test1.py
0
9519
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10438
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10164
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9042
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7540
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6780
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5437
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5563
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4113
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.