python-unicode doesn't support >65535 symbols?

gabor

hi,

today i made some tests...

i tested some unicode symbols, that are above the 16bit limit
(gothic:http://www.unicode.org/charts/PDF/U10330.pdf)
..

i played around with iconv and so on,
so at the end i created an utf8 encoded text file,
with the text "Marrakesh",
where the second 'a' wes replaced with
GOTHIC_LETTER_AHSA (unicode-value:0x10330).

(i simply wrote the text file "Marrakesh", used iconv to convert it to
utf32big-endian, and replaced the character in hexedit, then converted
with iconv back to utf8).

now i started python:

data = open("utf8.txt").read()
data 'Marr\xf0\x90\x8c\xb0kesh' text = data.decode("utf8")
text u'Marr\U00010330kesh'

so far it seemed ok.
then i did:
len(text) 10

this is wrong. the length should be 9.
and why?
text[0] u'M' text[1] u'a' text[2] u'r' text[3] u'r' text[4] u'\ud800' text[5] u'\udf30' text[6] u'k'

so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).

i don't understand.
if tthe representation of 'text' is correct, why is the length wrong?

btw. i understand that it's a very exotic character, but i tried for
example kwrite and gedit, and none of the was able to display the
symbol, but both successfully identified it as ONE unknown symbol.

thanks,
gabor

Jul 18 '05 #1

Subscribe Post Reply

2676

Michael Hudson

gabor <ga***@z10n.net> writes:

i played around with iconv and so on,
so at the end i created an utf8 encoded text file,
with the text "Marrakesh",
where the second 'a' wes replaced with
GOTHIC_LETTER_AHSA (unicode-value:0x10330).

(i simply wrote the text file "Marrakesh", used iconv to convert it to
utf32big-endian, and replaced the character in hexedit, then converted
with iconv back to utf8).

now i started python:
data = open("utf8.txt").read()
data 'Marr\xf0\x90\x8c\xb0kesh' text = data.decode("utf8")
text u'Marr\U00010330kesh'

so far it seemed ok.
then i did:
len(text) 10

this is wrong. the length should be 9.
I suspect you have a "narrow unicode" build of Python. You can make
yourself a "wide unicode" build easily enough.
and why?
text[0] u'M' text[1] u'a' text[2] u'r' text[3] u'r' text[4] u'\ud800' text[5] u'\udf30' text[6] u'k'

so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).

i don't understand.
if tthe representation of 'text' is correct, why is the length wrong?

I expect that this has to do with surrogates or some other unicode
thing that's beyond my understanding...

Cheers,
mwh

--
It's actually a corruption of "starling". They used to be carried.
Since they weighed a full pound (hence the name), they had to be
carried by two starlings in tandem, with a line between them.
-- Alan J Rosenthal explains "Pounds Sterling" on asr

Jul 18 '05 #2

Andrew Clover

gabor <ga***@z10n.net> wrote:

so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).
The default encoding for native Unicode strings in Python in UTF-16, which
cannot hold the extended planes beyond 0xFFFF in a single character. Instead,
it uses two 'surrogate' characters. Bit of a nasty hack, but that's what
Unicode does and there's nothing can be done about it now.

Python can be compiled to use UCS-4 for native Unicode strings if you prefer.
Then every conceptual 'character' from the Unicode repertoire will be one
item long. It'll eat more memory too of course.
if tthe representation of 'text' is correct, why is the length wrong?

The representation of 'text' you are seeing is just the nicely-readable
version output by Python 2.2+. Despite the \U sequence, it is actually still
stored internally as two UTF-16 surrogates. You'll see this if you enter
'\U00012345' into Python 2.0 or 2.1, which don't use the \U form to output
strings.

--
Andrew Clover
mailto:an*@doxdesk.com
http://www.doxdesk.com/

Jul 18 '05 #3

Rainer Deyke

Andrew Clover wrote:

gabor <ga***@z10n.net> wrote:
so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).

The default encoding for native Unicode strings in Python in UTF-16,
which cannot hold the extended planes beyond 0xFFFF in a single
character.

That's not quite right. UTF-16 encodes unicode characters as either single
16 bit values and pairs of 16 bit values. However, one character is still
one character.

Python makes the mistake of exposing the internal representation instead of
the logical value of unicode objects. This means that, aside from space
optimization, unicode objects have no advantage over UTF-8 encoded plain
strings for storing unicode text.
--
Rainer Deyke - ra*****@eldwood.com - http://eldwood.com

Jul 18 '05 #4

Martin v. Löwis

"Rainer Deyke" <ra*****@eldwood.com> writes:

Python makes the mistake of exposing the internal representation instead of
the logical value of unicode objects. This means that, aside from space
optimization, unicode objects have no advantage over UTF-8 encoded plain
strings for storing unicode text.

That is not true. First, it is not "Python", but a specific Python
configuration - in "wide Unicode" builds, it uses UCS-4 internally.

In either build, len() and indexing addresses code units, not
characters: that is true.

However, it is not true that there is no advantage over UTF-8 encoded
byte strings. Instead, there are several advantages:
- In a UCS-4 build, Unicode characters and code units are in a 1:1
relationship
- In a UCS-2 build, Unicode characters and code units are in a 1:1
relationship as long as the application only ever processes BMP
characters.
- In either case, a Unicode object has inherent information about the
character set, which a UTF-8 byte string does not have. IOW, you know
what a Unicode object is, but you don't know (inherently) whether a
byte string is UTF-8.

Regards,
Martin

Jul 18 '05 #5

Similar topics

Funny story about python

by: eScrewDotCom | last post by:

eScrew Welcome to eScrew! eScrew is eScrew and this is eScrew story. eScrew will tell you eScrew story if you promise eScrew to consider eScrew story as joke. eScrew story is very funny. eScrew...

Python

Embedding Python in C, undefined symbol: PyExc_FloatingPointError

by: Simon Newton | last post by:

Hi, I've just starting out embedding Python in C and get the following error when I try and import a module that in turn imports the math module. ImportError:...

Python

Why asci-only symbols?

by: Mike Meyer | last post by:

Out of random curiosity, is there a PEP/thread/? that explains why Python symbols are restricted to 7-bit ascii? <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/home/mwm/...

Python

Proposal for adding symbols within Python

by: Pierre Barbier de Reuille | last post by:

Please, note that I am entirely open for every points on this proposal (which I do not dare yet to call PEP). Abstract ======== This proposal suggests to add symbols into Python. Symbols...

Python

Embedded Python interpreter and sockets

by: wahn | last post by:

Hi, Here is a problem I came across and need some help with. I developed a little Python script with some classes which runs standalone and communicates with a database via sockets. So far...

Python

multithreading windows and embedding python

by: freesteel | last post by:

I have posted about this problem before. SInce then I found a much better article to help with embedding python in a multithreaded application: http://www.linuxjournal.com/article/3641 I...

Python

Embedded python loading .so files?

by: jefishman | last post by:

I have a Python (2.3.x) interpreter running embedded in a C++ application on a host machine. I would like to run a specific package on that host machine (numpy). I have managed to compile...

Python

852

merits of Lisp vs Python

by: Mark Tarver | last post by:

How do you compare Python to Lisp? What specific advantages do you think that one has over the other? Note I'm not a Python person and I have no axes to grind here. This is just a question for...

Python

Segfault with tktreectrl on python-2.5 on linux

by: klappnase | last post by:

Hello, I use the tktreectrl Tk extension (http://tktreectrl.sourceforge.net) through the python wrapper module (http://klappnase.zexxo.net/TkinterTreectrl/index.html). With python-2.5 it seems...

Python

shelve error

by: 7stud | last post by:

test1.py: -------------------- import shelve s = shelve.open("/Users/me/2testing/dir1/aaa.txt") s = "red" s.close() --------output:------ $ python test1.py

Python

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++