By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,742 Members | 1,044 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,742 IT Pros & Developers. It's quick & easy.

Short questions wrt Python & Unicode

P: n/a
KvS
Hi all,

I've been reading about unicode in general and using it in Python in
particular lately as this turns out to be not so straightforward
actually. I wanted to aks two questions:

1) I'm writing a program that interacts with the user through wxPython
(unicode build) and stores & retrieves data using PySQLite. As fas as I
know now, both packages are capable of handling Python unicode objects
(wxPython returns the values of text controls etc. by default as Python
unicode objects and "TEXT" columns in PySQLite have unicode entries)
and since of course both interface with me through Python unicode
objects I should be able to use each others generated unicode objects
without any fear in each other functions, right??

2) How do I get a representation of a unic. object in terms of Unicode
code points? repr() doesn't do that, it sometimes parses or encodes the
code points right:
s=u"\u0040\u0166\u00e6"
s

u'@\u0166\xe6'

(does this latter \xe6 have to do with the internal representation of
unic. objects, maybe with this UCS-2 encoding?)

Thanks in advance!

- Kees

Jun 9 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
On 9/06/2006 10:04 PM, KvS wrote:
2) How do I get a representation of a unic. object in terms of Unicode
code points? repr() doesn't do that, it sometimes parses or encodes the
code points right:

|>>> s=u"\u0040\u0166\u00e6"
|>>> s
u'@\u0166\xe6'
|>>> ' '.join('U+%04X % ord(c) for c in s)
'U+0040 U+0166 U+00E6'

If you'd prefer it more Pythonic than unicode.orgic, adjust the format
string and separator to suit your taste.
(does this latter \xe6 have to do with the internal representation of
unic. objects, maybe with this UCS-2 encoding?)


|>>> u'\xe6' == u'\u00e6' == unichr(0xe6)
True
|>>> hex(ord(u'\u00e6'))
'0xe6'

U+nnnnnn is represented internally as the integer 0xnnnnnn -- except if
it won't fit, but you can pretend that surrogate pairs don't exist, for
the moment :-)

Cheers,
John

Jun 9 '06 #2

P: n/a
KvS wrote:
s=u"\u0040\u0166\u00e6"
s

u'@\u0166\xe6'

(does this latter \xe6 have to do with the internal representation of
unic. objects, maybe with this UCS-2 encoding?)


no, it's simply the shortest way to represent U+00E6 as Python Unicode
string literal, when limited to ASCII only.

</F>

Jun 9 '06 #3

P: n/a
KvS

John Machin wrote:
On 9/06/2006 10:04 PM, KvS wrote:
2) How do I get a representation of a unic. object in terms of Unicode
code points? repr() doesn't do that, it sometimes parses or encodes the
code points right:

|>>> s=u"\u0040\u0166\u00e6"
|>>> s
u'@\u0166\xe6'


|>>> ' '.join('U+%04X % ord(c) for c in s)
'U+0040 U+0166 U+00E6'

If you'd prefer it more Pythonic than unicode.orgic, adjust the format
string and separator to suit your taste.
(does this latter \xe6 have to do with the internal representation of
unic. objects, maybe with this UCS-2 encoding?)


|>>> u'\xe6' == u'\u00e6' == unichr(0xe6)
True
|>>> hex(ord(u'\u00e6'))
'0xe6'

U+nnnnnn is represented internally as the integer 0xnnnnnn -- except if
it won't fit, but you can pretend that surrogate pairs don't exist, for
the moment :-)

Cheers,
John


Thanks to you and Fredrik! What about q1? I know it's silly since for
integers e.g. one doesn't give such an issue any thought at all, it's
just that this understanding of en/decodings etc. make things a bit
more blurry to me. It should be the case that a package may do
internally (en-/decodign etc.) what it wants to represent/manipulate
unic. strings but should always communicate to the outside world via
the interchangable & uniform Python unicode object right?

Jun 9 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.