By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,100 Members | 2,979 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,100 IT Pros & Developers. It's quick & easy.

strxfrm works with unicode string ?

P: n/a
I am trying to use strxfm with unicode strings, but it does not work.
This is what I did:
import locale
s=u'\u00e9'
print s locale.setlocale(locale.LC_ALL, '') 'French_Switzerland.1252' locale.strxfrm(s)
Traceback (most recent call last):
File "<pyshell#20>", line 1, in -toplevel-
locale.strxfrm(s)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 0: ordinal not in range(128)


Someone sees what I did wrong ?

Jul 19 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
How about:

import locale
s=u'\u00e9'
print s
locale.setlocale(locale.LC_ALL, '')
locale.strxfrm( s.encode( "latin-1" ) )

---
HTH,
Gerald

ni************@genevoise.ch schrieb:
I am trying to use strxfm with unicode strings, but it does not work.
This is what I did:

import locale
s=u'\u00e9'
print s

locale.setlocale(locale.LC_ALL, '')
'French_Switzerland.1252'
locale.strxfrm(s)

Traceback (most recent call last):
File "<pyshell#20>", line 1, in -toplevel-
locale.strxfrm(s)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 0: ordinal not in range(128)
Someone sees what I did wrong ?


--
GPG-Key: http://keyserver.veridis.com:11371/search?q=0xA140D634

Jul 19 '05 #2

P: n/a
Gruzi, Gerald ;-)

Well, ok, but I don't understand why I should first convert a pure
unicode string into a byte string.
The encoding ( here, latin-1) seems an arbitrary choice.

Your solution works, but is it a workaround or the real way to use
strxfrm ?
It seems a little artificial to me, but perhaps I haven't understood
something ...

Does this mean that you cannot pass a unicode string to strxfrm ?

Bonne journe !

Jul 19 '05 #3

P: n/a
Sali Nicolas :)),
please see below for my answers.

ni************@genevoise.ch schrieb:
Gruzi, Gerald ;-)

Well, ok, but I don't understand why I should first convert a pure
unicode string into a byte string.
The encoding ( here, latin-1) seems an arbitrary choice. Well "latin-1" is only encoding, about which I know that it works on
my xterm and which I can type without spelling errors :)
Your solution works, but is it a workaround or the real way to use
strxfrm ?
It seems a little artificial to me, but perhaps I haven't understood
something ... In Python 2.3.4 I had some strange encounters with the locale module,
In the end I considered it broken, at least when it came to currency
formating.
Does this mean that you cannot pass a unicode string to strxfrm ?

This works here for my home-grown python 2.4 on Jurrasic Debian Woody:

import locale
s=u'\u00e9'
print s

print locale.setlocale(locale.LC_ALL, '')
print repr( locale.strxfrm( s.encode( "latin-1" ) ) )
print repr( locale.strxfrm( s.encode( "utf-8" ) ) )

The output is rather strange:


de_DE
"\x10\x01\x05\x01\x02\x01'@/locale"
"\x0c\x01\x0c\x01\x04\x01'@/locale"

Another (not so) weird thing happens when I unset LANG.

bear@special:~ > unset LANG
bear@special:~ > python2.4 ttt.py
Traceback (most recent call last):
File "ttt.py", line 3, in ?
print s
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 0: ordinal not in range(128)

Acually it's more weird, that printing works with LANG=de_DE.

Back to your question. A quick glance at the C-sources of the
_localemodule.c reveals:

if (!PyArg_ParseTuple(args, "s:strxfrm", &s))

So yes, strxfrm does not accept unicode!

I am inclined to consider this a bug.
A least it is not consistent with strcoll.
Strcoll accepts either 2 strings or 2 unicode strings,
at least when HAVE_WCSCOLL was defined when python
was compiled on your plattform.

BTW: Which platform do you use?

HTH,
Gerald

PS: If you have access to irc, you can also ask at
irc://irc.freenode.net#python.de.

--
GPG-Key: http://keyserver.veridis.com:11371/search?q=0xA140D634

Jul 19 '05 #4

P: n/a
ni************@genevoise.ch wrote:
Gruzi, Gerald ;-)

Well, ok, but I don't understand why I should first convert a pure
unicode string into a byte string.
The encoding ( here, latin-1) seems an arbitrary choice.


Yes. The correct choice would be 'cp1252', not 'latin-1',
since that's what your locale setting indicates.

It seems to me that Python is on a journey from the ASCII
world to the Unicode world, and it will take a few more
versions before it gets there. Going from 2.2 to 2.3 was
a bumpy part of the ride, and it's still not smooth.

Just try to use raw_input with national characters. As far
as I remember it hasn't worked (on windows at least) since
2.2.

The clear improvement from 2.3 is that if you print unicode
strings to stdout, they will look correct both in the GUI
and in text mode (cmd.exe). That never worked before since
Windows use different code pages in Windows and in the text
mode (which is supposed to be DOS compatible).
Jul 19 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.