By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
434,786 Members | 1,131 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 434,786 IT Pros & Developers. It's quick & easy.

Strange problems with encoding

P: n/a
Hi newsgroup,

i am trying to replace german special characters in strings like
str = re.sub('÷', 'oe', str)

When i work with this, i always get the message
UniCode Error: ASCII decoding error : ordinal not in range(128)

Yes i have googled, i searched the faq, manual and python library and
searched all known soruces of information. I played with the python
builtin function encode to enforce the rigth encoding, but the error
stays the same. I ve read a lot about UniCode and internal conversion
about Strings done by python, but somehow i ve missed the clue.
Nope, python says Huuups... ordinal not in range(128), ;-(

Anyone of you having any idea?? Seems like i am too stupid to read
documentation carefully., perhaps i misunderstand something...

thanks for your help in advance

Sebastian
Jul 18 '05 #1
Share this Question
Share on Google+
14 Replies


P: n/a
Sebastian Meyer wrote:
Hi newsgroup,

i am trying to replace german special characters in strings like
str = re.sub('÷', 'oe', str)

When i work with this, i always get the message
UniCode Error: ASCII decoding error : ordinal not in range(128)

Yes i have googled, i searched the faq, manual and python library and
searched all known soruces of information. I played with the python
builtin function encode to enforce the rigth encoding, but the error
stays the same. I ve read a lot about UniCode and internal conversion
about Strings done by python, but somehow i ve missed the clue.
Nope, python says Huuups... ordinal not in range(128), ;-(

Anyone of you having any idea?? Seems like i am too stupid to read
documentation carefully., perhaps i misunderstand something...

thanks for your help in advance

Sebastian


I'm experiencing something similar for the moment. I try to
base64-encode Unicode strings and I get the exact same errormessage.
s = u'÷'
s u'\xf6' s.encode('base64') Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Python23\lib\encodings\base64_codec.py", line 24, in
base64_encode
output = base64.encodestring(input)
File "C:\Python23\lib\base64.py", line 39, in encodestring
pieces.append(binascii.b2a_base64(chunk))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 0: ordinal not in range(128)

When I don't specify it's unicode it works: s = '÷'
s '\xf6' s.encode('base64')

'9g==\n'

The reason I want to base64-encode these unicode strings is because I
get those as input and want to store them in a MySQL database using
SQLObject.
Jul 18 '05 #2

P: n/a
"Sebastian Meyer" <s.*****@technology-network.de> writes:
Hi newsgroup,

i am trying to replace german special characters in strings like
str = re.sub('÷', 'oe', str)


1) str is the name of a builtin -- often a bad idea to use that as a
variable name.

2) I presume `str' is a unicode string? Try writing the literal as
u'÷' instead (and adding the appropriate coding cookie to your
source file if using Python 2.3). Or I guess you could write it

u'\N{LATIN SMALL LETTER O WITH DIAERESIS}'

Cheers,
mwh

--
Usenet is like a herd of performing elephants with diarrhea --
massive, difficult to redirect, awe-inspiring, entertaining, and
a source of mind-boggling amounts of excrement when you least
expect it. -- spaf (1992)
Jul 18 '05 #3

P: n/a
On Thu, 06 Nov 2003 13:39:25 +0000, Michael Hudson wrote:
"Sebastian Meyer" <s.*****@technology-network.de> writes:
Hi newsgroup,

i am trying to replace german special characters in strings like
str = re.sub('÷', 'oe', str)
1) str is the name of a builtin -- often a bad idea to use that as a
variable name.


it was only the example name for the variable, be sure that dont
use any builtins as variable names
maybe not a good example ... thanks for the hint

2) I presume `str' is a unicode string? Try writing the literal as
u'÷' instead (and adding the appropriate coding cookie to your
source file if using Python 2.3). Or I guess you could write it

u'\N{LATIN SMALL LETTER O WITH DIAERESIS}'
i ll try and report back...

Cheers,
mwh


Jul 18 '05 #4

P: n/a
"Sebastian Meyer" <s.*****@technology-network.de> wrote in message
news:pa***************************@technology-network.de...
Hi newsgroup,

i am trying to replace german special characters in strings like
str = re.sub('÷', 'oe', str)

When i work with this, i always get the message
UniCode Error: ASCII decoding error : ordinal not in range(128)


Try adding

sys.setdefaultencoding( 'latin-1' )

to your site.py module, or rewrite your fragment as

from = '÷'
to = 'oe'
s = re.sub( from.encode('latin-1'), to.encode('latin-1', s )

If you are running on Windows you might want to change 'latin-1' to 'mbcs',
as that seems to be the most forgiving codec, but it is Windows only.

Joe
Jul 18 '05 #5

P: n/a
Rudy Schockaert <ru*************@pandoraSTOPSPAM.be> writes:
Sebastian Meyer wrote:
Hi newsgroup,
i am trying to replace german special characters in strings like
str = re.sub('÷', 'oe', str)
When i work with this, i always get the message
UniCode Error: ASCII decoding error : ordinal not in range(128)
Yes i have googled, i searched the faq, manual and python library
and
searched all known soruces of information. I played with the python
builtin function encode to enforce the rigth encoding, but the error
stays the same. I ve read a lot about UniCode and internal conversion
about Strings done by python, but somehow i ve missed the clue.
Nope, python says Huuups... ordinal not in range(128), ;-(
Anyone of you having any idea?? Seems like i am too stupid to read
documentation carefully., perhaps i misunderstand something...
thanks for your help in advance
Sebastian
I'm experiencing something similar for the moment. I try to
base64-encode Unicode strings and I get the exact same errormessage.


"base64-encoding Unicode strings" is not a particularly well defined
operation. "base64-encoding" is a way of turning *binary data* into a
particularly "safe" sequence of ascii characters.

Unicode (in some sense) is a family of ways of representing strings of
characters as binary data.

So to base-64 encode a Unicode string, you need to choose *which*
member of this family you're going to use, which is to say the
encoding. UTF-8 would seem a good bet.

But...
>>> s = u'÷'
>>> s u'\xf6' >>> s.encode('base64') Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Python23\lib\encodings\base64_codec.py", line 24, in
base64_encode
output = base64.encodestring(input)
File "C:\Python23\lib\base64.py", line 39, in encodestring
pieces.append(binascii.b2a_base64(chunk))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 0: ordinal not in range(128)

u'÷'.encode('utf-8').encode('base64') 'w7Y=\n'
When I don't specify it's unicode it works:
>>> s = '÷'
>>> s '\xf6' >>> s.encode('base64')
'9g==\n'


Well, this works because your terminal seems to be latin-1:
u'÷'.encode('latin-1').encode('base64')

'9g==\n'

What would you like to do with a character that isn't in latin-1?
The reason I want to base64-encode these unicode strings is because I
get those as input and want to store them in a MySQL database using
SQLObject.


! Why can't you just encode them as utf-8 strings? (Or, thinking
about it, why doesn't SQLObject support unicode?)

Cheers,
mwh

--
I think if we have the choice, I'd rather we didn't explicitly put
flaws in the reST syntax for the sole purpose of not insulting the
almighty. -- /will on the doc-sig
Jul 18 '05 #6

P: n/a
Sebastian Meyer wrote:
Hi newsgroup,

i am trying to replace german special characters in strings like
str = re.sub('÷', 'oe', str)

When i work with this, i always get the message
UniCode Error: ASCII decoding error : ordinal not in range(128)

Yes i have googled, i searched the faq, manual and python library and
searched all known soruces of information. I played with the python
builtin function encode to enforce the rigth encoding, but the error
stays the same. I ve read a lot about UniCode and internal conversion
about Strings done by python, but somehow i ve missed the clue.
Nope, python says Huuups... ordinal not in range(128), ;-(

Anyone of you having any idea?? Seems like i am too stupid to read
documentation carefully., perhaps i misunderstand something...

thanks for your help in advance

Sebastian


Works here, even with my older snake:

Python 2.2.1 (#1, Sep 10 2002, 17:49:17)
[GCC 3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import re
re.sub("÷", "oe", "D÷spaddel") 'Doespaddel' re.sub("÷", "oe", u"D÷spaddel") u'Doespaddel' re.sub("÷", u"oe", u"D÷spaddel") u'Doespaddel' re.sub(u"÷", u"oe", u"D÷spaddel") u'Doespaddel'

To provoke a UnicodeError, I have to convert a unicode string with umlauts
to str without providing the encoding:
str(u"D÷spaddel") Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)

I suspect that you have something similar hidden in your code (i. e.
characters >= 128 that are not converted). The remedy is to explicitly
decode with the appropriate encoding:
u"D÷spaddel".encode("latin-1") 'D\xf6spaddel'


Try to build a minimal script that shows the reported behaviour and fix it
or post it for more detailed advice. By the way, don't use str as a
variable name, it's the type of "ordinary" strings.

Peter

Jul 18 '05 #7

P: n/a
Joe Fromm wrote:

Try adding

sys.setdefaultencoding( 'latin-1' )

to your site.py module, or rewrite your fragment as

At the end of site.py you can enable a piece of code that sets your
default encoding to the current locale of your computer:

if 1:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]

This works great for me.

Thanks for pointing me to site.py

P.S. I really need some weeks off so I can read all the available
documentation ;-)
Jul 18 '05 #8

P: n/a
>
u'÷'.encode('utf-8').encode('base64')

'w7Y=\n'


This works indeed. And thanks to Joe Fromm's hint (site.py) I don't have
to worry about it anymore.
What would you like to do with a character that isn't in latin-1?
Actually, I don't care as long as the encode and decode on the same
machine give me back the original value.
The reason I want to base64-encode these unicode strings is because I
get those as input and want to store them in a MySQL database using
SQLObject.

! Why can't you just encode them as utf-8 strings? (Or, thinking
about it, why doesn't SQLObject support unicode?)


The actual input strings don't really contain unicode text values, but
rather binary values i get as result from calling win32.NetUserEnum.

The manual of SQLObject (great product btw) explains how you can easily
store binary data in a SQL table by encoding it when setting and
decoding it when getting the value. Tha is just what I was trying to do.
Jul 18 '05 #9

P: n/a
Rudy Schockaert <ru*************@pandoraSTOPSPAM.be> writes:
>u'÷'.encode('utf-8').encode('base64')

'w7Y=\n'


This works indeed. And thanks to Joe Fromm's hint (site.py) I don't
have to worry about it anymore.


Well, I'm from the setdefaultencoding-is-evil camp, but it sounds like
you're in a pretty icky situation.
What would you like to do with a character that isn't in latin-1?

Actually, I don't care as long as the encode and decode on the same
machine give me back the original value.


Huh?
The reason I want to base64-encode these unicode strings is because I
get those as input and want to store them in a MySQL database using
SQLObject.

! Why can't you just encode them as utf-8 strings? (Or, thinking
about it, why doesn't SQLObject support unicode?)


The actual input strings don't really contain unicode text values, but
rather binary values i get as result from calling win32.NetUserEnum.


Oh, so they're not really unicode strings at all? Blech. That's
really really nasty. Binary data should really be represented as
(narrow) strings in Python. Perhaps the utf-16-le codec would be the
most appropriate...

Cheers,
mwh

--
Q: What are 1000 lawyers at the bottom of the ocean?
A: A good start.
(A lawyer told me this joke.)
-- Michael Str÷der, comp.lang.python
Jul 18 '05 #10

P: n/a
On Thu, 06 Nov 2003 15:10:49 +0100, Sebastian Meyer wrote:
On Thu, 06 Nov 2003 13:39:25 +0000, Michael Hudson wrote:
2) I presume `str' is a unicode string? Try writing the literal as
u'÷' instead (and adding the appropriate coding cookie to your
source file if using Python 2.3). Or I guess you could write it

u'\N{LATIN SMALL LETTER O WITH DIAERESIS}'


i ll try and report back...


okay, i ve solved my problem... it seems that my method which tries
to insert the data i process into the database raises the error. The
data comes from XML files, my derived xml.sax.handler.ContentHandler
returns UniCode encoded data. The database routine tries to
encode the values as ASCII and --**BOOOM**-- ... Exception.

I now replace the special characters by their UniCode Names
eg. u'\N{LATIN SMALL LETTER O WITH DIAERESIS}' (thanks for the hint
michael), now all for works fine... ;-))

thanks for the great help NG

Sebastian
Jul 18 '05 #11

P: n/a
Michael Hudson wrote:

Well, I'm from the setdefaultencoding-is-evil camp, but it sounds like
you're in a pretty icky situation.
I wasn't even aware there are two camps. What would be the reasons not
to use setdefaultencoding? As I configured it now it uses the systems
locale to set the encoding. I'm using the same machine to retrieve data,
manipulate it and store in a database (on the same machine).
I would like to understand what could be wrong in this case.

Actually, I don't care as long as the encode and decode on the same
machine give me back the original value.

Huh?

What I mean is that I encode the data when I store it in the DB and
decode it when I retrieve the data from the DB. I do this because
SQLObject doesn't support the binary data. As long as the result that
comes back out is exactly the same as it was when it went in, I don't care.
The reason I want to base64-encode these unicode strings is because I
get those as input and want to store them in a MySQL database using
SQLObject.

! Why can't you just encode them as utf-8 strings? (Or, thinking
about it, why doesn't SQLObject support unicode?)

The actual input strings don't really contain unicode text values, but
rather binary values i get as result from calling win32.NetUserEnum.

Oh, so they're not really unicode strings at all? Blech. That's
really really nasty. Binary data should really be represented as
(narrow) strings in Python.

I'm just doing it the easy way, I guess. I get the data from the win32
call as Unicode data, even when it contains binary data. Perhaps that I
will transform this data in a later phase to more usefull format, but
that'll depend on the need.

Perhaps the utf-16-le codec would be the most appropriate...

This is really not my thing. I noticed that on my system the encoding is
now set to cp1252. What would be the difference if I switched to utf-16-le?

Thanks for your explanation.

Rudy
Jul 18 '05 #12

P: n/a
Rudy Schockaert wrote:
At the end of site.py you can enable a piece of code that sets your
default encoding to the current locale of your computer:

if 1:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]

This works great for me.
instead of hacking your Python installation, I suggest using
explicit calls to the "encode" method wherever you need to
convert from Unicode to binary data on the way out.
P.S. I really need some weeks off so I can read all the available
documentation ;-)


it shouldn't take you more than 15-20 minutes to learn enough
about Unicode to be able to write Python code that processes
non-ASCII text in a reliable and portable way:

short version:
http://effbot.org/zone/unicode-objects.htm

long version:
http://www.joelonsoftware.com/articles/Unicode.html

</F>


Jul 18 '05 #13

P: n/a
>>P.S. I really need some weeks off so I can read all the available
documentation ;-)

it shouldn't take you more than 15-20 minutes to learn enough
about Unicode to be able to write Python code that processes
non-ASCII text in a reliable and portable way:

short version:
http://effbot.org/zone/unicode-objects.htm

long version:
http://www.joelonsoftware.com/articles/Unicode.html

</F>


I wasn't referring to Unicode ;-) but to the existance of site.py .
There still is so much I have to learn about python that I will need
those weeks badly. I only got halfway in Alex' Python in a Nutshell
(splendid book btw) which I already have since Europython :-(
Jul 18 '05 #14

P: n/a
Rudy Schockaert <ru*************@pandoraSTOPSPAM.be> writes:
I wasn't even aware there are two camps. What would be the reasons not
to use setdefaultencoding?
You lose portability (more correctly: you get a false sense of
portability). If you have write an application that requires the
default encoding to be FOO-1, the application may work fine on system
A, and fail on system B. Telling the operator of system B to change
her default encoding may cause breakage of a different application on
system B, as B has BAR-2 as the default encoding; changing it to FOO-1
would break applications that require it to be BAR-2.

IOW, if you require conversions between Unicode and byte strings,
explicitly do them in your code. Explicit is better than implicit.
As I configured it now it uses the systems locale to set the
encoding. I'm using the same machine to retrieve data, manipulate it
and store in a database (on the same machine). I would like to
understand what could be wrong in this case.
If the next user logs in on the same system, and has a different
locale set, that user will misinterpret the data you have created.
What I mean is that I encode the data when I store it in the DB and
decode it when I retrieve the data from the DB. I do this because
SQLObject doesn't support the binary data. As long as the result that
comes back out is exactly the same as it was when it went in, I don't
care.


Then you should *define* an encoding that your application uses,
e.g. UTF-8, and use that encoding throughout whereever required,
instead of having the administrator to ask to change a system setting.

Regards,
Martin
Jul 18 '05 #15

This discussion thread is closed

Replies have been disabled for this discussion.