By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,837 Members | 1,195 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,837 IT Pros & Developers. It's quick & easy.

How does unicode() work?

P: n/a
Here's a test snippet...

import sys
for k in sys.stdin:
print '%s -%s' % (k, k.decode('iso-8859-1'))

....but it barfs when actually fed with iso8859-1 characters. How is this
done right?

robert
Jan 9 '08 #1
Share this Question
Share on Google+
7 Replies


P: n/a
Robert Latest wrote:
...but it barfs when actually fed with iso8859-1 characters.
Specifically, it says:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 0:
ordinal not in range(128)

which doesn't make sense to me, because I specifically asked for the
iso8859-1 decoder, not the 'ascii' one.

robert
Jan 9 '08 #2

P: n/a
Robert Latest wrote:
Here's a test snippet...

import sys
for k in sys.stdin:
print '%s -%s' % (k, k.decode('iso-8859-1'))

...but it barfs when actually fed with iso8859-1 characters. How is this
done right?
it's '%s -%s' % (byte string, unicode string) that barfs. try doing

import sys
for k in sys.stdin:
print '%s -%s' % (repr(k), k.decode('iso-8859-1'))

instead, to see what's going on.

</F>

Jan 9 '08 #3

P: n/a
On Wed, 2008-01-09 at 13:44 +0100, Fredrik Lundh wrote:
Robert Latest wrote:
Here's a test snippet...

import sys
for k in sys.stdin:
print '%s -%s' % (k, k.decode('iso-8859-1'))

...but it barfs when actually fed with iso8859-1 characters. How is this
done right?

it's '%s -%s' % (byte string, unicode string) that barfs. try doing

import sys
for k in sys.stdin:
print '%s -%s' % (repr(k), k.decode('iso-8859-1'))

instead, to see what's going on.
If that really is the line that barfs, wouldn't it make more sense to
repr() the unicode object in the second position?

import sys
for k in sys.stdin:
print '%s -%s' % (k, repr(k.decode('iso-8859-1')))

Also, I'm not sure if the OP has told us the truth about his code and/or
his error message. The implicit str() call done by formatting a unicode
object with %s would raise a UnicodeEncodeError, not the
UnicodeDecodeError that the OP is reporting. So either I need more
coffee or there is something else going on here that hasn't come to
light yet.

--
Carsten Haese
http://informixdb.sourceforge.net
Jan 9 '08 #4

P: n/a
Carsten Haese wrote:
If that really is the line that barfs, wouldn't it make more sense to
repr() the unicode object in the second position?

import sys
for k in sys.stdin:
print '%s -%s' % (k, repr(k.decode('iso-8859-1')))

Also, I'm not sure if the OP has told us the truth about his code and/or
his error message. The implicit str() call done by formatting a unicode
object with %s would raise a UnicodeEncodeError, not the
UnicodeDecodeError that the OP is reporting. So either I need more
coffee or there is something else going on here that hasn't come to
light yet.
When mixing Unicode with byte strings, Python attempts to decode the
byte string, not encode the Unicode string.

In this case, Python first inserts the non-ASCII byte string in "%s ->
%s" and gets a byte string. It then attempts to insert the non-ASCII
Unicode string, and realizes that it has to convert the (partially
built) target string to Unicode for that to work. Which results in a
*UnicodeDecodeError*.
>>"%s -%s" % ("едц", "едц")
'\x86\x84\x94 -\x86\x84\x94'
>>"%s -%s" % (u"едц", u"едц")
u'\xe5\xe4\xf6 -\xe5\xe4\xf6'
>>"%s -%s" % ("едц", u"едц")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x86 ...

(the actual implementation differs a bit from the description above, but
the behaviour is identical).

</F>

Jan 9 '08 #5

P: n/a
On Wed, 2008-01-09 at 15:33 +0100, Fredrik Lundh wrote:
When mixing Unicode with byte strings, Python attempts to decode the
byte string, not encode the Unicode string.
Ah, I did not realize that. I never mix Unicode and byte strings in the
first place, and now I know why. Thanks for clearing that up.

--
Carsten Haese
http://informixdb.sourceforge.net
Jan 9 '08 #6

P: n/a
On Jan 10, 1:55 am, Carsten Haese <cars...@uniqsys.comwrote:
On Wed, 2008-01-09 at 15:33 +0100, Fredrik Lundh wrote:
When mixing Unicode with byte strings, Python attempts to decode the
byte string, not encode the Unicode string.

Ah, I did not realize that. I never mix Unicode and byte strings in the
first place, and now I know why. Thanks for clearing that up.
When mixing unicode strings with byte strings, Python attempts to
decode the str object to unicode, not encode the unicode object to
str. This is fine, especially when compared with the alternative, so
long as the str object is (loosely) ASCII. If the str object contains
a byte such that ord(byte) 127, an exception will be raised.

When mixing floats with ints, Python attempts to decode the int to
float, not encode the float to int. This is fine, especially when
compared with the alternative, so long as the int is not humungous. If
the int is huge, you will lose precision without any warning or any
exception being raised.

Do you avoid mixing ints and floats?
Jan 9 '08 #7

P: n/a
John Machin wrote:
When mixing unicode strings with byte strings, Python attempts to
decode the str object to unicode, not encode the unicode object to
str.
Thanks for the explanation. Of course I didn't want to mix Unicode and Latin
in one string, my snippet just tried to illustrate the point. I'm new to
Python -- I came from C, and C gives a rat's ass about encoding. It just
dumps bytes and that's that.

robert
Jan 9 '08 #8

This discussion thread is closed

Replies have been disabled for this discussion.