unicode question

wolfgang haefelinger

Hi,

I wonder whether someone could explain me a bit what's going on here:

import sys

# I'm running Mandrake 1o and Windows XP.
print sys.version

## 2.3.3 (#2, Feb 17 2004, 11:45:40) [GCC 3.3.2 (Mandrake Linux 10.0
3.3.2-6mdk)]
## 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)]

print "sys.getdefaultencoding = ",sys.getdefaultencoding()
# This prints always "ascii" ..

## just a class
class Y:
def __str__(self):
return self.c

## define unicode character (ie. string)
gamma = u"\N{GREEK CAPITAL LETTER GAMMA}"

y = Y()
y.c = gamma

## works fine: prints greek capital gamma on terminal on windows (chcp 437).
## Mandrake 1o nothing gets printed but at least no excecption gets thrown.
print gamma # (1)

## same as before ..
print y.__str__() # (2)

## encoding error
print y # (3) ??????????????

## ascii encoding error ..
sys.stdout.write(gamma) # (4)

I wonder especially about case 2. I can see that "print y" makes a call to
Y.__str__() . But Y.__str__() can be printed?? So what is 'print' exactly
doing?

Thanks for any help,
Wolfgang.

Jul 18 '05 #1

Subscribe Post Reply

2805

Martin v. Löwis

wolfgang haefelinger wrote:

I wonder especially about case 2. I can see that "print y" makes a call to
Y.__str__() . But Y.__str__() can be printed?? So what is 'print' exactly
doing?

It looks at sys.stdout.encoding. If this is set, and the thing to print
is a unicode string, it converts it to the stream encoding, and prints
the result of the conversion.

Regards,
Martin

Jul 18 '05 #2

Kent Johnson

Martin v. LÃ¶wis wrote:

wolfgang haefelinger wrote:
I wonder especially about case 2. I can see that "print y" makes a
call to
Y.__str__() . But Y.__str__() can be printed?? So what is 'print' exactly
doing?

It looks at sys.stdout.encoding. If this is set, and the thing to print
is a unicode string, it converts it to the stream encoding, and prints
the result of the conversion.

I hate to contradict an expert, but ISTM that it is
sys.getdefaultencoding() ('ascii') that is the problem, not
sys.stdout.encoding ('cp437')

gamma converts to cp437 just fine:
gamma = u"\N{GREEK CAPITAL LETTER GAMMA}"
sys.stdout.encoding 'cp437' gamma.encode(sys.stdout.encoding) '\xe2' print gamma.encode(sys.stdout.encoding) Î“
(prints a gamma)

Trying to encode gamma using the 'ascii' codec doesn't work: str(gamma) Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0393' in
position 0: ordinal not in range(128)

My guess is that internally, print keeps calling str() on its argument
until it gets a string object. So it calls y.__str__() yielding gamma,
then gamma.__str__() which raises the error.

If the default encoding is set to cp437 then it works fine:
import sys
sys.getdefaultencoding() 'cp437' gamma = u"\N{GREEK CAPITAL LETTER GAMMA}"
str(gamma) '\xe2' print gamma Î“
(prints a gamma)
print str(gamma)

Î“
(prints a gamma)

Kent

Regards,
Martin

Jul 18 '05 #3

Martin v. LÃ¶wis

Kent Johnson wrote:

Martin v. LÃ¶wis wrote:
wolfgang haefelinger wrote:
I wonder especially about case 2. I can see that "print y" makes a
call to
Y.__str__() . But Y.__str__() can be printed?? So what is 'print'
exactly doing?

It looks at sys.stdout.encoding. If this is set, and the thing to print
is a unicode string, it converts it to the stream encoding, and prints
the result of the conversion.

I hate to contradict an expert, but ISTM that it is
sys.getdefaultencoding() ('ascii') that is the problem, not
sys.stdout.encoding ('cp437')

It seems we were answering different parts of the question. I answered
the part "What is 'print' exactly doing"; you answered the part as to
what the problem with str() conversion is (although I'm not sure whether
the OP has actually asked that question).

Also, the one case that is interesting here was not in your experiment:
try

print gamma

This should work, regardless of sys.getdefaultencoding(), as long as
sys.stdout.encoding supports the characters to be printed.

Regards,
Martin

Jul 18 '05 #4

wolfgang haefelinger

Hi Experts,

I'm actually not a Python expert so please bear with me and my naive
questions and remarks:

I was actually thinking that

print x

is just kind of shortcur for writing (simplifying bit):

import sys
if not (isinstance(x,str) or isinstance(x,unicode)) and x.__str__ :
x = x.__str__()
sys.stdout.write(x)

Or in words: if x is not a string type but has method __str__ then

print x

behaves like

print x.__str__()

Given this assumption I'm wondering then why print x.__str__()
works but print x does not?

Is this a bug??

Cheers,
Wolfgang.

""Martin v. Löwis"" <ma****@v.loewis.de> wrote in message
news:41**************@v.loewis.de...

Kent Johnson wrote:
Martin v. Löwis wrote:
wolfgang haefelinger wrote:

I wonder especially about case 2. I can see that "print y" makes a call
to
Y.__str__() . But Y.__str__() can be printed?? So what is 'print'
exactly doing?

It looks at sys.stdout.encoding. If this is set, and the thing to print
is a unicode string, it converts it to the stream encoding, and prints
the result of the conversion.

I hate to contradict an expert, but ISTM that it is
sys.getdefaultencoding() ('ascii') that is the problem, not
sys.stdout.encoding ('cp437')

It seems we were answering different parts of the question. I answered
the part "What is 'print' exactly doing"; you answered the part as to
what the problem with str() conversion is (although I'm not sure whether
the OP has actually asked that question).

Also, the one case that is interesting here was not in your experiment:
try

print gamma

This should work, regardless of sys.getdefaultencoding(), as long as
sys.stdout.encoding supports the characters to be printed.

Regards,
Martin

Jul 18 '05 #5

Martin v. Löwis

wolfgang haefelinger wrote:

I was actually thinking that

print x

is just kind of shortcur for writing (simplifying bit):

import sys
if not (isinstance(x,str) or isinstance(x,unicode)) and x.__str__ :
x = x.__str__()
sys.stdout.write(x)
This is too simplifying. For the context of this discussion,
it is rather

import sys
if isinstance(x, unicode) and sys.stdout.encoding:
x = x.encode(sys.stdout.encoding)
x = str(x)
sys.stdout.write(x)

(this, of course, is still quite simplicated. It ignores tp_print,
and it ignores softspaces).
Or in words: if x is not a string type but has method __str__ then

print x

behaves like

print x.__str__()
No. There are many types for which this is not true; in this specific
case, it isn't true for Unicode objects.
Is this a bug??

No. You are just misunderstanding it.

Regards,
Martin

Jul 18 '05 #6

wolfgang haefelinger

Hi Martin,

if print is implemented like this then I begin to understand the problem.

Neverthelss, I regard

print y.__str__() ## works
print y ## fails??

as a very inconsistent behaviour.

Somehow I have the feeling that Python should give up the distinction
between unicode and str and just have a str type which is internally
unicode.
Anyway, thanks for answering
Wolfgang.

""Martin v. Löwis"" <ma****@v.loewis.de> wrote in message
news:41*********************@news.freenet.de...

wolfgang haefelinger wrote:
I was actually thinking that

print x

is just kind of shortcur for writing (simplifying bit):

import sys
if not (isinstance(x,str) or isinstance(x,unicode)) and x.__str__ :
x = x.__str__()
sys.stdout.write(x)

This is too simplifying. For the context of this discussion,
it is rather

import sys
if isinstance(x, unicode) and sys.stdout.encoding:
x = x.encode(sys.stdout.encoding)
x = str(x)
sys.stdout.write(x)

(this, of course, is still quite simplicated. It ignores tp_print,
and it ignores softspaces).
Or in words: if x is not a string type but has method __str__ then

print x

behaves like

print x.__str__()

No. There are many types for which this is not true; in this specific
case, it isn't true for Unicode objects.
Is this a bug??

No. You are just misunderstanding it.

Regards,
Martin

Jul 18 '05 #7

Bengt Richter

On Mon, 22 Nov 2004 08:04:08 GMT, "wolfgang haefelinger" <wh****@web.de> wrote:

Hi Martin,

if print is implemented like this then I begin to understand the problem.

Neverthelss, I regard

print y.__str__() ## works
print y ## fails??

as a very inconsistent behaviour.

Somehow I have the feeling that Python should give up the distinction
between unicode and str and just have a str type which is internally
unicode.
Anyway, thanks for answering
Wolfgang.

""Martin v. Löwis"" <ma****@v.loewis.de> wrote in message
news:41*********************@news.freenet.de...
wolfgang haefelinger wrote:
I was actually thinking that

print x

is just kind of shortcur for writing (simplifying bit):

import sys
if not (isinstance(x,str) or isinstance(x,unicode)) and x.__str__ :
x = x.__str__()
sys.stdout.write(x)

This is too simplifying. For the context of this discussion,
it is rather

import sys
if isinstance(x, unicode) and sys.stdout.encoding:
x = x.encode(sys.stdout.encoding)
x = str(x)
sys.stdout.write(x)

(this, of course, is still quite simplicated. It ignores tp_print,
and it ignores softspaces).
Or in words: if x is not a string type but has method __str__ then

print x

behaves like

print x.__str__()

No. There are many types for which this is not true; in this specific
case, it isn't true for Unicode objects.
Is this a bug??

No. You are just misunderstanding it.

Regards,
Martin

It's an old issue, and ISTM there is either a problem or it needs to be better explained.
My bet is on a problem ;-) ISTM the key is that a plain str type is a byte sequence but can
be interpreted as a byte-stream-encoded character sequence, and there are some seemingly
schizophrenic situations. E.g., start with a sequence of numbers, obviously just produced
by a polynomial formula having nothing to do with characters:

numbers = [(lambda x: (-499*x**4 +4634*x**3 -13973*x**2 +13918*x +1824)/24)(x) for x in xrange(5)]
numbers [76, 246, 119, 105, 115]

Now if we convert those to str type characters with chr() and join them:
s = ''.join(map(chr, numbers))
Then we have a sequence of bytes which could have had any numerical value in range(256). No character
encoding is assumed. Yet. If we now assume, say, a latin-1 encoding, we can decode the bytes into
unicode:
u = s.decode('latin-1')
type(u) <type 'unicode'>

Now if we print that, sys.stdout.encoding should come into play:
print u Löwis

:-)

And we are ok, because we were explicit the whole way.
But if we don't decode s explicitly, it seems the system makes an assumption:
print s L÷wis

That is (if it survived) the 'cp437' character for byte '\xf6'. IOW, print seems
to assume that a plain str is encoded ready for output in sys.stdout.encoding in
a kind of reinterpret_cast of the str, or else a decode('cp437').encode('cp437')
optimized away.
sys.stdout.encoding 'cp437' sys.getdefaultencoding() 'ascii'

If it were assuming s was encoded as ascii, it should really do s.decode('ascii').encode('cp437')
to get it printed, but for plain str literals it does not seem to do that. I.e.,
s.decode('ascii') Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 1: ordinal not in range(128

doesn't work, so it can't be doing that. It seems to print s as s.decode('cp437').encode('cp437')
s.decode('cp437') u'L\xf7wis'

but that is a wrong decoding, (though the system can't be expected to know).
print s.decode('cp437').encode('cp437') L÷wis print s.decode('latin-1').encode('cp437') Löwis

What other decoding should be attempted, lacking an indication? sys.getdefaultencoding()
might be reasonable, but it seems to be locked into 'ascii' (I don't know how to set it)
sys.getdefaultencoding = lambda: 'latin-1'
sys.getdefaultencoding() 'latin-1' unicode('L\xf6wis') Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 1: ordinal not in range(128
So, bottom line, as Wolfgang effectively asked by his example, why does print try to coerce
the __str__ return value to ascii on the way to the ouput encoder, when there is encoding info
in the unicode object that it is happy to defer reencoding of for sys.stdout.encoding?
s 'L\xf6wis' u u'L\xf6wis' print s L÷wis print u Löwis class Y: ... def __str__(self): return self.c
... y = Y()
y.c = s
print y L÷wis y.c = u
print y Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in r
ange(128) print u Löwis

Maybe the output of __str__ should be ok as a type basestring subclass for print, so
y.c = u
print y
above has the same result as
print u

It seems to be trying to do u.encode('ascii').decode('ascii').encode('cp437')
instead of directly u.encode('cp437') when __str__ is involved.
print u'%s' % y Löwis

works, and
print '%s' % u Löwis

works, and
print y.__str__() Löwis

and
print y.c Löwis

works, y.c u'L\xf6wis'

but
print '%s'%y Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in r
ange(128)

and never mind print,
'%s' % u u'L\xf6wis' '%s' % y.__str__() u'L\xf6wis' '%s' % y

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in r
ange(128)

I guess its that str.__mod__(self, other) can deal with a unicode other and get promoted, but
it must do str(other) instead of other.__str__(), or it would be able to promote the result in
the latter case too...

This seems like a possible change that could smooth things a bit, especially if print a,b,c
was then effectively the same as print ('%s'%a),('%s'%b),('%s'%c) with encoding promotion.

Regards,
Bengt Richter

Jul 18 '05 #8

Martin v. Löwis

wolfgang haefelinger wrote:

Neverthelss, I regard

print y.__str__() ## works
print y ## fails??

as a very inconsistent behaviour.
Notice that this also fails

x=str(y)

So it is really the string conversion that fails. Roughly the same
happens with

class X:
def __str__(self):
return -1

Here, instances of X also cannot be printed: str() is really supposed
to return a byte string object - not a number, not a unicode object.
As a special exception, __str__ can return a Unicode object, as long
as that result can be converted with the system default encoding into
a byte string object. So we really have

def str(o):
if isinstance(o, types.StringType): return o
if isinstance(o, types.UnicodeType): return o.encode(None)
return str(o.__str__())

This is why the first print succeeds (it calls __str__ directly,
printing the Unicode object afterwards), and the second print fails
(trying to str()-convert its argument, which already fails - it
didn't get so far as to actually trying to print something).
Somehow I have the feeling that Python should give up the distinction
between unicode and str and just have a str type which is internally
unicode.

Yes, that should happen in P3k. But even then, there will be a
distinction between byte (plain) strings, and character (unicode)
strings.

Regards,
Martin

Jul 18 '05 #9

Martin v. Löwis

Bengt Richter wrote:

So, bottom line, as Wolfgang effectively asked by his example, why does print try to coerce
the __str__ return value to ascii on the way to the ouput encoder, when there is encoding info
in the unicode object that it is happy to defer reencoding of for sys.stdout.encoding?

[See my other posting:]
Because print invokes str() on its argument, unless the argument is
already a byte string (in which case it prints it directly), or a
Unicode string (in which case it encodes it with the stream encoding).
It is str(y) that fails, not the printing.

Regards,
Martin

Jul 18 '05 #10

Bengt Richter

On Tue, 23 Nov 2004 00:24:09 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <ma****@v.loewis.de> wrote:

Bengt Richter wrote:
So, bottom line, as Wolfgang effectively asked by his example, why does print try to coerce
the __str__ return value to ascii on the way to the ouput encoder, when there is encoding info
in the unicode object that it is happy to defer reencoding of for sys.stdout.encoding?
[See my other posting:]
Because print invokes str() on its argument, unless the argument is
already a byte string (in which case it prints it directly), or a

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^-- effectively an assumption that
bytestring.decode('some_unknown_encoding').encode( sys.stdout.encoding)
has already been done, it seems (I'm not arguing against).
Unicode string (in which case it encodes it with the stream encoding).
It is str(y) that fails, not the printing.

Yes, I think my turgid post did demonstrate that, among other things ;-)

So how about changing print so that it doesn't blindly use str(y), but instead
first tries to get y.__str__() in case the latter returns unicode?
Then print y can succeed the way print y.__str__() does now.

The same goes for str.__mod__ -- it apparently knows how to deal with '%s'% unicode(y)
so why shouldn't '%s'%y benefit when y.__str__ returns unicode?

I.e., str doesn't know that printing and '%s' can use unicode to good effect
if it available, so for print and str.__mod__ blindly to use str() intermediately
throws away an opportunity to do better ISTM.

Regards,
Bengt Richter

Jul 18 '05 #11

Steve Holden

Bengt Richter wrote:

On Tue, 23 Nov 2004 00:24:09 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <ma****@v.loewis.de> wrote:

Bengt Richter wrote:
So, bottom line, as Wolfgang effectively asked by his example, why does print try to coerce
the __str__ return value to ascii on the way to the ouput encoder, when there is encoding info
in the unicode object that it is happy to defer reencoding of for sys.stdout.encoding?

[See my other posting:]
Because print invokes str() on its argument, unless the argument is
already a byte string (in which case it prints it directly), or a

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^-- effectively an assumption that
bytestring.decode('some_unknown_encoding').encode( sys.stdout.encoding)
has already been done, it seems (I'm not arguing against).

Unicode string (in which case it encodes it with the stream encoding).
It is str(y) that fails, not the printing.

Yes, I think my turgid post did demonstrate that, among other things ;-)

So how about changing print so that it doesn't blindly use str(y), but instead
first tries to get y.__str__() in case the latter returns unicode?
Then print y can succeed the way print y.__str__() does now.

The same goes for str.__mod__ -- it apparently knows how to deal with '%s'% unicode(y)
so why shouldn't '%s'%y benefit when y.__str__ returns unicode?

I.e., str doesn't know that printing and '%s' can use unicode to good effect
if it available, so for print and str.__mod__ blindly to use str() intermediately
throws away an opportunity to do better ISTM.

Regards,
Bengt Richter

Am I the only person who found it scary that Bengt could apparently
casually drop on a polynomial the would decode to " Löwis"?

feel-dumb-just-being-in-the-same-newsgroup-ly y'rs - steve

--
http://www.holdenweb.com
http://pydish.holdenweb.com
Holden Web LLC +1 800 494 3119

Jul 18 '05 #12

Martin v. Löwis

Bengt Richter wrote:

Because print invokes str() on its argument, unless the argument is
already a byte string (in which case it prints it directly), or a
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^-- effectively an assumption that
bytestring.decode('some_unknown_encoding').encode( sys.stdout.encoding)
has already been done, it seems (I'm not arguing against).

Not really. sys.stdout really is a byte string, which may or may
not *have* an encoding. Python tries to guess, and refuses to
in the face of ambiguity: e.g. if sys.stdout is a file, resulting
from

python mkimage.py > image.gif

then sys.stdout really does not *have* an encoding - but it still
is a byte stream. So copying the bytes to stdout is a
straight-forward thing to do.

Of course, "print" should only be used if the stream is meant to
transmit characters, and then the bytes written to the stream should
use the stream's encoding. This is indeed the assumption - but one
that the application author needs to make.
So how about changing print so that it doesn't blindly use str(y)
On the C level, this is already possible, through tp_print. Whether or
not this should be exposed to the Python level (or whether doing so
would just add to the confusion), I don't know.
but instead
first tries to get y.__str__() in case the latter returns unicode?
Then print y can succeed the way print y.__str__() does now.
As yet another alternative, print could invoke unicode(), if
there is a stream encoding. This would try __unicode__first,
then fall back to call __str__. Patches in this direction would
be welcome - but the code implementing print is already quite
involved, so a redesign (with a PEP and everything) might also
be in order.

In P3k, this part of the issue will go away, as str() then will
return Unicode strings.
I.e., str doesn't know that printing and '%s' can use unicode to good effect
if it available, so for print and str.__mod__ blindly to use str() intermediately
throws away an opportunity to do better ISTM.

That is true. Of course, there is already so much backwards
compatibility in this that any change to behaviour (such as
trying unicode() before trying str()) might break things.

Regards,
Martin

Jul 18 '05 #13

Martin v. Löwis

Steve Holden wrote:

Am I the only person who found it scary that Bengt could apparently
casually drop on a polynomial the would decode to " Löwis"?

I'm not scared, but honored, of course.

Regards,
Martin

Jul 18 '05 #14

Bengt Richter

On Tue, 23 Nov 2004 20:37:04 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <ma****@v.loewis.de> wrote:

Steve Holden wrote:
Am I the only person who found it scary that Bengt could apparently
casually drop on a polynomial the would decode to " Löwis"?
Well, don't give me too much credit, though I admit enjoying a little unearned
flattered-ego buzz ;-) But it's not a big deal if you had recently implemented
an automatic lambda-printer-outer to solve for a polynomial function f such that
f(0)==k0, f(1)==k1, .. f(n)==kn. For a single number k0 that will be lambda x: k0
and for two numbers k0, k1 will be lambda x: k0 + x*(k1-k0) etc. It's a matter of
solving some simultaneous equations for the coefficient values, which I had done
in response to a previous thread. For that, I happened to have had some experience
from the '60s writing variations on an equation solver (back when we congratulated
ourselves on getting all (software-implemented) floating point ops other than divide
to execute in under a millisecond ;-) Here I was using an exact decimal module I happened
to have (also built in response to previous thread discussion ;-), so I didn't even have
to look for maximum abs pivot elements in the matrix for this one. And it didn't have to be fast.
So it was kind of a fun exercise. But anyway, it was all ready to go at this point, so
all I had to was do was run coeffsx.py with the character ord values as args on the command line.
The opportunity to use it in a fun way to fake casual wizardry was just dumb luck ;-)

I'm not scared, but honored, of course.

A bit late responding, but I couldn't think of a clever followup to that ;-)
But Just to play fair,

print ''.join([chr((lambda x: (
-6244372133*x**31 +3013910052086*x**30 -695396351572920*x**29
+102105752307741620*x**28 -10715303804974659632*x**27 +855734314951919397204*x**26
-54067713339116101354860*x**25 +2774121296568607137441900*x**24
-117725625258165396333623970*x**23 +4187405270602160539007125440*x**22
-126060225187601954901807327900*x**21 +3234908736910295469078183101700*x**20
-71121878980966418114205095297640*x**19 +1344268902923717571167117226451980*x**18
-21886601404074660751245403749948900*x**17 +307180698948793841846368910776059300*x**16
-3714719218772170154406066269371644945*x**15 +38641327091060849304069885597725238090*x**14
-344757809926306996671359721670334393500*x**13 +2627069115710241704477921121071756668600*x**12
-16998869426095431823754237370045113150352*x**11 +92697362475995606001274610327169882407584*x**10
-421837211162827653880286870838716820642880*x**9 +1581695033356657201434736494281105646218880*x**8
-4805817748883837636614530805204695373091328*x**7 +11572394080794032785251889126742747327087616*x**6
-21417820944419013080374525134500006003159040*x**5 +29141767437911436346798089144038222112768000*x**4
-27186086428826094346108431447644781404160000*x**3 +15339943556592952236643053124047771402240000*x**2
-3882253738078295379102517100266822041600000*x +230239482316981838896315760640000000)
/2740946218059307605908520960000000
)(x)) for x in xrange(32)])

Not-ready-to-be-mythologized-though-plenty-flatterable-ly y'rs

Regards,
Bengt Richter

Jul 18 '05 #15

unicode question

Similar topics