473,699 Members | 2,323 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

unicode question

Hi,

I wonder whether someone could explain me a bit what's going on here:

import sys

# I'm running Mandrake 1o and Windows XP.
print sys.version

## 2.3.3 (#2, Feb 17 2004, 11:45:40) [GCC 3.3.2 (Mandrake Linux 10.0
3.3.2-6mdk)]
## 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)]

print "sys.getdefault encoding = ",sys.getdefaul tencoding()
# This prints always "ascii" ..

## just a class
class Y:
def __str__(self):
return self.c

## define unicode character (ie. string)
gamma = u"\N{GREEK CAPITAL LETTER GAMMA}"

y = Y()
y.c = gamma

## works fine: prints greek capital gamma on terminal on windows (chcp 437).
## Mandrake 1o nothing gets printed but at least no excecption gets thrown.
print gamma # (1)

## same as before ..
print y.__str__() # (2)

## encoding error
print y # (3) ??????????????

## ascii encoding error ..
sys.stdout.writ e(gamma) # (4)

I wonder especially about case 2. I can see that "print y" makes a call to
Y.__str__() . But Y.__str__() can be printed?? So what is 'print' exactly
doing?

Thanks for any help,
Wolfgang.



Jul 18 '05 #1
14 2840
wolfgang haefelinger wrote:
I wonder especially about case 2. I can see that "print y" makes a call to
Y.__str__() . But Y.__str__() can be printed?? So what is 'print' exactly
doing?


It looks at sys.stdout.enco ding. If this is set, and the thing to print
is a unicode string, it converts it to the stream encoding, and prints
the result of the conversion.

Regards,
Martin
Jul 18 '05 #2
Martin v. Löwis wrote:
wolfgang haefelinger wrote:
I wonder especially about case 2. I can see that "print y" makes a
call to
Y.__str__() . But Y.__str__() can be printed?? So what is 'print' exactly
doing?

It looks at sys.stdout.enco ding. If this is set, and the thing to print
is a unicode string, it converts it to the stream encoding, and prints
the result of the conversion.


I hate to contradict an expert, but ISTM that it is
sys.getdefaulte ncoding() ('ascii') that is the problem, not
sys.stdout.enco ding ('cp437')

gamma converts to cp437 just fine:
gamma = u"\N{GREEK CAPITAL LETTER GAMMA}"
sys.stdout.enco ding 'cp437' gamma.encode(sy s.stdout.encodi ng) '\xe2' print gamma.encode(sy s.stdout.encodi ng) Γ
(prints a gamma)

Trying to encode gamma using the 'ascii' codec doesn't work: str(gamma) Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\u0393' in
position 0: ordinal not in range(128)

My guess is that internally, print keeps calling str() on its argument
until it gets a string object. So it calls y.__str__() yielding gamma,
then gamma.__str__() which raises the error.

If the default encoding is set to cp437 then it works fine:
import sys
sys.getdefaulte ncoding() 'cp437' gamma = u"\N{GREEK CAPITAL LETTER GAMMA}"
str(gamma) '\xe2' print gamma Γ
(prints a gamma)
print str(gamma)

Γ
(prints a gamma)

Kent

Regards,
Martin

Jul 18 '05 #3
Kent Johnson wrote:
Martin v. Löwis wrote:
wolfgang haefelinger wrote:
I wonder especially about case 2. I can see that "print y" makes a
call to
Y.__str__() . But Y.__str__() can be printed?? So what is 'print'
exactly doing?


It looks at sys.stdout.enco ding. If this is set, and the thing to print
is a unicode string, it converts it to the stream encoding, and prints
the result of the conversion.

I hate to contradict an expert, but ISTM that it is
sys.getdefaulte ncoding() ('ascii') that is the problem, not
sys.stdout.enco ding ('cp437')


It seems we were answering different parts of the question. I answered
the part "What is 'print' exactly doing"; you answered the part as to
what the problem with str() conversion is (although I'm not sure whether
the OP has actually asked that question).

Also, the one case that is interesting here was not in your experiment:
try

print gamma

This should work, regardless of sys.getdefaulte ncoding(), as long as
sys.stdout.enco ding supports the characters to be printed.

Regards,
Martin
Jul 18 '05 #4
Hi Experts,

I'm actually not a Python expert so please bear with me and my naive
questions and remarks:

I was actually thinking that

print x

is just kind of shortcur for writing (simplifying bit):

import sys
if not (isinstance(x,s tr) or isinstance(x,un icode)) and x.__str__ :
x = x.__str__()
sys.stdout.writ e(x)

Or in words: if x is not a string type but has method __str__ then

print x

behaves like

print x.__str__()

Given this assumption I'm wondering then why print x.__str__()
works but print x does not?

Is this a bug??

Cheers,
Wolfgang.

""Martin v. Lwis"" <ma****@v.loewi s.de> wrote in message
news:41******** ******@v.loewis .de...
Kent Johnson wrote:
Martin v. Lwis wrote:
wolfgang haefelinger wrote:

I wonder especially about case 2. I can see that "print y" makes a call
to
Y.__str__() . But Y.__str__() can be printed?? So what is 'print'
exactly doing?

It looks at sys.stdout.enco ding. If this is set, and the thing to print
is a unicode string, it converts it to the stream encoding, and prints
the result of the conversion.

I hate to contradict an expert, but ISTM that it is
sys.getdefaulte ncoding() ('ascii') that is the problem, not
sys.stdout.enco ding ('cp437')


It seems we were answering different parts of the question. I answered
the part "What is 'print' exactly doing"; you answered the part as to
what the problem with str() conversion is (although I'm not sure whether
the OP has actually asked that question).

Also, the one case that is interesting here was not in your experiment:
try

print gamma

This should work, regardless of sys.getdefaulte ncoding(), as long as
sys.stdout.enco ding supports the characters to be printed.

Regards,
Martin

Jul 18 '05 #5
wolfgang haefelinger wrote:
I was actually thinking that

print x

is just kind of shortcur for writing (simplifying bit):

import sys
if not (isinstance(x,s tr) or isinstance(x,un icode)) and x.__str__ :
x = x.__str__()
sys.stdout.writ e(x)
This is too simplifying. For the context of this discussion,
it is rather

import sys
if isinstance(x, unicode) and sys.stdout.enco ding:
x = x.encode(sys.st dout.encoding)
x = str(x)
sys.stdout.writ e(x)

(this, of course, is still quite simplicated. It ignores tp_print,
and it ignores softspaces).
Or in words: if x is not a string type but has method __str__ then

print x

behaves like

print x.__str__()
No. There are many types for which this is not true; in this specific
case, it isn't true for Unicode objects.
Is this a bug??


No. You are just misunderstandin g it.

Regards,
Martin
Jul 18 '05 #6
Hi Martin,

if print is implemented like this then I begin to understand the problem.

Neverthelss, I regard

print y.__str__() ## works
print y ## fails??

as a very inconsistent behaviour.

Somehow I have the feeling that Python should give up the distinction
between unicode and str and just have a str type which is internally
unicode.
Anyway, thanks for answering
Wolfgang.

""Martin v. Lwis"" <ma****@v.loewi s.de> wrote in message
news:41******** *************@n ews.freenet.de. ..
wolfgang haefelinger wrote:
I was actually thinking that

print x

is just kind of shortcur for writing (simplifying bit):

import sys
if not (isinstance(x,s tr) or isinstance(x,un icode)) and x.__str__ :
x = x.__str__()
sys.stdout.writ e(x)


This is too simplifying. For the context of this discussion,
it is rather

import sys
if isinstance(x, unicode) and sys.stdout.enco ding:
x = x.encode(sys.st dout.encoding)
x = str(x)
sys.stdout.writ e(x)

(this, of course, is still quite simplicated. It ignores tp_print,
and it ignores softspaces).
Or in words: if x is not a string type but has method __str__ then

print x

behaves like

print x.__str__()


No. There are many types for which this is not true; in this specific
case, it isn't true for Unicode objects.
Is this a bug??


No. You are just misunderstandin g it.

Regards,
Martin

Jul 18 '05 #7




On Mon, 22 Nov 2004 08:04:08 GMT, "wolfgang haefelinger" <wh****@web.d e> wrote:
Hi Martin,

if print is implemented like this then I begin to understand the problem.

Neverthelss, I regard

print y.__str__() ## works
print y ## fails??

as a very inconsistent behaviour.

Somehow I have the feeling that Python should give up the distinction
between unicode and str and just have a str type which is internally
unicode.
Anyway, thanks for answering
Wolfgang.

""Martin v. Lwis"" <ma****@v.loewi s.de> wrote in message
news:41******* **************@ news.freenet.de ...
wolfgang haefelinger wrote:
I was actually thinking that

print x

is just kind of shortcur for writing (simplifying bit):

import sys
if not (isinstance(x,s tr) or isinstance(x,un icode)) and x.__str__ :
x = x.__str__()
sys.stdout.writ e(x)


This is too simplifying. For the context of this discussion,
it is rather

import sys
if isinstance(x, unicode) and sys.stdout.enco ding:
x = x.encode(sys.st dout.encoding)
x = str(x)
sys.stdout.writ e(x)

(this, of course, is still quite simplicated. It ignores tp_print,
and it ignores softspaces).
Or in words: if x is not a string type but has method __str__ then

print x

behaves like

print x.__str__()


No. There are many types for which this is not true; in this specific
case, it isn't true for Unicode objects.
Is this a bug??


No. You are just misunderstandin g it.

Regards,
Martin

It's an old issue, and ISTM there is either a problem or it needs to be better explained.
My bet is on a problem ;-) ISTM the key is that a plain str type is a byte sequence but can
be interpreted as a byte-stream-encoded character sequence, and there are some seemingly
schizophrenic situations. E.g., start with a sequence of numbers, obviously just produced
by a polynomial formula having nothing to do with characters:
numbers = [(lambda x: (-499*x**4 +4634*x**3 -13973*x**2 +13918*x +1824)/24)(x) for x in xrange(5)]
numbers [76, 246, 119, 105, 115]

Now if we convert those to str type characters with chr() and join them:
s = ''.join(map(chr , numbers))
Then we have a sequence of bytes which could have had any numerical value in range(256). No character
encoding is assumed. Yet. If we now assume, say, a latin-1 encoding, we can decode the bytes into
unicode:
u = s.decode('latin-1')
type(u) <type 'unicode'>

Now if we print that, sys.stdout.enco ding should come into play:
print u Lwis

:-)

And we are ok, because we were explicit the whole way.
But if we don't decode s explicitly, it seems the system makes an assumption:
print s Lwis

That is (if it survived) the 'cp437' character for byte '\xf6'. IOW, print seems
to assume that a plain str is encoded ready for output in sys.stdout.enco ding in
a kind of reinterpret_cas t of the str, or else a decode('cp437') .encode('cp437' )
optimized away.
sys.stdout.enco ding 'cp437' sys.getdefaulte ncoding() 'ascii'

If it were assuming s was encoded as ascii, it should really do s.decode('ascii ').encode('cp43 7')
to get it printed, but for plain str literals it does not seem to do that. I.e.,
s.decode('ascii ') Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xf6 in position 1: ordinal not in range(128

doesn't work, so it can't be doing that. It seems to print s as s.decode('cp437 ').encode('cp43 7')
s.decode('cp437 ') u'L\xf7wis'

but that is a wrong decoding, (though the system can't be expected to know).
print s.decode('cp437 ').encode('cp43 7') Lwis print s.decode('latin-1').encode('cp4 37') Lwis

What other decoding should be attempted, lacking an indication? sys.getdefaulte ncoding()
might be reasonable, but it seems to be locked into 'ascii' (I don't know how to set it)
sys.getdefaulte ncoding = lambda: 'latin-1'
sys.getdefaulte ncoding() 'latin-1' unicode('L\xf6w is') Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xf6 in position 1: ordinal not in range(128
So, bottom line, as Wolfgang effectively asked by his example, why does print try to coerce
the __str__ return value to ascii on the way to the ouput encoder, when there is encoding info
in the unicode object that it is happy to defer reencoding of for sys.stdout.enco ding?
s 'L\xf6wis' u u'L\xf6wis' print s Lwis print u Lwis class Y: ... def __str__(self): return self.c
... y = Y()
y.c = s
print y Lwis y.c = u
print y Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in r
ange(128) print u Lwis

Maybe the output of __str__ should be ok as a type basestring subclass for print, so
y.c = u
print y
above has the same result as
print u

It seems to be trying to do u.encode('ascii ').decode('asci i').encode('cp4 37')
instead of directly u.encode('cp437 ') when __str__ is involved.
print u'%s' % y Lwis

works, and
print '%s' % u Lwis

works, and
print y.__str__() Lwis

and
print y.c Lwis

works, y.c u'L\xf6wis'

but
print '%s'%y Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in r
ange(128)

and never mind print,
'%s' % u u'L\xf6wis' '%s' % y.__str__() u'L\xf6wis' '%s' % y

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in r
ange(128)

I guess its that str.__mod__(sel f, other) can deal with a unicode other and get promoted, but
it must do str(other) instead of other.__str__() , or it would be able to promote the result in
the latter case too...

This seems like a possible change that could smooth things a bit, especially if print a,b,c
was then effectively the same as print ('%s'%a),('%s'% b),('%s'%c) with encoding promotion.

Regards,
Bengt Richter
Jul 18 '05 #8
wolfgang haefelinger wrote:
Neverthelss, I regard

print y.__str__() ## works
print y ## fails??

as a very inconsistent behaviour.
Notice that this also fails

x=str(y)

So it is really the string conversion that fails. Roughly the same
happens with

class X:
def __str__(self):
return -1

Here, instances of X also cannot be printed: str() is really supposed
to return a byte string object - not a number, not a unicode object.
As a special exception, __str__ can return a Unicode object, as long
as that result can be converted with the system default encoding into
a byte string object. So we really have

def str(o):
if isinstance(o, types.StringTyp e): return o
if isinstance(o, types.UnicodeTy pe): return o.encode(None)
return str(o.__str__() )

This is why the first print succeeds (it calls __str__ directly,
printing the Unicode object afterwards), and the second print fails
(trying to str()-convert its argument, which already fails - it
didn't get so far as to actually trying to print something).
Somehow I have the feeling that Python should give up the distinction
between unicode and str and just have a str type which is internally
unicode.


Yes, that should happen in P3k. But even then, there will be a
distinction between byte (plain) strings, and character (unicode)
strings.

Regards,
Martin
Jul 18 '05 #9
Bengt Richter wrote:
So, bottom line, as Wolfgang effectively asked by his example, why does print try to coerce
the __str__ return value to ascii on the way to the ouput encoder, when there is encoding info
in the unicode object that it is happy to defer reencoding of for sys.stdout.enco ding?


[See my other posting:]
Because print invokes str() on its argument, unless the argument is
already a byte string (in which case it prints it directly), or a
Unicode string (in which case it encodes it with the stream encoding).
It is str(y) that fails, not the printing.

Regards,
Martin
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
7084
by: sebastien.hugues | last post by:
Hi I would like to retrieve the application data directory path of the logged user on windows XP. To achieve this goal i use the environment variable APPDATA. The logged user has this name: sbastien. The second character is not an ascii one and when i try to encode the path that contains this name in utf-8,
9
2318
by: Franois Pinard | last post by:
Hi, people. I hope someone would like to enlighten me. For any application handling Unicode internally, I'm usually careful at properly converting those Unicode strings into 8-bit strings before writing them out. However, this morning, I mistakenly forgot to do so before using one Unicode string (containing a non-ASCII character) as an argument to the `print' statement, and I did _not_ get an error. This is rather surprising to me. ...
27
5141
by: EU citizen | last post by:
Do web pages have to be created in unicode in order to use UTF-8 encoding? If so, can anyone name a free application which I can use under Windows 98 to create web pages?
3
5242
by: Supratim | last post by:
Hi, For past few weeks I am working on a function that would take encoded Unicode characters from query string of http requests and then decode them back to Unicode numbers. I have full success with UTF-8 encoding but it is UTF-16 where I stumble. Can somebody help me with one of the following examples that puzzle me : %B7%C9 is UTF-16 encoded version of unicode 98DE (39134 in decimal)
3
2683
by: dalei | last post by:
My question is presented more clearly in following web page: http://www.pinyinology.com/signs2.html <html> HTML entities display outside script tags: a&sup1;, a&sup2;, a&sup3;, a⁴ But unicode doesn't display outside script tags: a\xb2, a\xb3, a\u2074
12
3034
by: damjan | last post by:
This may look like a silly question to someone, but the more I try to understand Unicode the more lost I feel. To say that I am not a beginner C++ programmer, only had no need to delve into character encoding intricacies before. In c/c++, the unicode characters are introduced by the means of wchar_t type. Based on the presence of _UNICODE definition C functions are macro'd to either the normal version or the one prefixed with w. Because...
14
6403
by: abhi147 | last post by:
Hi , I want to convert an array of bytes like : {79,104,-37,-66,24,123,30,-26,-99,-8,80,-38,19,14,-127,-3} into Unicode character with ISO-8859-1 standard. Can anyone help me .. how should I go about doing it ? Thanks
2
401
by: willie | last post by:
Martin v. Löwis: Thanks for the thorough explanation. One last question about terminology then I'll go away :) What is the proper way to describe "ustr" below? <type 'unicode'>
5
9566
by: =?Utf-8?B?S2V2aW4gVGFuZw==?= | last post by:
In MFC, CRichEditCtrl contrl, I want to set the codepage for the control to Unicode. I used the following method to set codepage for it (only for ANSI or BIG5, etc, not unicode). How should I change codepage to Unicode? Get the default character format. GetDefaultCharFormat(CHARFORMAT2& _cf)
0
5060
by: deloford | last post by:
Hi This is going to be a question for anyone who is an expert in C# Text Encoding. My situation is this: I have a Sybase database which is firing back ISO-8559 encoded strings. I am unable to get the db to translate to UTF-8 for non technical reasons. So I have a string coming back with the character (ISO value 156). this character appears in .NET as a box character because 156 is not a valid Unicode character value. I have been...
0
8621
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9041
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8928
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8890
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6538
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4379
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4634
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
2355
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2013
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.