By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,423 Members | 1,327 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,423 IT Pros & Developers. It's quick & easy.

Q: The `print' statement over Unicode

P: n/a
Hi, people. I hope someone would like to enlighten me.

For any application handling Unicode internally, I'm usually careful
at properly converting those Unicode strings into 8-bit strings before
writing them out.

However, this morning, I mistakenly forgot to do so before using one
Unicode string (containing a non-ASCII character) as an argument to
the `print' statement, and I did _not_ get an error. This is rather
surprising to me. I reread the section of the Python reference manual
(version 2.3.4, this machine uses 2.3.3 currently), and it does not say
anything about a special processing for Unicode strings.

In my understanding, when `print' is given an argument which is not
already a string (I read: 8-bit string), it first gets converted into
a string (I read: calling __str__). But if I call `str()' explicitly,
_then_ I get an error as expected. The question is, why is there no
error if I do not call `str()' explicity?

For example, given file `question.py' with this contents:

# -*- coding: UTF-8 -*-
texte = unicode("Fran\xe7ois", 'latin1')
print type(texte), repr(texte), texte
print type(texte), repr(texte), str(texte)

doing `python question.py' yields:

<type 'unicode'> u'Fran\xe7ois' François
<type 'unicode'> u'Fran\xe7ois'
Traceback (most recent call last):
File "question.py", line 4, in ?
print type(texte), repr(texte), str(texte)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' \
in position 4: ordinal not in range(128)

(last line wrapped for legibility).

So (trying to be crystal clear), why is the first `print' working over
its third argument, but not the second? How does `print' convert that
Unicode string to a 8-bit string for output, if not through `str()'?
What is missing to the documentation, or to my way of understanding it?

--
François Pinard http://pinard.progiciels-bpi.ca
Jul 19 '05 #1
Share this Question
Share on Google+
9 Replies


P: n/a
François Pinard <pi****@iro.umontreal.ca> writes:
Hi, people. I hope someone would like to enlighten me.

For any application handling Unicode internally, I'm usually careful
at properly converting those Unicode strings into 8-bit strings before
writing them out.

However, this morning, I mistakenly forgot to do so before using one
Unicode string (containing a non-ASCII character) as an argument to
the `print' statement, and I did _not_ get an error. This is rather
surprising to me. I reread the section of the Python reference manual
(version 2.3.4, this machine uses 2.3.3 currently), and it does not say
anything about a special processing for Unicode strings.

In my understanding, when `print' is given an argument which is not
already a string (I read: 8-bit string), it first gets converted into
a string (I read: calling __str__). But if I call `str()' explicitly,
_then_ I get an error as expected. The question is, why is there no
error if I do not call `str()' explicity?

For example, given file `question.py' with this contents:

# -*- coding: UTF-8 -*-
texte = unicode("Fran\xe7ois", 'latin1')
print type(texte), repr(texte), texte
print type(texte), repr(texte), str(texte)

doing `python question.py' yields:

<type 'unicode'> u'Fran\xe7ois' François
<type 'unicode'> u'Fran\xe7ois'
Traceback (most recent call last):
File "question.py", line 4, in ?
print type(texte), repr(texte), str(texte)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' \
in position 4: ordinal not in range(128)

(last line wrapped for legibility).

So (trying to be crystal clear), why is the first `print' working over
its third argument, but not the second? How does `print' convert that
Unicode string to a 8-bit string for output, if not through `str()'?
What is missing to the documentation, or to my way of understanding it?


AFAIK, print uses sys.stdout.encoding to encode the unicode string.

Thomas
Jul 19 '05 #2

P: n/a
[Thomas Heller]
François Pinard <pi****@iro.umontreal.ca> writes:
[...] given file `question.py' with this contents: # -*- coding: UTF-8 -*-
texte = unicode("Fran\xe7ois", 'latin1')
print type(texte), repr(texte), texte
print type(texte), repr(texte), str(texte) doing `python question.py' yields: <type 'unicode'> u'Fran\xe7ois' François
<type 'unicode'> u'Fran\xe7ois'
Traceback (most recent call last):
File "question.py", line 4, in ?
print type(texte), repr(texte), str(texte)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' \
in position 4: ordinal not in range(128) [...] why is the first `print' working over its third argument, but
not the second? How does `print' convert that Unicode string to a
8-bit string for output, if not through `str()'? What is missing to
the documentation, or to my way of understanding it?

AFAIK, print uses sys.stdout.encoding to encode the unicode string.


Much thanks for this information.

I was not aware of this file attribute. Looking around, I found a
quick description in the Library Reference, under "2.3.8 File Objects".
However, I did not find in the documentation the rules stating how
or when this attribute receives a value, and in particular here, for
the case of `sys.stdout'. The Reference Manual, under "6.6 The print
statement", is silent about how Unicode strings are handled.

Am I looking in the wrong places, or else, should not the standard
documentation more handily explain such things?

--
François Pinard http://pinard.progiciels-bpi.ca
Jul 19 '05 #3

P: n/a
François Pinard wrote:
Am I looking in the wrong places, or else, should not the standard
documentation more handily explain such things?


It should, but, alas, it doesn't. Contributions are welcome.

The algorithm to set sys.std{in,out}.encoding is in
sysmodule.c:_PySys_Init and pythonrun.c:Py_InitializeEx
and goes roughly as follows:

- On Windows, if isatty returns true, use GetConsoleCP and
GetConsoleOutputCP.
- On Unix, if isatty returns true, langinfo.h is present,
CODESET is defined, and nl_langinfo(CODESET) returns a
non-empty string, use that.
- otherwise, .encoding will not be set.

Regards,
Martin
Jul 19 '05 #4

P: n/a
[Martin von Löwis]
François Pinard wrote:
Am I looking in the wrong places, or else, should not the standard
documentation more handily explain such things?
It should, but, alas, it doesn't. Contributions are welcome.
My contributions are not that welcome. If they were, the core team
would not try forcing me into using robots and bug trackers! :-)
The algorithm to set sys.std{in,out}.encoding is in
sysmodule.c:_PySys_Init and pythonrun.c:Py_InitializeEx
and goes roughly as follows: - On Windows, if isatty returns true, use GetConsoleCP and
GetConsoleOutputCP.
- On Unix, if isatty returns true, langinfo.h is present,
CODESET is defined, and nl_langinfo(CODESET) returns a
non-empty string, use that.
- otherwise, .encoding will not be set.


Thanks. Your kind explanation, above, should make it, as is, somewhere
in the documentation -- until Xah decides to rewrite it, of course! :-).

--
François Pinard http://pinard.progiciels-bpi.ca
Jul 19 '05 #5

P: n/a
François Pinard wrote:
My contributions are not that welcome. If they were, the core team
would not try forcing me into using robots and bug trackers! :-)
Ok, then we need to wait for somebody else to contribute a documentation
patch.
Thanks. Your kind explanation, above, should make it, as is, somewhere
in the documentation


But how will that happen? Unless somebody contributes a documentation
patch, the documentation will not change magically!

Regards,
Martin
Jul 19 '05 #6

P: n/a
On Sat, 07 May 2005 12:10:46 -0400, François Pinard wrote:
[Martin von Löwis]
François Pinard wrote:
> Am I looking in the wrong places, or else, should not the standard
> documentation more handily explain such things?

It should, but, alas, it doesn't. Contributions are welcome.


My contributions are not that welcome. If they were, the core team
would not try forcing me into using robots and bug trackers! :-)


I'm not sure that the smiley completely de-fangs this comment.

Have you every tried managing a project even a tenth the size of Python
*without* those tools? If you had any idea of the kind of continuous,
day-in, day-out *useless busywork* you were asking of the developers,
merely to save you a minute or two on the one occasion you have something
to contribute, you'd apologize for the incredibly unreasonable demand you
are making from people giving you an amazing amount of free stuff. (You'd
get a lot less of it, too; administration isn't coding, and excessive
administration makes the coding even less fun and thus less likely to be
done.) An apology would not be out of line, smiley or no.

I've never administered anything the size of Python. I have, however, been
up close and personal with a project that had about five developers
full-time, and administering *that* without bug trackers would have been a
nightmare. I can't even imagine trying to run Python by hand.... at least
not that and getting useful work done too.
Jul 19 '05 #7

P: n/a
Jeremy Bowers <je**@jerf.org> writes:
On Sat, 07 May 2005 12:10:46 -0400, François Pinard wrote:
[Martin von Löwis]
François Pinard wrote:

> Am I looking in the wrong places, or else, should not the standard
> documentation more handily explain such things?

It should, but, alas, it doesn't. Contributions are welcome.


My contributions are not that welcome. If they were, the core team
would not try forcing me into using robots and bug trackers! :-)


I'm not sure that the smiley completely de-fangs this comment.
Have you every tried managing a project even a tenth the size of Python
*without* those tools? If you had any idea of the kind of continuous

[...]

I don't mean to put words into François' mouth, but IIRC he managed,
for example, GNU tar for some time and, while using some kind of
tracking system "under the covers", didn't impose it on his users.

IMVHO, that was very nice of him, but I'd be reluctant to attempt to
enforce this way of working on a hard-working and competent
contributor to an open source project to which I'm not a core
contributor myself.
John
Jul 19 '05 #8

P: n/a
On Sun, 08 May 2005 13:46:22 +0000, John J. Lee wrote:
I don't mean to put words into François' mouth, but IIRC he managed,
for example, GNU tar for some time and, while using some kind of
tracking system "under the covers", didn't impose it on his users.

IMVHO, that was very nice of him, but I'd be reluctant to attempt to
enforce this way of working on a hard-working and competent
contributor to an open source project to which I'm not a core
contributor myself.


Then I'd honor his consistency of belief, but still consider it impolite
in general, as asking someone to do tons of work overall to save you a bit
is almost always impolite.
Jul 19 '05 #9

P: n/a
Jeremy Bowers wrote:
Then I'd honor his consistency of belief, but still consider it impolite
in general, as asking someone to do tons of work overall to save you a bit
is almost always impolite.


This is not what he did, though - he did not break "the protocol" by
sending in patches by email (which indeed we would reject). Instead, he
said (before) that he cannot contribute because he is
unwilling to/incapable of using a bug tracker. This is an acceptable
position: contributors are volunteers, and he choses not to volunteer.
He then has to accept (in the specific case) that the documentation is
imprecise/incomplete.

More precisely, he is correct that *his* contribution is not welcome,
contrary to my broad statement "contributions are welcome". The
more narrower statement "contributions that follow the guidelines
are welcome" still stands.

Regards,
Martin
Jul 19 '05 #10

This discussion thread is closed

Replies have been disabled for this discussion.