By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,665 Members | 1,229 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,665 IT Pros & Developers. It's quick & easy.

PEP-0263 and default encoding

P: n/a
Hi,

After upgrading my Python interpreter to 2.3.1 I constantly get
warnings like this:

DeprecationWarning: Non-ASCII character '\xe6' in file
mumble.py on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

And while I understand the problem, I cannot fathom why Python
doesn't simply rely on the encoding I have specified in site.py,
which then calls sys.setdefaultencoding().

Would anyone care to explain why the community has chosen to
inconvenience the user for each python script with non-ASCII
characters, rather than using the default encoding given in the
site.py configuration file?

Cheers,

// Klaus

--
<> o mordo tua nuora, o aro un autodromo

Jul 18 '05 #1
Share this Question
Share on Google+
18 Replies


P: n/a
Yes ! Why ?

And, me (french), i add :
# -*- coding: cp1252 -*-
at begin of my scripts. But... and if i want do a script for two, three,
etc. languages ?

or

How have Ascii AND non-ascii caracters in scripts ?
* sorry for my bad english *
@-salutations
--
Michel Claveau


Jul 18 '05 #2

P: n/a

"News M Claveau /Hamster-P" <es****@mci.local> wrote in message
news:bl**********@news-reader3.wanadoo.fr...
Yes ! Why ?

And, me (french), i add :
# -*- coding: cp1252 -*-
at begin of my scripts. But... and if i want do a script for two, three, etc. languages ?

or

How have Ascii AND non-ascii caracters in scripts ?
* sorry for my bad english *
Use UTF-8. That's what it's there for.

Remember that the actual Python program has to be in
ASCII - only the text in string literals can be in different
character sets. I'm not sure about comments.

John Roth


@-salutations
--
Michel Claveau

Jul 18 '05 #3

P: n/a
John Roth wrote:
Remember that the actual Python program has to be in ASCII -
only the text in string literals can be in different character
sets. I'm not sure about comments.
Python barfs if there are non-ASCII characters in comments and there
is no coding-line.

Still beats me why it doesn't use the sys.getdefaultencoding() instead
of inconveniencing me.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #4

P: n/a
Klaus Alexander Seistrup <sp**@magnetic-ink.dk> wrote in news:3f79171c-
d4**********************************@news.szn.dk:

Still beats me why it doesn't use the sys.getdefaultencoding() instead
of inconveniencing me.


I think the reasoning was that you might give your scripts to someone else
who has a different default encoding and it would then fail obscurely. A
script should be portable, and that means it can't depend on things like
the default encoding.

i.e. Its an attempt to satisfy both of these:
Explicit is better than implicit.
Errors should never pass silently.

--
Duncan Booth du****@rcp.co.uk
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?
Jul 18 '05 #5

P: n/a
Klaus Alexander Seistrup <ra**********@myorange.dk> writes:
And while I understand the problem, I cannot fathom why Python
doesn't simply rely on the encoding I have specified in site.py,
which then calls sys.setdefaultencoding().
There are several reasons. Procedurally, there was no suggestion to do
so while the PEP was being discussed, and it was posted both to
comp.lang.python and python-dev several times. At the time, the most
common comment was that Python should just reject any user-defined
encoding, and declare that source code files are always UTF-8,
period. The PEP gives some more flexibility over that position.

Methodically, requiring the encoding to be declared in the source code
is a good thing, as it allows to move code around across systems,
which would not be that easy if the source encoding was part of the
Python installation. Explicit is better than implicit.
Would anyone care to explain why the community has chosen to
inconvenience the user for each python script with non-ASCII
characters, rather than using the default encoding given in the
site.py configuration file?


It is not clear to my why you want that. There are several
possible rationales:

1. You are have problem with existing code, and you are annoyed
by the warning. Just silence the warning in site.py.

2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.

In neither case, relying on the system default encoding is necessary.

Regards,
Martin
Jul 18 '05 #6

P: n/a
Klaus Alexander Seistrup <sp**@magnetic-ink.dk> writes:
Python barfs if there are non-ASCII characters in comments and there
is no coding-line.
Not necessarily. An UTF-8 BOM would do just as well.
Still beats me why it doesn't use the sys.getdefaultencoding() instead
of inconveniencing me.


See my other message.

REgards,
Martin
Jul 18 '05 #7

P: n/a

"Martin v. L÷wis" <ma****@v.loewis.de> wrote in message
news:m3************@mira.informatik.hu-berlin.de...
Klaus Alexander Seistrup <ra**********@myorange.dk> writes:
2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.
The problem with the UTF-8 BOM is that it precludes using the #! header
line under Linux/Unix. Otherwise, it's a great solution.

John Roth

Regards,
Martin

Jul 18 '05 #8

P: n/a
"John Roth" <ne********@jhrothjr.com> writes:
2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.


The problem with the UTF-8 BOM is that it precludes using the #! header
line under Linux/Unix. Otherwise, it's a great solution.


Indeed, in an executable script, you would use the encoding
declaration - or you would restrict yourself to ASCII only in the
script file (which might, as its only action, invoke a function from a
library, in which case there really isn't much need for non-ASCII
characters).

OTOH, I do hope that Unix, some day, recognizes UTF-8-BOM-#-! as
executable file.

Regards,
Martin

Jul 18 '05 #9

P: n/a
Martin v. L÷wis wrote:
Just save your code as UTF-8, using the UTF-8 BOM.
Please, could you explain what you mean by "the UTF-8 BOM"?
In neither case, relying on the system default encoding is
necessary.
I'd still prefer Python to rely on the system default encoding.
I can't see why it's there if Python ignores it.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #10

P: n/a
Klaus Alexander Seistrup wrote:
Please, could you explain what you mean by "the UTF-8 BOM"?


Byte order marker. It's a clever gimmick Unicode uses, where a few
valid Unicode characters are set aside for being used in sequence to
help determine whether an encoded Unicode stream is little-endian or
big-endian.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
__ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
/ \ People say that life is the thing, but I prefer reading.
\__/ Logan Pearsall Smith
Jul 18 '05 #11

P: n/a
Duncan Booth skrev:
Still beats me why it doesn't use the sys.getdefaultencoding()
instead of inconveniencing me.
I think the reasoning was that you might give your scripts to
someone else who has a different default encoding and it would
then fail obscurely.


You're probably right that's part of the reason.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #12

P: n/a
Martin v. L÷wis skrev:
I cannot fathom why Python doesn't simply rely on the encoding
I have specified in site.py, which then calls setdefaultencoding().
There are several reasons. Procedurally, there was no suggestion
to do so while the PEP was being discussed, and it was posted both
to comp.lang.python and python-dev several times.


It's a pity I didn't read c.l.python at that time, or I would have
protested.
It is not clear to my why you want that. There are several
possible rationales:

1. You are have problem with existing code, and you are annoyed
by the warning.
Yes, I literally have hundreds of scripts with non-ASCII characters
in them - even if it's just in the comments. Scripts that ran
silently, e.g. from crond. Now I have to manually correct each and
every script, even if I have stated in site.py that the default
encoding is iso-8859-1.
Just silence the warning in site.py.
Where and how in site.py can I do that?
2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.
Would that help? I put a BOM in a python script just to test your
suggestion, and I got an unrelated exception: "SyntaxError: EOL
while scanning single-quoted string". That's even worse than a
DeprecationWarning.
In neither case, relying on the system default encoding is necessary.
I'd prefer Python to rely on the system default encoding unless I
have explicitly stated that the script is written using another
encoding.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #13

P: n/a
Erik Max Francis skrev:
Please, could you explain what you mean by "the UTF-8 BOM"?
Byte order marker. It's a clever gimmick Unicode uses, where a few
valid Unicode characters are set aside for being used in sequence to
help determine whether an encoded Unicode stream is little-endian or
big-endian.


Thanks, I also found a reference on unicode.org╣ that was useful.
// Klaus

╣) <http://www.unicode.org/unicode/faq/utf_bom.html>
--<> unselfish actions pay back better

Jul 18 '05 #14

P: n/a
Klaus Alexander Seistrup wrote:
...
Just silence the warning in site.py.


Where and how in site.py can I do that?


Module warnings is well worth studying. You can insert (just about
anywhere you prefer in your site.py or site-customize py) the lines:

import warnings
warnings.filterwarnings('ignore', 'Non-ASCII character .*/peps/pep-0263',
DeprecationWarning)

this tells Python to ignore all warning messages of class DeprecationWarning
which match (case-insensitively) the regular expression given as the second
parameter of warnings.filterwarnings (choose the RE you prefer, of course --
here, I'm asking that the warning message to be ignored start with
"Non-ASCII character " and contain "/peps/pep-0263" anywhere afterwards,
but you may easily choose to be either more or less permissive than this).
Alex

Jul 18 '05 #15

P: n/a
Alex Martelli wrote:
Module warnings is well worth studying. [...]

import warnings
warnings.filterwarnings('ignore', 'Non-ASCII character .*/peps/pep-0263',
DeprecationWarning)
Whauw, neat feature! Thanks a lot for the example.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #16

P: n/a
On Wed, 1 Oct 2003 11:58:22 +0000 (UTC), Klaus Alexander Seistrup <sp**@magnetic-ink.dk> wrote:
Erik Max Francis skrev:
Please, could you explain what you mean by "the UTF-8 BOM"?


Byte order marker. It's a clever gimmick Unicode uses, where a few
valid Unicode characters are set aside for being used in sequence to
help determine whether an encoded Unicode stream is little-endian or
big-endian.


Thanks, I also found a reference on unicode.org╣ that was useful.
// Klaus

╣) <http://www.unicode.org/unicode/faq/utf_bom.html>


A table of BOMs appears:

00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

but I'm not sure I trust everything on that page. E.g., at the bottom it says,

"Last updated: - Tuesday, December 09, 1902 16:15:05" ;-)

There appear to be a number of other typos as well, and some mysterious semantics, e.g., in

"""
Q: Can you summarize how I should deal with BOMs?

A: Here are some guidelines to follow:

1. A particular protocol (e.g. Microsoft conventions for
.txt files) may require use of the BOM on certain Unicode
data streams, such as files. When you need to conform to
such a protocol, use a BOM.

2. Some protocols allow optional BOMs in the case of
untagged text. In those cases,

o Where a text data stream is known to be plain text,
but of unknown encoding, BOM can be used as a
signature. If there is no BOM, the encoding could be
anything.

o Where a text data stream is known to be plain
Unicode text (but not which endian), then BOM can be
used as a signature. If there is no BOM, the text
should be interpreted as big-endian.

3. Where the precise type of the data stream is known (e.g.
Unicode big-endian or Unicode little-endian), the BOM should
not be used. [MD]
"""

(3) sounds a little funny, though I think I know what it's trying
to say.

I don't understand (2), unless it's just saying you can make up
ad hoc markup using BOMs to indicate a binary packing scheme totally
orthogonally to what the packed bits might mean as an encoded data stream.

BOMs have always suggested Unicode to me, so this was a liberating notion,
intended or not ;-) In which case, why not UTF-xxz BOMs for zlib zip-format
packing, etc., where xx could be the usual, e.g., UTF-16lez or UTF-8z. I'd
bet the latter could save some bandwidth and disk space on some non-english
web sites, if browsers supported it for UTF-8 unicode.

Actually, is there a standard for overall compressed HTML transfer already?
Or is it ignored in favor of letting lower levels do compression?
Haven't looked lately...

Regards,
Bengt Richter
Jul 18 '05 #17

P: n/a
On 30 Sep 2003 23:40:53 +0200, ma****@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:
Klaus Alexander Seistrup <ra**********@myorange.dk> writes:
And while I understand the problem, I cannot fathom why Python
doesn't simply rely on the encoding I have specified in site.py,
which then calls sys.setdefaultencoding().
There are several reasons. Procedurally, there was no suggestion to do
so while the PEP was being discussed, and it was posted both to
comp.lang.python and python-dev several times. At the time, the most
common comment was that Python should just reject any user-defined
encoding, and declare that source code files are always UTF-8,
period. The PEP gives some more flexibility over that position.

Methodically, requiring the encoding to be declared in the source code
is a good thing, as it allows to move code around across systems,
which would not be that easy if the source encoding was part of the
Python installation. Explicit is better than implicit.

How about letting someone in Klaus' situation be explicit in another way? E.g.,

python -e iso-8859-1 the_unmarked_source.py

Hm, I guess to be consistent you would have to have some way to pass -e info
into any text-file-opening context, e.g., import, execfile, file, open, etc.
In such case, you'd want a default. Maybe it could come from site.py with
override by python -e, again with override by actual individual file-embedded
encoding info.

But being the encoding guru, you have probably explored all those thoughts and
decided what you decided ;-) Care to comment a little on the above, though?
Would anyone care to explain why the community has chosen to
inconvenience the user for each python script with non-ASCII
characters, rather than using the default encoding given in the
site.py configuration file?


It is not clear to my why you want that. There are several
possible rationales:

1. You are have problem with existing code, and you are annoyed
by the warning. Just silence the warning in site.py.

That's not the same as giving a proper encoding interpretation, is it?
(Though in comments it wouldn't matter much).

2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM. YMMV with the editor you are using though, right?

In neither case, relying on the system default encoding is necessary.

Hm, for python -e to work, I guess all the library stuff would have to
be explicitly marked, or the effect would have to be deferred.
Hm2, is all internal text representation going to wind up wchar at some point?

Regards,
Bengt Richter
Jul 18 '05 #18

P: n/a
bo**@oz.net (Bengt Richter) writes:
How about letting someone in Klaus' situation be explicit in another
way? E.g., python -e iso-8859-1 the_unmarked_source.py
What would the exact meaning of this command line option be?
Hm, I guess to be consistent you would have to have some way to pass
-e info into any text-file-opening context, e.g., import, execfile,
file, open, etc.
Ah, so it should probably apply only to the file passed to Python on
the command line - some people might think it would apply to all
files, though.
In such case, you'd want a default. Maybe it could come from site.py
with override by python -e, again with override by actual individual
file-embedded encoding info.
This shows the problem of this approach: Now it becomes hidden in
site.py, and, as soon as you move the code to a different machine, the
problems come back.
1. You are have problem with existing code, and you are annoyed
by the warning. Just silence the warning in site.py.

That's not the same as giving a proper encoding interpretation, is it?
(Though in comments it wouldn't matter much).


No. However, it would restore the meaning that the code has in 2.2:
For comments and byte string literals, it would be the "as-is"
encoding; for Unicode literals, the interpretation would be latin-1.
2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.

YMMV with the editor you are using though, right?


Somewhat, yes. However, I expect that most editors which
specifically support Python also support PEP-263, sooner or later.
Hm2, is all internal text representation going to wind up wchar at
some point?


It appears that string literals will continue to denote byte strings
for quite some time. There is the -U option, so you can try yourself
to see the effects of string literals denoting Unicode objects.

Clearly, a byte string type has to stay in the language. Not as
clearly, there might be a need for byte string literals. A PEP to this
effect was just withdrawn.

Regards,
Martin
Jul 18 '05 #19

This discussion thread is closed

Replies have been disabled for this discussion.