473,385 Members | 1,542 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

PEP-0263 and default encoding

Hi,

After upgrading my Python interpreter to 2.3.1 I constantly get
warnings like this:

DeprecationWarning: Non-ASCII character '\xe6' in file
mumble.py on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

And while I understand the problem, I cannot fathom why Python
doesn't simply rely on the encoding I have specified in site.py,
which then calls sys.setdefaultencoding().

Would anyone care to explain why the community has chosen to
inconvenience the user for each python script with non-ASCII
characters, rather than using the default encoding given in the
site.py configuration file?

Cheers,

// Klaus

--
<> o mordo tua nuora, o aro un autodromo

Jul 18 '05 #1
18 6521
Yes ! Why ?

And, me (french), i add :
# -*- coding: cp1252 -*-
at begin of my scripts. But... and if i want do a script for two, three,
etc. languages ?

or

How have Ascii AND non-ascii caracters in scripts ?
* sorry for my bad english *
@-salutations
--
Michel Claveau


Jul 18 '05 #2

"News M Claveau /Hamster-P" <es****@mci.local> wrote in message
news:bl**********@news-reader3.wanadoo.fr...
Yes ! Why ?

And, me (french), i add :
# -*- coding: cp1252 -*-
at begin of my scripts. But... and if i want do a script for two, three, etc. languages ?

or

How have Ascii AND non-ascii caracters in scripts ?
* sorry for my bad english *
Use UTF-8. That's what it's there for.

Remember that the actual Python program has to be in
ASCII - only the text in string literals can be in different
character sets. I'm not sure about comments.

John Roth


@-salutations
--
Michel Claveau

Jul 18 '05 #3
John Roth wrote:
Remember that the actual Python program has to be in ASCII -
only the text in string literals can be in different character
sets. I'm not sure about comments.
Python barfs if there are non-ASCII characters in comments and there
is no coding-line.

Still beats me why it doesn't use the sys.getdefaultencoding() instead
of inconveniencing me.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #4
Klaus Alexander Seistrup <sp**@magnetic-ink.dk> wrote in news:3f79171c-
d4**********************************@news.szn.dk:

Still beats me why it doesn't use the sys.getdefaultencoding() instead
of inconveniencing me.


I think the reasoning was that you might give your scripts to someone else
who has a different default encoding and it would then fail obscurely. A
script should be portable, and that means it can't depend on things like
the default encoding.

i.e. Its an attempt to satisfy both of these:
Explicit is better than implicit.
Errors should never pass silently.

--
Duncan Booth du****@rcp.co.uk
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?
Jul 18 '05 #5
Klaus Alexander Seistrup <ra**********@myorange.dk> writes:
And while I understand the problem, I cannot fathom why Python
doesn't simply rely on the encoding I have specified in site.py,
which then calls sys.setdefaultencoding().
There are several reasons. Procedurally, there was no suggestion to do
so while the PEP was being discussed, and it was posted both to
comp.lang.python and python-dev several times. At the time, the most
common comment was that Python should just reject any user-defined
encoding, and declare that source code files are always UTF-8,
period. The PEP gives some more flexibility over that position.

Methodically, requiring the encoding to be declared in the source code
is a good thing, as it allows to move code around across systems,
which would not be that easy if the source encoding was part of the
Python installation. Explicit is better than implicit.
Would anyone care to explain why the community has chosen to
inconvenience the user for each python script with non-ASCII
characters, rather than using the default encoding given in the
site.py configuration file?


It is not clear to my why you want that. There are several
possible rationales:

1. You are have problem with existing code, and you are annoyed
by the warning. Just silence the warning in site.py.

2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.

In neither case, relying on the system default encoding is necessary.

Regards,
Martin
Jul 18 '05 #6
Klaus Alexander Seistrup <sp**@magnetic-ink.dk> writes:
Python barfs if there are non-ASCII characters in comments and there
is no coding-line.
Not necessarily. An UTF-8 BOM would do just as well.
Still beats me why it doesn't use the sys.getdefaultencoding() instead
of inconveniencing me.


See my other message.

REgards,
Martin
Jul 18 '05 #7

"Martin v. Löwis" <ma****@v.loewis.de> wrote in message
news:m3************@mira.informatik.hu-berlin.de...
Klaus Alexander Seistrup <ra**********@myorange.dk> writes:
2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.
The problem with the UTF-8 BOM is that it precludes using the #! header
line under Linux/Unix. Otherwise, it's a great solution.

John Roth

Regards,
Martin

Jul 18 '05 #8
"John Roth" <ne********@jhrothjr.com> writes:
2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.


The problem with the UTF-8 BOM is that it precludes using the #! header
line under Linux/Unix. Otherwise, it's a great solution.


Indeed, in an executable script, you would use the encoding
declaration - or you would restrict yourself to ASCII only in the
script file (which might, as its only action, invoke a function from a
library, in which case there really isn't much need for non-ASCII
characters).

OTOH, I do hope that Unix, some day, recognizes UTF-8-BOM-#-! as
executable file.

Regards,
Martin

Jul 18 '05 #9
Martin v. Löwis wrote:
Just save your code as UTF-8, using the UTF-8 BOM.
Please, could you explain what you mean by "the UTF-8 BOM"?
In neither case, relying on the system default encoding is
necessary.
I'd still prefer Python to rely on the system default encoding.
I can't see why it's there if Python ignores it.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #10
Klaus Alexander Seistrup wrote:
Please, could you explain what you mean by "the UTF-8 BOM"?


Byte order marker. It's a clever gimmick Unicode uses, where a few
valid Unicode characters are set aside for being used in sequence to
help determine whether an encoded Unicode stream is little-endian or
big-endian.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
__ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
/ \ People say that life is the thing, but I prefer reading.
\__/ Logan Pearsall Smith
Jul 18 '05 #11
Duncan Booth skrev:
Still beats me why it doesn't use the sys.getdefaultencoding()
instead of inconveniencing me.
I think the reasoning was that you might give your scripts to
someone else who has a different default encoding and it would
then fail obscurely.


You're probably right that's part of the reason.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #12
Martin v. Löwis skrev:
I cannot fathom why Python doesn't simply rely on the encoding
I have specified in site.py, which then calls setdefaultencoding().
There are several reasons. Procedurally, there was no suggestion
to do so while the PEP was being discussed, and it was posted both
to comp.lang.python and python-dev several times.


It's a pity I didn't read c.l.python at that time, or I would have
protested.
It is not clear to my why you want that. There are several
possible rationales:

1. You are have problem with existing code, and you are annoyed
by the warning.
Yes, I literally have hundreds of scripts with non-ASCII characters
in them - even if it's just in the comments. Scripts that ran
silently, e.g. from crond. Now I have to manually correct each and
every script, even if I have stated in site.py that the default
encoding is iso-8859-1.
Just silence the warning in site.py.
Where and how in site.py can I do that?
2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.
Would that help? I put a BOM in a python script just to test your
suggestion, and I got an unrelated exception: "SyntaxError: EOL
while scanning single-quoted string". That's even worse than a
DeprecationWarning.
In neither case, relying on the system default encoding is necessary.
I'd prefer Python to rely on the system default encoding unless I
have explicitly stated that the script is written using another
encoding.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #13
Erik Max Francis skrev:
Please, could you explain what you mean by "the UTF-8 BOM"?
Byte order marker. It's a clever gimmick Unicode uses, where a few
valid Unicode characters are set aside for being used in sequence to
help determine whether an encoded Unicode stream is little-endian or
big-endian.


Thanks, I also found a reference on unicode.org¹ that was useful.
// Klaus

¹) <http://www.unicode.org/unicode/faq/utf_bom.html>
--<> unselfish actions pay back better

Jul 18 '05 #14
Klaus Alexander Seistrup wrote:
...
Just silence the warning in site.py.


Where and how in site.py can I do that?


Module warnings is well worth studying. You can insert (just about
anywhere you prefer in your site.py or site-customize py) the lines:

import warnings
warnings.filterwarnings('ignore', 'Non-ASCII character .*/peps/pep-0263',
DeprecationWarning)

this tells Python to ignore all warning messages of class DeprecationWarning
which match (case-insensitively) the regular expression given as the second
parameter of warnings.filterwarnings (choose the RE you prefer, of course --
here, I'm asking that the warning message to be ignored start with
"Non-ASCII character " and contain "/peps/pep-0263" anywhere afterwards,
but you may easily choose to be either more or less permissive than this).
Alex

Jul 18 '05 #15
Alex Martelli wrote:
Module warnings is well worth studying. [...]

import warnings
warnings.filterwarnings('ignore', 'Non-ASCII character .*/peps/pep-0263',
DeprecationWarning)
Whauw, neat feature! Thanks a lot for the example.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #16
On Wed, 1 Oct 2003 11:58:22 +0000 (UTC), Klaus Alexander Seistrup <sp**@magnetic-ink.dk> wrote:
Erik Max Francis skrev:
Please, could you explain what you mean by "the UTF-8 BOM"?


Byte order marker. It's a clever gimmick Unicode uses, where a few
valid Unicode characters are set aside for being used in sequence to
help determine whether an encoded Unicode stream is little-endian or
big-endian.


Thanks, I also found a reference on unicode.org¹ that was useful.
// Klaus

¹) <http://www.unicode.org/unicode/faq/utf_bom.html>


A table of BOMs appears:

00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

but I'm not sure I trust everything on that page. E.g., at the bottom it says,

"Last updated: - Tuesday, December 09, 1902 16:15:05" ;-)

There appear to be a number of other typos as well, and some mysterious semantics, e.g., in

"""
Q: Can you summarize how I should deal with BOMs?

A: Here are some guidelines to follow:

1. A particular protocol (e.g. Microsoft conventions for
.txt files) may require use of the BOM on certain Unicode
data streams, such as files. When you need to conform to
such a protocol, use a BOM.

2. Some protocols allow optional BOMs in the case of
untagged text. In those cases,

o Where a text data stream is known to be plain text,
but of unknown encoding, BOM can be used as a
signature. If there is no BOM, the encoding could be
anything.

o Where a text data stream is known to be plain
Unicode text (but not which endian), then BOM can be
used as a signature. If there is no BOM, the text
should be interpreted as big-endian.

3. Where the precise type of the data stream is known (e.g.
Unicode big-endian or Unicode little-endian), the BOM should
not be used. [MD]
"""

(3) sounds a little funny, though I think I know what it's trying
to say.

I don't understand (2), unless it's just saying you can make up
ad hoc markup using BOMs to indicate a binary packing scheme totally
orthogonally to what the packed bits might mean as an encoded data stream.

BOMs have always suggested Unicode to me, so this was a liberating notion,
intended or not ;-) In which case, why not UTF-xxz BOMs for zlib zip-format
packing, etc., where xx could be the usual, e.g., UTF-16lez or UTF-8z. I'd
bet the latter could save some bandwidth and disk space on some non-english
web sites, if browsers supported it for UTF-8 unicode.

Actually, is there a standard for overall compressed HTML transfer already?
Or is it ignored in favor of letting lower levels do compression?
Haven't looked lately...

Regards,
Bengt Richter
Jul 18 '05 #17
On 30 Sep 2003 23:40:53 +0200, ma****@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:
Klaus Alexander Seistrup <ra**********@myorange.dk> writes:
And while I understand the problem, I cannot fathom why Python
doesn't simply rely on the encoding I have specified in site.py,
which then calls sys.setdefaultencoding().
There are several reasons. Procedurally, there was no suggestion to do
so while the PEP was being discussed, and it was posted both to
comp.lang.python and python-dev several times. At the time, the most
common comment was that Python should just reject any user-defined
encoding, and declare that source code files are always UTF-8,
period. The PEP gives some more flexibility over that position.

Methodically, requiring the encoding to be declared in the source code
is a good thing, as it allows to move code around across systems,
which would not be that easy if the source encoding was part of the
Python installation. Explicit is better than implicit.

How about letting someone in Klaus' situation be explicit in another way? E.g.,

python -e iso-8859-1 the_unmarked_source.py

Hm, I guess to be consistent you would have to have some way to pass -e info
into any text-file-opening context, e.g., import, execfile, file, open, etc.
In such case, you'd want a default. Maybe it could come from site.py with
override by python -e, again with override by actual individual file-embedded
encoding info.

But being the encoding guru, you have probably explored all those thoughts and
decided what you decided ;-) Care to comment a little on the above, though?
Would anyone care to explain why the community has chosen to
inconvenience the user for each python script with non-ASCII
characters, rather than using the default encoding given in the
site.py configuration file?


It is not clear to my why you want that. There are several
possible rationales:

1. You are have problem with existing code, and you are annoyed
by the warning. Just silence the warning in site.py.

That's not the same as giving a proper encoding interpretation, is it?
(Though in comments it wouldn't matter much).

2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM. YMMV with the editor you are using though, right?

In neither case, relying on the system default encoding is necessary.

Hm, for python -e to work, I guess all the library stuff would have to
be explicitly marked, or the effect would have to be deferred.
Hm2, is all internal text representation going to wind up wchar at some point?

Regards,
Bengt Richter
Jul 18 '05 #18
bo**@oz.net (Bengt Richter) writes:
How about letting someone in Klaus' situation be explicit in another
way? E.g., python -e iso-8859-1 the_unmarked_source.py
What would the exact meaning of this command line option be?
Hm, I guess to be consistent you would have to have some way to pass
-e info into any text-file-opening context, e.g., import, execfile,
file, open, etc.
Ah, so it should probably apply only to the file passed to Python on
the command line - some people might think it would apply to all
files, though.
In such case, you'd want a default. Maybe it could come from site.py
with override by python -e, again with override by actual individual
file-embedded encoding info.
This shows the problem of this approach: Now it becomes hidden in
site.py, and, as soon as you move the code to a different machine, the
problems come back.
1. You are have problem with existing code, and you are annoyed
by the warning. Just silence the warning in site.py.

That's not the same as giving a proper encoding interpretation, is it?
(Though in comments it wouldn't matter much).


No. However, it would restore the meaning that the code has in 2.2:
For comments and byte string literals, it would be the "as-is"
encoding; for Unicode literals, the interpretation would be latin-1.
2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.

YMMV with the editor you are using though, right?


Somewhat, yes. However, I expect that most editors which
specifically support Python also support PEP-263, sooner or later.
Hm2, is all internal text representation going to wind up wchar at
some point?


It appears that string literals will continue to denote byte strings
for quite some time. There is the -U option, so you can try yourself
to see the effects of string literals denoting Unicode objects.

Clearly, a byte string type has to stay in the language. Not as
clearly, there might be a need for byte string literals. A PEP to this
effect was just withdrawn.

Regards,
Martin
Jul 18 '05 #19

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Christoph Becker-Freyseng | last post by:
Hello, recently there was a lot discussion about a new Path-class. So Gerrit Holl started to write a pre-PEP http://people.nl.linux.org/~gerrit/creaties/path/pep-xxxx.html We tried to...
45
by: Edward K. Ream | last post by:
Hello all, First of all, my present state of mind re pep 318 is one of sheepish confusion. I suspect pep 318 will not affect Leo significantly, but I am most surprised that an apparently...
0
by: Nick Coghlan | last post by:
Anyone playing with the CPython interpreter's new command line switch might have noticed that it only works with top-level modules (i.e. scripts that are directly on sys.path). If the script is...
15
by: Nick Coghlan | last post by:
Python 2.4's -m command line switch only works for modules directly on sys.path. Trying to use it with modules inside packages will fail with a "Module not found" error. This PEP aims to fix that...
18
by: Steven Bethard | last post by:
In the "empty classes as c structs?" thread, we've been talking in some detail about my proposed "generic objects" PEP. Based on a number of suggestions, I'm thinking more and more that instead of...
14
by: Marcin Ciura | last post by:
Here is a pre-PEP about print that I wrote recently. Please let me know what is the community's opinion on it. Cheers, Marcin PEP: XXX Title: Print Without Intervening Space Version:...
8
by: Micah Elliott | last post by:
I also have this living as a wiki <http://tracos.org/codetag/wiki/Pep> if people would like to add comments there. I might try to capture there feedback from this group anyway. First try at a PEP...
2
by: Bryan Olson | last post by:
Though I tried to submit a (pre-) PEP in the proper form through the proper channels, it has disappeared into the ether. In building a class that supports Python's slicing interface, ...
77
by: Ben Finney | last post by:
Howdy all, PEP 354: Enumerations in Python has been accepted as a draft PEP. The current version can be viewed online: <URL:http://www.python.org/peps/pep-0354.html> Here is the...
4
by: dustin | last post by:
I've been hacking away on this PEP for a while, and there has been some related discussion on python-dev that went into the PEP: ...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.