PEP-0263 and default encoding

Klaus Alexander Seistrup

Hi,

After upgrading my Python interpreter to 2.3.1 I constantly get
warnings like this:

DeprecationWarning: Non-ASCII character '\xe6' in file
mumble.py on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details

And while I understand the problem, I cannot fathom why Python
doesn't simply rely on the encoding I have specified in site.py,
which then calls sys.setdefaultencoding().

Would anyone care to explain why the community has chosen to
inconvenience the user for each python script with non-ASCII
characters, rather than using the default encoding given in the
site.py configuration file?

Cheers,

// Klaus

--

<> o mordo tua nuora, o aro un autodromo

Jul 18 '05 #1

Subscribe Post Reply

6521

News M Claveau /Hamster-P

Yes ! Why ?

And, me (french), i add :
# -*- coding: cp1252 -*-
at begin of my scripts. But... and if i want do a script for two, three,
etc. languages ?

or

How have Ascii AND non-ascii caracters in scripts ?
* sorry for my bad english *
@-salutations
--
Michel Claveau

Jul 18 '05 #2

John Roth

"News M Claveau /Hamster-P" <es****@mci.local> wrote in message
news:bl**********@news-reader3.wanadoo.fr...

Yes ! Why ?

And, me (french), i add :
# -*- coding: cp1252 -*-
at begin of my scripts. But... and if i want do a script for two, three, etc. languages ?

or

How have Ascii AND non-ascii caracters in scripts ?
* sorry for my bad english *
Use UTF-8. That's what it's there for.

Remember that the actual Python program has to be in
ASCII - only the text in string literals can be in different
character sets. I'm not sure about comments.

John Roth

@-salutations
--
Michel Claveau

Jul 18 '05 #3

Klaus Alexander Seistrup

John Roth wrote:

Remember that the actual Python program has to be in ASCII -
only the text in string literals can be in different character
sets. I'm not sure about comments.
Python barfs if there are non-ASCII characters in comments and there
is no coding-line.

Still beats me why it doesn't use the sys.getdefaultencoding() instead
of inconveniencing me.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #4

Duncan Booth

Klaus Alexander Seistrup <sp**@magnetic-ink.dk> wrote in news:3f79171c-
d4**********************************@news.szn.dk:

Still beats me why it doesn't use the sys.getdefaultencoding() instead
of inconveniencing me.

I think the reasoning was that you might give your scripts to someone else
who has a different default encoding and it would then fail obscurely. A
script should be portable, and that means it can't depend on things like
the default encoding.

i.e. Its an attempt to satisfy both of these:
Explicit is better than implicit.
Errors should never pass silently.

--
Duncan Booth du****@rcp.co.uk
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?

Jul 18 '05 #5

Martin v. Löwis

Klaus Alexander Seistrup <ra**********@myorange.dk> writes:

And while I understand the problem, I cannot fathom why Python
doesn't simply rely on the encoding I have specified in site.py,
which then calls sys.setdefaultencoding().
There are several reasons. Procedurally, there was no suggestion to do
so while the PEP was being discussed, and it was posted both to
comp.lang.python and python-dev several times. At the time, the most
common comment was that Python should just reject any user-defined
encoding, and declare that source code files are always UTF-8,
period. The PEP gives some more flexibility over that position.

Methodically, requiring the encoding to be declared in the source code
is a good thing, as it allows to move code around across systems,
which would not be that easy if the source encoding was part of the
Python installation. Explicit is better than implicit.
Would anyone care to explain why the community has chosen to
inconvenience the user for each python script with non-ASCII
characters, rather than using the default encoding given in the
site.py configuration file?

It is not clear to my why you want that. There are several
possible rationales:

1. You are have problem with existing code, and you are annoyed
by the warning. Just silence the warning in site.py.

2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.

In neither case, relying on the system default encoding is necessary.

Regards,
Martin

Jul 18 '05 #6

Martin v. Löwis

Klaus Alexander Seistrup <sp**@magnetic-ink.dk> writes:

Python barfs if there are non-ASCII characters in comments and there
is no coding-line.
Not necessarily. An UTF-8 BOM would do just as well.
Still beats me why it doesn't use the sys.getdefaultencoding() instead
of inconveniencing me.

See my other message.

REgards,
Martin

Jul 18 '05 #7

John Roth

"Martin v. Löwis" <ma****@v.loewis.de> wrote in message
news:m3************@mira.informatik.hu-berlin.de...

Klaus Alexander Seistrup <ra**********@myorange.dk> writes:
2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.
The problem with the UTF-8 BOM is that it precludes using the #! header
line under Linux/Unix. Otherwise, it's a great solution.

John Roth

Regards,
Martin

Jul 18 '05 #8

Martin v. Löwis

"John Roth" <ne********@jhrothjr.com> writes:

2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.

The problem with the UTF-8 BOM is that it precludes using the #! header
line under Linux/Unix. Otherwise, it's a great solution.

Indeed, in an executable script, you would use the encoding
declaration - or you would restrict yourself to ASCII only in the
script file (which might, as its only action, invoke a function from a
library, in which case there really isn't much need for non-ASCII
characters).

OTOH, I do hope that Unix, some day, recognizes UTF-8-BOM-#-! as
executable file.

Regards,
Martin

Jul 18 '05 #9

Klaus Alexander Seistrup

Martin v. Löwis wrote:

Just save your code as UTF-8, using the UTF-8 BOM.
Please, could you explain what you mean by "the UTF-8 BOM"?
In neither case, relying on the system default encoding is
necessary.
I'd still prefer Python to rely on the system default encoding.
I can't see why it's there if Python ignores it.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #10

Erik Max Francis

Klaus Alexander Seistrup wrote:

Please, could you explain what you mean by "the UTF-8 BOM"?

Byte order marker. It's a clever gimmick Unicode uses, where a few
valid Unicode characters are set aside for being used in sequence to
help determine whether an encoded Unicode stream is little-endian or
big-endian.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
__ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
/ \ People say that life is the thing, but I prefer reading.
\__/ Logan Pearsall Smith

Jul 18 '05 #11

Klaus Alexander Seistrup

Duncan Booth skrev:

Still beats me why it doesn't use the sys.getdefaultencoding()
instead of inconveniencing me.
I think the reasoning was that you might give your scripts to
someone else who has a different default encoding and it would
then fail obscurely.

You're probably right that's part of the reason.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #12

Klaus Alexander Seistrup

Martin v. Löwis skrev:

I cannot fathom why Python doesn't simply rely on the encoding
I have specified in site.py, which then calls setdefaultencoding().
There are several reasons. Procedurally, there was no suggestion
to do so while the PEP was being discussed, and it was posted both
to comp.lang.python and python-dev several times.

It's a pity I didn't read c.l.python at that time, or I would have
protested.
It is not clear to my why you want that. There are several
possible rationales:

1. You are have problem with existing code, and you are annoyed
by the warning.
Yes, I literally have hundreds of scripts with non-ASCII characters
in them - even if it's just in the comments. Scripts that ran
silently, e.g. from crond. Now I have to manually correct each and
every script, even if I have stated in site.py that the default
encoding is iso-8859-1.
Just silence the warning in site.py.
Where and how in site.py can I do that?
2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.
Would that help? I put a BOM in a python script just to test your
suggestion, and I got an unrelated exception: "SyntaxError: EOL
while scanning single-quoted string". That's even worse than a
DeprecationWarning.
In neither case, relying on the system default encoding is necessary.
I'd prefer Python to rely on the system default encoding unless I
have explicitly stated that the script is written using another
encoding.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #13

Klaus Alexander Seistrup

Erik Max Francis skrev:

Please, could you explain what you mean by "the UTF-8 BOM"?
Byte order marker. It's a clever gimmick Unicode uses, where a few
valid Unicode characters are set aside for being used in sequence to
help determine whether an encoded Unicode stream is little-endian or
big-endian.

Thanks, I also found a reference on unicode.org¹ that was useful.
// Klaus

¹) <http://www.unicode.org/unicode/faq/utf_bom.html>
--<> unselfish actions pay back better

Jul 18 '05 #14

Alex Martelli

Klaus Alexander Seistrup wrote:
...

Just silence the warning in site.py.

Where and how in site.py can I do that?

Module warnings is well worth studying. You can insert (just about
anywhere you prefer in your site.py or site-customize py) the lines:

import warnings
warnings.filterwarnings('ignore', 'Non-ASCII character .*/peps/pep-0263',
DeprecationWarning)

this tells Python to ignore all warning messages of class DeprecationWarning
which match (case-insensitively) the regular expression given as the second
parameter of warnings.filterwarnings (choose the RE you prefer, of course --
here, I'm asking that the warning message to be ignored start with
"Non-ASCII character " and contain "/peps/pep-0263" anywhere afterwards,
but you may easily choose to be either more or less permissive than this).
Alex

Jul 18 '05 #15

Klaus Alexander Seistrup

Alex Martelli wrote:

Module warnings is well worth studying. [...]

import warnings
warnings.filterwarnings('ignore', 'Non-ASCII character .*/peps/pep-0263',
DeprecationWarning)
Whauw, neat feature! Thanks a lot for the example.
// Klaus

--<> unselfish actions pay back better

Jul 18 '05 #16

Bengt Richter

On Wed, 1 Oct 2003 11:58:22 +0000 (UTC), Klaus Alexander Seistrup <sp**@magnetic-ink.dk> wrote:

Erik Max Francis skrev:
Please, could you explain what you mean by "the UTF-8 BOM"?

Byte order marker. It's a clever gimmick Unicode uses, where a few
valid Unicode characters are set aside for being used in sequence to
help determine whether an encoded Unicode stream is little-endian or
big-endian.

Thanks, I also found a reference on unicode.org¹ that was useful.
// Klaus

¹) <http://www.unicode.org/unicode/faq/utf_bom.html>

A table of BOMs appears:

00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

but I'm not sure I trust everything on that page. E.g., at the bottom it says,

"Last updated: - Tuesday, December 09, 1902 16:15:05" ;-)

There appear to be a number of other typos as well, and some mysterious semantics, e.g., in

"""
Q: Can you summarize how I should deal with BOMs?

A: Here are some guidelines to follow:

1. A particular protocol (e.g. Microsoft conventions for
.txt files) may require use of the BOM on certain Unicode
data streams, such as files. When you need to conform to
such a protocol, use a BOM.

2. Some protocols allow optional BOMs in the case of
untagged text. In those cases,

o Where a text data stream is known to be plain text,
but of unknown encoding, BOM can be used as a
signature. If there is no BOM, the encoding could be
anything.

o Where a text data stream is known to be plain
Unicode text (but not which endian), then BOM can be
used as a signature. If there is no BOM, the text
should be interpreted as big-endian.

3. Where the precise type of the data stream is known (e.g.
Unicode big-endian or Unicode little-endian), the BOM should
not be used. [MD]
"""

(3) sounds a little funny, though I think I know what it's trying
to say.

I don't understand (2), unless it's just saying you can make up
ad hoc markup using BOMs to indicate a binary packing scheme totally
orthogonally to what the packed bits might mean as an encoded data stream.

BOMs have always suggested Unicode to me, so this was a liberating notion,
intended or not ;-) In which case, why not UTF-xxz BOMs for zlib zip-format
packing, etc., where xx could be the usual, e.g., UTF-16lez or UTF-8z. I'd
bet the latter could save some bandwidth and disk space on some non-english
web sites, if browsers supported it for UTF-8 unicode.

Actually, is there a standard for overall compressed HTML transfer already?
Or is it ignored in favor of letting lower levels do compression?
Haven't looked lately...

Regards,
Bengt Richter

Jul 18 '05 #17

Bengt Richter

On 30 Sep 2003 23:40:53 +0200, ma****@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

Klaus Alexander Seistrup <ra**********@myorange.dk> writes:
And while I understand the problem, I cannot fathom why Python
doesn't simply rely on the encoding I have specified in site.py,
which then calls sys.setdefaultencoding().
There are several reasons. Procedurally, there was no suggestion to do
so while the PEP was being discussed, and it was posted both to
comp.lang.python and python-dev several times. At the time, the most
common comment was that Python should just reject any user-defined
encoding, and declare that source code files are always UTF-8,
period. The PEP gives some more flexibility over that position.

Methodically, requiring the encoding to be declared in the source code
is a good thing, as it allows to move code around across systems,
which would not be that easy if the source encoding was part of the
Python installation. Explicit is better than implicit.

How about letting someone in Klaus' situation be explicit in another way? E.g.,

python -e iso-8859-1 the_unmarked_source.py

Hm, I guess to be consistent you would have to have some way to pass -e info
into any text-file-opening context, e.g., import, execfile, file, open, etc.
In such case, you'd want a default. Maybe it could come from site.py with
override by python -e, again with override by actual individual file-embedded
encoding info.

But being the encoding guru, you have probably explored all those thoughts and
decided what you decided ;-) Care to comment a little on the above, though?

Would anyone care to explain why the community has chosen to
inconvenience the user for each python script with non-ASCII
characters, rather than using the default encoding given in the
site.py configuration file?

It is not clear to my why you want that. There are several
possible rationales:

1. You are have problem with existing code, and you are annoyed
by the warning. Just silence the warning in site.py.

That's not the same as giving a proper encoding interpretation, is it?
(Though in comments it wouldn't matter much).

2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM. YMMV with the editor you are using though, right?

In neither case, relying on the system default encoding is necessary.

Hm, for python -e to work, I guess all the library stuff would have to
be explicitly marked, or the effect would have to be deferred.
Hm2, is all internal text representation going to wind up wchar at some point?

Regards,
Bengt Richter

Jul 18 '05 #18

Martin v. Löwis

bo**@oz.net (Bengt Richter) writes:

How about letting someone in Klaus' situation be explicit in another
way? E.g., python -e iso-8859-1 the_unmarked_source.py
What would the exact meaning of this command line option be?
Hm, I guess to be consistent you would have to have some way to pass
-e info into any text-file-opening context, e.g., import, execfile,
file, open, etc.
Ah, so it should probably apply only to the file passed to Python on
the command line - some people might think it would apply to all
files, though.
In such case, you'd want a default. Maybe it could come from site.py
with override by python -e, again with override by actual individual
file-embedded encoding info.
This shows the problem of this approach: Now it becomes hidden in
site.py, and, as soon as you move the code to a different machine, the
problems come back.

1. You are have problem with existing code, and you are annoyed
by the warning. Just silence the warning in site.py.

That's not the same as giving a proper encoding interpretation, is it?
(Though in comments it wouldn't matter much).

No. However, it would restore the meaning that the code has in 2.2:
For comments and byte string literals, it would be the "as-is"
encoding; for Unicode literals, the interpretation would be latin-1.

2. You are writing new code, and you are annoyed by the encoding
declaration. Just save your code as UTF-8, using the UTF-8 BOM.

YMMV with the editor you are using though, right?

Somewhat, yes. However, I expect that most editors which
specifically support Python also support PEP-263, sooner or later.
Hm2, is all internal text representation going to wind up wchar at
some point?

It appears that string literals will continue to denote byte strings
for quite some time. There is the -U option, so you can try yourself
to see the effects of string literals denoting Unicode objects.

Clearly, a byte string type has to stay in the language. Not as
clearly, there might be a need for byte string literals. A PEP to this
effect was just withdrawn.

Regards,
Martin

Jul 18 '05 #19

Similar topics

PEP for new modules (I read PEP 2)

by: Christoph Becker-Freyseng | last post by:

Hello, recently there was a lot discussion about a new Path-class. So Gerrit Holl started to write a pre-PEP http://people.nl.linux.org/~gerrit/creaties/path/pep-xxxx.html We tried to...

Python

Confused about pep 318

by: Edward K. Ream | last post by:

Hello all, First of all, my present state of mind re pep 318 is one of sheepish confusion. I suspect pep 318 will not affect Leo significantly, but I am most surprised that an apparently...

Python

Pre-PEP: Executing modules inside packages with '-m'

by: Nick Coghlan | last post by:

Anyone playing with the CPython interpreter's new command line switch might have noticed that it only works with top-level modules (i.e. scripts that are directly on sys.path). If the script is...

Python

PEP 338: Executing modules inside packages with '-m'

by: Nick Coghlan | last post by:

Python 2.4's -m command line switch only works for modules directly on sys.path. Trying to use it with modules inside packages will fail with a "Module not found" error. This PEP aims to fix that...

Python

namespaces module (a.k.a. bunch, struct, generic object, etc.) PEP

by: Steven Bethard | last post by:

In the "empty classes as c structs?" thread, we've been talking in some detail about my proposed "generic objects" PEP. Based on a number of suggestions, I'm thinking more and more that instead of...

Python

pre-PEP: Print Without Intervening Space

by: Marcin Ciura | last post by:

Here is a pre-PEP about print that I wrote recently. Please let me know what is the community's opinion on it. Cheers, Marcin PEP: XXX Title: Print Without Intervening Space Version:...

Python

Pre-PEP Proposal: Codetags

by: Micah Elliott | last post by:

I also have this living as a wiki <http://tracos.org/codetag/wiki/Pep> if people would like to add comments there. I might try to capture there feedback from this group anyway. First try at a PEP...

Python

PEP submission broken?

by: Bryan Olson | last post by:

Though I tried to submit a (pre-) PEP in the proper form through the proper channels, it has disappeared into the ether. In building a class that supports Python's slicing interface, ...

Python

PEP 354: Enumerations in Python

by: Ben Finney | last post by:

Howdy all, PEP 354: Enumerations in Python has been accepted as a draft PEP. The current version can be viewed online: <URL:http://www.python.org/peps/pep-0354.html> Here is the...

Python

pre-PEP: Standard Microthreading Pattern

by: dustin | last post by:

I've been hacking away on this PEP for a while, and there has been some related discussion on python-dev that went into the PEP: ...

Python

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware