PEP 263 status check

John Roth

PEP 263 is marked finished in the PEP index, however
I haven't seen the specified Phase 2 in the list of changes
for 2.4 which is when I expected it.

Did phase 2 get cancelled, or is it just not in the
changes document?

John Roth

Jul 18 '05 #1

Subscribe Post Reply

2560

Martin v. Löwis

John Roth wrote:

PEP 263 is marked finished in the PEP index, however
I haven't seen the specified Phase 2 in the list of changes
for 2.4 which is when I expected it.

Did phase 2 get cancelled, or is it just not in the
changes document?

Neither, nor. Although this hasn't been discussed widely,
I personally believe it is too early yet to make lack of
encoding declarations a syntax error. I'd like to
reconsider the issue with Python 2.5.

OTOH, not many people have commented either way: would you
be outraged if a script that has given you a warning about
missing encoding declarations for some time fails with a
strict SyntaxError in 2.4? Has everybody already corrected
their scripts?

Regards,
Martin

Jul 18 '05 #2

Fernando Perez

"Martin v. Löwis" wrote:

I personally believe it is too early yet to make lack of
encoding declarations a syntax error. I'd like to

+1

Making this an all-out failure is pretty brutal, IMHO. You could change the
warning message to be more stringent about it becoming soon an error. But if
someone upgrades to 2.4 because of other benefits, and some large third-party
code they rely on (and which is otherwise perfectly fine with 2.4) fails
catastrophically because of these warnings becoming errors, I suspect they
will be very unhappy.

I see the need to nudge people in the right direction, but there's no need to
do it with a 10.000 Volt stick :)

Best,

f

Jul 18 '05 #3

John Roth

"Martin v. Löwis" <ma****@v.loewis.de> wrote in message
news:41**************@v.loewis.de...

John Roth wrote:
PEP 263 is marked finished in the PEP index, however
I haven't seen the specified Phase 2 in the list of changes
for 2.4 which is when I expected it.

Did phase 2 get cancelled, or is it just not in the
changes document?
Neither, nor. Although this hasn't been discussed widely,
I personally believe it is too early yet to make lack of
encoding declarations a syntax error. I'd like to
reconsider the issue with Python 2.5.

OTOH, not many people have commented either way: would you
be outraged if a script that has given you a warning about
missing encoding declarations for some time fails with a
strict SyntaxError in 2.4? Has everybody already corrected
their scripts?

Well, I don't particularly have that problem because I don't
have a huge number of scripts and for the ones I do it would be
relatively simple to do a scan and update - or just run them
with the unit tests and see if they break!

In fact, I think that a scan and update program in the tools
directory might be a very good idea - just walk through a
Python library, scan and update everything that doesn't
have a declaration.

The issue has popped in and out of my awareness a few
times, what brought it up this time was Hallvard's thread.

My specific question there was how the code handles the
combination of UTF-8 as the encoding and a non-ascii
character in an 8-bit string literal. Is this an error? The
PEP does not say so. If it isn't, what encoding will
it use to translate from unicode back to an 8-bit
encoding?

Another project for people who care about this
subject: tools. Of the half zillion editors, pretty printers
and so forth out there, how many check for the encoding
line and do the right thing with it? Which ones need to
be updated?

John Roth
Regards,
Martin

Jul 18 '05 #4

Vincent Wehren

"John Roth" <ne********@jhrothjr.com> schrieb im Newsbeitrag
news:10*************@news.supernews.com...
|
| "Martin v. Löwis" <ma****@v.loewis.de> wrote in message
| news:41**************@v.loewis.de...
| > John Roth wrote:
| > > PEP 263 is marked finished in the PEP index, however
| > > I haven't seen the specified Phase 2 in the list of changes
| > > for 2.4 which is when I expected it.
| > >
| > > Did phase 2 get cancelled, or is it just not in the
| > > changes document?
| >
| > Neither, nor. Although this hasn't been discussed widely,
| > I personally believe it is too early yet to make lack of
| > encoding declarations a syntax error. I'd like to
| > reconsider the issue with Python 2.5.
| >
| > OTOH, not many people have commented either way: would you
| > be outraged if a script that has given you a warning about
| > missing encoding declarations for some time fails with a
| > strict SyntaxError in 2.4? Has everybody already corrected
| > their scripts?
|
| Well, I don't particularly have that problem because I don't
| have a huge number of scripts and for the ones I do it would be
| relatively simple to do a scan and update - or just run them
| with the unit tests and see if they break!

Here's another thought: the company I work for uses (embedded) Python as
scripting language
for their report writer (among other things). Users can add little scripts
to their document templates which are used for printing database data. This
means, there are literally hundreds of little Python scripts embeddeded
within the document templates, which themselves are stored in whatever
database is used as the backend. In such a case, "scan and update" when
upgrading gets a little more complicated ;)

|
| In fact, I think that a scan and update program in the tools
| directory might be a very good idea - just walk through a
| Python library, scan and update everything that doesn't
| have a declaration.
|
| The issue has popped in and out of my awareness a few
| times, what brought it up this time was Hallvard's thread.
|
| My specific question there was how the code handles the
| combination of UTF-8 as the encoding and a non-ascii
| character in an 8-bit string literal. Is this an error? The
| PEP does not say so. If it isn't, what encoding will
| it use to translate from unicode back to an 8-bit
| encoding?

Isn't this covered by:

"Embedding of differently encoded data is not allowed and will
result in a decoding error during compilation of the Python
source code."

--
Vincent Wehren
|
| Another project for people who care about this
| subject: tools. Of the half zillion editors, pretty printers
| and so forth out there, how many check for the encoding
| line and do the right thing with it? Which ones need to
| be updated?
|
| John Roth
| >
| > Regards,
| > Martin
|
|

Jul 18 '05 #5

Martin v. Löwis

John Roth wrote:

In fact, I think that a scan and update program in the tools
directory might be a very good idea - just walk through a
Python library, scan and update everything that doesn't
have a declaration.
Good idea. I see whether I can write something before 2.4,
but contributions are definitely welcome.
My specific question there was how the code handles the
combination of UTF-8 as the encoding and a non-ascii
character in an 8-bit string literal. Is this an error? The
PEP does not say so. If it isn't, what encoding will
it use to translate from unicode back to an 8-bit
encoding?
UTF-8 is not in any way special wrt. the PEP. Notice that
UTF-8 is *not* Unicode - it is an encoding of Unicode, just
like ISO-8559-1 or us-ascii (although the latter two only
encode a subset of Unicode). Yes, the byte string literals
will be converted back to an "8-bit encoding", but the 8-bit
encoding will be UTF-8! IOW, byte string literals are always
converted back to the source encoding before execution.
Another project for people who care about this
subject: tools. Of the half zillion editors, pretty printers
and so forth out there, how many check for the encoding
line and do the right thing with it? Which ones need to
be updated?

I know IDLE, Eric, Komodo, and Emacs do support encoding
declarations. I know PythonWin doesn't, although I once
had written patches to add such support. A number of editors
(like notepad.exe) do the right thing only if the document
has the UTF-8 signature.

Of course, editors don't necessarily need to actively
support the feature as long as the declared encoding is
the one they use, anyway. They won't display source in
other encodings correctly, but some of them don't have
the notion of multiple encodings, anyway.

Regards,
Martin

Jul 18 '05 #6

Martin v. Löwis

Vincent Wehren wrote:

Here's another thought: the company I work for uses (embedded) Python as
scripting language
for their report writer (among other things). Users can add little scripts
to their document templates which are used for printing database data. This
means, there are literally hundreds of little Python scripts embeddeded
within the document templates, which themselves are stored in whatever
database is used as the backend. In such a case, "scan and update" when
upgrading gets a little more complicated ;)
At the same time, it might get also more simple. If the user interface
to edit these scripts is encoding-aware, and/or the database to store
them in is encoding-aware, an automated tool would not need to guess
what the encoding in the source is.
| My specific question there was how the code handles the
| combination of UTF-8 as the encoding and a non-ascii
| character in an 8-bit string literal. Is this an error? The
| PEP does not say so. If it isn't, what encoding will
| it use to translate from unicode back to an 8-bit
| encoding?

Isn't this covered by:

"Embedding of differently encoded data is not allowed and will
result in a decoding error during compilation of the Python
source code."

No. It is perfectly legal to have non-ASCII data in 8-bit string
literals (aka byte string literals, aka <type 'str'>). Of course,
these non-ASCII data also need to be encoded in UTF-8. Whether UTF-8
is an 8-bit encoding, I don't know - it is more precisely described
as a multibyte encoding. At execution time, the byte string literals
then have the source encoding again, i.e. UTF-8.

Regards,
Martin

Jul 18 '05 #7

John Roth

"Martin v. Löwis" <ma****@v.loewis.de> wrote in message
news:41**************@v.loewis.de...

John Roth wrote:
My specific question there was how the code handles the
combination of UTF-8 as the encoding and a non-ascii
character in an 8-bit string literal. Is this an error? The
PEP does not say so. If it isn't, what encoding will
it use to translate from unicode back to an 8-bit
encoding?

UTF-8 is not in any way special wrt. the PEP.

That's what I thought.
Notice that
UTF-8 is *not* Unicode - it is an encoding of Unicode, just
like ISO-8559-1 or us-ascii (although the latter two only
encode a subset of Unicode).
I disagree, but I think this is a definitional issue.
Yes, the byte string literals
will be converted back to an "8-bit encoding", but the 8-bit
encoding will be UTF-8! IOW, byte string literals are always
converted back to the source encoding before execution.
If I understand you correctly, if I put, say, a mixture of
Cyrillic, Hebrew, Arabic and Greek into a byte string
literal, at run time that character string will contain the
proper unicode at each character position?

Or are you trying to say that the character string will
contain the UTF-8 encoding of these characters; that
is, if I do a subscript, I will get one character of the
multi-byte encoding?

The point of this is that I don't think that either behavior
is what one would expect. It's also an open invitation
for someone to make an unchecked mistake! I think this
may be Hallvard's underlying issue in the other thread.
Regards,
Martin

John Roth

Jul 18 '05 #8

Michael Hudson

"John Roth" <ne********@jhrothjr.com> writes:

If I understand you correctly, if I put, say, a mixture of
Cyrillic, Hebrew, Arabic and Greek into a byte string
literal, at run time that character string will contain the
proper unicode at each character position?
Uh, I seem to be making a habit of labelling things you suggest
impossible :-)
Or are you trying to say that the character string will
contain the UTF-8 encoding of these characters; that
is, if I do a subscript, I will get one character of the
multi-byte encoding?

This is what happens, indeed.

Cheers,
mwh

--
This is the fixed point problem again; since all some implementors
do is implement the compiler and libraries for compiler writing, the
language becomes good at writing compilers and not much else!
-- Brian Rogoff, comp.lang.functional

Jul 18 '05 #9

Martin v. Löwis

John Roth wrote:

Or are you trying to say that the character string will
contain the UTF-8 encoding of these characters; that
is, if I do a subscript, I will get one character of the
multi-byte encoding?
Michael is almost right: this is what happens. Except that
what you get, I wouldn't call a "character". Instead, it
is always a single byte - even if that byte is part of
a multi-byte character.

Unfortunately, the things that constitute a byte string
are also called characters in the literature.

To be more specific: In an UTF-8 source file, doing

print "ö" == "\xc3\xb6"
print "ö"[0] == "\xc3"

would print two times "True", and len("ö") is 2.
OTOH, len(u"ö")==1.
The point of this is that I don't think that either behavior
is what one would expect. It's also an open invitation
for someone to make an unchecked mistake! I think this
may be Hallvard's underlying issue in the other thread.

What would you expect instead? Do you think your expectation
is implementable?

Regards,
Martin

Jul 18 '05 #10

John Roth

"Martin v. Löwis" <ma****@v.loewis.de> wrote in message
news:41************@v.loewis.de...

John Roth wrote:
Or are you trying to say that the character string will
contain the UTF-8 encoding of these characters; that
is, if I do a subscript, I will get one character of the
multi-byte encoding?
Michael is almost right: this is what happens. Except that
what you get, I wouldn't call a "character". Instead, it
is always a single byte - even if that byte is part of
a multi-byte character.

Unfortunately, the things that constitute a byte string
are also called characters in the literature.

To be more specific: In an UTF-8 source file, doing

print "ö" == "\xc3\xb6"
print "ö"[0] == "\xc3"

would print two times "True", and len("ö") is 2.
OTOH, len(u"ö")==1.
The point of this is that I don't think that either behavior
is what one would expect. It's also an open invitation
for someone to make an unchecked mistake! I think this
may be Hallvard's underlying issue in the other thread.

What would you expect instead? Do you think your expectation
is implementable?

I'd expect that the compiler would reject anything that
wasn't either in the 7-bit ascii subset, or else defined
with a hex escape.

The reason for this is simply that wanting to put characters
outside of the 7-bit ascii subset into a byte character string
isn't portable. It just pushes the need for a character set
(encoding) declaration down one level of recursion.
There's already a way of doing this: use a unicode string,
so it's not like we need two ways of doing it.

Now I will grant you that there is a need for representing
the utf-8 encoding in a character string, but do we need
to support that in the source text when it's much more
likely that it's a programming mistake?

As far as implementation goes, it should have been done
at the beginning. Prior to 2.3, there was no way of writing
a program using the utf-8 encoding (I think - I might be
wrong on that) so there were no programs out there that
put non-ascii subset characters into byte strings.

Today it's one more forward migration hurdle to jump over.
I don't think it's a particularly large one, but I don't have
any real world data at hand.

John Roth
Regards,
Martin

Jul 18 '05 #11

Martin v. Löwis

John Roth wrote:

What would you expect instead? Do you think your expectation
is implementable?

I'd expect that the compiler would reject anything that
wasn't either in the 7-bit ascii subset, or else defined
with a hex escape.

Are we still talking about PEP 263 here? If the entire source
code has to be in the 7-bit ASCII subset, then what is the point
of encoding declarations?

If you were suggesting that anything except Unicode literals
should be in the 7-bit ASCII subset, then this is still
unacceptable: Comments should also be allowed to contain non-ASCII
characters, don't you agree?

If you think that only Unicode literals and comments should be
allowed to contain non-ASCII, I disagree: At some point, I'd
like to propose support for non-ASCII in identifiers. This would
allow people to make identifiers that represent words from their
native language, which is helpful for people who don't speak
English well.

If you think that only Unicod literals, comments, and identifiers
should be allowed non-ASCII: perhaps, but this is out of scope
of PEP 263, which *only* introduces encoding declarations,
and explains what they mean for all current constructs.
The reason for this is simply that wanting to put characters
outside of the 7-bit ascii subset into a byte character string
isn't portable.
Define "is portable". With an encoding declaration, I can move
the source code from one machine to another, open it in an editor,
and have it display correctly. This was not portable without
encoding declarations (likewise for comments); with PEP 263,
such source code became portable.

Also, the run-time behaviour is fully predictable (which it
even was without PEP 263): At run-time, the string will have
exactly the same bytes that it does in the .py file. This
is fully portable.
It just pushes the need for a character set
(encoding) declaration down one level of recursion.
It depends on the program. E.g. if the program was to generate
HTML files with an explicit HTTP-Equiv charset=iso-8859-1,
then the resulting program is absolutely, 100% portable.

For messages directly output to a terminal, portability
might not be important.
There's already a way of doing this: use a unicode string,
so it's not like we need two ways of doing it.
Using a Unicode string might not work, because a library might
crash when confronted with a Unicode string. You are proposing
to break existing applications for no good reason, and with
no simple fix.
Now I will grant you that there is a need for representing
the utf-8 encoding in a character string, but do we need
to support that in the source text when it's much more
likely that it's a programming mistake?
But it isn't! People do put KOI-8R into source code, into
string literals, and it works perfectly fine for them. There
is no reason to arbitrarily break their code.
As far as implementation goes, it should have been done
at the beginning. Prior to 2.3, there was no way of writing
a program using the utf-8 encoding (I think - I might be
wrong on that)
You are wrong. You were always able to put UTF-8 into byte
strings, even at a time where UTF-8 was not yet an RFC
(say, in Python 1.1).
so there were no programs out there that
put non-ascii subset characters into byte strings.
That is just not true. If it were true, there would be no
need to introduce a grace period in the PEP. However,
*many* scripts in the world use non-ASCII in string literals;
it was always possible (although the documentation was
wishy-washy on what it actually meant).
Today it's one more forward migration hurdle to jump over.
I don't think it's a particularly large one, but I don't have
any real world data at hand.

Trust me: the outcry for banning non-ASCII from string literals
would be, by far, louder than the one for a proposed syntax
on decorators. That would break many production systems, CGI
scripts would suddenly stop working, GUIs would crash, etc.

Regards,
Martin

Jul 18 '05 #12

Hallvard B Furuseth

An addition to Martin's reply:

John Roth wrote:

"Martin v. Löwis" <ma****@v.loewis.de> wrote in message
news:41************@v.loewis.de...
John Roth wrote:

To be more specific: In an UTF-8 source file, doing

print "ö" == "\xc3\xb6"
print "ö"[0] == "\xc3"

would print two times "True", and len("ö") is 2.
OTOH, len(u"ö")==1.
The point of this is that I don't think that either behavior
is what one would expect. It's also an open invitation
for someone to make an unchecked mistake! I think this
may be Hallvard's underlying issue in the other thread.
What would you expect instead? Do you think your expectation
is implementable?

I'd expect that the compiler would reject anything that
wasn't either in the 7-bit ascii subset, or else defined
with a hex escape.

Then you should also expect a lot of people to move to
another language - one whose designers live in the real
world instead of your Utopian Unicode world.
The reason for this is simply that wanting to put characters
outside of the 7-bit ascii subset into a byte character string
isn't portable.

Unicode isn't portable either.
Try to output a Unicode string to a device (e.g. your terminal)
whose character encoding is not known to the program.
The program will fail, or just output the raw utf-8 string or
something, or just guess some character set the program's author
is fond of.

For that matter, tell me why my programs should spend any time
on converting between UTF-8 and the character set the
application actually works with just because you are fond of
Unicode. That might be a lot more time than just the time spent
parsing the program. Or tell me why I should spell quite normal
text strings with hex escaping or something, if that's what you
mean.

And tell me why I shouldn't be allowed to work easily with raw
UTF-8 strings, if I do use coding:utf-8.

--
Hallvard

Jul 18 '05 #13

John Roth

"Martin v. Löwis" <ma****@v.loewis.de> wrote in message
news:41**************@v.loewis.de...

John Roth wrote:
What would you expect instead? Do you think your expectation
is implementable?

I'd expect that the compiler would reject anything that
wasn't either in the 7-bit ascii subset, or else defined
with a hex escape.

Are we still talking about PEP 263 here? If the entire source
code has to be in the 7-bit ASCII subset, then what is the point
of encoding declarations?

Martin, I think you misinterpreted what I said at the
beginning. I'm only, and I need to repeat this, ONLY
dealing with the case where the encoding declaration
specifically says that the script is in UTF-8. No other
case.

I'm going to deal with your response point by point,
but I don't think most of this is really relevant. Your
response only makes sense if you missed the point that
I was talking about scripts that explicitly declared their
encoding to be UTF-8, and no other scripts in no
other circumstances.

I didn't mean the entire source was in 7-bit ascii. What
I meant was that if the encoding was utf-8 then the source
for 8-bit string literals must be in 7-bit ascii. Nothing more.
If you were suggesting that anything except Unicode literals
should be in the 7-bit ASCII subset, then this is still
unacceptable: Comments should also be allowed to contain non-ASCII
characters, don't you agree?
Of course.
If you think that only Unicode literals and comments should be
allowed to contain non-ASCII, I disagree: At some point, I'd
like to propose support for non-ASCII in identifiers. This would
allow people to make identifiers that represent words from their
native language, which is helpful for people who don't speak
English well.
L:ikewise. I never thought otherwise; in fact I'd like to expand
the availible operators to include the set operators as well as
the logical operators and the "real" division operator (the one
you learned in grade school - the dash with a dot above and
below the line.)
If you think that only Unicod literals, comments, and identifiers
should be allowed non-ASCII: perhaps, but this is out of scope
of PEP 263, which *only* introduces encoding declarations,
and explains what they mean for all current constructs.
The reason for this is simply that wanting to put characters
outside of the 7-bit ascii subset into a byte character string
isn't portable.
Define "is portable". With an encoding declaration, I can move
the source code from one machine to another, open it in an editor,
and have it display correctly. This was not portable without
encoding declarations (likewise for comments); with PEP 263,
such source code became portable.

Also, the run-time behaviour is fully predictable (which it
even was without PEP 263): At run-time, the string will have
exactly the same bytes that it does in the .py file. This
is fully portable.
It's predictable, but as far as I'm concerned, that's
not only useless behavior, it's counterproductive
behavior. I find it difficult to imagine any case
where the benefit of having normal character
literals accidentally contain utf-8 multi-byte
characters outweighs the pain of having it happen
accidentally, and then figuring out why your program
is giving you wierd behavior.

I would grant that there are cases where you
might want this behavior. I am pretty sure they
are in the distinct minority.

It just pushes the need for a character set
(encoding) declaration down one level of recursion.

It depends on the program. E.g. if the program was to generate
HTML files with an explicit HTTP-Equiv charset=iso-8859-1,
then the resulting program is absolutely, 100% portable.

It's portable, but that's not the normal case. See above.
For messages directly output to a terminal, portability
might not be important.
Portabiliity is less of an issue for me than the likelihood
of making a mistake in coding a literal and then having
to debug unexpected behavior when one byte no longer
equals one character.

There's already a way of doing this: use a unicode string,
so it's not like we need two ways of doing it.

Using a Unicode string might not work, because a library might
crash when confronted with a Unicode string. You are proposing
to break existing applications for no good reason, and with
no simple fix.

There's no reason why you have to have a utf-8
encoding declaration. If you want your source to
be utf-8, you need to accept the consequences.
I fully expect Python to support the usual mixture
of encodings until 3.0 at least. At that point, everything
gets to be rewritten anyway.

Now I will grant you that there is a need for representing
the utf-8 encoding in a character string, but do we need
to support that in the source text when it's much more
likely that it's a programming mistake?

But it isn't! People do put KOI-8R into source code, into
string literals, and it works perfectly fine for them. There
is no reason to arbitrarily break their code.
As far as implementation goes, it should have been done
at the beginning. Prior to 2.3, there was no way of writing
a program using the utf-8 encoding (I think - I might be
wrong on that)

You are wrong. You were always able to put UTF-8 into byte
strings, even at a time where UTF-8 was not yet an RFC
(say, in Python 1.1).

Were you able to write your entire program in UTF-8?
I think not.

so there were no programs out there that
put non-ascii subset characters into byte strings.
That is just not true. If it were true, there would be no
need to introduce a grace period in the PEP. However,
*many* scripts in the world use non-ASCII in string literals;
it was always possible (although the documentation was
wishy-washy on what it actually meant).
Today it's one more forward migration hurdle to jump over.
I don't think it's a particularly large one, but I don't have
any real world data at hand.

Trust me: the outcry for banning non-ASCII from string literals
would be, by far, louder than the one for a proposed syntax
on decorators. That would break many production systems, CGI
scripts would suddenly stop working, GUIs would crash, etc.

..

Regards,
Martin

Jul 18 '05 #14

John Roth

"Hallvard B Furuseth" <h.**********@usit.uio.no> wrote in message
news:HB**************@bombur.uio.no...

An addition to Martin's reply:

John Roth wrote:
"Martin v. Löwis" <ma****@v.loewis.de> wrote in message
news:41************@v.loewis.de...
John Roth wrote:

To be more specific: In an UTF-8 source file, doing

print "ö" == "\xc3\xb6"
print "ö"[0] == "\xc3"

would print two times "True", and len("ö") is 2.
OTOH, len(u"ö")==1.

The point of this is that I don't think that either behavior
is what one would expect. It's also an open invitation
for someone to make an unchecked mistake! I think this
may be Hallvard's underlying issue in the other thread.

What would you expect instead? Do you think your expectation
is implementable?
I'd expect that the compiler would reject anything that
wasn't either in the 7-bit ascii subset, or else defined
with a hex escape.

Then you should also expect a lot of people to move to
another language - one whose designers live in the real
world instead of your Utopian Unicode world.

Rudeness objection to your characteization.

Please see my response to Martin - I'm talking only,
and I repeat ONLY, about scripts that explicitly
say they are encoded in utf-8. Nothing else. I've
been in this business for close to 40 years, and I'm
quite well aware of backwards compatibility issues
and issues with breaking existing code.

Programmers in general have a very strong, and
let me repeat that, VERY STRONG assumption
that an 8-bit string contains one byte per character
unless there is a good reason to believe otherwise.
This assumption is built into various places, including
all of the string methods.

The current design allows accidental inclusion of
a character that is not in the 7bit ascii subset ***IN
A PROGRAM THAT HAS A UTF-8 CHARACTER
ENCODING DECLARATION*** to break that
assumption without any kind of notice. That in
turn will break all of the assumptions that the string
module and string methods are based on. That in
turn is likely to break lots of existing modules and
cause a lot of debugging time that could be avoided
by proper design.

One of Python's strong points is that it's difficult
to get into trouble unless you deliberately try (then
it's quite easy, fortunately.)

I'm not worried about this causing people to
abandon Python. I'm more worried about the
current situation causing enough grief that people
will decided that utf-8 source code encoding isn't
worth it.
And tell me why I shouldn't be allowed to work easily with raw
UTF-8 strings, if I do use coding:utf-8.
First, there's nothing that's stopping you. All that
my proposal will do is require you to do a one
time conversion of any strings you put in the
program as literals. It doesn't affect any other
strings in any other way at any other time.

I'll withdraw my objection if you can seriously
assure me that working with raw utf-8 in
8-bit character string literals is what most programmers
are going to do most of the time.

I'm not going to accept the very common need
of converting unicode strings to 8-bit strings so
they can be written to disk or stored in a data base
or whatnot (or reversing the conversion for reading.)
That has nothing to do with the current issue - it's
something that everyone who deals with unicode
needs to do, regardless of the encoding of the
source program.

John Roth
--
Hallvard

Jul 18 '05 #15

Martin v. Löwis

John Roth wrote:

Martin, I think you misinterpreted what I said at the
beginning. I'm only, and I need to repeat this, ONLY
dealing with the case where the encoding declaration
specifically says that the script is in UTF-8. No other
case.
From the viewpoint of PEP 263, there is absolutely *no*,
and I repeat NO difference between chosing UTF-8 and
chosing windows-1252 as the source encoding.
I'm going to deal with your response point by point,
but I don't think most of this is really relevant. Your
response only makes sense if you missed the point that
I was talking about scripts that explicitly declared their
encoding to be UTF-8, and no other scripts in no
other circumstances.
I don't understand why it is desirable to single out
UTF-8 as a source encoding. PEP 263 does no such thing,
except for allowing an addition encoding declaration
for UTF-8 (by means of the UTF-8 signature).
I didn't mean the entire source was in 7-bit ascii. What
I meant was that if the encoding was utf-8 then the source
for 8-bit string literals must be in 7-bit ascii. Nothing more.
PEP 263 never says such a thing. Why did you get this impression
after reading it?

*If* you understood that byte string literals can have the full
power of the source encoding, plus hex-escaping, I can't see what
made you think that power did not apply if the source encoding
was UTF-8.
L:ikewise. I never thought otherwise; in fact I'd like to expand
the availible operators to include the set operators as well as
the logical operators and the "real" division operator (the one
you learned in grade school - the dash with a dot above and
below the line.)
That would be a different PEP, though, and I doubt Guido will be
in favour. However, this is OT for this thread.
It's predictable, but as far as I'm concerned, that's
not only useless behavior, it's counterproductive
behavior. I find it difficult to imagine any case
where the benefit of having normal character
literals accidentally contain utf-8 multi-byte
characters outweighs the pain of having it happen
accidentally, and then figuring out why your program
is giving you wierd behavior.
Might be. This is precisely the issue that Hallvard is addressing.
I agree there should be a mechanism to check whether all significant
non-ASCII characters are inside Unicode literals.

I personally would prefer a command line switch over a per-file
declaration, but that would be the subject of Hallvard's PEP.
Under no circumstances I would disallow using the full source
encoding in byte strings, even if the source encoding is UTF-8.
There's no reason why you have to have a utf-8
encoding declaration. If you want your source to
be utf-8, you need to accept the consequences.
Even for UTF-8, you need an encoding declaration (although
the UTF-8 signature is sufficient for that matter). If
there is no encoding declaration whatsoever, Python will
assume that the source is us-ascii.
I fully expect Python to support the usual mixture
of encodings until 3.0 at least. At that point, everything
gets to be rewritten anyway.
I very much doubt that, in two ways:
a) Python 3.0 will not happen, in any foreseeable future
b) if it happens, much code will stay the same, or only
require minor changes. I doubt that non-UTF-8 source
encoding will be banned in Python 3.
Were you able to write your entire program in UTF-8?
I think not.

What do you mean, your entire program? All strings?
Certainly you were. Why not?

Of course, before UTF-8 was an RFC, there were no
editors available, nor would any operating system
support output in UTF-8, so you would need to
organize everything on your own (perhaps it was
simpler on Plan-9 at that time, but I have never
really used Plan-9 - and you might have needed
UTF-1 instead, anyway).

Regards,
Martin

Jul 18 '05 #16

Martin v. Löwis

John Roth wrote:

I've
been in this business for close to 40 years, and I'm
quite well aware of backwards compatibility issues
and issues with breaking existing code.

Programmers in general have a very strong, and
let me repeat that, VERY STRONG assumption
that an 8-bit string contains one byte per character
unless there is a good reason to believe otherwise.
You clearly come from a Western business. In CJK
languages, people are very aware that characters can
have more than one byte. They consider UTF-8 as just
another multi-byte encoding, and used to consider it
as an encoding that Westerners made to complicate their
lifes. That attitude appears to be changing now, but
UTF-8 is not a clear winner in the worlds where we
Westerners would expect it to be a clear winner.
The current design allows accidental inclusion of
a character that is not in the 7bit ascii subset ***IN
A PROGRAM THAT HAS A UTF-8 CHARACTER
ENCODING DECLARATION*** to break that
assumption without any kind of notice.
This is a problem only for the Western world. In the
CJK languages, such programs were broken a long time
ago. I don't think Python needs to be so Americo-centric
as to protect American programmers from programming
mistakes.
That in
turn will break all of the assumptions that the string
module and string methods are based on. That in
turn is likely to break lots of existing modules and
cause a lot of debugging time that could be avoided
by proper design.
Indeed. If the program is currently not broken, why
are you changing the source encoding? If you are
trying to support multiple languages, a properly-
designed application would use gettext instead
of putting non-ASCII into source code.

If you are writing a new application, and you
put non-ASCII into the source, in UTF-8, are you
not testing your application properly?
I'm not worried about this causing people to
abandon Python. I'm more worried about the
current situation causing enough grief that people
will decided that utf-8 source code encoding isn't
worth it.
Again, this is what Hallvard's PEP is for. It
does not apply to UTF-8 only, but I see no reason
why UTF-8 needs to be singled out.
I'll withdraw my objection if you can seriously
assure me that working with raw utf-8 in
8-bit character string literals is what most programmers
are going to do most of the time.

In what time scale? Near time, most people will use
other source encodings. In the medium term, I expect
Unix will switch to UTF-8 throughout, at which point
using UTF-8 byte strings will work on every Unix
system - the scripts, by nature, won't work on non-Unix
systems, anyway. In the long term, I expect all Python
strings will be Unicode strings, unless explicitly
declared as byte strings.

Regards,
Martin

Jul 18 '05 #17

Terry Reedy

"Martin v. Löwis" <ma****@v.loewis.de> wrote in message
news:41**************@v.loewis.de...

If you think that only Unicode literals and comments should be
allowed to contain non-ASCII, I disagree: At some point, I'd
like to propose support for non-ASCII in identifiers. This would
allow people to make identifiers that represent words from their
native language, which is helpful for people who don't speak
English well.

Off the main topic of this thread, but...

While sympathizing with this notion, I have hitherto opposed it on the
basis that this would lead to code that could only be read by people within
each language group. But, rereading your idea, I realize that this
objection would be overcome by a reader that displayed for each Unicode
char (codepoint?) not its native glyph but a roman transliteration. As far
as I know, such tranliterations, more or less standardized, exist at least
for all major alphabets and syllable systems. Indeed, I would find
Japanese code displayed as

for sushi in michiro.readlines():
print fuji(sushi)

clearer than 'English' code using identifiers like Q8zB2_0Ol1!

If the Unicode group does not distribute a master roman tranliteration
table at least for alphabetic symbols, I would consider it a lack that
hinders adoption of Unicode.

Some writing systems also have different number digits, which could also be
used natively and tranliterated. A Unicode Python could also use a set of
user codepoints as an alternate coding of keywords for almost complete
nativification. I believe the math symbols are pretty universal (but could
be educated if not).

Terry J. Reedy

Jul 18 '05 #18

John Roth

"Martin v. Löwis" <ma****@v.loewis.de> wrote in message
news:41************@v.loewis.de...

John Roth wrote:
Martin, I think you misinterpreted what I said at the
beginning. I'm only, and I need to repeat this, ONLY
dealing with the case where the encoding declaration
specifically says that the script is in UTF-8. No other
case.
From the viewpoint of PEP 263, there is absolutely *no*,
and I repeat NO difference between chosing UTF-8 and
chosing windows-1252 as the source encoding.

I don't believe I ever said that PEP 263 said there was
a difference. If I gave you that impression, I will
appologize if you can show me where it I did it.

I'm going to deal with your response point by point,
but I don't think most of this is really relevant. Your
response only makes sense if you missed the point that
I was talking about scripts that explicitly declared their
encoding to be UTF-8, and no other scripts in no
other circumstances.

I don't understand why it is desirable to single out
UTF-8 as a source encoding. PEP 263 does no such thing,
except for allowing an addition encoding declaration
for UTF-8 (by means of the UTF-8 signature).

As far as I'm concerned, what PEP 263 says is utterly
irrelevant to the point I'm trying to make.

The only connection PEP 263 has to the entire thread
(at least from my view) is that I wanted to check on
whether phase 2, as described in the PEP, was
scheduled for 2.4. I was under the impression it was
and was puzzled by not seeing it. You said it wouldn't
be in 2.4. Question answered, no further issue on
that point (but see below for an additonal puzzlement.)

I didn't mean the entire source was in 7-bit ascii. What
I meant was that if the encoding was utf-8 then the source
for 8-bit string literals must be in 7-bit ascii. Nothing more.

PEP 263 never says such a thing. Why did you get this impression
after reading it?

I didn't get it from the PEP. I got it from what you said. Your
response seemed to make sense only if you assumed that I
had this totally idiotic idea that we should change everything
to 7-bit ascii. That was not my intention.

Let's go back to square one and see if I can explain my
concern from first principles.

8-bit strings have a builtin assumption that one
byte equals one character. This is something that
is ingrained in the basic fabric of many programming
languages, Python included. It's a basic assumption
in the string module, the string methods and all through
just about everything, and it's something that most
programmers expect, and IMO have every right
to expect.

Now, people violate this assumption all the time,
for a number of reasons, including binary data and
encoded data (including utf-8 encodings)
but they do so deliberately, knowing what they're
doing. These particular exceptions don't negate the
rule.

The problem I have is that if you use utf-8 as the
source encoding, you can suddenly drop multi-byte
characters into an 8-bit string ***BY ACCIDENT***.
This accident is not possible with single byte
encodings, which is why I am emphasizing that I
am only talking about source that is encoded in utf-8.
(I don't know what happens with far Eastern multi-byte
encodings.)

UTF-8 encoded source has this problem. Source
encoded with single byte encodings does not have
this problem. It's as simple as that. Accordingly
it is not my intention, and has never been my
intention, to change the way 8-bit string literals
are handled when the source program has a
single byte encoding.

We may disagree on whether this is enough of
a problem that it warrents a solution. That's life.

Now, my suggested solution of this problem was
to require that 8-bit string literals in source that was
encoded with UTF-8 be restricted to the 7-bit
ascii subset. The reason is that there are logically
three things that can be done here if we find a
character that is outside of the 7-bit ascii subset.

One is to do the current practice and violate the
one byte == one character invariant, the second
is to use some encoding to convert the non-ascii
characters into a single byte encoding, thus
preserving the one byte == one character invariant.
The third is to prohibit anything that is ambiguous,
which in practice means to restrict 8-bit literals
to the 7-bit ascii subset (plus hex escapes, of course.)

The second possibility begs the question of what
encoding to use, which is why I don't seriously
propose it (although if I understand Hallvard's
position correctly, that's essentially his proposal.)
*If* you understood that byte string literals can have the full
power of the source encoding, plus hex-escaping, I can't see what
made you think that power did not apply if the source encoding
was UTF-8.
I think I covered that adequately above. It's not that
it doesn't apply, it's that it's unsafe.

It's predictable, but as far as I'm concerned, that's
not only useless behavior, it's counterproductive
behavior. I find it difficult to imagine any case
where the benefit of having normal character
literals accidentally contain utf-8 multi-byte
characters outweighs the pain of having it happen
accidentally, and then figuring out why your program
is giving you wierd behavior.

Might be. This is precisely the issue that Hallvard is addressing.
I agree there should be a mechanism to check whether all significant
non-ASCII characters are inside Unicode literals.

I think that means we're in substantive agreement (although
I see no reason to restrict comments to 7-bit ascii.)
I personally would prefer a command line switch over a per-file
declaration, but that would be the subject of Hallvard's PEP.
Under no circumstances I would disallow using the full source
encoding in byte strings, even if the source encoding is UTF-8.
I assume here you intended to mean strings, not literals. If
so, we're in agreement - I see absolutely no reason to even
think of suggesting a change to Python's run time string
handling behavior.

There's no reason why you have to have a utf-8
encoding declaration. If you want your source to
be utf-8, you need to accept the consequences.

Even for UTF-8, you need an encoding declaration (although
the UTF-8 signature is sufficient for that matter). If
there is no encoding declaration whatsoever, Python will
assume that the source is us-ascii.

I think I didn't say this clearly. What I intended to get across
is that there isn't any major reason for a source to be utf-8;
other encodings are for the most part satisfactory.
Saying something about the declaration seems to have muddied
the meaning.

The last sentence puzzles me. In 2.3, absent a declaration
(and absent a parameter on the interpreter) Python assumes
that the source is Latin-1, and phase 2 was to change
this to the 7-bit ascii subset (US-Ascii). That was the
original question at the start of this thread. I had assumed
that change was to go into 2.4, your reply made it seem
that it would go into 2.5 (maybe.) This statement makes
it seem that it is the current state in 2.3.

I fully expect Python to support the usual mixture
of encodings until 3.0 at least. At that point, everything
gets to be rewritten anyway.

I very much doubt that, in two ways:
a) Python 3.0 will not happen, in any foreseeable future

I probably should let this sleeping dog lie, however,
there is a general expectation that there will be a 3.0
at some point before the heat death of the universe.
I was certainly under that impression, and I've seen
nothing from anyone who I regard as authoratitive until
this statement that says otherwise.
b) if it happens, much code will stay the same, or only
require minor changes. I doubt that non-UTF-8 source
encoding will be banned in Python 3.
Were you able to write your entire program in UTF-8?
I think not.
What do you mean, your entire program? All strings?
Certainly you were. Why not?

Of course, before UTF-8 was an RFC, there were no
editors available, nor would any operating system
support output in UTF-8, so you would need to
organize everything on your own (perhaps it was
simpler on Plan-9 at that time, but I have never
really used Plan-9 - and you might have needed
UTF-1 instead, anyway).

This doesn't make sense in context. I'm not talking
about some misty general UTF-8. I'm talking
about writing Python programs using the c-python
interpreter. Not jython, not IronPython, not some
other programming language.
Specifically, what would the Python 2.2 interpreter
have done if I handed it a program encoded in utf-8?
Was that a legitimate encoding? I don't know whether
it was or not. Clearly it wouldn't have been possible
before the unicode support in 2.0.

John Roth

Regards,
Martin

Jul 18 '05 #19

John Roth

"Martin v. Löwis" <ma****@v.loewis.de> wrote in message
news:41**************@v.loewis.de...

John Roth wrote:
I've
been in this business for close to 40 years, and I'm
quite well aware of backwards compatibility issues
and issues with breaking existing code.

Programmers in general have a very strong, and
let me repeat that, VERY STRONG assumption
that an 8-bit string contains one byte per character
unless there is a good reason to believe otherwise.
You clearly come from a Western business. In CJK
languages, people are very aware that characters can
have more than one byte. They consider UTF-8 as just
another multi-byte encoding, and used to consider it
as an encoding that Westerners made to complicate their
lifes. That attitude appears to be changing now, but
UTF-8 is not a clear winner in the worlds where we
Westerners would expect it to be a clear winner.

I'm aware of that.

The current design allows accidental inclusion of
a character that is not in the 7bit ascii subset ***IN
A PROGRAM THAT HAS A UTF-8 CHARACTER
ENCODING DECLARATION*** to break that
assumption without any kind of notice.

This is a problem only for the Western world. In the
CJK languages, such programs were broken a long time
ago. I don't think Python needs to be so Americo-centric
as to protect American programmers from programming
mistakes.

American != non East Asian.

In fact, I would consider American programmers to
be the least prone to making this kind of mistake
simply because all standard characters are included
in the US-Ascii subset. It's much more likely to be
a European (or non North American) problem.
Even when writing in English, people's names will
have non-English characters, and they have a
tendency to leak into literals.
(Mexico considers themselves to be part of
Central America, for some political reason.)

That in
turn will break all of the assumptions that the string
module and string methods are based on. That in
turn is likely to break lots of existing modules and
cause a lot of debugging time that could be avoided
by proper design.

Indeed. If the program is currently not broken, why
are you changing the source encoding? If you are
trying to support multiple languages, a properly-
designed application would use gettext instead
of putting non-ASCII into source code.

If you are writing a new application, and you
put non-ASCII into the source, in UTF-8, are you
not testing your application properly?
I'm not worried about this causing people to
abandon Python. I'm more worried about the
current situation causing enough grief that people
will decided that utf-8 source code encoding isn't
worth it.

Again, this is what Hallvard's PEP is for. It
does not apply to UTF-8 only, but I see no reason
why UTF-8 needs to be singled out.
I'll withdraw my objection if you can seriously
assure me that working with raw utf-8 in
8-bit character string literals is what most programmers
are going to do most of the time.

In what time scale? Near time, most people will use
other source encodings. In the medium term, I expect
Unix will switch to UTF-8 throughout, at which point
using UTF-8 byte strings will work on every Unix
system - the scripts, by nature, won't work on non-Unix
systems, anyway. In the long term, I expect all Python
strings will be Unicode strings, unless explicitly
declared as byte strings.

I asked Hallvard this question, not you. It makes sense
in the context of the statements of his I was responding to.

Your answer does not make sense. Hallvard's objection
was that he actually wanted to have non-ascii characters
put into byte literals in their utf-8 encoded forms (at least
as I understand it.)

If I thought about it, I could undoubtedly come up with
use cases where I would find this behavior useful. The
presupposition behind my statement was that those
use cases were overwhelmingly less likely than the
standard uses of byte string literals where a utf-8
encoded "character" would be a problem.

John Roth

Regards,
Martin

Jul 18 '05 #20

Martin v. Löwis

John Roth wrote:

I don't believe I ever said that PEP 263 said there was
a difference. If I gave you that impression, I will
appologize if you can show me where it I did it.
In <10*************@news.supernews.com>, titled
" PEP 263 status check", you write

[quote]
My specific question there was how the code handles the
combination of UTF-8 as the encoding and a non-ascii
character in an 8-bit string literal. Is this an error?
[end quote]

So I assumed you were all the time talking about how this
is implemented, and how you expected to be implemented,
and I assumed we agree that the implementation should
match the specification in PEP 263.
As far as I'm concerned, what PEP 263 says is utterly
irrelevant to the point I'm trying to make.
Then I don't know what the point is you are trying to
make. It appears that you are now saying that Python
does not work the way it should work. IOW, you are
proposing that it be changed, right? This sounds like
another PEP.
The only connection PEP 263 has to the entire thread
(at least from my view) is that I wanted to check on
whether phase 2, as described in the PEP, was
scheduled for 2.4. I was under the impression it was
and was puzzled by not seeing it. You said it wouldn't
be in 2.4. Question answered, no further issue on
that point (but see below for an additonal puzzlement.)
Ok. A change of subject might have helped.
8-bit strings have a builtin assumption that one
byte equals one character.
Not at all. Some 8-bit strings don't denote characters
at all, and some 8-bit string, atleast in some regions
of the world, are deliberately using multi-byte character
encodings. In particular, UTF-8 is such an encoding.
It's a basic assumption
in the string module, the string methods and all through
just about everything, and it's something that most
programmers expect, and IMO have every right
to expect.
Not at all. Most string methods don't assume anything
about characters. Instead, they assume that the building
block of a byte string is a "byte", and operate on those.
Only some methods of the string objects assume that the
bytes denote characters; they typically assume that the
current locale provides the definition of the character
set.
Now, people violate this assumption all the time,
for a number of reasons, including binary data and
encoded data (including utf-8 encodings)
but they do so deliberately, knowing what they're
doing. These particular exceptions don't negate the
rule.
Not at all. These usages are deliberate, equally-righted
applications of the string type. In Python, the string
type really is meant for binary data (unlike, say, C,
which has issues with NUL bytes).
The problem I have is that if you use utf-8 as the
source encoding, you can suddenly drop multi-byte
characters into an 8-bit string ***BY ACCIDENT***.
Ok.
(I don't know what happens with far Eastern multi-byte
encodings.)
The same issues as UTF-8, plus some additional worse issues.
Now, my suggested solution of this problem was
to require that 8-bit string literals in source that was
encoded with UTF-8 be restricted to the 7-bit
ascii subset.
Ok. I disagree that this is desirable; if you really
want to see that happen, you should write a PEP.
The second possibility begs the question of what
encoding to use, which is why I don't seriously
propose it (although if I understand Hallvard's
position correctly, that's essentially his proposal.)
No. He proposes your third alternative (ban non-ASCII
characters in byte string literals), not just for UTF-8,
but for all encodings. Not for all files, though, but
only for selected files.

If
there is no encoding declaration whatsoever, Python will
assume that the source is us-ascii.

[...] The last sentence puzzles me. In 2.3, absent a declaration
(and absent a parameter on the interpreter) Python assumes
that the source is Latin-1, and phase 2 was to change
this to the 7-bit ascii subset (US-Ascii). That was the
original question at the start of this thread. I had assumed
that change was to go into 2.4, your reply made it seem
that it would go into 2.5 (maybe.) This statement makes
it seem that it is the current state in 2.3.
With "will assume", I actually meant future tense. Not
being a native speaker, I'm uncertain how to distinguish
this from the conditional form that you apparently understood.
Specifically, what would the Python 2.2 interpreter
have done if I handed it a program encoded in utf-8?
Was that a legitimate encoding?
Yes, the Python interpeter would have processed it.

print "Grüß Gott"

would have send the greeting to the terminal.
I don't know whether
it was or not. Clearly it wouldn't have been possible
before the unicode support in 2.0.

Why do you think so? The above print statement has worked
since Python 1.0 or so. Before PEP 263, Python was unaware
of source encodings, and would literally copy the bytes
from the source code file into the string object - whether
they were latin-1, UTF-8, or some other encoding. The
only requirement was that the encoding needs to be an
ASCII superset, so that Python properly detects the end
of the string.

Regards,
Martin

Jul 18 '05 #21

Martin v. Löwis

Terry Reedy wrote:

While sympathizing with this notion, I have hitherto opposed it on the
basis that this would lead to code that could only be read by people within
each language group. But, rereading your idea, I realize that this
objection would be overcome by a reader that displayed for each Unicode
char (codepoint?) not its native glyph but a roman transliteration.
I personally consider this objection irrelevant. Yes, it is desirable
that portable libraries use only pronouncable (in English) identifiers.
However, that is no justification for the language to make a policy
decision that all source code in the language needs to use pronouncable
identifiers. Instead, the author of each piece of code needs to make
a decision what kind of identifiers to use. Some people (e.g. children)
don't care a bit if somebody 20km away can read their source code, let
alone somebody 10000km away - those far-away people will never get
to see the code in the first place.

So I doubt there is much need for transliterating source code viewers.
At the same time, it might be a fun project to do.
Some writing systems also have different number digits, which could also be
used natively and tranliterated. A Unicode Python could also use a set of
user codepoints as an alternate coding of keywords for almost complete
nativification. I believe the math symbols are pretty universal (but could
be educated if not).

Now, this is different story. To implement this, the Python parser needs
to be changed to contain locale information, and one carefully has to
make an implementation so that the same code will run the same way
independent on the locale in which it is executed. This requires that
information about all locales is included in all installations, which
is expensive to maintain.

In addition, alternate keywords might not help so much, since real
integration into the natural language would also require to change
the order of identifiers and keywords - something that I consider
unimplementable.

Regards,
Martin

Jul 18 '05 #22

John Roth

"Martin v. Löwis" <ma****@v.loewis.de> wrote in message
news:41***********************@news.freenet.de...

John Roth wrote:
I don't believe I ever said that PEP 263 said there was
a difference. If I gave you that impression, I will
appologize if you can show me where it I did it.
In <10*************@news.supernews.com>, titled
" PEP 263 status check", you write

[quote]
My specific question there was how the code handles the
combination of UTF-8 as the encoding and a non-ascii
character in an 8-bit string literal. Is this an error?
[end quote]

So I assumed you were all the time talking about how this
is implemented, and how you expected to be implemented,
and I assumed we agree that the implementation should
match the specification in PEP 263.

Ah! While my assumption was that the code had been
implemented correctly according to the specification,
and that the specification leaves a trap for the unwary
in one very significant (although also very narrow) case.

As far as I'm concerned, what PEP 263 says is utterly
irrelevant to the point I'm trying to make.

Then I don't know what the point is you are trying to
make. It appears that you are now saying that Python
does not work the way it should work. IOW, you are
proposing that it be changed, right? This sounds like
another PEP.

It could very well be another PEP.

8-bit strings have a builtin assumption that one
byte equals one character.
Not at all. Some 8-bit strings don't denote characters
at all, and some 8-bit string, atleast in some regions
of the world, are deliberately using multi-byte character
encodings. In particular, UTF-8 is such an encoding.

This is true, but it's also beside the point. Most *programmers*
(other than ones that use single-language multi-byte
encodings) make that assumption. If they didn't there
wouldn't be a problem.

Every tutorial I've ever seen on unicode spends a great
deal of time at the beginning explaining the difference
between bytes, characters, encodings and all that stuff.
If this was common knowledge, why would the authors
bother? They bother simply because it isn't common
knowledge, at least in the sense that it's wired into
developer's common coding intuitions and habits.

The problem I have is that if you use utf-8 as the
source encoding, you can suddenly drop multi-byte
characters into an 8-bit string ***BY ACCIDENT***. Ok.

Now, my suggested solution of this problem was
to require that 8-bit string literals in source that was
encoded with UTF-8 be restricted to the 7-bit
ascii subset.

Ok. I disagree that this is desirable; if you really
want to see that happen, you should write a PEP.
The second possibility begs the question of what
encoding to use, which is why I don't seriously
propose it (although if I understand Hallvard's
position correctly, that's essentially his proposal.)

No. He proposes your third alternative (ban non-ASCII
characters in byte string literals), not just for UTF-8,
but for all encodings. Not for all files, though, but
only for selected files.

Which is what I don't like about it. It adds complexity
to the language and a feature that I don't think is really
necessary (restricting string literals for single-byte encodings.)
The other thing I don't like is that it still leaves the
trap for the unwary which I'm discussing.

If
there is no encoding declaration whatsoever, Python will
assume that the source is us-ascii. [...]
The last sentence puzzles me. In 2.3, absent a declaration
(and absent a parameter on the interpreter) Python assumes
that the source is Latin-1, and phase 2 was to change
this to the 7-bit ascii subset (US-Ascii). That was the
original question at the start of this thread. I had assumed
that change was to go into 2.4, your reply made it seem
that it would go into 2.5 (maybe.) This statement makes
it seem that it is the current state in 2.3.

With "will assume", I actually meant future tense. Not
being a native speaker, I'm uncertain how to distinguish
this from the conditional form that you apparently understood.

Ah. I understand now. I understood the final clause as a
form of present tense. To make it a future I'd probably
stick the word 'eventually' or 'in Release 2.5' in there:
"will eventually assume" or "In Release 2.5, Python will assume..."
Specifically, what would the Python 2.2 interpreter
have done if I handed it a program encoded in utf-8?
Was that a legitimate encoding?

Yes, the Python interpeter would have processed it.

print "Grüß Gott"

would have send the greeting to the terminal.

I see your point here. It does round trip successfully.

John Roth
Regards,
Martin

Jul 18 '05 #23

Hallvard B Furuseth

John Roth wrote:

"Hallvard B Furuseth" <h.**********@usit.uio.no> wrote in message
news:HB**************@bombur.uio.no...
An addition to Martin's reply:
John Roth wrote:
"Martin v. Löwis" <ma****@v.loewis.de> wrote in message
news:41************@v.loewis.de...

To be more specific: In an UTF-8 source file, doing

print "ö" == "\xc3\xb6"
print "ö"[0] == "\xc3"

would print two times "True", and len("ö") is 2.
OTOH, len(u"ö")==1.

(...)
I'd expect that the compiler would reject anything that
wasn't either in the 7-bit ascii subset, or else defined
with a hex escape.
Then you should also expect a lot of people to move to
another language - one whose designers live in the real
world instead of your Utopian Unicode world.

Rudeness objection to your characteization.

Sorry, I guess that was a bit over the top. I've just gotten so fed up
with bad charset handling, including over-standardization, over the
years. And as you point out, I misunderstood the scope of your
suggestion. But you have been saying that people should always use
Unicode, and things like that.
Please see my response to Martin - I'm talking only,
and I repeat ONLY, about scripts that explicitly
say they are encoded in utf-8. Nothing else. I've
been in this business for close to 40 years, and I'm
quite well aware of backwards compatibility issues
and issues with breaking existing code.

Programmers in general have a very strong, and
let me repeat that, VERY STRONG assumption
that an 8-bit string contains one byte per character
unless there is a good reason to believe otherwise.
Often true in our part of the world. However, another VERY STRONG
assumption is that if we feed the computer a raw character string and
ensure that it doesn't do any fancy charset handling, the program won't
mess with the string and things will Just Work. Well, except that
programs that strip the 8. bit is a problem. While there is no longer
any telling what a program will do if it gets the idea that it can be
helpful about the character set.

The biggest problem with labeling anything as Unicode may be that it
will have to be converted back before it is output, but the program
often does not know which character set to convert it to. It might not
be running on a system where "the charset" is available in some standard
location. It might not be able to tell from the name of the locale. In
any case, the desired output charset might not be the same as that of
the current locale. So the program (or some module it is using) can
decide to guess, which can give very bad results, or it can fail, which
is no fun either. Or the programmer can set a default charset, even
though he does not know that the user will be using this charset. Or
the program can refuse to run unless the user configures the charset,
which is often nonsense.

The rest of my reply to that grew to a rather large rant with very
little relevance to PEP 263, so I moved it to the end of this message.

Anyway, the fact remains that in quite a number of situations, the
simplest way to do charset handling is to keep various programs firmly
away from charset issues. If a program does not know which charset is
in use, the best way is to not do any charset handling. In the case of
Python strings, that means 'str' literals instead of u'Unicode'
literals. Then the worst that can happen if the program is run with an
unexpected charset/encoding is that the strings built into the program
will not be displayed correctly.

It would be nice to have a machinery to tag all strings, I/O channels
and so on with their charset/encoding and with what to do if a string
cannot be converted to that encoding, but lacking that (or lacking
knowledge of how to tag some data), no charset handling will remain
better than guesstimate charset handling in some situations.
This assumption is built into various places, including
all of the string methods.
I don't agree with that, but maybe it's a matter of how we view function
and type names, or something.
The current design allows accidental inclusion of
a character that is not in the 7bit ascii subset ***IN
A PROGRAM THAT HAS A UTF-8 CHARACTER
ENCODING DECLARATION*** to break that
assumption without any kind of notice. That in
turn will break all of the assumptions that the string
module and string methods are based on. That in
turn is likely to break lots of existing modules and
cause a lot of debugging time that could be avoided
by proper design.
For programs that think they work with Unicode strings, yes. For
programs that have no charset opinion, quite the opposite is true.

And tell me why I shouldn't be allowed to work easily with raw
UTF-8 strings, if I do use coding:utf-8.

First, there's nothing that's stopping you. All that
my proposal will do is require you to do a one
time conversion of any strings you put in the
program as literals. It doesn't affect any other
strings in any other way at any other time.

It is not a one-time conversion if it's inside a loop or a small
function which is called many times. It would have to be moved out
to a global variable or something, which makes the program a lot more
cumbersome.

Second, any time one has to write more complex expressions to achieve
something, it becomes easier to introduce bugs. In particular when
people's solution will sometimes be to write '\xc3\xb8' instead of 'ø'
and add a comment with the real string. If the comment is wrong, which
happens, the bug may survive for a long time.
I'll withdraw my objection if you can seriously
assure me that working with raw utf-8 in
8-bit character string literals is what most programmers
are going to do most of the time.
Of course it isn't. Nor is working with a lot of other Python features.
I'm not going to accept the very common need
of converting unicode strings to 8-bit strings so
they can be written to disk or stored in a data base
or whatnot (or reversing the conversion for reading.)
That's your choice, of course. It's not mine.
That has nothing to do with the current issue - it's
something that everyone who deals with unicode
needs to do, regardless of the encoding of the
source program.

I'm not not even sure which issue is the 'current issue',
if it makes that irrelevant.

========

<rant>
I've been a programmer for about 20 years, and for most of that time
the solution to charset issues in my environment (Tops-20, Unix, no
multi-cultural issues) has been for the user to take care of the
matter.

At first, the computer thought it was using ASCII, we were using
terminals and printers with NS_4551-1 - not that I knew a name for it
- and that was that. (NS_4551-1 is ASCII with [\]{|} replaced with
ÆØÅæøå.) If we wanted to print an ASCII file, there might be a switch
to get an ASCII font, we might have an ASCII printer/terminal, or we
just learned to read ÆØÅ as [\] and vice versa. A C program which
should output a Norwegian string would use [\\] as ÆØÅ - or the other
way around, depending on how one displayed the program.

Then some programs began to become charset-aware, but they "knew" that
we were using ASCII, and began to e.g. label everyone's e-mail
messages with "X-Charset: ASCII" or something. So such labels came in
practice to mean 'any character set'. The solution was to ignore that
label and get on with life. Maybe a program had to be tweaked a bit
to achieve that, but usually not. And it might or might not be
possible to configure a program to label things correctly, but since
everyone ignored the label anyway, who cared?

Then 8-bit character sets and MIME arrived, and the same thing
happened again: 'Content-Type: text/plain; charset=iso-8859-1' came to
mean 'any character set or encoding'. After all, programmers knew
that this was the charset everyone was using if they were not using
ASCII. This time it couldn't even be blamed on poor programmers: If I
remember correctly, MIME says the default character set is ASCII, so
programs _have_ to label 8-bit messages with a charset even if they
have no idea which charset is in use. Programs can make the charset
configurable, of course, but most users didn't know or care about such
things, so that was really no help.

Fortunately, most programs just displayed the raw bytes and ignored
the charset, so it was easy to stay with the old solution of ignoring
charset labels and get on with life. Same with e.g. the X window
system: Parts of it (cut&paste buffers? Don't remember) was defined to
work with latin-1, but NS_4551-1 fonts worked just fine. Of course,
if we pasted æøå from an NS_4551-1 window to a latin-1 window we got
{|}, but that's was what we had learned to expect anyway. I don't
remember if one had to to some tweaking to convince X not to get
clever, but I think not.

Locales arrived too, and they might be helpful - except several
implementations were so buggy that programs crashed or misbehaved if
one turned them on. Also, it might or might not be possible to deduce
which character set was in use from the names of the locales. So, on
many machines, ignore them and move on.

Then UTF-8 arrived, and things got messy. We actually begun to need
to deal with different encodings as well as character sets.

UTF-8 texts labeled as iso-8859-1 (these still occur once in a while)
have to be decoded, it's not enough to switch the window's font if the
charset is wrong. Programs expecting UTF-8 would get a parse error on
iso-8859-1 input, it was not enough to change font.

There is a Linux box I'm sometimes doing remote logins to which I
can't figure out how to display non-ASCII characters. It insist that
my I'm using UTF-8. My X.11 font is latin-1. I can turn off the
locale settings, but then 8-bit characters are not displayed at all.
I'm sure there is some way to fix that, but I haven't bothered to find
out. I didn't need to dig around in manuals to find out that sort of
thing before.

I remember we had 3 LDAPv2 servers running for a while - one with
UTF-8, one with iso-8859-1, and one with T.61, which is the character
set which the LDAPv2 standard actually specified. Unless the third
server used NS_4551-1; I don't remember.

I've mentioned elsewhere that I had to downgrade Perl5.8 to a Unicode-
unaware version when my programs crashed. There was a feature to turn
off Unicode, but it didn't work. It seems to work in later versions.
Maybe it's even bug-free this time. I'm not planning to find out,
since we can't risk that these programs produce wrong output.

And don't get me started on Emacs MULE, a charset solution so poor
that from what I hear even Microsoft began to abandon it a decade
earlier (code pages). For a while the --unibyte helped, but after a
while that got icky. Oh well, most of the MULE bugs seem to be gone
now, after - is it 5 years?

The C language recently got both 8-bit characters and Unicode tokens
and literals (\unnnn). As far as I can tell, what it didn't get was
any provision for compilers and linkers which don't know which
character set is in use and therefore can't know which native
character should be translated to which Unicode character or vice
versa. So my guess is that compilers will just pick a character set
which seems likely if they aren't told. Or use the locale, which may
have nothing at all to do with which character set the program source
code is using. I may be wrong there, though; I only remember some of
the discussions on comp.std.c, I haven't checked the final standard.
</rant>

Of course, there are a lot of good sides to the story too - even locales
got cleaned up a lot, for example. And you'd get a very different story
from people in different environments (e.g. multi-cultural ones) or with
different operating systems even in Norway, but you already know that.

--
Hallvard

Jul 18 '05 #24

Hallvard B Furuseth

I wrote:

John Roth wrote:
Rudeness objection to your characteization.

Sorry, I guess that was a bit over the top. I've just gotten so fed
up with bad charset handling, including over-standardization, over the
years. And as you point out, I misunderstood the scope of your
suggestion. But you have been saying that people should always use
Unicode, and things like that.

Sorry again, I seem to have confused you with Peter. I should have
gotten a clue when you said "if you want your source to be utf-8, you
need to accept the consequences". Not exactly the words of a True
Believer in Unicode:-)

--
Hallvard

Jul 18 '05 #25

Hallvard B Furuseth

John Roth wrote:

The problem I have is that if you use utf-8 as the
source encoding, you can suddenly drop multi-byte
characters into an 8-bit string ***BY ACCIDENT***.
(...)
Now, my suggested solution of this problem was
to require that 8-bit string literals in source that was
encoded with UTF-8 be restricted to the 7-bit
ascii subset.
Then shouldn't your solution catch all multibyte encodings, not
just UTF-8?

Martin v. Löwis wrote:
[Hallvard] proposes your third alternative (ban non-ASCII
characters in byte string literals), not just for UTF-8,
but for all encodings. Not for all files, though, but
only for selected files.
John Roth wrote:
Which is what I don't like about it. It adds complexity
to the language and a feature that I don't think is really
necessary (restricting string literals for single-byte encodings.)
It's to prevent several errors:

* If the source file has one 'coding:' and the output destination has
another character set/encoding, then the wrong character set will be
output. Python offers two simple solutions to this:
- If the program is charset-aware, it can work with Unicode strings,
and the 8-bit string literal should be a Unicode literal.
- Otherwise, the program can stay away from Unicode and leave the
charset problem to the user.

* A worse case of the above: If the 8-bit output goes to an utf-8
destination, it won't merely give the wrong character, it will have
invalid format. So a program which reads the output may close the
connection it reads from, or fail to display the file at all, or -
if it is not robust - crash. I expect the same applies to other
multibyte encodings, and probably some single-byte encodings too.

* If the program is charset-aware and works with Unicode strings,
the Unicode handling blows up if it is passed an 8-bit str
(example copied from Anders' pychecker feature request):

# -*- coding: latin-1 -*-
x = "blåbærgrød"
unicode(x)
-->
Traceback (most recent call last):
File "/tmp/u.py", line 3, in ?
unicode(x)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5
in position 2: ordinal not in range(128)

The problem is that even though the file is tagged with latin-1, the
string x does not inherit that tag. So the Unicode handling doesn't
know which character set, if any, the string contains.
The other thing I don't like is that it still leaves the
trap for the unwary which I'm discussing.

Well, I would like to see a feature like this turned on by default
eventually (both for UTF-8 and other character sets), but for the time
being I'll stick to getting the feature into Python in the first place.

Though I do seem to have been too unambitious. For some reason I was
thinking it would be harder to get a new option into Python than a
per-file declaration.

--
Hallvard

Jul 18 '05 #26

Dieter Maurer

"Martin v. Löwis" <ma****@v.loewis.de> writes on Thu, 05 Aug 2004 23:49:07 +0200:

...
I personally believe it is too early yet to make lack of
encoding declarations a syntax error. I'd like to
reconsider the issue with Python 2.5.

I hope, it will never come...

The declaration is necessary for modules that are distributed
all over the world but superfluous for modules only used locally
(with fixed encoding).

Dieter

Jul 18 '05 #27

Martin v. Löwis

Dieter Maurer wrote:

I hope, it will never come...

Your hope will not be fulfilled. Some version of Python *will*
require that all non-ASCII source code contains an encoding
declarations. Read PEP 263, which has already been accepted.

Regards,
Martin

Jul 18 '05 #28

PEP 263 status check

Similar topics