Why asci-only symbols?

Mike Meyer

Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Oct 11 '05 #1

Subscribe Post Reply

1850

jepler

I'm not aware of any PEPs on the subject, but google groups turns up some past
threads. Here's one from February 2004:
http://groups.google.com/group/comp....856af647ce71d5
I didn't immediately find this message of Guido's that everyone's talking about
as this thread began, though.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD4DBQFDSxnJJd01MZaTXX0RAlOiAJjp6Bv3VSWDEqkkmp7SCu Rwl3wOAKCPdL/g
xhDDw1NPR6JpgjLptiWERQ==
=bFRK
-----END PGP SIGNATURE-----

Oct 11 '05 #2

Peter Hansen

Mike Meyer wrote:

Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?

And of equally random curiosity :-), what alternative(s) can you suggest
would have been appropriate? (I note that Unicode, for example, dates
from around the time Python was first released. And I can't really
imagine a non-ugly alternative, which probably reveals something bad
about my imagination.)

-Peter

Oct 11 '05 #3

Do Re Mi chel La Si Do

Hi !

I agree with you; I will adore capacity to call functions named in Unicode.

@-salutations

Michel Claveau

Oct 11 '05 #4

Martin v. Löwis

Mike Meyer wrote:

Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?

No PEP yet; I meant to write one for several years now.

The principles would be
- sources must use encoding declarations
- valid identifiers would follow the Unicode consortium guidelines,
in particular: identifiers would be normalized in NFKC (I think),
adjusted in the ASCII range for backward compatibility (i.e.
not introducing any additional ASCII characters as legal identifier
characters)
- __dict__ will contain Unicode keys
- all objects should support Unicode getattr/setattr (potentially
raising AttributeError, of course)
- open issue: what to do on the C API (perhaps nothing, perhaps
allowing UTF-8)

Regards,
Martin

Oct 12 '05 #5

Bengt Richter

On Wed, 12 Oct 2005 10:56:44 +0200, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <ma****@v.loewis.de> wrote:

Mike Meyer wrote:
Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?

No PEP yet; I meant to write one for several years now.

The principles would be
- sources must use encoding declarations
- valid identifiers would follow the Unicode consortium guidelines,
in particular: identifiers would be normalized in NFKC (I think),
adjusted in the ASCII range for backward compatibility (i.e.
not introducing any additional ASCII characters as legal identifier
characters)
- __dict__ will contain Unicode keys
- all objects should support Unicode getattr/setattr (potentially
raising AttributeError, of course)
- open issue: what to do on the C API (perhaps nothing, perhaps
allowing UTF-8)

Perhaps string equivalence in keys will be treated like numeric equivalence?
I.e., a key/name representation is established by the initial key/name binding, but
values can be retrieved by "equivalent" key/names with different representations
like unicode vs ascii or latin-1 etc.?

Regards,
Bengt Richter

Oct 16 '05 #6

Martin v. Löwis

Bengt Richter wrote:

Perhaps string equivalence in keys will be treated like numeric equivalence?
I.e., a key/name representation is established by the initial key/name binding, but
values can be retrieved by "equivalent" key/names with different representations
like unicode vs ascii or latin-1 etc.?

That would require that you know the encoding of a byte string; this
information is not available at run-time.

You could also try all possible encodings to see whether the strings
are equal if you chose the right encoding for each one. This would
be both expensive and unlike numeric equivalence: in numeric
equivalence, you don't give a sequence of bytes all possible
interpretations to find some interpretation in which they are
equivalent, either.

There is one special case, though: when comparing a byte string
and a Unicode string, the system default encoding (i.e. ASCII)
is assumed. This only really works if the default encoding
really *is* ASCII. Otherwise, equal strings might not hash
equal, in which case you wouldn't find them properly in a
dictionary.

Regards,
Martin

Oct 16 '05 #7

Bengt Richter

On Sun, 16 Oct 2005 12:16:58 +0200, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <ma****@v.loewis.de> wrote:

Bengt Richter wrote:
Perhaps string equivalence in keys will be treated like numeric equivalence?
I.e., a key/name representation is established by the initial key/name binding, but
values can be retrieved by "equivalent" key/names with different representations
like unicode vs ascii or latin-1 etc.?
That would require that you know the encoding of a byte string; this
information is not available at run-time.

Well, what will be assumed about name after the lines

#-*- coding: latin1 -*-
name = 'Martin Löwis'

?
I know type(name) will be <type 'str'> and in itself contain no encoding information now,
but why shouldn't the default assumption for literal-generated strings be what the coding
cookie specified? I know the current implementation doesn't keep track of the different
encodings that could reasonably be inferred from the source of the strings, but we are talking
about future stuff here ;-)
You could also try all possible encodings to see whether the strings
are equal if you chose the right encoding for each one. This would
be both expensive and unlike numeric equivalence: in numeric
equivalence, you don't give a sequence of bytes all possible
interpretations to find some interpretation in which they are
equivalent, either.
Agreed, that would be a mess.
There is one special case, though: when comparing a byte string
and a Unicode string, the system default encoding (i.e. ASCII)
is assumed. This only really works if the default encoding
really *is* ASCII. Otherwise, equal strings might not hash
equal, in which case you wouldn't find them properly in a
dictionary.

Perhaps the str (or future byte) type could have an encoding attribute
defaulting to None, meaning to treat its instances as a current str instances.
Then setting the attribute to some particular encoding, like 'latin-1' (probably
internally normalized and optimized to be represented as a c pointer slot with a
NULL or a pointer to an appropriate codec or whatever) would make the str byte
string explicitly an encoded string, without changing the byte string data or
converting to a unicode encoding. With encoding information explicitly present
or absent, keys could have a normalized hash and comparison, maybe just normalizing
to platform utf for dict encoding-tagged string keys by default.

If this were done, IWT the automatic result of

#-*- coding: latin1 -*-
name = 'Martin Löwis'

could be that name.encoding == 'latin-1'

whereas without the encoding cookie, the default encoding assumption
for the program source would be used, and set explicitly to 'ascii'
or whatever it is.

Functions that generate strings, such as chr(), could be assumed to create
a string with the same encoding as the source code for the chr(...) invocation.
Ditto for e.g. '%s == %c' % (65, 65)
And
s = u'Martin Löwis'.encode('latin-1')
would get
s.encoding == 'latin-1'
not
s.encoding == None
so that the encoding information could make
print s
mean
print s.decode(s.encoding)
(which of course would re-encode to the output device encoding for output, like current
print s.decode('latin-1') and not fail like the current default assumption for s encoding
which is s.encoding==None, i.e., assume default, which is likely print s.decode('ascii'))

Hm, probably
s.encode(None)
and
s.decode(None)
could mean retrieve the str byte data unchanged as a str string with encoding set to None
in the result either way.

Now when you read a file in binary without specifying any encoding assumption, you
would get a str string with .encoding==None, but you could effectively reinterpret-cast it
to any encoding you like by assigning the encoding attribute. The attribute
could be a property that causes decode/encode automatically to create data in the
new encoding. The None encoding would coming or going would not change the data bytes, but
differing explicit encodings would cause decode/encode.

This could also support s1+s2 to mean generate a concatenated string
that has the same encoding attribute if s1.encoding==s2.encoding and otherwise promotes
each to the platform standard unicode encoding and concatenates those if they
are different (and records the unicode encoding chosen in the result's encoding
attribute).

This is not a fully developed idea, and there has been discussion on the topic before
(even between us ;-) but I thought another round might bring out your current thinking
on it ;-)

Regards,
Bengt Richter

Oct 17 '05 #8

Martin v. Löwis

Bengt Richter wrote:

Well, what will be assumed about name after the lines

#-*- coding: latin1 -*-
name = 'Martin Löwis'

?
Are you asking what is assumed about the identifier 'name', or the value
bound to that identifier? Currently, the identifier must be encoded in
latin1 in this source code, and it must only consist of letters, digits,
and the underscore.

The value of name will be a string consisting of the bytes
4d 61 72 74 69 6e 20 4c f6 77 69 73
I know type(name) will be <type 'str'> and in itself contain no encoding information now,
but why shouldn't the default assumption for literal-generated strings be what the coding
cookie specified?
That certainly is the assumption: string literals must be in the
encoding specified in the source encoding, in the source code file
on disk. If they aren't (and cannot be interpreted that way), you
get a syntax error.
I know the current implementation doesn't keep track of the different
encodings that could reasonably be inferred from the source of the strings,
but we are talking about future stuff here ;-)
Ah, so you want the source encoding to be preserved, say as an attribute
of the string literal. This has been discussed many times, and was
always rejected.

Some people reject it because it is overkill: if you want reliable,
stable representation of characters, you should use Unicode strings.

Others reject it because of semantic difficulties: how would such
strings behave under concatenation, if the encodings are different?
#-*- coding: latin1 -*-
name = 'Martin Löwis'

could be that name.encoding == 'latin-1'
That is not at all intuitive. I would have expected name.encoding
to be 'latin1'.
Functions that generate strings, such as chr(), could be assumed to create
a string with the same encoding as the source code for the chr(...) invocation.
What is the source of the chr invocation? If I do chr(param), should I
use the source where param was computed, or the source where the call
to chr occurs? If the latter, how should the interpreter preserve the
encoding of where the call came from?

What about the many other sources of byte strings (like strings read
from a file, or received via a socket)?
This is not a fully developed idea, and there has been discussion on the topic before
(even between us ;-) but I thought another round might bring out your current thinking
on it ;-)

My thinking still is the same. It cannot really work, and it wouldn't do
any good with what little it could do. Just use Unicode strings.

Regards,
Martin

Oct 17 '05 #9

Bengt Richter

On Tue, 18 Oct 2005 01:34:09 +0200, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <ma****@v.loewis.de> wrote:

Bengt Richter wrote:
Well, what will be assumed about name after the lines

#-*- coding: latin1 -*-
name = 'Martin Löwis'

?
Are you asking what is assumed about the identifier 'name', or the value
bound to that identifier? Currently, the identifier must be encoded in
latin1 in this source code, and it must only consist of letters, digits,
and the underscore.

The value of name will be a string consisting of the bytes
4d 61 72 74 69 6e 20 4c f6 77 69 73

Which is the latin-1 encoding. Ok, so far so good. We know it's latin1, but the knowledge
is lost to python.

I know type(name) will be <type 'str'> and in itself contain no encoding information now,
but why shouldn't the default assumption for literal-generated strings be what the coding
cookie specified?
That certainly is the assumption: string literals must be in the
encoding specified in the source encoding, in the source code file
on disk. If they aren't (and cannot be interpreted that way), you
get a syntax error.

I meant the "literal-generated string" (internal str instance representation compiled
from the latin1-encoded source string literal.
I know the current implementation doesn't keep track of the different
encodings that could reasonably be inferred from the source of the strings,
but we are talking about future stuff here ;-)
Ah, so you want the source encoding to be preserved, say as an attribute
of the string literal. This has been discussed many times, and was
always rejected.

Not of the string literal per se. That is only one (constant) expression resulting
in a str instance. I want (for the sake of this discussion ;-) the str instance
to have an encoding attribute when it can reliably be inferred, as e.g. when a coding
cookie is specified and the str instance comes from a constant literal string expression. Some people reject it because it is overkill: if you want reliable,
stable representation of characters, you should use Unicode strings.

Others reject it because of semantic difficulties: how would such
strings behave under concatenation, if the encodings are different? I mentioned that in parts you snipped (2nd half here):
"""
Now when you read a file in binary without specifying any encoding assumption, you
would get a str string with .encoding==None, but you could effectively reinterpret-cast it
to any encoding you like by assigning the encoding attribute. The attribute
could be a property that causes decode/encode automatically to create data in the
new encoding. The None encoding would coming or going would not change the data bytes, but
differing explicit encodings would cause decode/encode.

This could also support s1+s2 to mean generate a concatenated string
that has the same encoding attribute if s1.encoding==s2.encoding and otherwise promotes
each to the platform standard unicode encoding and concatenates those if they
are different (and records the unicode encoding chosen in the result's encoding
attribute).
"""
#-*- coding: latin1 -*-
name = 'Martin Löwis'

could be that name.encoding == 'latin-1'
That is not at all intuitive. I would have expected name.encoding
to be 'latin1'.

That's pretty dead-pan. Not even a smiley ;-)

Functions that generate strings, such as chr(), could be assumed to create
a string with the same encoding as the source code for the chr(...) invocation.
What is the source of the chr invocation? If I do chr(param), should I

The source file that the "chr(param)" appears in.use the source where param was computed, or the source where the call No, the param is numeric, and has no reasonably inferrable encoding. (I don't
propose to have ord pass it on for integers to carry ;-) (so ord in another
module with different source encoding could be the source and an encoding
conversion could happen with integer as intermediary. But that's expected ;-)
to chr occurs? If the latter, how should the interpreter preserve the
encoding of where the call came from? not this latter, so not applicable.
What about the many other sources of byte strings (like strings read
from a file, or received via a socket)? I mentioned that in parts you snipped. See above.

This is not a fully developed idea, and there has been discussion on the topic before
(even between us ;-) but I thought another round might bring out your current thinking
on it ;-)

My thinking still is the same. It cannot really work, and it wouldn't do
any good with what little it could do. Just use Unicode strings.

To hear "It cannot really work" causes me agitation, even if I know it's not worth
the effort to pursue it ;-)

Anyway, ok, I'll leave it at that, but I'm not altogether happy with having to write

#-*- coding: latin1 -*-
name = 'Martin Löwis'
print name.decode('latin1')

where I think

#-*- coding: latin1 -*-
name = 'Martin Löwis'
print name

should reasonably produce the same output. Though I grant you

#-*- coding: latin1 -*-
name = u'Martin Löwis'
print name

is not that hard to do. (Please excuse the use of your name, which has a handy non-ascii letter ;-)

Regards,
Bengt Richter

Oct 18 '05 #10

Scott David Daniels

Bengt Richter wrote:

<on tracking the encodings of literal generated astrings>

The big problem you'll hit is figuring out how to use these strings.
Which string ops preserve the encoding? Even the following is
problematic:

#-*- coding: utf-8 -*-
name = 'Martin Löwis'

brokenpart = name[: 9]

Because brokenpart is not a correct utf-8 encoding of anything.
The problem is that there is no good way to propagate the
encoding without understanding the purpose of the operations
themselves.

--Scott David Daniels
sc***********@acm.org

Oct 18 '05 #11

Martin v. Löwis

Bengt Richter wrote:

Others reject it because of semantic difficulties: how would such
strings behave under concatenation, if the encodings are different?
I mentioned that in parts you snipped (2nd half here):

This could also support s1+s2 to mean generate a concatenated string
that has the same encoding attribute if s1.encoding==s2.encoding and otherwise promotes
each to the platform standard unicode encoding and concatenates those if they
are different (and records the unicode encoding chosen in the result's encoding
attribute).

It remains semantically difficult. There are other alternatives, e.g.
(s1+s2).encoding could become None, instead of using your procedure.

Also, this specification is incomplete: what if either s1.encoding
or s2.encoding is None?

Then, what recoding to the platform encoding fails? With ASCII
being the default encoding at the moment, it is very likely that
concatenations will fail if there are funny characters in either
string.

If you propose that this should raise an exception, it means that
normal string concatenations will then give you exceptions just
as often as (or even more often than) you get UnicodeErrors
currently. I doubt users would like that.

This is not a fully developed idea, and there has been discussion on the topic before
(even between us ;-) but I thought another round might bring out your current thinking
on it ;-)

My thinking still is the same. It cannot really work, and it wouldn't do
any good with what little it could do. Just use Unicode strings.

To hear "It cannot really work" causes me agitation, even if I know it's not worth
the effort to pursue it ;-)

It is certainly implementable, yes. But it will then break a lot of
existing code.
Though I grant you

#-*- coding: latin1 -*-
name = u'Martin Löwis'
print name

is not that hard to do.
This is indeed what you should do. In Python 3, you can omit the u,
as the string type will go away (and be replaced with the Unicode type).
(Please excuse the use of your name, which has a handy non-ascii letter ;-)

No problem with that :-)

Regards,
Martin

Oct 18 '05 #12

Similar topics

Extended ASCI Encoding

by: Stelrad Doulton | last post by:

Hi, Does anyone know what number to use with GetEncoding for 8 bit ASCI? Cheers

.NET Framework

converting data

by: WindAndWaves | last post by:

Hi Gurus I have a database full of text that I have downloaded from a website (ftp download, read file, parse file (i.e. break it down into parts), move it into a table, etc...). My next step...

Microsoft Access / VBA

How to read an unsigned 32 bit value coming in via socket??

by: Isi Robayna | last post by:

Hello, I am trying to communicate with a legacy socket server. The first thing this legacy server does upon getting a client connection is to send an ID back to the client... I am trying to read...

C# / C Sharp

Socket - send bytes to 255

by: Dave | last post by:

I'm using sockets. I need to send bytes without alternation from 0-255. When I use the standard, Byte bytestuff = System.Text.Encoding.ASCII.GetBytes(stuff.ToCharArray());...

C# / C Sharp

Question regarding binary code and vb.net.

by: Eddie | last post by:

Hi, Is it possible to make a program in VB.net that will convert letters, numbers etc to Binary Code? I have the tables here that show me exactly what each character works out to, but I was...

Visual Basic .NET

104

form border style question

by: Colin McGuire | last post by:

Hi, is there a way to show a form without a titlebar (and therefore no control box/minimize box/title etc) but still have it appear looking like 3D? The property FormBorderStyle to None - this...

Visual Basic .NET

HELP! C++ program for simulating tollbooth

by: Keo932 | last post by:

Hello all, I am finishing up my program to simulate a tollbooth by using classes. What happens is cars passing by the booth are expected to pay a fifty cent toll. The program keeps track of the...

C / C++

Generating alpha numeric no

by: chindanoor | last post by:

i need to Generate a unique id which contains 14 numeric and one Alphabatic which i need to increment it whenver i call the Function. Eg: A00000000000001 to A99999999999999 and it should start...

Java

how to protect my project

by: muddasirmunir | last post by:

my project is nearly completed . and now i am going to put my project in office .how can i protect my projoct form bieing decoded by anyone or no body can use my programe in any other comuputer

Visual Basic 4 / 5 / 6

VB - APP - Extracting Certain Block From Hex File

by: sillybob123 | last post by:

Hello, Im new to the site and was hoping some of you could help me, I would like to be able to open a file locate a certain offset in that file and extract up to a certain part the item type...

.NET Framework

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++