Why asci-only symbols?

Mike Meyer

Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?

<mike
--
Mike Meyer <mw*@mired.or g> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Oct 11 '05 #1

Subscribe Reply

1872

jepler

I'm not aware of any PEPs on the subject, but google groups turns up some past
threads. Here's one from February 2004:
http://groups.google.com/group/comp....856af647ce71d5
I didn't immediately find this message of Guido's that everyone's talking about
as this thread began, though.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD4DBQFDSxnJJd0 1MZaTXX0RAlOiAJ jp6Bv3VSWDEqkkm p7SCuRwl3wOAKCP dL/g
xhDDw1NPR6JpgjL ptiWERQ==
=bFRK
-----END PGP SIGNATURE-----

Oct 11 '05 #2

Peter Hansen

Mike Meyer wrote:

Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?

And of equally random curiosity :-), what alternative(s) can you suggest
would have been appropriate? (I note that Unicode, for example, dates
from around the time Python was first released. And I can't really
imagine a non-ugly alternative, which probably reveals something bad
about my imagination.)

-Peter

Oct 11 '05 #3

Do Re Mi chel La Si Do

Hi !

I agree with you; I will adore capacity to call functions named in Unicode.

@-salutations

Michel Claveau

Oct 11 '05 #4

Martin v. Löwis

Mike Meyer wrote:

Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?

No PEP yet; I meant to write one for several years now.

The principles would be
- sources must use encoding declarations
- valid identifiers would follow the Unicode consortium guidelines,
in particular: identifiers would be normalized in NFKC (I think),
adjusted in the ASCII range for backward compatibility (i.e.
not introducing any additional ASCII characters as legal identifier
characters)
- __dict__ will contain Unicode keys
- all objects should support Unicode getattr/setattr (potentially
raising AttributeError, of course)
- open issue: what to do on the C API (perhaps nothing, perhaps
allowing UTF-8)

Regards,
Martin

Oct 12 '05 #5

Bengt Richter

On Wed, 12 Oct 2005 10:56:44 +0200, =?ISO-8859-1?Q?=22Martin_v =2E_L=F6wis=22? = <ma****@v.loewi s.de> wrote:

Mike Meyer wrote:
Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?

No PEP yet; I meant to write one for several years now.

The principles would be
- sources must use encoding declarations
- valid identifiers would follow the Unicode consortium guidelines,
in particular: identifiers would be normalized in NFKC (I think),
adjusted in the ASCII range for backward compatibility (i.e.
not introducing any additional ASCII characters as legal identifier
characters)
- __dict__ will contain Unicode keys
- all objects should support Unicode getattr/setattr (potentially
raising AttributeError, of course)
- open issue: what to do on the C API (perhaps nothing, perhaps
allowing UTF-8)

Perhaps string equivalence in keys will be treated like numeric equivalence?
I.e., a key/name representation is established by the initial key/name binding, but
values can be retrieved by "equivalent " key/names with different representations
like unicode vs ascii or latin-1 etc.?

Regards,
Bengt Richter

Oct 16 '05 #6

Martin v. Löwis

Bengt Richter wrote:

Perhaps string equivalence in keys will be treated like numeric equivalence?
I.e., a key/name representation is established by the initial key/name binding, but
values can be retrieved by "equivalent " key/names with different representations
like unicode vs ascii or latin-1 etc.?

That would require that you know the encoding of a byte string; this
information is not available at run-time.

You could also try all possible encodings to see whether the strings
are equal if you chose the right encoding for each one. This would
be both expensive and unlike numeric equivalence: in numeric
equivalence, you don't give a sequence of bytes all possible
interpretations to find some interpretation in which they are
equivalent, either.

There is one special case, though: when comparing a byte string
and a Unicode string, the system default encoding (i.e. ASCII)
is assumed. This only really works if the default encoding
really *is* ASCII. Otherwise, equal strings might not hash
equal, in which case you wouldn't find them properly in a
dictionary.

Regards,
Martin

Oct 16 '05 #7

Bengt Richter

On Sun, 16 Oct 2005 12:16:58 +0200, =?ISO-8859-1?Q?=22Martin_v =2E_L=F6wis=22? = <ma****@v.loewi s.de> wrote:

Bengt Richter wrote:
Perhaps string equivalence in keys will be treated like numeric equivalence?
I.e., a key/name representation is established by the initial key/name binding, but
values can be retrieved by "equivalent " key/names with different representations
like unicode vs ascii or latin-1 etc.?
That would require that you know the encoding of a byte string; this
information is not available at run-time.

Well, what will be assumed about name after the lines

#-*- coding: latin1 -*-
name = 'Martin Löwis'

?
I know type(name) will be <type 'str'> and in itself contain no encoding information now,
but why shouldn't the default assumption for literal-generated strings be what the coding
cookie specified? I know the current implementation doesn't keep track of the different
encodings that could reasonably be inferred from the source of the strings, but we are talking
about future stuff here ;-)
You could also try all possible encodings to see whether the strings
are equal if you chose the right encoding for each one. This would
be both expensive and unlike numeric equivalence: in numeric
equivalence, you don't give a sequence of bytes all possible
interpretation s to find some interpretation in which they are
equivalent, either.
Agreed, that would be a mess.
There is one special case, though: when comparing a byte string
and a Unicode string, the system default encoding (i.e. ASCII)
is assumed. This only really works if the default encoding
really *is* ASCII. Otherwise, equal strings might not hash
equal, in which case you wouldn't find them properly in a
dictionary.

Perhaps the str (or future byte) type could have an encoding attribute
defaulting to None, meaning to treat its instances as a current str instances.
Then setting the attribute to some particular encoding, like 'latin-1' (probably
internally normalized and optimized to be represented as a c pointer slot with a
NULL or a pointer to an appropriate codec or whatever) would make the str byte
string explicitly an encoded string, without changing the byte string data or
converting to a unicode encoding. With encoding information explicitly present
or absent, keys could have a normalized hash and comparison, maybe just normalizing
to platform utf for dict encoding-tagged string keys by default.

If this were done, IWT the automatic result of

#-*- coding: latin1 -*-
name = 'Martin Löwis'

could be that name.encoding == 'latin-1'

whereas without the encoding cookie, the default encoding assumption
for the program source would be used, and set explicitly to 'ascii'
or whatever it is.

Functions that generate strings, such as chr(), could be assumed to create
a string with the same encoding as the source code for the chr(...) invocation.
Ditto for e.g. '%s == %c' % (65, 65)
And
s = u'Martin Löwis'.encode(' latin-1')
would get
s.encoding == 'latin-1'
not
s.encoding == None
so that the encoding information could make
print s
mean
print s.decode(s.enco ding)
(which of course would re-encode to the output device encoding for output, like current
print s.decode('latin-1') and not fail like the current default assumption for s encoding
which is s.encoding==Non e, i.e., assume default, which is likely print s.decode('ascii '))

Hm, probably
s.encode(None)
and
s.decode(None)
could mean retrieve the str byte data unchanged as a str string with encoding set to None
in the result either way.

Now when you read a file in binary without specifying any encoding assumption, you
would get a str string with .encoding==None , but you could effectively reinterpret-cast it
to any encoding you like by assigning the encoding attribute. The attribute
could be a property that causes decode/encode automatically to create data in the
new encoding. The None encoding would coming or going would not change the data bytes, but
differing explicit encodings would cause decode/encode.

This could also support s1+s2 to mean generate a concatenated string
that has the same encoding attribute if s1.encoding==s2 .encoding and otherwise promotes
each to the platform standard unicode encoding and concatenates those if they
are different (and records the unicode encoding chosen in the result's encoding
attribute).

This is not a fully developed idea, and there has been discussion on the topic before
(even between us ;-) but I thought another round might bring out your current thinking
on it ;-)

Regards,
Bengt Richter

Oct 17 '05 #8

Martin v. Löwis

Bengt Richter wrote:

Well, what will be assumed about name after the lines

#-*- coding: latin1 -*-
name = 'Martin Löwis'

?
Are you asking what is assumed about the identifier 'name', or the value
bound to that identifier? Currently, the identifier must be encoded in
latin1 in this source code, and it must only consist of letters, digits,
and the underscore.

The value of name will be a string consisting of the bytes
4d 61 72 74 69 6e 20 4c f6 77 69 73
I know type(name) will be <type 'str'> and in itself contain no encoding information now,
but why shouldn't the default assumption for literal-generated strings be what the coding
cookie specified?
That certainly is the assumption: string literals must be in the
encoding specified in the source encoding, in the source code file
on disk. If they aren't (and cannot be interpreted that way), you
get a syntax error.
I know the current implementation doesn't keep track of the different
encodings that could reasonably be inferred from the source of the strings,
but we are talking about future stuff here ;-)
Ah, so you want the source encoding to be preserved, say as an attribute
of the string literal. This has been discussed many times, and was
always rejected.

Some people reject it because it is overkill: if you want reliable,
stable representation of characters, you should use Unicode strings.

Others reject it because of semantic difficulties: how would such
strings behave under concatenation, if the encodings are different?
#-*- coding: latin1 -*-
name = 'Martin Löwis'

could be that name.encoding == 'latin-1'
That is not at all intuitive. I would have expected name.encoding
to be 'latin1'.
Functions that generate strings, such as chr(), could be assumed to create
a string with the same encoding as the source code for the chr(...) invocation.
What is the source of the chr invocation? If I do chr(param), should I
use the source where param was computed, or the source where the call
to chr occurs? If the latter, how should the interpreter preserve the
encoding of where the call came from?

What about the many other sources of byte strings (like strings read
from a file, or received via a socket)?
This is not a fully developed idea, and there has been discussion on the topic before
(even between us ;-) but I thought another round might bring out your current thinking
on it ;-)

My thinking still is the same. It cannot really work, and it wouldn't do
any good with what little it could do. Just use Unicode strings.

Regards,
Martin

Oct 17 '05 #9

Bengt Richter

On Tue, 18 Oct 2005 01:34:09 +0200, =?ISO-8859-1?Q?=22Martin_v =2E_L=F6wis=22? = <ma****@v.loewi s.de> wrote:

Bengt Richter wrote:
Well, what will be assumed about name after the lines

#-*- coding: latin1 -*-
name = 'Martin Löwis'

?
Are you asking what is assumed about the identifier 'name', or the value
bound to that identifier? Currently, the identifier must be encoded in
latin1 in this source code, and it must only consist of letters, digits,
and the underscore.

The value of name will be a string consisting of the bytes
4d 61 72 74 69 6e 20 4c f6 77 69 73

Which is the latin-1 encoding. Ok, so far so good. We know it's latin1, but the knowledge
is lost to python.

I know type(name) will be <type 'str'> and in itself contain no encoding information now,
but why shouldn't the default assumption for literal-generated strings be what the coding
cookie specified?
That certainly is the assumption: string literals must be in the
encoding specified in the source encoding, in the source code file
on disk. If they aren't (and cannot be interpreted that way), you
get a syntax error.

I meant the "literal-generated string" (internal str instance representation compiled
from the latin1-encoded source string literal.
I know the current implementation doesn't keep track of the different
encodings that could reasonably be inferred from the source of the strings,
but we are talking about future stuff here ;-)
Ah, so you want the source encoding to be preserved, say as an attribute
of the string literal. This has been discussed many times, and was
always rejected.

Not of the string literal per se. That is only one (constant) expression resulting
in a str instance. I want (for the sake of this discussion ;-) the str instance
to have an encoding attribute when it can reliably be inferred, as e.g. when a coding
cookie is specified and the str instance comes from a constant literal string expression. Some people reject it because it is overkill: if you want reliable,
stable representation of characters, you should use Unicode strings.

Others reject it because of semantic difficulties: how would such
strings behave under concatenation, if the encodings are different? I mentioned that in parts you snipped (2nd half here):
"""
Now when you read a file in binary without specifying any encoding assumption, you
would get a str string with .encoding==None , but you could effectively reinterpret-cast it
to any encoding you like by assigning the encoding attribute. The attribute
could be a property that causes decode/encode automatically to create data in the
new encoding. The None encoding would coming or going would not change the data bytes, but
differing explicit encodings would cause decode/encode.

This could also support s1+s2 to mean generate a concatenated string
that has the same encoding attribute if s1.encoding==s2 .encoding and otherwise promotes
each to the platform standard unicode encoding and concatenates those if they
are different (and records the unicode encoding chosen in the result's encoding
attribute).
"""
#-*- coding: latin1 -*-
name = 'Martin Löwis'

could be that name.encoding == 'latin-1'
That is not at all intuitive. I would have expected name.encoding
to be 'latin1'.

That's pretty dead-pan. Not even a smiley ;-)

Functions that generate strings, such as chr(), could be assumed to create
a string with the same encoding as the source code for the chr(...) invocation.
What is the source of the chr invocation? If I do chr(param), should I

The source file that the "chr(param) " appears in.use the source where param was computed, or the source where the call No, the param is numeric, and has no reasonably inferrable encoding. (I don't
propose to have ord pass it on for integers to carry ;-) (so ord in another
module with different source encoding could be the source and an encoding
conversion could happen with integer as intermediary. But that's expected ;-)
to chr occurs? If the latter, how should the interpreter preserve the
encoding of where the call came from? not this latter, so not applicable.
What about the many other sources of byte strings (like strings read
from a file, or received via a socket)? I mentioned that in parts you snipped. See above.

This is not a fully developed idea, and there has been discussion on the topic before
(even between us ;-) but I thought another round might bring out your current thinking
on it ;-)

My thinking still is the same. It cannot really work, and it wouldn't do
any good with what little it could do. Just use Unicode strings.

To hear "It cannot really work" causes me agitation, even if I know it's not worth
the effort to pursue it ;-)

Anyway, ok, I'll leave it at that, but I'm not altogether happy with having to write

#-*- coding: latin1 -*-
name = 'Martin Löwis'
print name.decode('la tin1')

where I think

#-*- coding: latin1 -*-
name = 'Martin Löwis'
print name

should reasonably produce the same output. Though I grant you

#-*- coding: latin1 -*-
name = u'Martin Löwis'
print name

is not that hard to do. (Please excuse the use of your name, which has a handy non-ascii letter ;-)

Regards,
Bengt Richter

Oct 18 '05 #10

Similar topics

2445

Extended ASCI Encoding

by: Stelrad Doulton | last post by:

Hi, Does anyone know what number to use with GetEncoding for 8 bit ASCI? Cheers

.NET Framework

1282

converting data

by: WindAndWaves | last post by:

Hi Gurus I have a database full of text that I have downloaded from a website (ftp download, read file, parse file (i.e. break it down into parts), move it into a table, etc...). My next step now is to load this data into a MySQL database on the net for a PHP website. HOWEVER, the funny characters on the original website (e.g. unicode characters, etc.... etc...) are doing really nasty things. The MySQL thaat I use does not support...

Microsoft Access / VBA

1934

How to read an unsigned 32 bit value coming in via socket??

by: Isi Robayna | last post by:

Hello, I am trying to communicate with a legacy socket server. The first thing this legacy server does upon getting a client connection is to send an ID back to the client... I am trying to read that ID Value. By the way, that ID is an uint (32 bit unsigned integer) I have tried a number of different things to translate the buffer coming in to the a number, and I cannot translate the number(ID) in the buffer to a uint .....

C# / C Sharp

4846

Socket - send bytes to 255

by: Dave | last post by:

I'm using sockets. I need to send bytes without alternation from 0-255. When I use the standard, Byte bytestuff = System.Text.Encoding.ASCII.GetBytes(stuff.ToCharArray()); tSocket.Send(bytestuff,bytestuff.Length,0); It encodes most characters to ? past asci 127. I beleive that's because of the Encoding,ASCII. How can I send info preserving the raw bits for asci 128-255?

C# / C Sharp

5514

Question regarding binary code and vb.net.

by: Eddie | last post by:

Hi, Is it possible to make a program in VB.net that will convert letters, numbers etc to Binary Code? I have the tables here that show me exactly what each character works out to, but I was wondering if there is an easier way to do this. Thanks,

Visual Basic .NET

104

5477

form border style question

by: Colin McGuire | last post by:

Hi, is there a way to show a form without a titlebar (and therefore no control box/minimize box/title etc) but still have it appear looking like 3D? The property FormBorderStyle to None - this gives no titlebar etc but the form borders don't look 3D. In case I haven't explained what I want well, I want a form that looks like a button with no text (ie a form with the lovely 3D borders but no titlebar etc).

Visual Basic .NET

7086

HELP! C++ program for simulating tollbooth

by: Keo932 | last post by:

Hello all, I am finishing up my program to simulate a tollbooth by using classes. What happens is cars passing by the booth are expected to pay a fifty cent toll. The program keeps track of the number of cars that have gone by (paid and unpaid), and the total money collected. The two data items are of type int to hold the total number of cars and type float to hold the total amount of money collected. A constructor initializes both these...

C / C++

2105

Generating alpha numeric no

by: chindanoor | last post by:

i need to Generate a unique id which contains 14 numeric and one Alphabatic which i need to increment it whenver i call the Function. Eg: A00000000000001 to A99999999999999 and it should start from B00000000000001 to .......etc Can anybody help me regarding this for generating it.. Thanks in advance....

Java

2732

how to protect my project

by: muddasirmunir | last post by:

my project is nearly completed . and now i am going to put my project in office .how can i protect my projoct form bieing decoded by anyone or no body can use my programe in any other comuputer

Visual Basic 4 / 5 / 6

2081

VB - APP - Extracting Certain Block From Hex File

by: sillybob123 | last post by:

Hello, Im new to the site and was hoping some of you could help me, I would like to be able to open a file locate a certain offset in that file and extract up to a certain part the item type i would want to extract is PNG from offsets 171A

.NET Framework

8820

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8718

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8499

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8601

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6162

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5630

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4150

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

2726

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

1937

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP