473,657 Members | 2,953 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Why asci-only symbols?

Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?

<mike
--
Mike Meyer <mw*@mired.or g> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Oct 11 '05 #1
11 1872
I'm not aware of any PEPs on the subject, but google groups turns up some past
threads. Here's one from February 2004:
http://groups.google.com/group/comp....856af647ce71d5
I didn't immediately find this message of Guido's that everyone's talking about
as this thread began, though.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD4DBQFDSxnJJd0 1MZaTXX0RAlOiAJ jp6Bv3VSWDEqkkm p7SCuRwl3wOAKCP dL/g
xhDDw1NPR6JpgjL ptiWERQ==
=bFRK
-----END PGP SIGNATURE-----

Oct 11 '05 #2
Mike Meyer wrote:
Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?


And of equally random curiosity :-), what alternative(s) can you suggest
would have been appropriate? (I note that Unicode, for example, dates
from around the time Python was first released. And I can't really
imagine a non-ugly alternative, which probably reveals something bad
about my imagination.)

-Peter
Oct 11 '05 #3
Hi !

I agree with you; I will adore capacity to call functions named in Unicode.

@-salutations

Michel Claveau

Oct 11 '05 #4
Mike Meyer wrote:
Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?


No PEP yet; I meant to write one for several years now.

The principles would be
- sources must use encoding declarations
- valid identifiers would follow the Unicode consortium guidelines,
in particular: identifiers would be normalized in NFKC (I think),
adjusted in the ASCII range for backward compatibility (i.e.
not introducing any additional ASCII characters as legal identifier
characters)
- __dict__ will contain Unicode keys
- all objects should support Unicode getattr/setattr (potentially
raising AttributeError, of course)
- open issue: what to do on the C API (perhaps nothing, perhaps
allowing UTF-8)

Regards,
Martin
Oct 12 '05 #5
On Wed, 12 Oct 2005 10:56:44 +0200, =?ISO-8859-1?Q?=22Martin_v =2E_L=F6wis=22? = <ma****@v.loewi s.de> wrote:
Mike Meyer wrote:
Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?


No PEP yet; I meant to write one for several years now.

The principles would be
- sources must use encoding declarations
- valid identifiers would follow the Unicode consortium guidelines,
in particular: identifiers would be normalized in NFKC (I think),
adjusted in the ASCII range for backward compatibility (i.e.
not introducing any additional ASCII characters as legal identifier
characters)
- __dict__ will contain Unicode keys
- all objects should support Unicode getattr/setattr (potentially
raising AttributeError, of course)
- open issue: what to do on the C API (perhaps nothing, perhaps
allowing UTF-8)


Perhaps string equivalence in keys will be treated like numeric equivalence?
I.e., a key/name representation is established by the initial key/name binding, but
values can be retrieved by "equivalent " key/names with different representations
like unicode vs ascii or latin-1 etc.?

Regards,
Bengt Richter
Oct 16 '05 #6
Bengt Richter wrote:
Perhaps string equivalence in keys will be treated like numeric equivalence?
I.e., a key/name representation is established by the initial key/name binding, but
values can be retrieved by "equivalent " key/names with different representations
like unicode vs ascii or latin-1 etc.?


That would require that you know the encoding of a byte string; this
information is not available at run-time.

You could also try all possible encodings to see whether the strings
are equal if you chose the right encoding for each one. This would
be both expensive and unlike numeric equivalence: in numeric
equivalence, you don't give a sequence of bytes all possible
interpretations to find some interpretation in which they are
equivalent, either.

There is one special case, though: when comparing a byte string
and a Unicode string, the system default encoding (i.e. ASCII)
is assumed. This only really works if the default encoding
really *is* ASCII. Otherwise, equal strings might not hash
equal, in which case you wouldn't find them properly in a
dictionary.

Regards,
Martin
Oct 16 '05 #7
On Sun, 16 Oct 2005 12:16:58 +0200, =?ISO-8859-1?Q?=22Martin_v =2E_L=F6wis=22? = <ma****@v.loewi s.de> wrote:
Bengt Richter wrote:
Perhaps string equivalence in keys will be treated like numeric equivalence?
I.e., a key/name representation is established by the initial key/name binding, but
values can be retrieved by "equivalent " key/names with different representations
like unicode vs ascii or latin-1 etc.?
That would require that you know the encoding of a byte string; this
information is not available at run-time.

Well, what will be assumed about name after the lines

#-*- coding: latin1 -*-
name = 'Martin Löwis'

?
I know type(name) will be <type 'str'> and in itself contain no encoding information now,
but why shouldn't the default assumption for literal-generated strings be what the coding
cookie specified? I know the current implementation doesn't keep track of the different
encodings that could reasonably be inferred from the source of the strings, but we are talking
about future stuff here ;-)
You could also try all possible encodings to see whether the strings
are equal if you chose the right encoding for each one. This would
be both expensive and unlike numeric equivalence: in numeric
equivalence, you don't give a sequence of bytes all possible
interpretation s to find some interpretation in which they are
equivalent, either.
Agreed, that would be a mess.
There is one special case, though: when comparing a byte string
and a Unicode string, the system default encoding (i.e. ASCII)
is assumed. This only really works if the default encoding
really *is* ASCII. Otherwise, equal strings might not hash
equal, in which case you wouldn't find them properly in a
dictionary.

Perhaps the str (or future byte) type could have an encoding attribute
defaulting to None, meaning to treat its instances as a current str instances.
Then setting the attribute to some particular encoding, like 'latin-1' (probably
internally normalized and optimized to be represented as a c pointer slot with a
NULL or a pointer to an appropriate codec or whatever) would make the str byte
string explicitly an encoded string, without changing the byte string data or
converting to a unicode encoding. With encoding information explicitly present
or absent, keys could have a normalized hash and comparison, maybe just normalizing
to platform utf for dict encoding-tagged string keys by default.

If this were done, IWT the automatic result of

#-*- coding: latin1 -*-
name = 'Martin Löwis'

could be that name.encoding == 'latin-1'

whereas without the encoding cookie, the default encoding assumption
for the program source would be used, and set explicitly to 'ascii'
or whatever it is.

Functions that generate strings, such as chr(), could be assumed to create
a string with the same encoding as the source code for the chr(...) invocation.
Ditto for e.g. '%s == %c' % (65, 65)
And
s = u'Martin Löwis'.encode(' latin-1')
would get
s.encoding == 'latin-1'
not
s.encoding == None
so that the encoding information could make
print s
mean
print s.decode(s.enco ding)
(which of course would re-encode to the output device encoding for output, like current
print s.decode('latin-1') and not fail like the current default assumption for s encoding
which is s.encoding==Non e, i.e., assume default, which is likely print s.decode('ascii '))

Hm, probably
s.encode(None)
and
s.decode(None)
could mean retrieve the str byte data unchanged as a str string with encoding set to None
in the result either way.

Now when you read a file in binary without specifying any encoding assumption, you
would get a str string with .encoding==None , but you could effectively reinterpret-cast it
to any encoding you like by assigning the encoding attribute. The attribute
could be a property that causes decode/encode automatically to create data in the
new encoding. The None encoding would coming or going would not change the data bytes, but
differing explicit encodings would cause decode/encode.

This could also support s1+s2 to mean generate a concatenated string
that has the same encoding attribute if s1.encoding==s2 .encoding and otherwise promotes
each to the platform standard unicode encoding and concatenates those if they
are different (and records the unicode encoding chosen in the result's encoding
attribute).

This is not a fully developed idea, and there has been discussion on the topic before
(even between us ;-) but I thought another round might bring out your current thinking
on it ;-)

Regards,
Bengt Richter
Oct 17 '05 #8
Bengt Richter wrote:
Well, what will be assumed about name after the lines

#-*- coding: latin1 -*-
name = 'Martin Löwis'

?
Are you asking what is assumed about the identifier 'name', or the value
bound to that identifier? Currently, the identifier must be encoded in
latin1 in this source code, and it must only consist of letters, digits,
and the underscore.

The value of name will be a string consisting of the bytes
4d 61 72 74 69 6e 20 4c f6 77 69 73
I know type(name) will be <type 'str'> and in itself contain no encoding information now,
but why shouldn't the default assumption for literal-generated strings be what the coding
cookie specified?
That certainly is the assumption: string literals must be in the
encoding specified in the source encoding, in the source code file
on disk. If they aren't (and cannot be interpreted that way), you
get a syntax error.
I know the current implementation doesn't keep track of the different
encodings that could reasonably be inferred from the source of the strings,
but we are talking about future stuff here ;-)
Ah, so you want the source encoding to be preserved, say as an attribute
of the string literal. This has been discussed many times, and was
always rejected.

Some people reject it because it is overkill: if you want reliable,
stable representation of characters, you should use Unicode strings.

Others reject it because of semantic difficulties: how would such
strings behave under concatenation, if the encodings are different?
#-*- coding: latin1 -*-
name = 'Martin Löwis'

could be that name.encoding == 'latin-1'
That is not at all intuitive. I would have expected name.encoding
to be 'latin1'.
Functions that generate strings, such as chr(), could be assumed to create
a string with the same encoding as the source code for the chr(...) invocation.
What is the source of the chr invocation? If I do chr(param), should I
use the source where param was computed, or the source where the call
to chr occurs? If the latter, how should the interpreter preserve the
encoding of where the call came from?

What about the many other sources of byte strings (like strings read
from a file, or received via a socket)?
This is not a fully developed idea, and there has been discussion on the topic before
(even between us ;-) but I thought another round might bring out your current thinking
on it ;-)


My thinking still is the same. It cannot really work, and it wouldn't do
any good with what little it could do. Just use Unicode strings.

Regards,
Martin
Oct 17 '05 #9
On Tue, 18 Oct 2005 01:34:09 +0200, =?ISO-8859-1?Q?=22Martin_v =2E_L=F6wis=22? = <ma****@v.loewi s.de> wrote:
Bengt Richter wrote:
Well, what will be assumed about name after the lines

#-*- coding: latin1 -*-
name = 'Martin Löwis'

?
Are you asking what is assumed about the identifier 'name', or the value
bound to that identifier? Currently, the identifier must be encoded in
latin1 in this source code, and it must only consist of letters, digits,
and the underscore.

The value of name will be a string consisting of the bytes
4d 61 72 74 69 6e 20 4c f6 77 69 73


Which is the latin-1 encoding. Ok, so far so good. We know it's latin1, but the knowledge
is lost to python.
I know type(name) will be <type 'str'> and in itself contain no encoding information now,
but why shouldn't the default assumption for literal-generated strings be what the coding
cookie specified?
That certainly is the assumption: string literals must be in the
encoding specified in the source encoding, in the source code file
on disk. If they aren't (and cannot be interpreted that way), you
get a syntax error.

I meant the "literal-generated string" (internal str instance representation compiled
from the latin1-encoded source string literal.
I know the current implementation doesn't keep track of the different
encodings that could reasonably be inferred from the source of the strings,
but we are talking about future stuff here ;-)
Ah, so you want the source encoding to be preserved, say as an attribute
of the string literal. This has been discussed many times, and was
always rejected.

Not of the string literal per se. That is only one (constant) expression resulting
in a str instance. I want (for the sake of this discussion ;-) the str instance
to have an encoding attribute when it can reliably be inferred, as e.g. when a coding
cookie is specified and the str instance comes from a constant literal string expression. Some people reject it because it is overkill: if you want reliable,
stable representation of characters, you should use Unicode strings.

Others reject it because of semantic difficulties: how would such
strings behave under concatenation, if the encodings are different? I mentioned that in parts you snipped (2nd half here):
"""
Now when you read a file in binary without specifying any encoding assumption, you
would get a str string with .encoding==None , but you could effectively reinterpret-cast it
to any encoding you like by assigning the encoding attribute. The attribute
could be a property that causes decode/encode automatically to create data in the
new encoding. The None encoding would coming or going would not change the data bytes, but
differing explicit encodings would cause decode/encode.

This could also support s1+s2 to mean generate a concatenated string
that has the same encoding attribute if s1.encoding==s2 .encoding and otherwise promotes
each to the platform standard unicode encoding and concatenates those if they
are different (and records the unicode encoding chosen in the result's encoding
attribute).
"""
#-*- coding: latin1 -*-
name = 'Martin Löwis'

could be that name.encoding == 'latin-1'
That is not at all intuitive. I would have expected name.encoding
to be 'latin1'.

That's pretty dead-pan. Not even a smiley ;-)
Functions that generate strings, such as chr(), could be assumed to create
a string with the same encoding as the source code for the chr(...) invocation.
What is the source of the chr invocation? If I do chr(param), should I

The source file that the "chr(param) " appears in.use the source where param was computed, or the source where the call No, the param is numeric, and has no reasonably inferrable encoding. (I don't
propose to have ord pass it on for integers to carry ;-) (so ord in another
module with different source encoding could be the source and an encoding
conversion could happen with integer as intermediary. But that's expected ;-)
to chr occurs? If the latter, how should the interpreter preserve the
encoding of where the call came from? not this latter, so not applicable.
What about the many other sources of byte strings (like strings read
from a file, or received via a socket)? I mentioned that in parts you snipped. See above.
This is not a fully developed idea, and there has been discussion on the topic before
(even between us ;-) but I thought another round might bring out your current thinking
on it ;-)


My thinking still is the same. It cannot really work, and it wouldn't do
any good with what little it could do. Just use Unicode strings.

To hear "It cannot really work" causes me agitation, even if I know it's not worth
the effort to pursue it ;-)

Anyway, ok, I'll leave it at that, but I'm not altogether happy with having to write

#-*- coding: latin1 -*-
name = 'Martin Löwis'
print name.decode('la tin1')

where I think

#-*- coding: latin1 -*-
name = 'Martin Löwis'
print name

should reasonably produce the same output. Though I grant you

#-*- coding: latin1 -*-
name = u'Martin Löwis'
print name

is not that hard to do. (Please excuse the use of your name, which has a handy non-ascii letter ;-)

Regards,
Bengt Richter
Oct 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
2445
by: Stelrad Doulton | last post by:
Hi, Does anyone know what number to use with GetEncoding for 8 bit ASCI? Cheers
1
1282
by: WindAndWaves | last post by:
Hi Gurus I have a database full of text that I have downloaded from a website (ftp download, read file, parse file (i.e. break it down into parts), move it into a table, etc...). My next step now is to load this data into a MySQL database on the net for a PHP website. HOWEVER, the funny characters on the original website (e.g. unicode characters, etc.... etc...) are doing really nasty things. The MySQL thaat I use does not support...
3
1934
by: Isi Robayna | last post by:
Hello, I am trying to communicate with a legacy socket server. The first thing this legacy server does upon getting a client connection is to send an ID back to the client... I am trying to read that ID Value. By the way, that ID is an uint (32 bit unsigned integer) I have tried a number of different things to translate the buffer coming in to the a number, and I cannot translate the number(ID) in the buffer to a uint .....
3
4846
by: Dave | last post by:
I'm using sockets. I need to send bytes without alternation from 0-255. When I use the standard, Byte bytestuff = System.Text.Encoding.ASCII.GetBytes(stuff.ToCharArray()); tSocket.Send(bytestuff,bytestuff.Length,0); It encodes most characters to ? past asci 127. I beleive that's because of the Encoding,ASCII. How can I send info preserving the raw bits for asci 128-255?
6
5514
by: Eddie | last post by:
Hi, Is it possible to make a program in VB.net that will convert letters, numbers etc to Binary Code? I have the tables here that show me exactly what each character works out to, but I was wondering if there is an easier way to do this. Thanks,
104
5477
by: Colin McGuire | last post by:
Hi, is there a way to show a form without a titlebar (and therefore no control box/minimize box/title etc) but still have it appear looking like 3D? The property FormBorderStyle to None - this gives no titlebar etc but the form borders don't look 3D. In case I haven't explained what I want well, I want a form that looks like a button with no text (ie a form with the lovely 3D borders but no titlebar etc).
5
7086
by: Keo932 | last post by:
Hello all, I am finishing up my program to simulate a tollbooth by using classes. What happens is cars passing by the booth are expected to pay a fifty cent toll. The program keeps track of the number of cars that have gone by (paid and unpaid), and the total money collected. The two data items are of type int to hold the total number of cars and type float to hold the total amount of money collected. A constructor initializes both these...
6
2105
by: chindanoor | last post by:
i need to Generate a unique id which contains 14 numeric and one Alphabatic which i need to increment it whenver i call the Function. Eg: A00000000000001 to A99999999999999 and it should start from B00000000000001 to .......etc Can anybody help me regarding this for generating it.. Thanks in advance....
8
2732
by: muddasirmunir | last post by:
my project is nearly completed . and now i am going to put my project in office .how can i protect my projoct form bieing decoded by anyone or no body can use my programe in any other comuputer
13
2081
by: sillybob123 | last post by:
Hello, Im new to the site and was hoping some of you could help me, I would like to be able to open a file locate a certain offset in that file and extract up to a certain part the item type i would want to extract is PNG from offsets 171A
0
8820
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8718
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8499
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8601
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6162
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5630
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4150
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
2726
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
1937
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.