PEP 3131: Supporting Non-ASCII Identifiers

=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=

PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.pyth on), or to
py*********@pyt hon.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: LÃ¶ffelstiel, changÃ©, Ð¾ÑˆÐ¸Ð±ÐºÐ°, or å£²ã‚Šå*´
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?
- would you use them if it was possible to do so? in what cases?

Regards,
Martin
PEP: 3131
Title: Supporting Non-ASCII Identifiers
Version: $Revision: 55059 $
Last-Modified: $Date: 2007-05-01 22:34:25 +0200 (Di, 01 Mai 2007) $
Author: Martin v. LÃ¶wis <ma****@v.loewi s.de>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 1-May-2007
Python-Version: 3.0
Post-History:
Abstract
========

This PEP suggests to support non-ASCII letters (such as accented
characters, Cyrillic, Greek, Kanji, etc.) in Python identifiers.

Rationale
=========

Python code is written by many people in the world who are not familiar
with the English language, or even well-acquainted with the Latin
writing system. Such developers often desire to define classes and
functions with names in their native languages, rather than having to
come up with an (often incorrect) English translation of the concept
they want to name.

For some languages, common transliteration systems exist (in particular,
for the Latin-based writing systems). For other languages, users have
larger difficulties to use Latin to write their native words.

Common Objections
=============== ==

Some objections are often raised against proposals similar to this one.

People claim that they will not be able to use a library if to do so
they have to use characters they cannot type on their keyboards.
However, it is the choice of the designer of the library to decide on
various constraints for using the library: people may not be able to use
the library because they cannot get physical access to the source code
(because it is not published), or because licensing prohibits usage, or
because the documentation is in a language they cannot understand. A
developer wishing to make a library widely available needs to make a
number of explicit choices (such as publication, licensing, language
of documentation, and language of identifiers). It should always be the
choice of the author to make these decisions - not the choice of the
language designers.

In particular, projects wishing to have wide usage probably might want
to establish a policy that all identifiers, comments, and documentation
is written in English (see the GNU coding style guide for an example of
such a policy). Restricting the language to ASCII-only identifiers does
not enforce comments and documentation to be English, or the identifiers
actually to be English words, so an additional policy is necessary,
anyway.

Specification of Language Changes
=============== =============== ===

The syntax of identifiers in Python will be based on the Unicode
standard annex UAX-31 [1]_, with elaboration and changes as defined
below.

Within the ASCII range (U+0001..U+007F ), the valid characters for
identifiers are the same as in Python 2.5. This specification only
introduces additional characters from outside the ASCII range. For
other characters, the classification uses the version of the Unicode
Character Database as included in the ``unicodedata`` module.

The identifier syntax is ``<ID_Start<ID_ Continue>*``.

``ID_Start`` is defined as all characters having one of the general
categories uppercase letters (Lu), lowercase letters (Ll), titlecase
letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers
(Nl), plus the underscore (XXX what are "stability extensions" listed in
UAX 31).

``ID_Continue`` is defined as all characters in ``ID_Start``, plus
nonspacing marks (Mn), spacing combining marks (Mc), decimal number
(Nd), and connector punctuations (Pc).

All identifiers are converted into the normal form NFC while parsing;
comparison of identifiers is based on NFC.

Policy Specification
=============== =====

As an addition to the Python Coding style, the following policy is
prescribed: All identifiers in the Python standard library MUST use
ASCII-only identifiers, and SHOULD use English words wherever feasible.

As an option, this specification can be applied to Python 2.x. In that
case, ASCII-only identifiers would continue to be represented as byte
string objects in namespace dictionaries; identifiers with non-ASCII
characters would be represented as Unicode strings.

Implementation
==============

The following changes will need to be made to the parser:

1. If a non-ASCII character is found in the UTF-8 representation of the
source code, a forward scan is made to find the first ASCII
non-identifier character (e.g. a space or punctuation character)

2. The entire UTF-8 string is passed to a function to normalize the
string to NFC, and then verify that it follows the identifier syntax.
No such callout is made for pure-ASCII identifiers, which continue to
be parsed the way they are today.

3. If this specification is implemented for 2.x, reflective libraries
(such as pydoc) must be verified to continue to work when Unicode
strings appear in ``__dict__`` slots as keys.

References
==========

... [1] http://www.unicode.org/reports/tr31/
Copyright
=========

This document has been placed in the public domain.

May 13 '07 #1

Subscribe Reply

399

12835

1
2
3
11
>
Last »

dustin

On Sun, May 13, 2007 at 05:44:39PM +0200, "Martin v. L??wis" wrote:

- should non-ASCII identifiers be supported? why?

The only objection that comes to mind is that adding such support may
make some distinct identifiers visually indistinguishab le. IIRC the DNS
system has had this problem, leading to much phishing abuse.

I don't necessarily think that the objection is strong enough to reject
the idea -- programmers using non-ASCII symbols would be responsible for
the consequences of their character choice.

Dustin

May 13 '07 #2

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

The only objection that comes to mind is that adding such support may

make some distinct identifiers visually indistinguishab le. IIRC the DNS
system has had this problem, leading to much phishing abuse.

This is a commonly-raised objection, but I don't understand why people
see it as a problem. The phishing issue surely won't apply, as you
normally don't "click" on identifiers, but rather type them. In a
phishing case, it is normally difficult to type the fake character
(because the phishing relies on you mistaking the character for another
one, so you would type the wrong identifier).

People have mentioned that this could be used to obscure your code - but
there are so many ways to write obscure code that I don't see a problem
in adding yet another way.

People also mentioned that they might mistake identifiers in a regular,
non-phishing, non-joking scenario, because they can't tell whether the
second letter of MAXLINESIZE is a Latin A or Greek Alpha. I find that
hard to believe - if the rest of the identifier is Latin, the A surely
also is Latin, and if the rest is Greek, it's likely an Alpha. The issue
is only with single-letter identifiers, and those are most common
as local variables. Then, it's an Alpha if there is also a Beta and
a Gamma as a local variable - if you have B and C also, it's likely A.

I don't necessarily think that the objection is strong enough to reject
the idea -- programmers using non-ASCII symbols would be responsible for
the consequences of their character choice.

Indeed.

Martin

May 13 '07 #3

=?utf-8?B?QW5kcsOp?=

On May 13, 12:44 pm, "Martin v. LÃ¶wis" <mar...@v.loewi s.dewrote:

PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.pyth on), or to
python-3...@python.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: LÃ¶ffelstiel, changÃ©, Ð¾ÑˆÐ¸Ð±ÐºÐ°, or å£²ã‚Šå*´
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?

I use to think differently. However, I would say a strong YES. They
would be extremely useful when teaching programming.

- would you use them if it was possible to do so? in what cases?

Only if I was teaching native French speakers.

Policy Specification
=============== =====

As an addition to the Python Coding style, the following policy is
prescribed: All identifiers in the Python standard library MUST use
ASCII-only identifiers, and SHOULD use English words wherever feasible.

I would add something like:

Any module released for general use SHOULD use ASCII-only identifiers
in the public API.

Thanks for this initiative.

AndrÃ©

May 13 '07 #4

John Nagle

Martin v. LÃ¶wis wrote:

PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.pyth on), or to
py*********@pyt hon.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: LÃ¶ffelstiel, changÃ©, Ð¾ÑˆÐ¸Ð±ÐºÐ°, or å£²ã‚Šå*´
(hoping that the latter one means "counter").

All identifiers are converted into the normal form NFC while parsing;
comparison of identifiers is based on NFC.

That may not be restrictive enough, because it permits multiple
different lexical representations of the same identifier in the same
text. Search and replace operations on source text might not find
all instances of the same identifier. Identifiers should be required
to be written in source text with a unique source text representation,
probably NFC, or be considered a syntax error.

I'd suggest restricting identifiers under the rules of UTS-39,
profile 2, "Highly Restrictive". This limits mixing of scripts
in a single identifier; you can't mix Hebrew and ASCII, for example,
which prevents problems with mixing right to left and left to right
scripts. Domain names have similar restrictions.

John Nagle

May 13 '07 #5

Paul Rubin

"Martin v. Löwis" <ma****@v.loewi s.dewrites:

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?

No, and especially no without mandatory declarations of all variables.
Look at the problems of non-ascii characters in domain names and the
subsequent invention of Punycode. Maintaining code that uses those
identifiers in good faith will already be a big enough hassle, since
it will require installing and getting familiar with keyboard setups
and editing tools needed to enter those characters. Then there's the
issue of what happens when someone tries to slip a malicious patch
through a code review on purpose, by using homoglyphic characters
similar to the way domain name phishing works. Those tricks have also
been used to re-insert bogus articles into Wikipedia, circumventing
administrative blocks on the article names.

- would you use them if it was possible to do so? in what cases?

I would never insert them into a program. In existing programs where
they were used, I would remove them everywhere I could.

May 13 '07 #6

=?utf-8?B?QW5kcsOp?=

On May 13, 2:30 pm, John Nagle <n...@animats.c omwrote:

Martin v. LÃ¶wis wrote:
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.pyth on), or to
python-3...@python.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: LÃ¶ffelstiel, changÃ©, Ð¾ÑˆÐ¸Ð±ÐºÐ°, or å£²ã‚Šå*´
(hoping that the latter one means "counter").
All identifiers are converted into the normal form NFC while parsing;
comparison of identifiers is based on NFC.

That may not be restrictive enough, because it permits multiple
different lexical representations of the same identifier in the same
text. Search and replace operations on source text might not find
all instances of the same identifier. Identifiers should be required
to be written in source text with a unique source text representation,
probably NFC, or be considered a syntax error.

I'd suggest restricting identifiers under the rules of UTS-39,
profile 2, "Highly Restrictive". This limits mixing of scripts
in a single identifier; you can't mix Hebrew and ASCII, for example,
which prevents problems with mixing right to left and left to right
scripts. Domain names have similar restrictions.

John Nagle

Python keywords MUST be in ASCII ... so the above restriction can't
work. Unless the restriction is removed (which would be a separate
PEP).

AndrÃ©

May 13 '07 #7

Paul Rubin

"Martin v. Löwis" <ma****@v.loewi s.dewrites:

This is a commonly-raised objection, but I don't understand why people
see it as a problem. The phishing issue surely won't apply, as you
normally don't "click" on identifiers, but rather type them. In a
phishing case, it is normally difficult to type the fake character
(because the phishing relies on you mistaking the character for another
one, so you would type the wrong identifier).

It certainly does apply, if you're maintaining a program and someone
submits a patch. In that case you neither click nor type the
character. You'd normally just make sure the patched program passes
the existing test suite, and examine the patch on the screen to make
sure it looks reasonable. The phishing possibilities are obvious.

May 13 '07 #8

=?iso-8859-1?B?QW5kcuk=?=

On May 13, 12:44 pm, "Martin v. Löwis" <mar...@v.loewi s.dewrote:

PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.pyth on), or to
python-3...@python.org

It should be noted that the Python community may use other forums, in
other languages. They would likely be a lot more enthusiastic about
this PEP than the usual crowd here (comp.lang.pyth on).

André

May 13 '07 #9

Anton Vredegoor

Martin v. LÃ¶wis wrote:

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: LÃ¶ffelstiel, changÃ©, Ð¾ÑˆÐ¸Ð±ÐºÐ°, or å£²ã‚Šå*´
(hoping that the latter one means "counter").

I am against this PEP for the following reasons:

It will split up the Python user community into different language or
interest groups without having any benefit as to making the language
more expressive in an algorithmic way.

Some time ago there was a discussion about introducing macros into the
language. Among the reasons why macros were excluded was precisely
because anyone could start writing their own kind of dialect of Python
code, resulting in less people being able to read what other programmers
wrote. And that last thing: 'Being able to easily read what other people
wrote' (sometimes that 'other people' is yourself half a year later, but
that isn't relevant in this specific case) is one of the main virtues in
the Python programming community. Correct me if I'm wrong please.

At that time I was considering to give up some user conformity because
the very powerful syntax extensions would make Python rival Lisp. It's
worth sacrificing something if one gets some other thing in return.

However since then we have gained metaclasses, iterators and generators
and even a C-like 'if' construct. Personally I'd also like to have a
'repeat-until'. These things are enough to keep us busy for a long time
and in some respects this new syntax is even more powerful/dangerous
than macros. But most importantly these extra burdens on the ease with
which one is to read code are offset by gaining more expressiveness in
the *coding* of scripts.

While I have little doubt that in the end some stubborn mathematician or
Frenchman will succeed in writing a preprocessor that would enable him
to indoctrinate his students into his specific version of reality, I see
little reason to actively endorse such foolishness.

The last argument I'd like to make is about the very possibly reality
that in a few years the Internet will be dominated by the Chinese
language instead of by the English language. As a Dutchman I have no
special interest in English being the language of the Internet but
-given the status quo- I can see the advantages of everyone speaking the
*same* language. If it be Chinese, Chinese I will start to learn,
however inept I might be at it at first.

That doesn't mean however that one should actively open up to a kind of
contest as to which language will become the main language! On the
contrary one should hold out as long as possible to the united group one
has instead of dispersing into all kinds of experimental directions.

Do we harm the Chinese in this way one might ask by making it harder for
them to gain access to the net? Do we harm ourselves by not opening up
in time to the new status quo? Yes, in a way these are valid points, but
one should not forget that more advanced countries also have a
responsibility to lead the way by providing an example, one should not
think too lightly about that.

Anyway, I feel that it will not be possible to hold off these
developments in the long run, but great beneficial effects can still be
attained by keeping the language as simple and expressive as possible
and to adjust to new realities as soon as one of them becomes undeniably
apparent (which is something entirely different than enthusiasticall y
inviting them in and let them fight it out against each other in your
own house) all the time taking responsibility to lead the way as long as
one has any consensus left.

A.

May 13 '07 #10

Similar topics

1996

How much effort to put into supporting pre-DOM browsers?

by: Richie | last post by:

I went through the past six months or so of entries in c.l.javascript, and found a couple where people had expressed opinions about the value of supporting much older versions of Netscape and IE. The entries included incidental mention of server logs showing how many pages had been retrieved by such browsers. I'd like to get some sort of communal variety of opinions on how much effort it's worth to put in the support, or in some...

Javascript

8889

How to make font size constant in HTML

by: Nirvana | last post by:

How to make the font size constant in HTML code, so that in a web browser it remains fixed. For e.g in IE if you press CTRL and move mouse wheel front or back the font size changes, cheers

HTML / CSS

1488

V Studio 2003 Doesn't Load Supporting DLL Debug Symbols for Windows Service Remote Debug

by: INGSOC | last post by:

Using remote debugging, I can attach to a windows service and run it in debug mode in VS.Net 2003. The problem is this service uses two supporting dlls. On the remote service, the dlls have been registered in the GAC. They are also in the service's directory, along with their associated *.dbg symbols. Yet, any breakpoint in the dlls is not hit. The dlls are part of the project file, and breakpoints in them can be hit when I debug...

C# / C Sharp

2293

Rational for not supporting optional arguments

by: Nick Hounsome | last post by:

Can anyone tell me what the rational is for not supporting optional arguments. It is obviously a trivial thing to implement and, since C++ has them, I would not expect them to be omitted without a good reason.

C# / C Sharp

1525

Browser not supporting JavaScript

by: Ravi | last post by:

Hi, I want the list of browser which is not supporting Java Script. So far I am thinking only JavaScript is the standard scripting language supports in most the browser. Is any scripting language supports in all browsers. Thanks in Advance, Ravi.

Javascript

3287

Runtime Error '3131'

by: babyspring | last post by:

Hi All, I have encountered an annoying problem. I've read through all the post concerning this error. But yet, I still can't seem to solve the problem. When I run the program, it pops out this annoying error stating Run-time error '3131'. Syntax error in FROM clause. I've also read through salman1karim's post, and added spaces after and before a new line like told. And yet, it can't seem to solve the problem... Can anyone please help me? I'm...

Visual Basic 4 / 5 / 6

2206

OT Supporting an application

by: salad | last post by:

I have an application written in MS-Access. It is a complete application that manages the day-to-day operations of a business. The program is nearly ready to be used in other customer sites. I am wondering if any of you have advice on supporting an application. Since it has never had any outside exposure, what I don't want is to make a bunch of sales and not be able to support the issues that arise. I believe as kinks are worked out...

Microsoft Access / VBA

2842

New Thread- Supporting Multiline values in ConfigParser

by: Phoe6 | last post by:

Hi, Am starting a new thread as I fear the old thread which more than a week old can go unnoticed. Sorry for the multiple mails. I took the approach of Subclassing ConfigParser to support multiline values without leading white-spaces, but am struct at which position in _read I should modify to accomodate the non-leading whitespace based multiline values.

Python

1417

Supporting XP Operating system

by: =?Utf-8?B?U29hcHk=?= | last post by:

Hi: I heard from a friend that Microsoft will no longer support XP. There is a sign-up page for those people who still use it and would like MS to continue supporting it. I don't know if this is proper for me to do this on this board, however, I don't know where else to go. I have supplied the link if anyone here would like to ask MS to continue supporting XP. THanks. http://www.infoworld.com/article/08/01/14/02FE-why-save-xp_1.html

.NET Framework

1109

"Supporting" classes and project structure

by: Clive Dixon | last post by:

When working with lots of associated "supporting" classes alongside classes (by this, I mean things such as associated component editor classes specified by , debugger proxy classes specified by etc. etc.), what kind of project structure do people use along with this? Nested classes within the main class? (What I've used so far, but gets a bit unwieldy in terms of file size once you have lots of such classes within a class.) Separate...

C# / C Sharp

8801

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8707

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9314

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9174

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9074

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9015

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7953

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

4725

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3158

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp