By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,986 Members | 2,030 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,986 IT Pros & Developers. It's quick & easy.

PEP 3131: Supporting Non-ASCII Identifiers

P: n/a
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
py*********@python.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?
- would you use them if it was possible to do so? in what cases?

Regards,
Martin
PEP: 3131
Title: Supporting Non-ASCII Identifiers
Version: $Revision: 55059 $
Last-Modified: $Date: 2007-05-01 22:34:25 +0200 (Di, 01 Mai 2007) $
Author: Martin v. Löwis <ma****@v.loewis.de>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 1-May-2007
Python-Version: 3.0
Post-History:
Abstract
========

This PEP suggests to support non-ASCII letters (such as accented
characters, Cyrillic, Greek, Kanji, etc.) in Python identifiers.

Rationale
=========

Python code is written by many people in the world who are not familiar
with the English language, or even well-acquainted with the Latin
writing system. Such developers often desire to define classes and
functions with names in their native languages, rather than having to
come up with an (often incorrect) English translation of the concept
they want to name.

For some languages, common transliteration systems exist (in particular,
for the Latin-based writing systems). For other languages, users have
larger difficulties to use Latin to write their native words.

Common Objections
=================

Some objections are often raised against proposals similar to this one.

People claim that they will not be able to use a library if to do so
they have to use characters they cannot type on their keyboards.
However, it is the choice of the designer of the library to decide on
various constraints for using the library: people may not be able to use
the library because they cannot get physical access to the source code
(because it is not published), or because licensing prohibits usage, or
because the documentation is in a language they cannot understand. A
developer wishing to make a library widely available needs to make a
number of explicit choices (such as publication, licensing, language
of documentation, and language of identifiers). It should always be the
choice of the author to make these decisions - not the choice of the
language designers.

In particular, projects wishing to have wide usage probably might want
to establish a policy that all identifiers, comments, and documentation
is written in English (see the GNU coding style guide for an example of
such a policy). Restricting the language to ASCII-only identifiers does
not enforce comments and documentation to be English, or the identifiers
actually to be English words, so an additional policy is necessary,
anyway.

Specification of Language Changes
=================================

The syntax of identifiers in Python will be based on the Unicode
standard annex UAX-31 [1]_, with elaboration and changes as defined
below.

Within the ASCII range (U+0001..U+007F), the valid characters for
identifiers are the same as in Python 2.5. This specification only
introduces additional characters from outside the ASCII range. For
other characters, the classification uses the version of the Unicode
Character Database as included in the ``unicodedata`` module.

The identifier syntax is ``<ID_Start<ID_Continue>*``.

``ID_Start`` is defined as all characters having one of the general
categories uppercase letters (Lu), lowercase letters (Ll), titlecase
letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers
(Nl), plus the underscore (XXX what are "stability extensions" listed in
UAX 31).

``ID_Continue`` is defined as all characters in ``ID_Start``, plus
nonspacing marks (Mn), spacing combining marks (Mc), decimal number
(Nd), and connector punctuations (Pc).

All identifiers are converted into the normal form NFC while parsing;
comparison of identifiers is based on NFC.

Policy Specification
====================

As an addition to the Python Coding style, the following policy is
prescribed: All identifiers in the Python standard library MUST use
ASCII-only identifiers, and SHOULD use English words wherever feasible.

As an option, this specification can be applied to Python 2.x. In that
case, ASCII-only identifiers would continue to be represented as byte
string objects in namespace dictionaries; identifiers with non-ASCII
characters would be represented as Unicode strings.

Implementation
==============

The following changes will need to be made to the parser:

1. If a non-ASCII character is found in the UTF-8 representation of the
source code, a forward scan is made to find the first ASCII
non-identifier character (e.g. a space or punctuation character)

2. The entire UTF-8 string is passed to a function to normalize the
string to NFC, and then verify that it follows the identifier syntax.
No such callout is made for pure-ASCII identifiers, which continue to
be parsed the way they are today.

3. If this specification is implemented for 2.x, reflective libraries
(such as pydoc) must be verified to continue to work when Unicode
strings appear in ``__dict__`` slots as keys.

References
==========

... [1] http://www.unicode.org/reports/tr31/
Copyright
=========

This document has been placed in the public domain.
May 13 '07 #1
Share this Question
Share on Google+
399 Replies


P: n/a
On Sun, May 13, 2007 at 05:44:39PM +0200, "Martin v. L??wis" wrote:
- should non-ASCII identifiers be supported? why?
The only objection that comes to mind is that adding such support may
make some distinct identifiers visually indistinguishable. IIRC the DNS
system has had this problem, leading to much phishing abuse.

I don't necessarily think that the objection is strong enough to reject
the idea -- programmers using non-ASCII symbols would be responsible for
the consequences of their character choice.

Dustin
May 13 '07 #2

P: n/a
The only objection that comes to mind is that adding such support may
make some distinct identifiers visually indistinguishable. IIRC the DNS
system has had this problem, leading to much phishing abuse.
This is a commonly-raised objection, but I don't understand why people
see it as a problem. The phishing issue surely won't apply, as you
normally don't "click" on identifiers, but rather type them. In a
phishing case, it is normally difficult to type the fake character
(because the phishing relies on you mistaking the character for another
one, so you would type the wrong identifier).

People have mentioned that this could be used to obscure your code - but
there are so many ways to write obscure code that I don't see a problem
in adding yet another way.

People also mentioned that they might mistake identifiers in a regular,
non-phishing, non-joking scenario, because they can't tell whether the
second letter of MAXLINESIZE is a Latin A or Greek Alpha. I find that
hard to believe - if the rest of the identifier is Latin, the A surely
also is Latin, and if the rest is Greek, it's likely an Alpha. The issue
is only with single-letter identifiers, and those are most common
as local variables. Then, it's an Alpha if there is also a Beta and
a Gamma as a local variable - if you have B and C also, it's likely A.
I don't necessarily think that the objection is strong enough to reject
the idea -- programmers using non-ASCII symbols would be responsible for
the consequences of their character choice.
Indeed.

Martin

May 13 '07 #3

P: n/a
On May 13, 12:44 pm, "Martin v. Löwis" <mar...@v.loewis.dewrote:
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
python-3...@python.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?
I use to think differently. However, I would say a strong YES. They
would be extremely useful when teaching programming.
- would you use them if it was possible to do so? in what cases?
Only if I was teaching native French speakers.
Policy Specification
====================

As an addition to the Python Coding style, the following policy is
prescribed: All identifiers in the Python standard library MUST use
ASCII-only identifiers, and SHOULD use English words wherever feasible.
I would add something like:

Any module released for general use SHOULD use ASCII-only identifiers
in the public API.

Thanks for this initiative.

André

May 13 '07 #4

P: n/a
Martin v. Löwis wrote:
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
py*********@python.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").
All identifiers are converted into the normal form NFC while parsing;
comparison of identifiers is based on NFC.
That may not be restrictive enough, because it permits multiple
different lexical representations of the same identifier in the same
text. Search and replace operations on source text might not find
all instances of the same identifier. Identifiers should be required
to be written in source text with a unique source text representation,
probably NFC, or be considered a syntax error.

I'd suggest restricting identifiers under the rules of UTS-39,
profile 2, "Highly Restrictive". This limits mixing of scripts
in a single identifier; you can't mix Hebrew and ASCII, for example,
which prevents problems with mixing right to left and left to right
scripts. Domain names have similar restrictions.

John Nagle
May 13 '07 #5

P: n/a
"Martin v. Lwis" <ma****@v.loewis.dewrites:
So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?
No, and especially no without mandatory declarations of all variables.
Look at the problems of non-ascii characters in domain names and the
subsequent invention of Punycode. Maintaining code that uses those
identifiers in good faith will already be a big enough hassle, since
it will require installing and getting familiar with keyboard setups
and editing tools needed to enter those characters. Then there's the
issue of what happens when someone tries to slip a malicious patch
through a code review on purpose, by using homoglyphic characters
similar to the way domain name phishing works. Those tricks have also
been used to re-insert bogus articles into Wikipedia, circumventing
administrative blocks on the article names.
- would you use them if it was possible to do so? in what cases?
I would never insert them into a program. In existing programs where
they were used, I would remove them everywhere I could.
May 13 '07 #6

P: n/a
On May 13, 2:30 pm, John Nagle <n...@animats.comwrote:
Martin v. Löwis wrote:
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
python-3...@python.org
In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").
All identifiers are converted into the normal form NFC while parsing;
comparison of identifiers is based on NFC.

That may not be restrictive enough, because it permits multiple
different lexical representations of the same identifier in the same
text. Search and replace operations on source text might not find
all instances of the same identifier. Identifiers should be required
to be written in source text with a unique source text representation,
probably NFC, or be considered a syntax error.

I'd suggest restricting identifiers under the rules of UTS-39,
profile 2, "Highly Restrictive". This limits mixing of scripts
in a single identifier; you can't mix Hebrew and ASCII, for example,
which prevents problems with mixing right to left and left to right
scripts. Domain names have similar restrictions.

John Nagle
Python keywords MUST be in ASCII ... so the above restriction can't
work. Unless the restriction is removed (which would be a separate
PEP).

André

May 13 '07 #7

P: n/a
"Martin v. Lwis" <ma****@v.loewis.dewrites:
This is a commonly-raised objection, but I don't understand why people
see it as a problem. The phishing issue surely won't apply, as you
normally don't "click" on identifiers, but rather type them. In a
phishing case, it is normally difficult to type the fake character
(because the phishing relies on you mistaking the character for another
one, so you would type the wrong identifier).
It certainly does apply, if you're maintaining a program and someone
submits a patch. In that case you neither click nor type the
character. You'd normally just make sure the patched program passes
the existing test suite, and examine the patch on the screen to make
sure it looks reasonable. The phishing possibilities are obvious.
May 13 '07 #8

P: n/a
On May 13, 12:44 pm, "Martin v. Lwis" <mar...@v.loewis.dewrote:
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
python-3...@python.org
It should be noted that the Python community may use other forums, in
other languages. They would likely be a lot more enthusiastic about
this PEP than the usual crowd here (comp.lang.python).

Andr
May 13 '07 #9

P: n/a
Martin v. Löwis wrote:
In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").
I am against this PEP for the following reasons:

It will split up the Python user community into different language or
interest groups without having any benefit as to making the language
more expressive in an algorithmic way.

Some time ago there was a discussion about introducing macros into the
language. Among the reasons why macros were excluded was precisely
because anyone could start writing their own kind of dialect of Python
code, resulting in less people being able to read what other programmers
wrote. And that last thing: 'Being able to easily read what other people
wrote' (sometimes that 'other people' is yourself half a year later, but
that isn't relevant in this specific case) is one of the main virtues in
the Python programming community. Correct me if I'm wrong please.

At that time I was considering to give up some user conformity because
the very powerful syntax extensions would make Python rival Lisp. It's
worth sacrificing something if one gets some other thing in return.

However since then we have gained metaclasses, iterators and generators
and even a C-like 'if' construct. Personally I'd also like to have a
'repeat-until'. These things are enough to keep us busy for a long time
and in some respects this new syntax is even more powerful/dangerous
than macros. But most importantly these extra burdens on the ease with
which one is to read code are offset by gaining more expressiveness in
the *coding* of scripts.

While I have little doubt that in the end some stubborn mathematician or
Frenchman will succeed in writing a preprocessor that would enable him
to indoctrinate his students into his specific version of reality, I see
little reason to actively endorse such foolishness.

The last argument I'd like to make is about the very possibly reality
that in a few years the Internet will be dominated by the Chinese
language instead of by the English language. As a Dutchman I have no
special interest in English being the language of the Internet but
-given the status quo- I can see the advantages of everyone speaking the
*same* language. If it be Chinese, Chinese I will start to learn,
however inept I might be at it at first.

That doesn't mean however that one should actively open up to a kind of
contest as to which language will become the main language! On the
contrary one should hold out as long as possible to the united group one
has instead of dispersing into all kinds of experimental directions.

Do we harm the Chinese in this way one might ask by making it harder for
them to gain access to the net? Do we harm ourselves by not opening up
in time to the new status quo? Yes, in a way these are valid points, but
one should not forget that more advanced countries also have a
responsibility to lead the way by providing an example, one should not
think too lightly about that.

Anyway, I feel that it will not be possible to hold off these
developments in the long run, but great beneficial effects can still be
attained by keeping the language as simple and expressive as possible
and to adjust to new realities as soon as one of them becomes undeniably
apparent (which is something entirely different than enthusiastically
inviting them in and let them fight it out against each other in your
own house) all the time taking responsibility to lead the way as long as
one has any consensus left.

A.


May 13 '07 #10

P: n/a
Anton Vredegoor wrote:
>In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").

I am against this PEP for the following reasons:

It will split up the Python user community into different language or
interest groups without having any benefit as to making the language
more expressive in an algorithmic way.

We must distinguish between "identifiers named in a non-english language" and
"identifiers written with non-ASCII characters".

While the first is already allowed as long as the transcription uses only
ASCII characters, the second is currently forbidden and is what this PEP is about.

So, nothing currently keeps you from giving names to identifiers that are
impossible to understand by, say, Americans (ok, that's easy anyway).

For example, I could write

def zieheDreiAbVon(wert):
return zieheAb(wert, 3)

and most people on earth would not have a clue what this is good for. However,
someone who is fluent enough in German could guess from the names what this does.

I do not think non-ASCII characters make this 'problem' any worse. So I must
ask people to restrict their comments to the actual problem that this PEP is
trying to solve.

Stefan
May 13 '07 #11

P: n/a
Martin v. Lwis napisa(a):
So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?
No, because "programs must be written for people to read, and only
incidentally for machines to execute". Using anything other than "lowest
common denominator" (ASCII) will restrict accessibility of code. This is
not a literature, that requires qualified translators to get the text
from Hindi (or Persian, or Chinese, or Georgian, or...) to Polish.

While I can read the code with Hebrew, Russian or Greek names
transliterated to ASCII, I would not be able to read such code in native.
For some languages, common transliteration systems exist (in particular,
for the Latin-based writing systems). For other languages, users have
larger difficulties to use Latin to write their native words.
This is one of least disturbing difficulties when it comes to programming.

--
Jarek Zgoda
http://jpa.berlios.de/
May 13 '07 #12

P: n/a
Martin v. Löwis schrieb:
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
py*********@python.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?
- would you use them if it was possible to do so? in what cases?

To make it clear: this PEP considers "identifiers written with non-ASCII
characters", not "identifiers named in a non-english language".

While the first is already allowed as long as the transcription uses only
ASCII characters, the second is currently forbidden and is what this PEP is about.

Now, I am not a strong supporter (most public code will use English
identifiers anyway) but we should not forget that Python supports encoding
declarations in source files and thus has much cleaner support for non-ASCII
source code than, say, Java. So, introducing non-ASCII identifiers is just a
small step further. Disallowing this does *not* guarantee in any way that
identifiers are understandable for English native speakers. It only guarantees
that identifiers are always *typable* by people who have access to latin
characters on their keyboard. A rather small advantage, I'd say.

The capability of a Unicode-aware language to express non-English identifiers
in a non-ASCII encoding totally makes sense to me.

Stefan
May 13 '07 #13

P: n/a
Paul Rubin wrote:
"Martin v. Lwis" <ma****@v.loewis.dewrites:
>- would you use them if it was possible to do so? in what cases?

I would never insert them into a program. In existing programs where
they were used, I would remove them everywhere I could.
Luckily, you will never be able to touch every program in the world.

Stefan
May 13 '07 #14

P: n/a
Jarek Zgoda schrieb:
Martin v. Lwis napisa(a):
Uuups, is that a non-ASCII character in there? Why don't you keep them out of
an English speaking newsgroup?

>So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?

No, because "programs must be written for people to read, and only
incidentally for machines to execute". Using anything other than "lowest
common denominator" (ASCII) will restrict accessibility of code.
No, but it would make it a lot easier for a lot of people to use descriptive
names. Remember: we're all adults here, right?

While I can read the code with Hebrew, Russian or Greek names
transliterated to ASCII, I would not be able to read such code in native.
Then maybe it was code that was not meant to be read by you?

In the (not so small) place where I work, we tend to use descriptive names *in
German* for the code we write, mainly for reasons of domain clarity. The
*only* reason why we still use the (simple but ugly) ASCII-transcription
(->ue etc.) for identifiers is that we program in Java and Java lacks a
/reliable/ way to support non-ASCII characters in source code. Thanks to PEP
263 and 3120, Python does not suffer from this problem, but it suffers from
the bigger problem of not *allowing* non-ASCII characters in identifiers. And
I believe that's a rather arbitrary decision.

The more I think about it, the more I believe that this restriction should be
lifted. 'Any' non-ASCII identifier should be allowed where developers decide
that it makes sense.

Stefan
May 13 '07 #15

P: n/a
Stefan Behnel wrote:
Anton Vredegoor wrote:
>>In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").
I am against this PEP for the following reasons:

It will split up the Python user community into different language or
interest groups without having any benefit as to making the language
more expressive in an algorithmic way.

We must distinguish between "identifiers named in a non-english language" and
"identifiers written with non-ASCII characters".
[snip]
I do not think non-ASCII characters make this 'problem' any worse. So I must
ask people to restrict their comments to the actual problem that this PEP is
trying to solve.
Really? Because when I am reading source code, even if a particular
variable *name* is a sequence of characters that I cannot identify as a
word that I know, I can at least spell it out using Latin characters, or
perhaps even attempt to pronounce it (verbalization of a word, even if
it is an incorrect verbalization, I find helps me to remember a variable
and use it later).

On the other hand, the introduction of some 60k+ valid unicode glyphs
into the set of characters that can be seen as a name in Python would
make any such attempts by anyone who is not a native speaker (and even
native speakers in the case of the more obscure Kanji glyphs) an
exercise in futility.

As it stands, people who use Python (and the vast majority of other
programming languages) learn the 52 upper/lowercase variants of the
latin alphabet (and sometimes the 0-9 number characters for some parts
of the world). That's it. 62 glyphs at the worst. But a huge portion
of these people have already been exposed to these characters through
school, the internet, etc., and this isn't likely to change (regardless
of the 'impending' Chinese population dominance on the internet).

Indeed, the lack of the 60k+ glyphs as valid name characters can make
the teaching of Python to groups of people that haven't been exposed to
the Latin alphabet more difficult, but those people who are exposed to
programming are also typically exposed to the internet, on which Latin
alphabets dominate (never mind that html tags are Latin characters, as
are just about every daemon configuration file, etc.). Exposure to the
Latin alphabet isn't going to go away, and Python is very unlikely to be
the first exposure programmers have to the Latin alphabet (except for
OLPC, but this PEP is about a year late to the game to change that).
And even if Python *is* the first time children or adults are exposed to
the Latin alphabet, one would hope that 62 characters to learn to 'speak
the language of Python' is a small price to pay to use it.

Regarding different characters sharing the same glyphs, it is a problem.
Say that you are importing a module written by a mathematician that
uses an actual capital Greek alpha for a name. When a user sits down to
use it, they could certainly get NameErrors, AttributeErrors, etc., and
never understand why it is the case. Their fancy-schmancy unicode
enabled terminal will show them what looks like the Latin A, but it will
in fact be the Greek Α. Until they copy/paste, check its ord(), etc.,
they will be baffled. It isn't a problem now because A = Α is a syntax
error, but it can and will become a problem if it is allowed to.

But this issue isn't limited to different characters sharing glyphs!
It's also about being able to type names to use them in your own code
(generally very difficult if not impossible for many non-Latin
characters), or even be able to display them. And no number of
guidelines, suggestions, etc., against distributing libraries with
non-Latin identifiers will stop it from happening, and *will* fragment
the community as Anton (and others) have stated.

- Josiah
May 13 '07 #16

P: n/a
On Sun, 2007-05-13 at 21:01 +0200, Stefan Behnel wrote:
For example, I could write

def zieheDreiAbVon(wert):
return zieheAb(wert, 3)

and most people on earth would not have a clue what this is good for. However,
someone who is fluent enough in German could guess from the names what this does.

I do not think non-ASCII characters make this 'problem' any worse. So I must
ask people to restrict their comments to the actual problem that this PEP is
trying to solve.
I think non-ASCII characters makes the problem far far worse. While I
may not understand what the function is by it's name in your example,
allowing non-ASCII characters makes it works by forcing all would-be
code readers have to have all kinds of necessary fonts just to view the
source code. Things like reporting exceptions too. At least in your
example I know the exception occurred in zieheDreiAbVon. But if that
identifier is some UTF-8 string, how do I go about finding it in my text
editor, or even reporting the message to the developers? I don't happen
to have that particular keymap installed in my linux system, so I can't
even type the letters!

So given that people can already transliterate their language for use as
identifiers, I think avoiding non-ASCII character sets is a good idea.
ASCII is simply the lowest denominator and is support by *all*
configurations and locales on all developers' systems.
>
Stefan
May 13 '07 #17

P: n/a
Stefan Behnel napisa(a):
>While I can read the code with Hebrew, Russian or Greek names
transliterated to ASCII, I would not be able to read such code in native.

Then maybe it was code that was not meant to be read by you?
OK, then. As a code obfuscation measure this would fit perfectly.

--
Jarek Zgoda
http://jpa.berlios.de/
May 13 '07 #18

P: n/a
Josiah Carlson wrote:
It's also about being able to type names to use them in your own code
(generally very difficult if not impossible for many non-Latin
characters), or even be able to display them. And no number of
guidelines, suggestions, etc., against distributing libraries with
non-Latin identifiers will stop it from happening, and *will* fragment
the community as Anton (and others) have stated.
Ever noticed how the community is already fragmented into people working on
project A and people not working on project A? Why shouldn't the people
working on project A agree what language they write and spell their
identifiers in? And don't forget about project B, C, and all the others.

I agree that code posted to comp.lang.python should use english identifiers
and that it is worth considering to use english identifiers in open source
code that is posted to a public OS project site. Note that I didn't say "ASCII
identifiers" but plain english identifiers. All other code should use the
language and encoding that fits its environment best.

Stefan
May 13 '07 #19

P: n/a

"Stefan Behnel" <st******************@web.dewrote in message
news:46**************@web.de...
| For example, I could write
|
| def zieheDreiAbVon(wert):
| return zieheAb(wert, 3)
|
| and most people on earth would not have a clue what this is good for.
However,
| someone who is fluent enough in German could guess from the names what
this does.
|
| I do not think non-ASCII characters make this 'problem' any worse.

It is ridiculous claims like this and the consequent refusal to admit,
address, and ameliorate the 50x worse problems that would be introduced
that lead me to oppose the PEP in its current form.

Terry Jan Reedy

May 13 '07 #20

P: n/a
Martin v. Löwis a écrit :
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
py*********@python.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported?
No.
why?
Because it will definitivly make code-sharing impossible. Live with it
or else, but CS is english-speaking, period. I just can't understand
code with spanish or german (two languages I have notions of)
identifiers, so let's not talk about other alphabets...

NB : I'm *not* a native english speaker, I do *not* live in an english
speaking country, and my mother's language requires non-ascii encoding.
And I don't have special sympathy for the USA. And yes, I do write my
code - including comments - in english.
May 13 '07 #21

P: n/a
Stefan Behnel a écrit :
Anton Vredegoor wrote:
>>>In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").

I am against this PEP for the following reasons:

It will split up the Python user community into different language or
interest groups without having any benefit as to making the language
more expressive in an algorithmic way.

We must distinguish between "identifiers named in a non-english language" and
"identifiers written with non-ASCII characters".

While the first is already allowed as long as the transcription uses only
ASCII characters, the second is currently forbidden and is what this PEP is about.

So, nothing currently keeps you from giving names to identifiers that are
impossible to understand by, say, Americans (ok, that's easy anyway).

For example, I could write

def zieheDreiAbVon(wert):
return zieheAb(wert, 3)

and most people on earth would not have a clue what this is good for.
Which is exactly why I don't agree with adding support with non-ascii
identifiers. Using non-english identifiers should be strongly
discouraged, not openly supported.
However,
someone who is fluent enough in German could guess from the names what this does.

I do not think non-ASCII characters make this 'problem' any worse.
It does, by openly stating that it's ok to write unreadable code and
offering support for it.
So I must
ask people to restrict their comments to the actual problem that this PEP is
trying to solve.
Sorry, but we can't dismiss the side-effects. Learning enough
CS-oriented technical english to actually read and write code and
documentation is not such a big deal - even I managed to to so, and I'm
a bit impaired when it comes to foreign languages.

May 13 '07 #22

P: n/a
On May 13, 8:49 pm, Michael Torrie <torr...@chem.byu.eduwrote:
On Sun, 2007-05-13 at 21:01 +0200, Stefan Behnel wrote:
For example, I could write
def zieheDreiAbVon(wert):
return zieheAb(wert, 3)
and most people on earth would not have a clue what this is good for. However,
someone who is fluent enough in German could guess from the names what this does.
I do not think non-ASCII characters make this 'problem' any worse. So Imust
ask people to restrict their comments to the actual problem that this PEP is
trying to solve.

I think non-ASCII characters makes the problem far far worse. While I
may not understand what the function is by it's name in your example,
allowing non-ASCII characters makes it works by forcing all would-be
code readers have to have all kinds of necessary fonts just to view the
source code. Things like reporting exceptions too. At least in your
example I know the exception occurred in zieheDreiAbVon. But if that
identifier is some UTF-8 string, how do I go about finding it in my text
editor, or even reporting the message to the developers? I don't happen
to have that particular keymap installed in my linux system, so I can't
even type the letters!

So given that people can already transliterate their language for use as
identifiers, I think avoiding non-ASCII character sets is a good idea.
ASCII is simply the lowest denominator and is support by *all*
configurations and locales on all developers' systems.
Perhaps there could be the option of typing and showing characters as
\uxxxx, eg. \u00FC instead of (u-umlaut), or showing them in a
different colour if they're not in a specified set.

May 13 '07 #23

P: n/a
Stefan Behnel a écrit :
Martin v. Löwis schrieb:
>>PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
py*********@python.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?
- would you use them if it was possible to do so? in what cases?

To make it clear: this PEP considers "identifiers written with non-ASCII
characters", not "identifiers named in a non-english language".
You cannot just claim that these are two totally distinct issues and get
away with it. The fact is that not only non-english identifiers are a
bad thing when it comes to sharing and cooperation, and it's obvious
that non-ascii glyphs can only make things work - since it's obvious
that people willing to use such a "feature" *wont* do it to spell
english identifiers anyway.
While the first is already allowed as long as the transcription uses only
ASCII characters, the second is currently forbidden and is what this PEP is about.

Now, I am not a strong supporter (most public code will use English
identifiers anyway) but we should not forget that Python supports encoding
declarations in source files and thus has much cleaner support for non-ASCII
source code than, say, Java. So, introducing non-ASCII identifiers is just a
small step further.
I would certainly not qualify this as a "small" step.
Disallowing this does *not* guarantee in any way that
identifiers are understandable for English native speakers.
I'm not an English native speaker. And there's more than a subtle
distinction between "not garantying" and "encouraging".
It only guarantees
that identifiers are always *typable* by people who have access to latin
characters on their keyboard. A rather small advantage, I'd say.

The capability of a Unicode-aware language to express non-English identifiers
in a non-ASCII encoding totally makes sense to me.
It does of course make sens (at least if you add support for non-english
non-ascii translation of the *whole* language - keywords, builtins and
the whole standard lib included). But it's still a very bad idea IMHO.
May 13 '07 #24

P: n/a
On May 13, 11:44*am, "Martin v. Löwis" <mar...@v.loewis.dewrote:
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
python-3...@python.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?
- would you use them if it was possible to do so? in what cases?

Regards,
Martin

PEP: 3131
Title: Supporting Non-ASCII Identifiers
Version: $Revision: 55059 $
Last-Modified: $Date: 2007-05-01 22:34:25 +0200 (Di, 01 Mai 2007) $
Author: Martin v. Löwis <mar...@v.loewis.de>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 1-May-2007
Python-Version: 3.0
Post-History:

Abstract
========

This PEP suggests to support non-ASCII letters (such as accented
characters, Cyrillic, Greek, Kanji, etc.) in Python identifiers.

Rationale
=========

Python code is written by many people in the world who are not familiar
with the English language, or even well-acquainted with the Latin
writing system. *Such developers often desire to define classes and
functions with names in their native languages, rather than having to
come up with an (often incorrect) English translation of the concept
they want to name.

For some languages, common transliteration systems exist (in particular,
for the Latin-based writing systems). *For other languages, users have
larger difficulties to use Latin to write their native words.

Common Objections
=================

Some objections are often raised against proposals similar to this one.

People claim that they will not be able to use a library if to do so
they have to use characters they cannot type on their keyboards.
However, it is the choice of the designer of the library to decide on
various constraints for using the library: people may not be able to use
the library because they cannot get physical access to the source code
(because it is not published), or because licensing prohibits usage, or
because the documentation is in a language they cannot understand. *A
developer wishing to make a library widely available needs to make a
number of explicit choices (such as publication, licensing, language
of documentation, and language of identifiers). *It should always bethe
choice of the author to make these decisions - not the choice of the
language designers.

In particular, projects wishing to have wide usage probably might want
to establish a policy that all identifiers, comments, and documentation
is written in English (see the GNU coding style guide for an example of
such a policy). Restricting the language to ASCII-only identifiers does
not enforce comments and documentation to be English, or the identifiers
actually to be English words, so an additional policy is necessary,
anyway.

Specification of Language Changes
=================================

The syntax of identifiers in Python will be based on the Unicode
standard annex UAX-31 [1]_, with elaboration and changes as defined
below.

Within the ASCII range (U+0001..U+007F), the valid characters for
identifiers are the same as in Python 2.5. *This specification only
introduces additional characters from outside the ASCII range. *For
other characters, the classification uses the version of the Unicode
Character Database as included in the ``unicodedata`` module.

The identifier syntax is ``<ID_Start<ID_Continue>*``.

``ID_Start`` is defined as all characters having one of the general
categories uppercase letters (Lu), lowercase letters (Ll), titlecase
letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers
(Nl), plus the underscore (XXX what are "stability extensions" listed in
UAX 31).

``ID_Continue`` is defined as all characters in ``ID_Start``, plus
nonspacing marks (Mn), spacing combining marks (Mc), decimal number
(Nd), and connector punctuations (Pc).

All identifiers are converted into the normal form NFC while parsing;
comparison of identifiers is based on NFC.

Policy Specification
====================

As an addition to the Python Coding style, the following policy is
prescribed: All identifiers in the Python standard library MUST use
ASCII-only identifiers, and SHOULD use English words wherever feasible.

As an option, this specification can be applied to Python 2.x. *In that
case, ASCII-only identifiers would continue to be represented as byte
string objects in namespace dictionaries; identifiers with non-ASCII
characters would be represented as Unicode strings.

Implementation
==============

The following changes will need to be made to the parser:

1. If a non-ASCII character is found in the UTF-8 representation of the
* *source code, a forward scan is made to find the first ASCII
* *non-identifier character (e.g. a space or punctuation character)

2. The entire UTF-8 string is passed to a function to normalize the
* *string to NFC, and then verify that it follows the identifier syntax.
* *No such callout is made for pure-ASCII identifiers, which continue to
* *be parsed the way they are today.

3. If this specification is implemented for 2.x, reflective libraries
* *(such as pydoc) must be verified to continue to work when Unicode
* *strings appear in ``__dict__`` slots as keys.

References
==========

.. [1]http://www.unicode.org/reports/tr31/

Copyright
=========

This document has been placed in the public domain.
I don't think that supporting non-ascii characters for identifiers
would cause any problem. Most people won't use it anyway. People who
use non-english identifiers for their project and hope for it to be
popular worldwide will probably just fail because of their foolish
coding style policy choice. I put that kind of choice in the same
ballpark as deciding to use hungarian notation for python code.

As for malicious patch submission, I think this is a non issue.
Designing tool to detect any non-ascii char identifier in a file
should be a trivial script to write.

I say that if there is a demand for it, let's do it.

May 13 '07 #25

P: n/a
Jarek Zgoda <jz****@o2.usun.plwrites:
Martin v. Löwis napisał(a):
>So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?

No, because "programs must be written for people to read, and only
incidentally for machines to execute". Using anything other than "lowest
common denominator" (ASCII) will restrict accessibility of code. This is
not a literature, that requires qualified translators to get the text
from Hindi (or Persian, or Chinese, or Georgian, or...) to Polish.

While I can read the code with Hebrew, Russian or Greek names
transliterated to ASCII, I would not be able to read such code in native.
Who or what would force you to? Do you currently have to deal with hebrew,
russian or greek names transliterated into ASCII? I don't and I suspect this
whole panic about everyone suddenly having to deal with code written in kanji,
klingon and hieroglyphs etc. is unfounded -- such code would drastically
reduce its own "fitness" (much more so than the ASCII-transliterated chinese,
hebrew and greek code I never seem to come across), so I think the chances
that it will be thrust upon you (or anyone else in this thread) are minuscule.
Plenty of programming languages already support unicode identifiers, so if
there is any rational basis for this fear it shouldn't be hard to come up with
-- where is it?

'as

BTW, I'm not sure if you don't underestimate your own intellectual faculties
if you think couldn't cope with greek or russian characters. On the other hand
I wonder if you don't overestimate your ability to reasonably deal with code
written in a completely foreign language, as long as its ASCII -- for anything
of nontrivial length, surely doing anything with such code would already be
orders of magnitude harder?

May 13 '07 #26

P: n/a
Josiah Carlson wrote:
On the other hand, the introduction of some 60k+ valid unicode glyphs
into the set of characters that can be seen as a name in Python would
make any such attempts by anyone who is not a native speaker (and even
native speakers in the case of the more obscure Kanji glyphs) an
exercise in futility.
So you gather up a list of identifiers and and send out for translation. Having
actual Kanji glyphs instead a mix of transliterations and bad English will only
make that easier.

That won't even cost you anything, since you were already having docstrings
translated, along with comments and documentation, right?
But this issue isn't limited to different characters sharing glyphs!
It's also about being able to type names to use them in your own code
(generally very difficult if not impossible for many non-Latin
characters), or even be able to display them.
For display, tell your editor the utf-8 source file is really latin-1. For
entry, copy-paste.

- Anders
May 13 '07 #27

P: n/a
Bruno Desthuilliers <bd*****************@free.quelquepart.frwrote:
Disallowing this does *not* guarantee in any way that
identifiers are understandable for English native speakers.

I'm not an English native speaker. And there's more than a subtle
distinction between "not garantying" and "encouraging".
I agree with Bruno and the many others who have expressed disapproval
for this idea -- and I am not an English native speaker, either (and
neither, it seems to me, are many others who dislike this PEP). The
mild pleasure of using accented letters in code "addressed strictly to
Italian-speaking audiences and never intended to be of any use to
anybody not speaking Italian" (should I ever desire to write such code)
pales in comparison with the disadvantages, many of which have already
been analyzed or at least mentioned.

Homoglyphic characters _introduced by accident_ should not be discounted
as a risk, as, it seems to me, was done early in this thread after the
issue had been mentioned. In the past, it has happened to me to
erroneously introduce such homoglyphs in a document I was preparing with
a word processor, by a slight error in the use of the system- provided
way for inserting characters not present on the keyboard; I found out
when later I went looking for the name I _thought_ I had input (but I
was looking for it spelled with the "right" glyph, not the one I had
actually used which looked just the same) and just could not find it.

On that occasion, suspecting I had mistyped in some way or other, I
patiently tried looking for "pieces" of the word in question, eventually
locating it with just a mild amount of aggravation when I finally tried
a piece without the offending character. But when something similar
happens to somebody using a sufficiently fancy text editor to input
source in a programming language allowing arbitrary Unicode letters in
identifiers, the damage (the sheer waste of developer time) can be much
more substantial -- there will be two separate identifiers around, both
looking exactly like each other but actually distinct, and unbounded
amount of programmer time can be spent chasing after this extremely
elusive and tricky bug -- why doesn't a rebinding appear to "take", etc.
With some copy-and-paste during development and attempts at debugging,
several copies of each distinct version of the identifier can be spread
around the code, further hampering attempts at understanding.
Alex
May 13 '07 #28

P: n/a
Il Sun, 13 May 2007 17:44:39 +0200, "Martin v. Lwis" ha scritto:

[cut]

I'm from Italy, and I can say that some thoughts by Martin v. Lwis are
quite right. It's pretty easy to see code that uses "English" identifiers
and comments, but they're not really english - many times, they're just
"englishized" versions of the italian word. They might lure a real english
reader into an error rather than help him understand what the name really
stands for. It would be better to let the programmer pick the language he
or she prefers, without restrictions.

The patch problem doesn't seem a real issue to me, because it's the project
admin the one who can pick the encoding, and he could easily refuse any
patch that doesn't conform to the standards he wants.

BTW, there're a couple of issues that should be solved; even though I could
do with iso-8859-1, I usually pick utf-8 as the preferred encoding for my
files, because I found it more portable and more compatible with different
editors and IDE (I don't know if I just found some bugs in some specific
software, but I had problems with accented characters when switching
environment from Win to Linux, especially when reading/writing to and from
non-native FS, e.g. reading files from a ntfs disk from linux, or reading
an ext2 volume from Windows) on various platforms.

By the way, I would highly dislike anybody submitting a patch that contains
identifiers other than ASCII or iso-8859-1. Hence, I think there should be
a way, a kind of directive or sth. like that, to constrain the identifiers
charset to a 'subset' of the 'global' one.

Also, there should be a way to convert source files in any 'exotic'
encoding to a pseudo-intellegibile encoding for any reader, a kind of
translittering (is that a proper english word) system out-of-the-box, not
requiring any other tool that's not included in the Python distro. This
will let people to retain their usual working environments even though
they're dealing with source code with identifiers in a really different
charset.

--
Alan Franzoni <al***************@gmail.com>
-
Togli .xyz dalla mia email per contattarmi.
Remove .xyz from my address in order to contact me.
-
GPG Key Fingerprint (Key ID = FE068F3E):
5C77 9DC3 BD5B 3A28 E7BC 921A 0255 42AA FE06 8F3E
May 13 '07 #29

P: n/a
"Martin v. Löwis" <ma****@v.loewis.dewrites:
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
py*********@python.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported?
Yes.
why?
Because not everyone speaks English, not all languages can losslessly
transliterated ASCII and because it's unreasonable to drastically restrict the
domain of things that can be conveniently expressed for a language that's also
targeted at a non-professional programmer audience.

I'm also not aware of any horror stories from languages which do already allow
unicode identifiers.
- would you use them if it was possible to do so?
Possibly.
in what cases?
Maybe mathematical code (greek letters) or code that is very culture and
domain specific (say code doing Japanese tax forms).

'as
May 13 '07 #30

P: n/a
Michael Torrie wrote:
>
So given that people can already transliterate their language for use as
identifiers, I think avoiding non-ASCII character sets is a good idea.
Transliteration makes people choose bad variable names, I see it all the time
with Danish programmers. Say e.g. the most descriptive name for a process is
"kr forlns" (run forward). But "koer_forlaens" is ugly, so instead he'll
write "run_fremad", combining an English word with a slightly less appropriate
Danish word. Sprinkle in some English spelling errors and badly-chosen English
words, and you have the sorry state of the art that is today.

- Anders
May 13 '07 #31

P: n/a
On Sun, 13 May 2007 15:35:15 -0700, Alex Martelli wrote:
Homoglyphic characters _introduced by accident_ should not be discounted
as a risk
....
But when something similar
happens to somebody using a sufficiently fancy text editor to input
source in a programming language allowing arbitrary Unicode letters in
identifiers, the damage (the sheer waste of developer time) can be much
more substantial -- there will be two separate identifiers around, both
looking exactly like each other but actually distinct, and unbounded
amount of programmer time can be spent chasing after this extremely
elusive and tricky bug -- why doesn't a rebinding appear to "take", etc.
With some copy-and-paste during development and attempts at debugging,
several copies of each distinct version of the identifier can be spread
around the code, further hampering attempts at understanding.

How is that different from misreading "disk_burnt = True" as "disk_bumt =
True"? In the right (or perhaps wrong) font, like the ever-popular Arial,
the two can be visually indistinguishable. Or "call" versus "cal1"?

Surely the correct solution is something like pylint or pychecker? Or
banning the use of lower-case L and digit 1 in identifiers. I'm good with
both.
--
Steven.

May 13 '07 #32

P: n/a
Alex Martelli wrote:
>
Homoglyphic characters _introduced by accident_ should not be discounted
as a risk, as, it seems to me, was done early in this thread after the
issue had been mentioned. In the past, it has happened to me to
erroneously introduce such homoglyphs in a document I was preparing with
a word processor, by a slight error in the use of the system- provided
way for inserting characters not present on the keyboard; I found out
when later I went looking for the name I _thought_ I had input (but I
was looking for it spelled with the "right" glyph, not the one I had
actually used which looked just the same) and just could not find it.
There's any number of things to be done about that.
1. # -*- encoding: ascii -*-
(I'd like to see you sneak those homoglyphic characters past *that*.)
2. pychecker and pylint - I'm sure you realise what they could do for you.
3. Use a font that doesn't have those characters or deliberately makes them
distinct (that could help web browsing safety too).

I'm not discounting the problem, I just dont believe it's a big one. Can we
chose a codepoint subset that doesn't have these dupes?

- Anders
May 13 '07 #33

P: n/a
Alexander Schmolck <a.********@gmail.comwrites:
Plenty of programming languages already support unicode identifiers,
Could you name a few? Thanks.
May 13 '07 #34

P: n/a
On Sun, 13 May 2007 10:52:12 -0700, Paul Rubin wrote:
"Martin v. Lwis" <ma****@v.loewis.dewrites:
>This is a commonly-raised objection, but I don't understand why people
see it as a problem. The phishing issue surely won't apply, as you
normally don't "click" on identifiers, but rather type them. In a
phishing case, it is normally difficult to type the fake character
(because the phishing relies on you mistaking the character for another
one, so you would type the wrong identifier).

It certainly does apply, if you're maintaining a program and someone
submits a patch. In that case you neither click nor type the
character. You'd normally just make sure the patched program passes
the existing test suite, and examine the patch on the screen to make
sure it looks reasonable. The phishing possibilities are obvious.
Not to me, I'm afraid. Can you explain how it works? A phisher might be
able to fool a casual reader, but how does he fool the compiler into
executing the wrong code?

As for project maintainers, surely a patch using some unexpected Unicode
locale would fail the "looks reasonable" test? That could even be
automated -- if the patch uses an unexpected "#-*- coding: blah" line, or
includes characters outside of a pre-defined range, ring alarm bells.
("Why is somebody patching my Turkish module in Korean?")

--
Steven

May 13 '07 #35

P: n/a
In <ma***************************************@python. org>, Michael Torrie
wrote:
I think non-ASCII characters makes the problem far far worse. While I
may not understand what the function is by it's name in your example,
allowing non-ASCII characters makes it works by forcing all would-be
code readers have to have all kinds of necessary fonts just to view the
source code. Things like reporting exceptions too. At least in your
example I know the exception occurred in zieheDreiAbVon. But if that
identifier is some UTF-8 string, how do I go about finding it in my text
editor, or even reporting the message to the developers? I don't happen
to have that particular keymap installed in my linux system, so I can't
even type the letters!
You find it in the sources by the line number from the traceback and the
letters can be copy'n'pasted if you don't know how to input them with your
keymap or keyboard layout.

Ciao,
Marc 'BlackJack' Rintsch
May 13 '07 #36

P: n/a
Thus spake "Martin v. Lwis" (ma****@v.loewis.de):
- should non-ASCII identifiers be supported? why?
No! I believe that:

- The security implications have not been sufficiently explored. I don't
want to be in a situation where I need to mechanically "clean" code (say,
from a submitted patch) with a tool because I can't reliably verify it by
eye. We should learn from the plethora of Unicode-related security
problems that have cropped up in the last few years.
- Non-ASCII identifiers would be a barrier to code exchange. If I know
Python I should be able to easily read any piece of code written in it,
regardless of the linguistic origin of the author. If PEP 3131 is
accepted, this will no longer be the case. A Python project that uses
Urdu identifiers throughout is just as useless to me, from a
code-exchange point of view, as one written in Perl.
- Unicode is harder to work with than ASCII in ways that are more important
in code than in human-language text. Humans eyes don't care if two
visually indistinguishable characters are used interchangeably.
Interpreters do. There is no doubt that people will accidentally
introduce mistakes into their code because of this.

- would you use them if it was possible to do so? in what cases?
No.


Regards,

Aldo

--
Aldo Cortesi
al**@nullcube.com
http://www.nullcube.com
Mob: 0419 492 863
May 14 '07 #37

P: n/a
Steven D'Aprano <st***@REMOVE.THIS.cybersource.com.auwrites:
It certainly does apply, if you're maintaining a program and someone
submits a patch. In that case you neither click nor type the
character. You'd normally just make sure the patched program passes
the existing test suite, and examine the patch on the screen to make
sure it looks reasonable. The phishing possibilities are obvious.

Not to me, I'm afraid. Can you explain how it works? A phisher might be
able to fool a casual reader, but how does he fool the compiler into
executing the wrong code?
The compiler wouldn't execute the wrong code; it would execute the code
that the phisher intended it to execute. That might be different from
what it looked like to the reviewer.
May 14 '07 #38

P: n/a

"Alan Franzoni" <al*******************@geemail.invalidwrote in message
news:1u*****************************@40tude.net...
Il Sun, 13 May 2007 17:44:39 +0200, "Martin v. Lwis" ha scritto:
|Also, there should be a way to convert source files in any 'exotic'
encoding to a pseudo-intellegibile encoding for any reader, a kind of
translittering (is that a proper english word) system out-of-the-box, not
requiring any other tool that's not included in the Python distro. This
will let people to retain their usual working environments even though
they're dealing with source code with identifiers in a really different
charset.
=============================

When I proposed that PEP3131 include transliteration support, Martin
rejected the idea.

tjr

May 14 '07 #39

P: n/a
Paul Rubin wrote:
>Plenty of programming languages already support unicode identifiers,

Could you name a few? Thanks.
C#, Java, Ecmascript, Visual Basic.

Neil
May 14 '07 #40

P: n/a
On Mon, 14 May 2007 09:42:13 +1000, Aldo Cortesi wrote:
I don't
want to be in a situation where I need to mechanically "clean"
code (say, from a submitted patch) with a tool because I can't
reliably verify it by eye.
But you can't reliably verify by eye. That's orders of magnitude more
difficult than debugging by eye, and we all know that you can't reliably
debug anything but the most trivial programs by eye.

If you're relying on cursory visual inspection to recognize harmful code,
you're already vulnerable to trojans.
We should learn from the plethora of
Unicode-related security problems that have cropped up in the last
few years.
Of course we should. And one of the things we should learn is when and
how Unicode is a risk, and not imagine that Unicode is some sort of
mystical contamination that creates security problems just by being used.
- Non-ASCII identifiers would be a barrier to code exchange. If I
know
Python I should be able to easily read any piece of code written
in it, regardless of the linguistic origin of the author. If PEP
3131 is accepted, this will no longer be the case.
But it isn't the case now, so that's no different. Code exchange
regardless of human language is a nice principle, but it doesn't work in
practice. How do you use "any piece of code ... regardless of the
linguistic origin of the author" when you don't know what the functions
and classes and arguments _mean_?

Here's a tiny doc string from one of the functions in the standard
library, translated (more or less) to Portuguese. If you can't read
Portuguese at least well enough to get by, how could you possibly use
this function? What would you use it for? What does it do? What arguments
does it take?

def dirsorteinsercao(a, x, baixo=0, elevado=None):
"""da o artigo x insercao na lista a, e mantem-na a
supondo classificado e classificado. Se x estiver ja em a,
introduza-o a direita do x direita mais. Os args opcionais
baixos (defeito 0) e elevados (len(a) do defeito) limitam
a fatia de a a ser procurarado.
"""
# not a non-ASCII character in sight (unless I missed one...)

[Apologies to Portuguese speakers for the dogs-breakfast I'm sure Babel-
fish and I made of the translation.]

The particular function I chose is probably small enough and obvious
enough that you could work out what it does just by following the
algorithm. You might even be able to guess what it is, because Portuguese
is similar enough to other Latin languages that most people can guess
what some of the words might mean (elevados could be height, maybe?). Now
multiply this difficulty by a thousand for a non-trivial module with
multiple classes and dozens of methods and functions. And you might not
even know what language it is in.

No, code exchange regardless of natural language is a nice principle, but
it doesn't exist except in very special circumstances.
A Python
project that uses Urdu identifiers throughout is just as useless
to me, from a code-exchange point of view, as one written in Perl.
That's because you can't read it, not because it uses Unicode. It could
be written entirely in ASCII, and still be unreadable and impossible to
understand.
- Unicode is harder to work with than ASCII in ways that are more
important
in code than in human-language text. Humans eyes don't care if two
visually indistinguishable characters are used interchangeably.
Interpreters do. There is no doubt that people will accidentally
introduce mistakes into their code because of this.
That's no different from typos in ASCII. There's no doubt that we'll give
the same answer we've always given for this problem: unit tests, pylint
and pychecker.

--
Steven.
May 14 '07 #41

P: n/a
On Sun, 13 May 2007 17:59:23 -0700, Paul Rubin wrote:
Steven D'Aprano <st***@REMOVE.THIS.cybersource.com.auwrites:
It certainly does apply, if you're maintaining a program and someone
submits a patch. In that case you neither click nor type the
character. You'd normally just make sure the patched program passes
the existing test suite, and examine the patch on the screen to make
sure it looks reasonable. The phishing possibilities are obvious.

Not to me, I'm afraid. Can you explain how it works? A phisher might be
able to fool a casual reader, but how does he fool the compiler into
executing the wrong code?

The compiler wouldn't execute the wrong code; it would execute the code
that the phisher intended it to execute. That might be different from
what it looked like to the reviewer.
How? Just repeating in more words your original claim doesn't explain a
thing.

It seems to me that your argument is, only slightly exaggerated, akin to
the following:

"Unicode identifiers are bad because phishers will no longer need to
write call_evil_func() but can write call_ƎvĬľ_func() instead."

Maybe I'm naive, but I don't see how giving phishers the ability to
insert a call to ƒunction() in some module is any more dangerous than
them inserting a call to function() instead.

If I'm mistaken, please explain why I'm mistaken, not just repeat your
claim in different words.
--
Steven.
May 14 '07 #42

P: n/a
Neil Hodgson <ny*****************@gmail.comwrites:
Plenty of programming languages already support unicode identifiers,
Could you name a few? Thanks.
C#, Java, Ecmascript, Visual Basic.
Java (and C#?) have mandatory declarations so homoglyphic identifiers aren't
nearly as bad a problem. Ecmascript is a horrible bug-prone language and
we want Python to move away from resembling it, not towards it. VB: well,
same as Ecmascript, I guess.
May 14 '07 #43

P: n/a
Steven D'Aprano <st****@REMOVE.THIS.cybersource.com.auwrites:
If I'm mistaken, please explain why I'm mistaken, not just repeat your
claim in different words.
if user_entered_password != stored_password_from_database:
password_is_correct = False
...
if password_is_correct:
log_user_in()

Does "password_is_correct" refer to the same variable in both places?
May 14 '07 #44

P: n/a
On Sun, 13 May 2007 20:12:23 -0700, Paul Rubin wrote:
Steven D'Aprano <st****@REMOVE.THIS.cybersource.com.auwrites:
>If I'm mistaken, please explain why I'm mistaken, not just repeat your
claim in different words.

if user_entered_password != stored_password_from_database:
password_is_correct = False
...
if password_is_correct:
log_user_in()

Does "password_is_correct" refer to the same variable in both places?
No way of telling without a detailed code inspection. Who knows what
happens in the ... ? If a black hat has access to the code, he could
insert anything he liked in there, ASCII or non-ASCII.

How is this a problem with non-ASCII identifiers? password_is_correct is
all ASCII. How can you justify saying that non-ASCII identifiers
introduce a security hole that already exists in all-ASCII Python?
--
Steven.
May 14 '07 #45

P: n/a
Steven D'Aprano <st****@REMOVE.THIS.cybersource.com.auwrites:
password_is_correct is all ASCII.
How do you know that? What steps did you take to ascertain it? Those
are steps you currently don't have to bother with.
May 14 '07 #46

P: n/a
Paul Rubin wrote:
Neil Hodgson <ny*****************@gmail.comwrites:
>>>>Plenty of programming languages already support unicode identifiers,

Could you name a few? Thanks.

C#, Java, Ecmascript, Visual Basic.


Java (and C#?) have mandatory declarations so homoglyphic identifiers aren't
nearly as bad a problem. Ecmascript is a horrible bug-prone language and
we want Python to move away from resembling it, not towards it. VB: well,
same as Ecmascript, I guess.
That's the first substantive objection I've seen. In a language
without declarations, trouble is more likely. Consider the maintenance
programmer who sees a variable name and retypes it elsewhere, not realizing
the glyphs are different even though they look the same. In a language
with declarations, that generates a compile-time error. In Python, it
doesn't.

John Nagle
May 14 '07 #47

P: n/a
Thus spake Steven D'Aprano (st****@REMOVE.THIS.cybersource.com.au):
If you're relying on cursory visual inspection to recognize harmful code,
you're already vulnerable to trojans.
What a daft thing to say. How do YOU recognize harmful code in a patch
submission? Perhaps you blindly apply patches, and then run your test suite on
a quarantined system, with an instrumented operating system to allow you to
trace process execution, and then perform a few weeks worth of analysis on the
data?

Me, I try to understand a patch by reading it. Call me old-fashioned.

Code exchange regardless of human language is a nice principle, but it
doesn't work in practice.
And this is clearly bunk. I have come accross code with transliterated
identifiers and comments in a different language, and while understanding was
hampered it wasn't impossible.

That's no different from typos in ASCII. There's no doubt that we'll give
the same answer we've always given for this problem: unit tests, pylint
and pychecker.
A typo that can't be detected visually is fundamentally different problem from
an ASCII typo, as many people in this thread have pointed out.

Regards,


Aldo

--
Aldo Cortesi
al**@nullcube.com
http://www.nullcube.com
Mob: 0419 492 863
May 14 '07 #48

P: n/a
It should be noted that the Python community may use other forums, in
other languages. They would likely be a lot more enthusiastic about
this PEP than the usual crowd here (comp.lang.python).
Please spread the news.

Martin
May 14 '07 #49

P: n/a
Steven D'Aprano <st***@REMOVE.THIS.cybersource.com.auwrote:
automated -- if the patch uses an unexpected "#-*- coding: blah" line, or
No need -- a separate PEP (also by Martin) makes UTF-8 the default
encoding, and UTF-8 can encode any Unicode character you like.
Alex
May 14 '07 #50

399 Replies

This discussion thread is closed

Replies have been disabled for this discussion.