By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
434,795 Members | 1,251 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 434,795 IT Pros & Developers. It's quick & easy.

PEP 3131: Supporting Non-ASCII Identifiers

P: n/a
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
py*********@python.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?
- would you use them if it was possible to do so? in what cases?

Regards,
Martin
PEP: 3131
Title: Supporting Non-ASCII Identifiers
Version: $Revision: 55059 $
Last-Modified: $Date: 2007-05-01 22:34:25 +0200 (Di, 01 Mai 2007) $
Author: Martin v. Löwis <ma****@v.loewis.de>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 1-May-2007
Python-Version: 3.0
Post-History:
Abstract
========

This PEP suggests to support non-ASCII letters (such as accented
characters, Cyrillic, Greek, Kanji, etc.) in Python identifiers.

Rationale
=========

Python code is written by many people in the world who are not familiar
with the English language, or even well-acquainted with the Latin
writing system. Such developers often desire to define classes and
functions with names in their native languages, rather than having to
come up with an (often incorrect) English translation of the concept
they want to name.

For some languages, common transliteration systems exist (in particular,
for the Latin-based writing systems). For other languages, users have
larger difficulties to use Latin to write their native words.

Common Objections
=================

Some objections are often raised against proposals similar to this one.

People claim that they will not be able to use a library if to do so
they have to use characters they cannot type on their keyboards.
However, it is the choice of the designer of the library to decide on
various constraints for using the library: people may not be able to use
the library because they cannot get physical access to the source code
(because it is not published), or because licensing prohibits usage, or
because the documentation is in a language they cannot understand. A
developer wishing to make a library widely available needs to make a
number of explicit choices (such as publication, licensing, language
of documentation, and language of identifiers). It should always be the
choice of the author to make these decisions - not the choice of the
language designers.

In particular, projects wishing to have wide usage probably might want
to establish a policy that all identifiers, comments, and documentation
is written in English (see the GNU coding style guide for an example of
such a policy). Restricting the language to ASCII-only identifiers does
not enforce comments and documentation to be English, or the identifiers
actually to be English words, so an additional policy is necessary,
anyway.

Specification of Language Changes
=================================

The syntax of identifiers in Python will be based on the Unicode
standard annex UAX-31 [1]_, with elaboration and changes as defined
below.

Within the ASCII range (U+0001..U+007F), the valid characters for
identifiers are the same as in Python 2.5. This specification only
introduces additional characters from outside the ASCII range. For
other characters, the classification uses the version of the Unicode
Character Database as included in the ``unicodedata`` module.

The identifier syntax is ``<ID_Start<ID_Continue>*``.

``ID_Start`` is defined as all characters having one of the general
categories uppercase letters (Lu), lowercase letters (Ll), titlecase
letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers
(Nl), plus the underscore (XXX what are "stability extensions" listed in
UAX 31).

``ID_Continue`` is defined as all characters in ``ID_Start``, plus
nonspacing marks (Mn), spacing combining marks (Mc), decimal number
(Nd), and connector punctuations (Pc).

All identifiers are converted into the normal form NFC while parsing;
comparison of identifiers is based on NFC.

Policy Specification
====================

As an addition to the Python Coding style, the following policy is
prescribed: All identifiers in the Python standard library MUST use
ASCII-only identifiers, and SHOULD use English words wherever feasible.

As an option, this specification can be applied to Python 2.x. In that
case, ASCII-only identifiers would continue to be represented as byte
string objects in namespace dictionaries; identifiers with non-ASCII
characters would be represented as Unicode strings.

Implementation
==============

The following changes will need to be made to the parser:

1. If a non-ASCII character is found in the UTF-8 representation of the
source code, a forward scan is made to find the first ASCII
non-identifier character (e.g. a space or punctuation character)

2. The entire UTF-8 string is passed to a function to normalize the
string to NFC, and then verify that it follows the identifier syntax.
No such callout is made for pure-ASCII identifiers, which continue to
be parsed the way they are today.

3. If this specification is implemented for 2.x, reflective libraries
(such as pydoc) must be verified to continue to work when Unicode
strings appear in ``__dict__`` slots as keys.

References
==========

... [1] http://www.unicode.org/reports/tr31/
Copyright
=========

This document has been placed in the public domain.
May 13 '07
Share this Question
Share on Google+
399 Replies


P: n/a
On May 16, 11:09 pm, Gregor Horvath <g...@gregor-horvath.comwrote:
sjdevn...@yahoo.com schrieb:
On May 16, 12:54 pm, Gregor Horvath <g...@gregor-horvath.comwrote:
Istvan Albert schrieb:
So the solution is to forbid Chinese XP ?
Who said anything like that? It's just an example of surprising and
unexpected difficulties that may arise even when doing trivial things,
and that proponents do not seem to want to admit to.
Should computer programming only be easy accessible to a small fraction
of privileged individuals who had the luck to be born in the correct
countries?
Should the unfounded and maybe xenophilous fear of loosing power and
control of a small number of those already privileged be a guide for
development?
Now that right there is your problem. You are reading a lot more into
this than you should. Losing power, xenophilus(?) fear, privileged
individuals,

just step back and think about it for a second, it's a PEP and people
have different opinions, it is very unlikely that there is some
generic sinister agenda that one must be subscribed to

i.

May 17 '07 #351

P: n/a
I'd suggest restricting identifiers under the rules of UTS-39,
profile 2, "Highly Restrictive". This limits mixing of scripts
in a single identifier; you can't mix Hebrew and ASCII, for example,
which prevents problems with mixing right to left and left to right
scripts. Domain names have similar restrictions.
That sounds interesting, however, I cannot find the document
your refer to. In TR 39 (also called Unicode Technical Standard #39),
at http://unicode.org/reports/tr39/ there is no mentioning
of numbered profiles, or "Highly Restrictive".

Looking at the document, it seems 3.1., "General Security Profile
for Identifiers" might apply. IIUC, xidmodifications.txt would
have to be taken into account.

I'm not quite sure what that means; apparently, a number of
characters (listed as restricted) should not be used in
identifiers. OTOH, it also adds HYPHEN-MINUS and KATAKANA
MIDDLE DOT - which surely shouldn't apply to Python
identifiers, no? (at least HYPHEN-MINUS already has a meaning
in Python, and cannot possibly be part of an identifier).

Also, mixed-script detection might be considered, but it is
not clear to me how to interpret the algorithm in section
5, plus it says that this is just one of the possible
algorithms.

Finally, Confusable Detection is difficult to perform on
a single identifier - it seems you need two of them to
find out whether they are confusable.

In any case, I added this as an open issue to the PEP.

Regards,
Martin
May 17 '07 #352

P: n/a
On May 17, 9:07 am, "Martin v. Lwis" <mar...@v.loewis.dewrote:
up. I interviewed about 20 programmers (none of them Python users), and
most took the position "I might not use it myself, but it surely
can't hurt having it, and there surely are people who would use it".
Typically when you ask people about esoteric features that seemingly
don't affect them but might be useful to someone, the majority will
say yes. Its simply common courtesy, its is not like they have to do
anything.

At the same time it takes some mental effort to analyze and understand
all the implications of a feature, and without taking that effort
"something" will always beat "nothing".

After the first time that your programmer friends need fix a trivial
bug in a piece of code that does not display correctly in the terminal
I can assure you that their mellow acceptance will turn to something
entirely different.

i.

May 17 '07 #353

P: n/a
On May 17, 4:56 am, "Martin v. Lwis" <mar...@v.loewis.dewrote:
....
(look me in the eye and tell me that "def" is
an English word, or that "getattr" is one)
That's not quite fair. They are not english
words but they are derived from english and
have a memonic value to english speakers that
they don't (or only accidently) have for
non-english speakers.

May 17 '07 #354

P: n/a
Istvan Albert schrieb:
>
After the first time that your programmer friends need fix a trivial
bug in a piece of code that does not display correctly in the terminal
I can assure you that their mellow acceptance will turn to something
entirely different.
Is there any difference for you in debugging this code snippets?

class Trstock(object):
hhe = 0
breite = 0
tiefe = 0

def _get_flche(self):
return self.hhe * self.breite

flche = property(_get_flche)

#-----------------------------------

class Tuerstock(object):
hoehe = 0
breite = 0
tiefe = 0

def _get_flaeche(self):
return self.hoehe * self.breite

flaeche = property(_get_flaeche)
I can tell you that for me and for my costumers this makes a big difference.

Whether this PEP gets accepted or not I am going to use German
identifiers and you have to be frightened to death by that fact ;-)

Gregor
May 17 '07 #355

P: n/a
Istvan Albert wrote:
On May 17, 9:07 am, "Martin v. Lwis" <mar...@v.loewis.dewrote:
>up. I interviewed about 20 programmers (none of them Python users), and
most took the position "I might not use it myself, but it surely
can't hurt having it, and there surely are people who would use it".

Typically when you ask people about esoteric features that seemingly
don't affect them but might be useful to someone, the majority will
say yes. Its simply common courtesy, its is not like they have to do
anything.

At the same time it takes some mental effort to analyze and understand
all the implications of a feature, and without taking that effort
"something" will always beat "nothing".
Indeed. For example, getattr() and friends now have to accept Unicode
arguments, and presumably to canonicalize correctly to avoid errors, and
treat equivalent Unicode and ASCII names as the same (question: if two
strings compare equal, do they refer to the same name in a namespace?).
After the first time that your programmer friends need fix a trivial
bug in a piece of code that does not display correctly in the terminal
I can assure you that their mellow acceptance will turn to something
entirely different.
And pretty quickly, too. If anyone but Martin were the author of the
PEP I'd have serious doubts, but if he thinks it's worth proposing
there's at least a chance that it will eventually be implemented.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

May 17 '07 #356

P: n/a
Gregor Horvath wrote:
Istvan Albert schrieb:
>After the first time that your programmer friends need fix a trivial
bug in a piece of code that does not display correctly in the terminal
I can assure you that their mellow acceptance will turn to something
entirely different.

Is there any difference for you in debugging this code snippets?

class Trstock(object):
hhe = 0
breite = 0
tiefe = 0

def _get_flche(self):
return self.hhe * self.breite

flche = property(_get_flche)

#-----------------------------------

class Tuerstock(object):
hoehe = 0
breite = 0
tiefe = 0

def _get_flaeche(self):
return self.hoehe * self.breite

flaeche = property(_get_flaeche)
I can tell you that for me and for my costumers this makes a big difference.
So you are selling to the clothing market? [I think you meant
"customers". God knows I have no room to be snitty about other people's
typos. Just thought it might raise a smile].
Whether this PEP gets accepted or not I am going to use German
identifiers and you have to be frightened to death by that fact ;-)
That's fine - they will be at least as meaningful to you as my English
ones would be to your countrymen who don't speah English.

I think we should remember that while programs are about communication
there's no requirement for (most of) them to be universally comprehensible.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

May 17 '07 #357

P: n/a
>At the same time it takes some mental effort to analyze and understand
>all the implications of a feature, and without taking that effort
"something" will always beat "nothing".
Indeed. For example, getattr() and friends now have to accept Unicode
arguments, and presumably to canonicalize correctly to avoid errors, and
treat equivalent Unicode and ASCII names as the same (question: if two
strings compare equal, do they refer to the same name in a namespace?).
Actually, that is not an issue: In Python 3, there is no data type for
"ASCII string" anymore, so all __name__ attributes and __dict__ keys
are Unicode strings - regardless of whether this PEP gets accepted
or not (which it just did).

Regards,
Martin
May 17 '07 #358

P: n/a
On May 17, 2:30 pm, Gregor Horvath <g...@gregor-horvath.comwrote:
Istvan Albert schrieb:
After the first time that your programmer friends need fix a trivial
bug in a piece of code that does not display correctly in the terminal
I can assure you that their mellow acceptance will turn to something
entirely different.

Is there any difference for you in debugging this code snippets?

class Trstock(object):
[snip]
class Tuerstock(object):
After finding a platform where those are different, I have to say
yes. Absolutely. In my normal setup they both display as "class
Tuerstock" (three letters 'T' 'u' 'e' starting the class name). If,
say, an exception was raised, it'd be fruitless for me to grep or
search for "Tuerstock" in the first one, and I might wind up wasting a
fair amount of time if a user emailed that to me before realizing that
the stack trace was just wrong. Even if I had extended character
support, there's no guarantee that all the users I'm supporting do.
If they do, there's no guarantee that some intervening email system
(or whatever) won't munge things.

With the second one, all my standard tools would work fine. My user's
setups will work with it. And there's a much higher chance that all
the intervening systems will work with it.

May 17 '07 #359

P: n/a
Martin v. Lwis:
... regardless of whether this PEP gets accepted
or not (which it just did).
Which version can we expect this to be implemented in?

Neil
May 17 '07 #360

P: n/a
Neil Hodgson schrieb:
Martin v. Lwis:
>... regardless of whether this PEP gets accepted
or not (which it just did).

Which version can we expect this to be implemented in?
The PEP says 3.0, and the planned implementation also targets
that release.

Regards,
Martin
May 17 '07 #361

P: n/a
On May 16, 6:38 pm, r...@yahoo.com wrote:
On May 16, 11:41 am, "sjdevn...@yahoo.com" <sjdevn...@yahoo.com>
wrote:
Christophe wrote:
....snip...
Who displays stack frames? Your code. Whose code includes unicode
identifiers? Your code. Whose fault is it to create a stack trace
display procedure that cannot handle unicode? You.
Thanks but no--I work with a _lot_ of code I didn't write, and looking
through stack traces from 3rd party packages is not uncommon.

Are you worried that some 3rd-party package you have
included in your software will have some non-ascii identifiers
buried in it somewhere? Surely that is easy to check for?
Far easier that checking that it doesn't have some trojan
code it it, it seems to me.
What do you mean, "check for"? If, say, numeric starts using math
characters (as has been suggested), I'm not exactly going to stop
using numeric. It'll still be a lot better than nothing, just
slightly less better than it used to be.
And I'm often not creating a stack trace procedure, I'm using the
built-in python procedure.
And I'm often dealing with mailing lists, Usenet, etc where I don't
know ahead of time what the other end's display capabilities are, how
to fix them if they don't display what I'm trying to send, whether
intervening systems will mangle things, etc.

I think we all are in this position. I always send plain
text mail to mailing lists, people I don't know etc. But
that doesn't mean that email software should be contrainted
to only 7-bit plain text, no attachements! I frequently use
such capabilities when they are appropriate.
Sure. But when you're talking about maintaining code, there's a very
high value to having all the existing tools work with it whether
they're wide-character aware or not.
If your response is, "yes, but look at the problems html
email, virus infected, attachements etc cause", the situation
is not the same. You have little control over what kind of
email people send you but you do have control over what
code, libraries, patches, you choose to use in your
software.

If you want to use ascii-only, do it! Nobody is making
you deal with non-ascii code if you don't want to.
Yes. But it's not like this makes things so horribly awful that it's
worth my time to reimplement large external libraries. I remain at -0
on the proposal; it'll cause some headaches for the majority of
current Python programmers, but it may have some benefits to a
sizeable minority and may help bring in new coders. And it's not
going to cause flaming catastrophic death or anything.

May 17 '07 #362

P: n/a
On Sun, 13 May 2007 17:44:39 +0200, Martin v. Lwis wrote:
The syntax of identifiers in Python will be based on the Unicode
standard annex UAX-31 [1]_, with elaboration and changes as defined
below.

Within the ASCII range (U+0001..U+007F), the valid characters for
identifiers are the same as in Python 2.5. This specification only
introduces additional characters from outside the ASCII range. For
other characters, the classification uses the version of the Unicode
Character Database as included in the ``unicodedata`` module.

The identifier syntax is ``<ID_Start<ID_Continue>*``.

``ID_Start`` is defined as all characters having one of the general
categories uppercase letters (Lu), lowercase letters (Ll), titlecase
letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers
(Nl), plus the underscore (XXX what are "stability extensions" listed in
UAX 31).

``ID_Continue`` is defined as all characters in ``ID_Start``, plus
nonspacing marks (Mn), spacing combining marks (Mc), decimal number
(Nd), and connector punctuations (Pc).
[...]

.. [1] http://www.unicode.org/reports/tr31/
First, to Martin: Thanks for writing this PEP.

While I have been reading both sides of this debate and finding both
sides reasonable and understandable in the main, I have several
questions which seem to not have been raised in this thread so far.

Currently, in Python 2.5, identifiers are specified as starting with
an upper- or lowercase letter or underscore ('_') with the following
"characters" of the identifier also optionally being a numerical digit
("0"..."9").

This current state seems easy to remember even if felt restrictive by
many.

Contrawise, the referenced document "UAX-31" is a bit obscure to me
(which is not eased by the fact that various browsers render non-ASCII
characters differently or not at all depending on the setup and font
sets available). Further, a cursory perusing of the unicodedata module
seems to refer me back to the Unicode docs.

I note that UAX-31 seems to allow "ideographs" as ``ID_Start``, for
example. From my relative state of ignorance, several questions come
to mind:

1) Will this allow me to use, say, a "right-arrow" glyph (if I can
find one) to start my identifier?

2) Could an ``ID_Continue`` be used as an ``ID_Start`` if using a RTL
(reversed or "mirrored") identifier? (Probably not, but I don't know.)

3) Is or will there be a definitive and exhaustive listing (with
bitmap representations of the glyphs to avoid the font issues) of the
glyphs that the PEP 3131 would allow in identifiers? (Does this
question even make sense?)

I have long programmed in RPL and have appreciated being able to use,
say, a "right arrow" symbol to start a name of a function (e.g., "->R"
or "->HMS" where the '->' is a single, right-arrow glyph).[1]

While it is not clear that identifiers I may wish to use would still
be prohibited under PEP 3131, I vote:

+0

__________________________________________
[1] RPL (HP's Dr. William Wickes' language and environment circa the
1980s) allows for a few specific "non-ASCII" glyphs as the start of a
name. I have solved my problem with my Python "appliance computer"
project by having up to three representations for my names: Python 2.x
acceptable names as the actual Python identifier, a Unicode text
display exposed to the end user, and also if needed, a bitmap display
exposed to the end user. So -- IAGNI. :-)

--
Richard Hanson

May 17 '07 #363

P: n/a
On May 16, 6:38 pm, r...@yahoo.com wrote:
On May 16, 11:41 am, "sjdevn...@yahoo.com" <sjdevn...@yahoo.com>
wrote:
Christophe wrote:
....snip...
Who displays stack frames? Your code. Whose code includes unicode
identifiers? Your code. Whose fault is it to create a stack trace
display procedure that cannot handle unicode? You.
Thanks but no--I work with a _lot_ of code I didn't write, and looking
through stack traces from 3rd party packages is not uncommon.

Are you worried that some 3rd-party package you have
included in your software will have some non-ascii identifiers
buried in it somewhere? Surely that is easy to check for?
Far easier that checking that it doesn't have some trojan
code it it, it seems to me.
What do you mean, "check for"? If, say, numeric starts using math
characters (as has been suggested), I'm not exactly going to stop
using numeric. It'll still be a lot better than nothing, just
slightly less better than it used to be.
And I'm often not creating a stack trace procedure, I'm using the
built-in python procedure.
And I'm often dealing with mailing lists, Usenet, etc where I don't
know ahead of time what the other end's display capabilities are, how
to fix them if they don't display what I'm trying to send, whether
intervening systems will mangle things, etc.

I think we all are in this position. I always send plain
text mail to mailing lists, people I don't know etc. But
that doesn't mean that email software should be contrainted
to only 7-bit plain text, no attachements! I frequently use
such capabilities when they are appropriate.
Sure. But when you're talking about maintaining code, there's a very
high value to having all the existing tools work with it whether
they're wide-character aware or not.
If your response is, "yes, but look at the problems html
email, virus infected, attachements etc cause", the situation
is not the same. You have little control over what kind of
email people send you but you do have control over what
code, libraries, patches, you choose to use in your
software.

If you want to use ascii-only, do it! Nobody is making
you deal with non-ascii code if you don't want to.
Yes. But it's not like this makes things so horribly awful that it's
worth my time to reimplement large external libraries. I remain at -0
on the proposal; it'll cause some headaches for the majority of
current Python programmers, but it may have some benefits to a
sizeable minority and may help bring in new coders. And it's not
going to cause flaming catastrophic death or anything.

May 17 '07 #364

P: n/a
Martin v. Lwis wrote:
Neil Hodgson schrieb:
>Martin v. Lwis:
>>... regardless of whether this PEP gets accepted
or not (which it just did).
Which version can we expect this to be implemented in?

The PEP says 3.0, and the planned implementation also targets
that release.
Can we take it this change *won't* be backported to the 2.X series?

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

May 18 '07 #365

P: n/a
=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= <ma****@v.loewis.dewrote:
>One possible reason is that the tools processing the program would not
know correctly what encoding the source file is in, and would fail
when they guessed the encoding incorrectly. For comments, that is not
a problem, as an incorrect encoding guess has no impact on the meaning
of the program (if the compiler is able to read over the comment
in the first place).
Possibly. One Java program I remember had Japanese comments encoded
in Shift-JIS. Will Python be better here? Will it support the source
code encodings that programmers around the world expect?
>Another possible reason is that the programmers were unsure whether
non-ASCII identifiers are allowed.
If that's the case, I'm not sure how you can improve on that in Python.

There are lots of possible reasons why all these programmers around
the world who want to use non-ASCII identifiers end-up not using them.
One is simply that very people ever really want to do so. However,
if you're to assume that they do, then you should look the existing
practice in other languages to find out what they did right and what
they did wrong. You don't have to speculate.

Ross Ridge

--
l/ // Ross Ridge -- The Great HTMU
[oo][oo] rr****@csclub.uwaterloo.ca
-()-/()/ http://www.csclub.uwaterloo.ca/~rridge/
db //
May 18 '07 #366

P: n/a
sj*******@yahoo.com schrieb:
With the second one, all my standard tools would work fine. My user's
setups will work with it. And there's a much higher chance that all
the intervening systems will work with it.
Please fix your setup.
This is the 21st Century. Unicode is the default in Python 3000.
Wake up before it is too late for you.

Gregor
May 18 '07 #367

P: n/a
Currently, in Python 2.5, identifiers are specified as starting with
an upper- or lowercase letter or underscore ('_') with the following
"characters" of the identifier also optionally being a numerical digit
("0"..."9").

This current state seems easy to remember even if felt restrictive by
many.

Contrawise, the referenced document "UAX-31" is a bit obscure to me
It's actually very easy. The basic principle will stay: the first
character must be a letter or an underscore, followed by letters,
underscores, and digits.

The question really is "what is a letter"? what is an underscore?
what is a digit?
1) Will this allow me to use, say, a "right-arrow" glyph (if I can
find one) to start my identifier?
No. A right-arrow (such as U+2192, RIGHTWARDS ARROW) is a symbol
(general category Sm: Symbol, Math). See

http://unicode.org/Public/UNIDATA/UCD.html

for a list of general category values, and

http://unicode.org/Public/UNIDATA/UnicodeData.txt

for a textual description of all characters.

Now, there is a special case in that Unicode supports "combining
modifier characters", i.e. characters that are not characters
themselves, but modify previous characters, to add diacritical
marks to letters. Unicode has great flexibility in applying these,
to form characters that are not supported themselves. Among those,
there is U+20D7, COMBINING RIGHT ARROW ABOVE, which is of general
category Mn, Mark, Nonspacing.

In PEP 3131, such marks may not appear as the first character
(since they need to modify a base character), but as subsequent
characters. This allows you to form identifiers such as
v⃗ (which should render as a small letter v, with an vector
arrow on top).
2) Could an ``ID_Continue`` be used as an ``ID_Start`` if using a RTL
(reversed or "mirrored") identifier? (Probably not, but I don't know.)
Unicode, and this PEP, always uses logical order, not rendering order.
What matters is in what order the characters appear in the source code
string.

RTL languages do pose a challenge, in particular since bidirectional
algorithms apparently aren't implemented correctly in many editors.
3) Is or will there be a definitive and exhaustive listing (with
bitmap representations of the glyphs to avoid the font issues) of the
glyphs that the PEP 3131 would allow in identifiers? (Does this
question even make sense?)
It makes sense, but it is difficult to implement. The PEP already
links to a non-normative list that is exhaustive for Unicode 4.1.
Future Unicode versions may add additional characters, so the
a list that is exhaustive now might not be in the future. The
Unicode consortium promises stability, meaning that what is an
identifier now won't be reclassified as a non-identifier in the
future, but the reverse is not true, as new code points get
assigned.

As for the list I generated in HTML: It might be possible to
make it include bitmaps instead of HTML character references,
but doing so is a licensing problem, as you need a license
for a font that has all these characters. If you want to
lookup a specific character, I recommend to go to the Unicode
code charts, at

http://www.unicode.org/charts/

Notice that an HTML page that includes individual bitmaps
for all characters would take *ages* to load.

Regards,
Martin

P.S. Anybody who wants to play with generating visualisations
of the PEP, here are the functions I used:

def isnorm(c):
return unicodedata.normalize("NFC", c)

def start(c):
if not isnorm(c):
return False
if unicodedata.category(c) in ('Ll', 'Lt', 'Lm', 'Lo', 'Nl'):
return True
if c==u'_':
return True
if c in u"\u2118\u212E\u309B\u309C":
return True
return False

def cont_only(c):
if not isnorm(c):
return False
if unicodedata.category(c) in ('Mn', 'Mc', 'Nd', 'Pc'):
return True
if 0x1369 <= ord(c) <= 0x1371:
return True
return False

def cont(c):
return start(c) or cont_only(c)

The isnorm() aspect excludes characters from the list which
change under NFC. This excludes a few compatibility characters
which are allowed in source code, but become indistinguishable
from their canonical form semantically.
May 18 '07 #368

P: n/a
Possibly. One Java program I remember had Japanese comments encoded
in Shift-JIS. Will Python be better here? Will it support the source
code encodings that programmers around the world expect?
It's not a question of "will it". It does today, starting from Python 2.3.
>Another possible reason is that the programmers were unsure whether
non-ASCII identifiers are allowed.

If that's the case, I'm not sure how you can improve on that in Python.
It will change on its own over time. "Not allowed" could mean "not
permitted by policy". Indeed, the PEP explicitly mandates a policy
that bans non-ASCII characters from source (whether in identifiers or
comments) for Python itself, and encourages other projects to define
similar policies. What projects pick up such a policy, or pick a
different policy (e.g. all comments must be in Korean) remains to
be seen.

Then, programmers will not be sure whether the language and the tools
allow it. For Python, it will be supported from 3.0, so people will
be worried initially whether their code needs to run on older Python
versions. When Python 3.5 comes along, people hopefully have lost
interest in supporting 2.x, so they will start using 3.x features,
including this one.

Now, it may be tempting to say "ok, so lets wait until 3.5, if people
won't use it before anyway". That is trick logic: if we add it only
to 3.5, people won't be using it before 4.0. *Any* new feature
takes several years to get into wide acceptance, but years pass
surprisingly fast.
There are lots of possible reasons why all these programmers around
the world who want to use non-ASCII identifiers end-up not using them.
One is simply that very people ever really want to do so. However,
if you're to assume that they do, then you should look the existing
practice in other languages to find out what they did right and what
they did wrong. You don't have to speculate.
That's indeed how this PEP came about. There were early adapters, like
Java, then experience gained from it (resulting in PEP 263, implemented
in Python 2.3 on the Python side, and resulting in UAX#39 on the Unicode
consortium side), and that experience now flows into PEP 3131.

If you think I speculated in reasoning why people did not use the
feature in Java: sorry for expressing myself unclearly. I know for
a fact that the reasons I suggested were actual reasons given by
actual people. I'm just not sure whether this was an exhaustive
list (because I did not interview every programmer in the world),
and what statistical relevance each of these reasons had (because
I did not conduct a scientific research to gain statistically
relevant data on usage of non-ASCII identifiers in different
regions of the world).

Regards,
Martin
May 18 '07 #369

P: n/a
"Martin v. Lwis" <ma****@v.loewis.dewrites:
Now I understand it is meaning 12 in Merriam-Webster's dictionary,
a) "to decline to bid, double, or redouble in a card game", or b)
"to let something go by without accepting or taking
advantage of it".
I never thought of it as having that meaning. I thought of it in the
sense of going by something without stopping, like "I passed a post
office on my way to work today".
May 18 '07 #370

P: n/a
"Martin v. Lwis" <ma****@v.loewis.dewrites:
If you doubt the claim, please indicate which of these three aspects
you doubt:
1. there are programmers which desire to defined classes and functions
with names in their native language.
2. those developers find the code clearer and more maintainable than
if they had to use English names.
3. code clarity and maintainability is important.
I think it can damage clarity and maintainability and if there's so
much demand for it then I'd propose this compromise: non-ascii
identifiers are allowed but they produce a compiler warning message
(including from eval and exec). You can suppress the warning message
with a command line option.
May 18 '07 #371

P: n/a
"Martin v. Lwis" <ma****@v.loewis.dewrites:
Integration with existing tools *is* something that a PEP should
consider. This one does not do that sufficiently, IMO.
What specific tools should be discussed, and what specific problems
do you expect?
Emacs, whose unicode support is still pretty weak.
May 18 '07 #372

P: n/a

"Hendrik van Rooyen" <m...@m........co.zawrote:
>
Now look me in the eye and tell me that you find
the mix of proper German and English keywords
beautiful.
I can't admit that, but I find that using German
class and method names is beautiful. The rest around
it (keywords and names from the standard library)
are not English - they are Python.
MvL:
(look me in the eye and tell me that "def" is
an English word, or that "getattr" is one)
HvR:
LOL - true - but a broken down assembler programmer like me
does not use getattr - and def is short for define, and for and while
and in are not German.
After an intense session of omphaloscopy, I would like another bite
at this cherry.

I think my problem is something like this - when I see a line of code
like:

def frobnitz():

I do not actually see the word "def" - I see something like:

define a function with no arguments called frobnitz

This "expansion" process is involuntary, and immediate in my mind.

And this is immediately followed by an irritated reaction, like:

WTF is frobnitz? What is it supposed to do? What Idiot wrote this?

Similarly, when I encounter the word "getattr" - it is immediately
expanded to "get attribute" and this "expansion" is kind of
dependant on another thing, namely that my mind is in "English
mode" - I refer here to something that only happens rarely, but
with devastating effect, experienced only by people who can read
more than one language - I am referring to the phenomenon that you
look at an unfamiliar piece of writing on say a signboard, with the
wrong language "switch" set in your mind - and you cannot read it,
it makes no sense for a second or two - until you kind of step back
mentally and have a more deliberate look at it, when it becomes
obvious that its not say English, but Afrikaans, or German, or vice
versa.

So in a sense, I can look you in the eye and assert that "def" and
"getattr" are in fact English words... (for me, that is)

I suppose that this "one language track" - mindedness of mine
is why I find the mix of keywords and German or Afrikaans so
abhorrent - I cannot really help it, it feels as if I am eating a
sandwich, and that I bite on a stone in the bread. - It just jars.

Good luck with your PEP - I don't support it, but it is unlikely
that the Python-dev crowd and GvR would be swayed much
by the opinions of the egregious HvR.

Aesthetics aside, I think that the practical maintenance problems
(especially remote maintenance) is the rock on which this
ship could founder.

- Hendrik

--
Philip Larkin (English Poet) :
They fuck you up, your mom and dad -
They do not mean to, but they do.
They fill you with the faults they had,
and add some extra, just for you.
May 18 '07 #373

P: n/a
"Sion Arrowsmith" <si..a@....org.ukwrote:

Hvr:
>>Would not like it at all, for the same reason I don't like re's -
It looks like random samples out of alphabet soup to me.

What I meant was, would the use of "foreign" identifiers look so
horrible to you if the core language had fewer English keywords?
(Perhaps Perl, with its line-noise, was a poor choice of example.
Maybe Lisp would be better, but I'm not so sure of my Lisp as to
make such an assertion for it.)
I suppose it would jar less - but I avoid such languages, as the whole
thing kind of jars - I am not on the python group for nothing..

: - )

- Hendrik

May 18 '07 #374

P: n/a
Long and interresting discussion with different point of view.

Personnaly, even if the PEP goes (and its accepted), I'll continue to use
identifiers as currently. But I understand those who wants to be able to
use chars in their own language.

* for people which are not expert developers (non-pros, or in learning
context), to be able to use names having meaning, and for pro developers
wanting to give a clear domain specific meaning - mainly for languages non
based on latin characters where the problem must be exacerbated.
They can already use unicode in strings (including documentation ones).

* for exchanging with other programing languages having such identifiers...
when they are really used (I include binding of table/column names in
relational dataabses).

* (not read, but I think present) this will allow developers to lock the
code so that it could not be easily taken/delocalized anywhere by anybody.
In the discussion I've seen that problem of mixing chars having different
unicode number but same representation (ex. omega) is resolved (use of an
unicode attribute linked to representation AFAIU).

I've seen (on fclp) post about speed, it should be verified, I'm not sure we
will loose speed with unicode identifiers.

On the unicode editing, we have in 2007 enough correct editors supporting
unicode (I configure my Windows/Linux editors to use utf-8 by default).
I join concern in possibility to read code from a project which may use such
identifiers (i dont read cyrillic, neither kanji or hindi) but, this will
just give freedom to users.

This can be a pain for me in some case, but is this a valuable argument so
to forbid this for other people which feel the need ?
IMHO what we should have if the PEP goes on:

* reworking on traceback to have a general option (like -T) to ensure
tracebacks prints only pure ascii, to avoid encoding problem when
displaying errors on terminals.

* a possibility to specify for modules that they must *define* only
ascii-based names, like a from __futur__ import asciionly. To be able to
enforce this policy in projects which request it.

* and, as many wrote, enforce that standard Python libraries use only ascii
identifiers.

May 18 '07 #375

P: n/a
Hallöchen!

Martin v. Löwis writes:
>In <sl*****************@irishsea.home.craig-wood.com>, Nick Craig-Wood
wrote:
>>My initial reaction is that it would be cool to use all those
great symbols. A variable called OHM etc!

This is a nice candidate for homoglyph confusion. There's the
Greek letter omega (U+03A9) Ω and the SI unit symbol (U+2126) Ω,
and I think some omegas in the mathematical symbols area too.

Under the PEP, identifiers are converted to normal form NFC, and
we have

pyunicodedata.normalize("NFC", u"\u2126")
u'\u03a9'

So, OHM SIGN compares equal to GREEK CAPITAL LETTER OMEGA. It can't
be confused with it - it is equal to it by the proposed language
semantics.
So different unicode sequences in the source code can denote the
same identifier?

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: br*****@jabber.org
(See http://ime.webhop.org for ICQ, MSN, etc.)
May 18 '07 #376

P: n/a
Hallöchen!

Laurent Pointal writes:
[...]

Personnaly, even if the PEP goes (and its accepted), I'll continue
to use identifiers as currently. [...]
Me too (mostly), although I do like the PEP. While many people have
pointed out possible issues of the PEP, only few have tried to
estimate its actual impact. I don't think that it will do harm to
Python code because the programmers will know when it's appropriate
to use it. The potential trouble is too obvious for being ignored
accidentally. And in the case of a bad programmer, you have more
serious problems than flawed identifier names, really.

But for private utilities for example, such identifiers are really a
nice thing to have. The same is true for teaching in some cases.
And the small simulation program in my thesis would have been better
with some α and φ. At least, the program would be closer to the
equations in the text then.
[...]

* a possibility to specify for modules that they must *define*
only ascii-based names, like a from __futur__ import asciionly. To
be able to enforce this policy in projects which request it.
Please don't. We're all adults. If a maintainer is really
concerned about such a thing, he should write a trivial program that
ensures it. After all, there are some other coding guidelines too
that could be enforced this way but aren't, for good reason.

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: br*****@jabber.org
(See http://ime.webhop.org for ICQ, MSN, etc.)
May 18 '07 #377

P: n/a
=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?= <ma****@v.loewis.dewrote:
>3) Is or will there be a definitive and exhaustive listing (with
bitmap representations of the glyphs to avoid the font issues) of the
glyphs that the PEP 3131 would allow in identifiers? (Does this
question even make sense?)
As for the list I generated in HTML: It might be possible to
make it include bitmaps instead of HTML character references,
but doing so is a licensing problem, as you need a license
for a font that has all these characters. If you want to
lookup a specific character, I recommend to go to the Unicode
code charts, at
http://www.unicode.org/charts/
My understanding is also that there are several east-asian
characters that display quite differently depending on whether
you are in Japan, Taiwan or mainland China. So much differently
that for example a Japanese person will not be able to recognize
a character rendered in the Taiwanese or mainland Chinese way.
--
Thomas Bellman, Lysator Computer Club, Linkping University, Sweden
"Adde parvum parvo magnus acervus erit" ! bellman @ lysator.liu.se
(From The Mythical Man-Month) ! Make Love -- Nicht Wahr!
May 18 '07 #378

P: n/a
Hendrik van Rooyen schrieb:
I suppose that this "one language track" - mindedness of mine
is why I find the mix of keywords and German or Afrikaans so
abhorrent - I cannot really help it, it feels as if I am eating a
sandwich, and that I bite on a stone in the bread. - It just jars.
Please come to Vienna and learn the local slang.
You would be surprised how beautiful and expressive a language mixed up
of a lot of very different languages can be. Same for music. It's the
secret of success of the music from Vienna. It's just a mix up of all
the different cultures once living in a big multicultural kingdom.

A mix up of Python key words and German identifiers feels very natural
for me. I live in cultural diversity and richness and love it.

Gregor
May 18 '07 #379

P: n/a
On May 17, 2:30 pm, Gregor Horvath <g...@gregor-horvath.comwrote:
Is there any difference for you in debugging this code snippets?
class Türstock(object):
Of course there is, how do I type the ü ? (I can copy/paste for
example, but that gets old quick).

But you're making a strawman argument by using extended ASCII
characters that would work anyhow. How about debugging this (I wonder
will it even make it through?) :

class 6자회담관**조
6자회 = 0
6자회담관* *귀 명=10
(I don't know what it means, just copied over some words from a
japanese news site, but the first thing it did it messed up my editor,
would not type the colon anymore)

i.

May 18 '07 #380

P: n/a
"Istvan Albert" <istvan.@.comescribi:
How about debugging this (I wonder will it even make it through?) :

class 6???????
6?? = 0
6????? ?? ?=10
This question is more or less what a Korean who doesn't
speak English would ask if he had to debug a program
written in English.
(I don't know what it means, just copied over some words
from a japanese news site,
A Japanese speaking Korean, it seems. :-)

Javier
------------------------------------------
http://www.texytipografia.com
May 18 '07 #381

P: n/a
Istvan Albert schrieb:
On May 17, 2:30 pm, Gregor Horvath <g...@gregor-horvath.comwrote:
>Is there any difference for you in debugging this code snippets?
>class Türstock(object):

Of course there is, how do I type the ü ? (I can copy/paste for
example, but that gets old quick).
I doubt that you can debug the code without Unicode chars. It seems that
you do no understand German and therefore you do not know what the
purpose of this program is.
Can you tell me if there is an error in the snippet without Unicode?

I would refuse to try do debug a program that I do not understand.
Avoiding Unicode does not help a bit in this regard.

Gregor
May 18 '07 #382

P: n/a
On 18 Mai, 18:42, "Javier Bezos" <see_below_no_s...@yahoo.eswrote:
"Istvan Albert" <istvan.@.comescribi:
How about debugging this (I wonder will it even make it through?) :
class 6???????
6?? = 0
6????? ?? ?=10

This question is more or less what a Korean who doesn't
speak English would ask if he had to debug a program
written in English.
Perhaps, but the treatment by your mail/news software plus the
delightful Google Groups of the original text (which seemed intact in
the original, although I don't have the fonts for the content) would
suggest that not just social or cultural issues would be involved.
It's already more difficult than it ought to be to explain to people
why they have trouble printing text to the console, for example, and
if one considers issues with badly configured text editors putting the
wrong character values into programs, even if Python complains about
it, there's still going to be some explaining to do.

One thing that some people already dislike about Python is the
"editing discipline" required. Although I don't have much time for
people whose coding "skills" involve random edits using badly
configured editors, trashing the indentation and the appearance of the
code (regardless of the language involved), we do need to consider the
need to bring people "up to speed" gracefully by encouraging the
proper use of tools, and so on, all without making it seem really
difficult and discouraging people from learning the language.

Paul

May 18 '07 #383

P: n/a
Paul Boddie schrieb:
Perhaps, but the treatment by your mail/news software plus the
delightful Google Groups of the original text (which seemed intact in
the original, although I don't have the fonts for the content) would
suggest that not just social or cultural issues would be involved.
I do not see the point.
If my editor or newsreader does display the text correctly or not is no
difference for me, since I do not understand a word of it anyway. It's a
meaningless stream of bits for me.
It's save to assume that for people who are finding this meaningful
their setup will display it correctly. Otherwise they could not work
with their computer anyway.

Until now I did not find a single Computer in my German domain who
cannot display: .

Gregor
May 18 '07 #384

P: n/a
>This question is more or less what a Korean who doesn't
speak English would ask if he had to debug a program
written in English.

Perhaps, but the treatment by your mail/news software plus the
delightful Google Groups of the original text (which seemed intact in
the original, although I don't have the fonts for the content) would
suggest that not just social or cultural issues would be involved.
The fact my Outlook changed the text is irrelevant
for something related to Python. And just remember
how Google mangled the intentation of Python code
some time ago. This was a technical issue which has
been solved, and no doubt my laziness (I didn't
switch to Unicode) won't prevent non-ASCII identifiers
be properly showed in general.

Javier
-----------------------------
http://www.texytipografia.com

May 18 '07 #385

P: n/a
On May 18, 1:47 pm, "Javier Bezos" <see_below_no_s...@yahoo.eswrote:
This question is more or less what a Korean who doesn't
speak English would ask if he had to debug a program
written in English.
Perhaps, but the treatment by your mail/news software plus the
delightful Google Groups of the original text (which seemed intact in
the original, although I don't have the fonts for the content) would
suggest that not just social or cultural issues would be involved.

The fact my Outlook changed the text is irrelevant
for something related to Python.
On the contrary, it cuts to the heart of the problem. There are
hundreds of tools out there that programmers use, and mailing lists
are certainly an incredibly valuable tool--introducing a change that
makes code more likely to be silently mangled seems like a negative.

Of course, there are other benefits to the PEP, so I'm only barely
opposed. But dismissing the fact that Outlook and other quite common
tools may have severe problems with code seems naive (or disingenuous,
but I don't think that's the case here).

May 18 '07 #386

P: n/a
Gregor Horvath wrote:
Paul Boddie schrieb:
Perhaps, but the treatment by your mail/news software plus the
delightful Google Groups of the original text (which seemed intact in
the original, although I don't have the fonts for the content) would
suggest that not just social or cultural issues would be involved.

I do not see the point.
If my editor or newsreader does display the text correctly or not is no
difference for me, since I do not understand a word of it anyway. It's a
meaningless stream of bits for me.
But if your editor doesn't even bother to preserve those bits
correctly, it makes a big difference. When 6자회담관**조 becomes 6???????
because someone's tool did the equivalent of
unicode_obj.encode("iso-8859-1", "replace"), then the stream of bits
really does become meaningless. (We'll see if the former identifier
even resembles what I've just pasted later on, or whether it resembles
the latter.)
It's save to assume that for people who are finding this meaningful
their setup will display it correctly. Otherwise they could not work
with their computer anyway.
Sure, it's all about "editor discipline" or "tool discipline" just as
I wrote. I'm in favour of the PEP, generally, but I worry about the
long explanations required when people find that their programs are
now ill-formed because someone made a quick edit in a bad editor.

Paul

May 18 '07 #387

P: n/a
Istvan Albert:
But you're making a strawman argument by using extended ASCII
characters that would work anyhow. How about debugging this (I wonder
will it even make it through?) :

class 6자회담관**조
6자회 = 0
6자회담관* *귀 명=10
That would be invalid syntax since the third line is an assignment
with target identifiers separated only by spaces.

Neil
May 18 '07 #388

P: n/a
On May 13, 9:44 am, "Martin v. Löwis" <mar...@v.loewis.dewrote:
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
python-3...@python.org

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り*
(hoping that the latter one means "counter").
I notice that Guido has approved it, so I'm looking at what it would
take to support it for Python FIT. The actual issue (for me) is
translating labels for cell columns (and similar) into Python
identifiers. After looking at the firestorm, I've come to the
conclusion that the old methods need to be retained not only for
backwards compatability but also for people who want to translate
existing fixtures.

The guidelines in PEP 3131 for standard library code appear to be
adequate for code that's going to be contributed to the community. I
will most likely emphasize those in my documentation.

Providing a method that would translate an arbitrary string into a
valid Python identifier would be helpful. It would be even more
helpful if it could provide a way of converting untranslatable
characters. However, I suspect that the translate (normalize?) routine
in the unicode module will do.

John Roth
Phthon FIT
May 19 '07 #389

P: n/a
sj*******@yahoo.com schrieb:
opposed. But dismissing the fact that Outlook and other quite common
tools may have severe problems with code seems naive (or disingenuous,
but I don't think that's the case here).
Of course there is broken software out there. There are even editors
that mix tabs and spaces ;-) Python did not introduce braces to solve
this problem but encouraged to use appropriate tools. It seems to work
for 99% of us. Same here.
It is the 21st century. Tools that destroy Unicode byte streams are
seriously broken. Face it. You can not halt progress because of some
broken software. Fix or drop it instead.

I do not think that this will be a big problem because only a very small
fraction of specialized local code will use Unicode identifiers anyway.

Unicode strings and comments are allowed today and I didn't heard of a
single issue of destroyed strings because of bad editors, although I
guess that Unicode strings in code are way more common than Unicode
identifiers would ever be.

Gregor
May 19 '07 #390

P: n/a
<@yahoo.comescribi:
Perhaps, but the treatment by your mail/news software plus the
delightful Google Groups of the original text (which seemed intact in
the original, although I don't have the fonts for the content) would
suggest that not just social or cultural issues would be involved.

The fact my Outlook changed the text is irrelevant
for something related to Python.

On the contrary, it cuts to the heart of the problem. There are
hundreds of tools out there that programmers use, and mailing lists
are certainly an incredibly valuable tool--introducing a change that
makes code more likely to be silently mangled seems like a negative.
In such a case, the Python indentation should be
rejected (quite interesting you removed from my
post the part mentioning it). I can promise there
are Korean groups and there are no problems at
all in using Hangul (the Korean writing).

Javier
-----------------------------
http://www.texytipografia.com
May 19 '07 #391

P: n/a
Providing a method that would translate an arbitrary string into a
valid Python identifier would be helpful. It would be even more
helpful if it could provide a way of converting untranslatable
characters. However, I suspect that the translate (normalize?) routine
in the unicode module will do.
Not at all. Unicode normalization only unifies different "spellings"
of the same character.

For transliteration, no simple algorithm exists, as it generally depends
on the language. However, if you just want any kind of ASCII string,
you can use the Unicode error handlers (PEP 293). For example, the
program

import unicodedata, codecs

def namereplace(exc):
if isinstance(exc,
(UnicodeEncodeError, UnicodeTranslateError)):
s = u""
for c in exc.object[exc.start:exc.end]:
s += "N_"+unicode(unicodedata.name(c).replace(" ","_"))+"_"
return (s, exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)

codecs.register_error("namereplace", namereplace)

print u"Schl\xfcssel".encode("ascii", "namereplace")

prints SchlN_LATIN_SMALL_LETTER_U_WITH_DIAERESIS_ssel.

HTH,
Martin
May 19 '07 #392

P: n/a
>But you're making a strawman argument by using extended ASCII
>characters that would work anyhow. How about debugging this (I wonder
will it even make it through?) :

class 6자회담관**조
6자회 = 0
6자회담관* *귀 명=10

That would be invalid syntax since the third line is an assignment
with target identifiers separated only by spaces.
Plus, the identifier starts with a number (even though 6 is not DIGIT
SIX, but FULLWIDTH DIGIT SIX, it's still of category Nd, and can't
start an identifier).

Regards,
Martin
May 19 '07 #393

P: n/a
On Fri, 18 May 2007 06:28:03 +0200, Martin v. Lwis wrote:

[excellent as always exposition by Martin]

Thanks, Martin.
P.S. Anybody who wants to play with generating visualisations
of the PEP, here are the functions I used:
[code snippets]

Thanks for those functions, too -- I've been exploring with them and
am slowly coming to some understanding.

-- Richard Hanson

"To many native-English-speaking developers well versed in other
programming environments, Python is *already* a foreign language --
judging by the posts here in c.l.py over the years." ;-)
__________________________________________________

May 19 '07 #394

P: n/a
Martin v. Löwis schrieb:
I've reported this before, but happily do it again: I have lived many
years without knowing what a "hub" is, and what "to pass" means if
it's not the opposite of "to fail". Yet, I have used their technical
meanings correctly all these years.
I was not speaking of the more general (non-technical) meanings, but of
the technical ones. The claim which I challenged was that people learn
just the "use" (syntax) but not the "meaning" (semantics) of these
terms. I think you are actually supporting my argument ;)

--
René
May 19 '07 #395

P: n/a
Martin v. Löwis schrieb:
>>Then get tools that match your working environment.
Integration with existing tools *is* something that a PEP should
consider. This one does not do that sufficiently, IMO.

What specific tools should be discussed, and what specific problems
do you expect?
Systems that cannot display code parts correctly. I expect problems with
unreadable tracebacks, for example.

Also: Are existing tools that somehow process Python source code e.g. to
test wether it meets certain criteria (pylint & co) or to aid in
creating documentation (epydoc & co) fully unicode-ready?

--
René
May 19 '07 #396

P: n/a
Martin v. Löwis wrote:
Python code is written by many people in the world who are not familiar
with the English language, or even well-acquainted with the Latin
writing system.
I believe that there is a not a single programmer in the world who doesn't
know ASCII. It isn't hard to learn the latin alphabet and you have to know
it anyway to use the keywords and the other ASCII characters to write numbers,
punctuation etc. Most non-western alphabets have ASCII transcription rules
and contain ASCII as a subset. On the other hand non-ascii identifiers
lead to fragmentation and less understanding in the programming world so I
don't like them. I also don't like non-ascii domain names where the same
arguments apply.

Let the data be expressed with Unicode but the logic with ASCII.

--
Regards/Gruesse,

Peter Maas, Aachen
E-mail 'cGV0ZXIubWFhc0B1dGlsb2cuZGU=\n'.decode('base64')
May 19 '07 #397

P: n/a
On May 19, 3:33 am, "Martin v. Löwis" <mar...@v.loewis.dewrote:
That would be invalid syntax since the third line is an assignment
with target identifiers separated only by spaces.

Plus, the identifier starts with a number (even though 6 is not DIGIT
SIX, but FULLWIDTH DIGIT SIX, it's still of category Nd, and can't
start an identifier).
Actually both of these issues point to the real problem with this PEP.

I knew about them (note that the colon is also missing) alas I
couldn't fix them.
My editor would could not remove a space or add a colon anymore, it
would immediately change the rest of the characters to something
crazy.

(Of course now someone might feel compelled to state that this is an
editor problem but I digress, the reality is that features need to
adapt to reality, moreso had I used a different editor I'd be still
unable to write these characters).

i.
May 20 '07 #398

P: n/a
Istvan Albert wrote:
On May 19, 3:33 am, "Martin v. Löwis" <mar...@v.loewis.dewrote:
That would be invalid syntax since the third line is an assignment
with target identifiers separated only by spaces.

Plus, the identifier starts with a number (even though 6 is not DIGIT
SIX, but FULLWIDTH DIGIT SIX, it's still of category Nd, and can't
start an identifier).

Actually both of these issues point to the real problem with this PEP.

I knew about them (note that the colon is also missing) alas I
couldn't fix them.
My editor would could not remove a space or add a colon anymore, it
would immediately change the rest of the characters to something
crazy.

(Of course now someone might feel compelled to state that this is an
editor problem but I digress, the reality is that features need to
adapt to reality, moreso had I used a different editor I'd be still
unable to write these characters).
The reality is that the few users who care about having chinese in their
code *will* be using an editor that supports them.

May 20 '07 #399

P: n/a
On May 17, 5:03 pm, "sjdevn...@yahoo.com" <sjdevn...@yahoo.comwrote:
On May 16, 6:38 pm, r...@yahoo.com wrote:
Are you worried that some 3rd-party package you have
included in your software will have some non-ascii identifiers
buried in it somewhere? Surely that is easy to check for?
Far easier that checking that it doesn't have some trojan
code it it, it seems to me.

What do you mean, "check for"? If, say, numeric starts using math
characters (as has been suggested), I'm not exactly going to stop
using numeric. It'll still be a lot better than nothing, just
slightly less better than it used to be.
The PEP explicitly states that no non-ascii identifiers
will be permitted in the standard library. The opinions
expressed here seems almost unamimous that non-ascii
identifiers are a bad idea in any sort of shared public
code. Why do you think the occurance of non-ascii
identifiers in Numpy is likely?
And I'm often not creating a stack trace procedure, I'm using the
built-in python procedure.
And I'm often dealing with mailing lists, Usenet, etc where I don't
know ahead of time what the other end's display capabilities are, how
to fix them if they don't display what I'm trying to send, whether
intervening systems will mangle things, etc.
I think we all are in this position. I always send plain
text mail to mailing lists, people I don't know etc. But
that doesn't mean that email software should be contrainted
to only 7-bit plain text, no attachements! I frequently use
such capabilities when they are appropriate.

Sure. But when you're talking about maintaining code, there's a very
high value to having all the existing tools work with it whether
they're wide-character aware or not.
I agree. On Windows I often use Notepad to edit
python files. (There goes my credibility! :-)
So I don't like tab-only indent proposals that assume
I can set tabs to be an arbitrary number of spaces.
But tab-only indentation would affect every python
program and every python programmer.

In the case of non-ascii identifiers, the potential
gains are so big for non-english spreakers, and (IMO)
the difficulty of working with non-ascii identifiers
times the probibility of having to work with them,
so low, that the former clearly outweighs the latter.
If your response is, "yes, but look at the problems html
email, virus infected, attachements etc cause", the situation
is not the same. You have little control over what kind of
email people send you but you do have control over what
code, libraries, patches, you choose to use in your
software.

If you want to use ascii-only, do it! Nobody is making
you deal with non-ascii code if you don't want to.

Yes. But it's not like this makes things so horribly awful that it's
worth my time to reimplement large external libraries. I remain at -0
on the proposal;
it'll cause some headaches for the majority of
current Python programmers, but it may have some benefits to a
sizeable minority
This is the crux of the matter I think. That
non-ascii identifiers will spead like a virus, infecting
program after program until every piece of Python code
is nothing but a mass of wreathing unintellagible non-
ascii characters. (OK, maybe I am overstating a little. :-)

I (and I think other proponents) don't think this is
likely to happen, and the the benefits to non-english
speakers of being able to write maintainable code far
outweigh the very rare case when it does occur.
and may help bring in new coders. And it's not
going to cause flaming catastrophic death or anything.
May 20 '07 #400

399 Replies

This discussion thread is closed

Replies have been disabled for this discussion.