Allowing non-ASCII identifiers (Fran?ois Pinard)

Doug Fort

This is an excerpt from a much longer post on the python-dev mailing list.
I'm responding here, to avoid cluttering up python-dev.

[François Pinard]
<snip>

Some English readers might not really imagine, but it is a constant
misery, having to mangle identifiers while documenting and thinking
in languages other than English, merely because the Python notion of
letter is limited to the English subset. Granted, keywords and standard
library use English, this is Python, and this is not at stake here!
However, there is a good part of code in local (or in-house) programs
which is thought as our crafted code, and even the linguistic change is
useful (to us) for segregating between what comes from the language and
what comes from us. The idea is extremely appealing of being able to
craft and polish our code (comments, strings, identifiers) to make it as <nice as it could get, while thinking in our native, natural language.--
François Pinard http://www.iro.umontreal.ca/~pinard

</snip>

Monglot English speakers, like me, might also benefit from reading
well-crafted Python code with non-english identifiers and comments. I learn
best by anchoring new ideas in a familiar context.

One of my (non-programmer) friends is improving his French by working
through the French versions of the Harry Potter novels.

Jul 18 '05 #1

Subscribe Post Reply

1730

Paul Prescod

Doug Fort wrote:

[François Pinard]
<snip>

Some English readers might not really imagine, but it is a constant
misery, having to mangle identifiers while documenting and thinking
in languages other than English, merely because the Python notion of
letter is limited to the English subset. Granted, keywords and standard
library use English, this is Python, and this is not at stake here!
However, there is a good part of code in local (or in-house) programs
which is thought as our crafted code, and even the linguistic change is
useful (to us) for segregating between what comes from the language and
what comes from us. The idea is extremely appealing of being able to
craft and polish our code (comments, strings, identifiers) to make it as

I wonder if the proposal would be more palatable if it were restricted
to 8-bit encodings (what we used to call "code pages"). This is at least
a first step in the right direction that would help westerners and could
be made to work even if Python were compiled without Unicode support.
(it is still possible to compile Python without Unicode isn't it?)

Paul Prescod

Jul 18 '05 #2

Martin v. LÃ¶wis

Paul Prescod wrote:

I wonder if the proposal would be more palatable if it were restricted
to 8-bit encodings (what we used to call "code pages"). This is at least
a first step in the right direction that would help westerners and could
be made to work even if Python were compiled without Unicode support.
(it is still possible to compile Python without Unicode isn't it?)

I doubt that it would matter much to those currently opposed; I know
that *I* would be opposed to such a strategy: Allowing arbitrary source
code encoding is no technical challenge whatsoever, and restricting
it to single-byte encodings is an arbitrary restriction.

I believe Guido's concern is more along the lines "How do I call a
function that has a Å‚ in its name, or a Î£?", or, even, "How can I
find out what the function does, by looking at its name and doc
string, if that is in Polish or Greek?" The fact that there is
a single-byte encoding for either character doesn't really help
here.

So this is about social issues, coding policies, guidelines, etc -
not about technical issues.

Regards,
Martin

Jul 18 '05 #3

François Pinard

[Paul Prescod]

I wonder if the proposal would be more palatable if it were restricted
to 8-bit encodings (what we used to call "code pages"). This is at
least a first step in the right direction that would help westerners
and could be made to work even if Python were compiled without Unicode
support.
To repeat something I was writing to python-dev earlier today, it
already works by some kind of accident. A smallish main program
could do:

import locale
locale.setlocale(locale.LC_ALL, '')
import THE-REAL-APPLICATION

to activate your code page, given your environment is already set for
it. This will activate proper classification of characters in <ctype.c>
and then, Python seems to behave properly with non-ASCII identifiers
within the imported application.

It is an accident because it was not meant this way by Guido, at least
so far that I know. The trick might break at various places, who knows.
I did not test it seriously, and do not intend to rely on it, as Guido
might even choose to consider this as a bug to be corrected.

The plan rather seems to be to support non-ASCII identifiers widely
instead of parsimoniously, if Python ever does it, or not at all. The
decision has not been taken yet, Guido wants a PEP and a discussion
first.

In my experience, such discussions are often rough (or at least
demanding), because people have a lot of emotions on linguistic
issues, and do not always show the real relations between emotions and
rationalisations, which sometimes get convoluted.
(it is still possible to compile Python without Unicode isn't it?)

I would guess that Unicode in Python is central if you want codecs to
work, in particular for all code pages which Python currently supports.

--
François Pinard http://www.iro.umontreal.ca/~pinard

Jul 18 '05 #4

Paul Prescod

Martin v. LÃ¶wis wrote:

Paul Prescod wrote:

I wonder if the proposal would be more palatable if it were restricted
to 8-bit encodings (what we used to call "code pages"). This is at
least a first step in the right direction that would help westerners
and could be made to work even if Python were compiled without Unicode
support. (it is still possible to compile Python without Unicode isn't
it?)

I doubt that it would matter much to those currently opposed; I know
that *I* would be opposed to such a strategy: Allowing arbitrary source
code encoding is no technical challenge whatsoever, and restricting
it to single-byte encodings is an arbitrary restriction.

Jul 18 '05 #5

John Roth

"Paul Prescod" <pa**@prescod.net> wrote in message
news:ma***************************************@pyt hon.org...
Martin v. Löwis wrote:

Paul Prescod wrote:
I wonder if the proposal would be more palatable if it were restricted
to 8-bit encodings (what we used to call "code pages"). This is at
least a first step in the right direction that would help westerners
and could be made to work even if Python were compiled without Unicode
support. (it is still possible to compile Python without Unicode isn't
it?)

I doubt that it would matter much to those currently opposed; I know
that *I* would be opposed to such a strategy: Allowing arbitrary source
code encoding is no technical challenge whatsoever, and restricting
it to single-byte encodings is an arbitrary restriction.

You are right. Re-reading Guido's complaint I understand what you mean.
But I have heard the argument in the past that Unicode source files
would break introspection tools. If that isn't a concern this time
around then disregard my suggestion.

[JR]
I believe that unicode (actually UTF-8) source code files
are legitimate if you declare them properly in the encoding
line. In fact, UTF-8 is the example in the documentation.

I'm all in favor of going to unicode all the way. I'd like to
have the proper mathematical symbols for logical and set
operations, as well as integer divide. They're all there in the
unicode character set, after all; why should we have to
settle for archaic character restrictions?

John Roth
[/JR]

Paul Prescod

Jul 18 '05 #6

AdSR

"John Roth" <ne********@jhrothjr.com> wrote in message news:<10*************@news.supernews.com>...

I'm all in favor of going to unicode all the way. I'd like to
have the proper mathematical symbols for logical and set
operations, as well as integer divide. They're all there in the
unicode character set, after all; why should we have to
settle for archaic character restrictions?

Java allows for Unicode identifiers and I'm yet to see a single source
file that uses anything but ASCII. Actually, so far I have only seen
non-ASCII in Polish Logo many years ago, and that was only for
educational purposes.

As a non-native English speaker, coming from Polish and Portuguese
background, I could argue in favor of non-ASCII identifiers, but I'm
against them. Do we really need those? Even if program output is in
Polish, all my code is "identified" and commented in English, which I
think of as of a good habit. (With exception of HTML, where comments
are closely related to content.)

I don't have any _really_ solid reasons against Unicode identifiers,
except for simplicity. It's just the way I feel about programming.

On a side note, one place where I think non-ASCII really should be
avoided are domain names, something that is being much debated
recently.

AdSR

Jul 18 '05 #7

Michael Hudson

ar**********@yahoo.com (AdSR) writes:

On a side note, one place where I think non-ASCII really should be
avoided are domain names, something that is being much debated
recently.

And something Python supports already :-)

Cheers,
mwh

--
Windows XP: Big cow. Stands there, not especially malevolent
but constantly crapping on your carpet. Eventually you have to
open a window to let the crap out or you die.
-- Jim's pedigree of operating systems, asr

Jul 18 '05 #8

Scott David Daniels

John Roth wrote:

...
I believe that unicode (actually UTF-8) source code files
are legitimate if you declare them properly in the encoding
line. In fact, UTF-8 is the example in the documentation.

I'm all in favor of going to unicode all the way. I'd like to
have the proper mathematical symbols for logical and set
operations, as well as integer divide. They're all there in the
unicode character set, after all; why should we have to
settle for archaic character restrictions?

Because some of us use archaic systems and/or fonts which are
incapable of displaying such symbols. Never mind whether we
can read them.

Also, we would have to solve the issue of multiple representations
for the same identifier (normalized identifiers)? There are four
equivalent representations:

(u'\N{Latin small letter e with acute}l'
u'\N{Latin small letter e with grave}ve')

(u'\N{Latin small letter e with acute}l'
u'e\N{Combining grave accent}ve')

(u'e\N{Combining acute accent}l'
u'\N{Latin small letter e with grave}ve')

(u'e\N{Combining acute accent}l'
u'e\N{Combining grave accent}ve')

Unicode says we should treat these four identically. Further,
they each have a distinct hash code, so a dictionary will not
necessarily even try to compare them to find them equal.
--
-Scott David Daniels
Sc***********@Acm.Org

Jul 18 '05 #9

Martin v. LÃ¶wis

Paul Prescod wrote:

You are right. Re-reading Guido's complaint I understand what you mean.
But I have heard the argument in the past that Unicode source files
would break introspection tools. If that isn't a concern this time
around then disregard my suggestion.

That might be a problem, indeed. OTOH, those tools likely also
break if you use non-ASCII byte strings for identifiers.

Regards,
Martin

Jul 18 '05 #10

Dietrich Epp

On Feb 10, 2004, at 8:59 AM, Scott David Daniels wrote:

Also, we would have to solve the issue of multiple representations
for the same identifier (normalized identifiers)? There are four
equivalent representations:

(u'\N{Latin small letter e with acute}l'
u'\N{Latin small letter e with grave}ve')

(u'\N{Latin small letter e with acute}l'
u'e\N{Combining grave accent}ve')

(u'e\N{Combining acute accent}l'
u'\N{Latin small letter e with grave}ve')

(u'e\N{Combining acute accent}l'
u'e\N{Combining grave accent}ve')

Unicode says we should treat these four identically. Further,
they each have a distinct hash code, so a dictionary will not
necessarily even try to compare them to find them equal.

You could require that all identifiers be the canonically decomposed
Unicode representations encoded into UTF-8. This would mean that no
matter which string is chosen from the above, the result is always the
same sequence of characters. This is how many filesystems use unicode,
i.e., Mac HFS+ works this way (but filesystems usually also require a
specific version of Unicode for backwards compatibility).

I personally think that Unicode identifiers would be catastrophic.
With Unicode on the web, if you can't represent some characters, you
can't read the web page. With programming, it could mean that you are
unable to use a particular module, altering the functionality for
people who can't enter certain codes. There is also the issue of which
characters to allow, because some characters look like numbers. Is
unicode 'IV' a number or an identifier? What about a circled 4? What
about unicode line breaks and paragraph breaks? What about opening and
closing quote marks? What about right-to-left characters? What about
ligatures? Non-breaking spaces? Function application?

I think the assumption some people have is that Unicode will only ever
be used for things that are like the roman alphabet: adding diacritical
marks, etc. It sounds like the most worthless extension ever, and the
only language I think of when I think of special characters is
Intercal.

Jul 18 '05 #11

Scott David Daniels

Dietrich Epp wrote:

You could require that all identifiers be the canonically decomposed
Unicode representations encoded into UTF-8. This would mean that no
matter which string is chosen from the above, the result is always the
same sequence of characters. This is how many filesystems use unicode,
i.e., Mac HFS+ works this way (but filesystems usually also require a
specific version of Unicode for backwards compatibility). There are several "Normal forms" for Unicode letters. You'd need to
choose one.
I personally think that Unicode identifiers would be catastrophic..... {lotsa examples, some good, some not-so-good elided)
I'm reluctant to endorse it because I _know_ I'll see "Why doesn't my
program work?" accompanied by characters I'm not used to distinguishing.
I think the assumption some people have is that Unicode will only ever
be used for things that are like the roman alphabet: adding diacritical
marks, etc. It sounds like the most worthless extension ever, and the
only language I think of when I think of special characters is Intercal.

And this is why I had to comment. You obviously never dealt with APL.
I actually used it without an APL type ball, which was painful in the
extreme. When I give language summaries, my quote for APL is,
"APL is the only language where you regularly see one programmer walk
into another's office (well, cube now, but in the day....) and say,
'I bet you cannot guess what this one-line program does.'"

--
-Scott David Daniels
Sc***********@Acm.Org

Jul 18 '05 #12

Martin v. Löwis

Scott David Daniels wrote:

Because some of us use archaic systems and/or fonts which are
incapable of displaying such symbols. Never mind whether we
can read them.
Right. However, policy whether to use non-ASCII identifiers
because of such issues should be with the source code authors,
not with the language implementation. Being able to use non-ASCII
identifiers does not mean you *have* to; not being able means
you *cannot*.
Also, we would have to solve the issue of multiple representations
for the same identifier (normalized identifiers)?
I would use NFC, because it has the best chances of being displayed
properly even on terminals that don't do combining characters.

For the language itself, the specific choice of normalization form
is irrelevant - any form would do (but I agree that normalization
should happen).
Unicode says we should treat these four identically. Further,
they each have a distinct hash code, so a dictionary will not
necessarily even try to compare them to find them equal.

If identifiers are Unicode-normalized, this is not an issue -
all copies of the normal form will hash identical.

Regards,
Martin

Jul 18 '05 #13

Martin v. LÃ¶wis

Dietrich Epp wrote:

You could require that all identifiers be the canonically decomposed
Unicode representations encoded into UTF-8.
That would be unpythonic: non-ASCII identifiers should be represented
as Unicode objects, not as UTF-8 byte strings.
I personally think that Unicode identifiers would be catastrophic. With
Unicode on the web, if you can't represent some characters, you can't
read the web page. With programming, it could mean that you are unable
to use a particular module, altering the functionality for people who
can't enter certain codes.
It is the case that some people would have problems invoking certain
functions. Why would that be a catastrophy? Authors of Python software
should make a choice whether they prefer readability of the source code,
or accessibility to everyone. Depending on the situation, one choice
or the other may be appropriate. Python should not police that decision
for the developer.
There is also the issue of which characters
to allow, because some characters look like numbers.
Yes. I would go with a list similar to the Java one, except with a
few obvious restrictions (e.g. disallow currency symbols: Python
does not allow the DOLLAR SIGN in identifiers, whereas Java does).
Is unicode 'IV' a number or an identifier?
It is certainly *not* a number. I propose to change the syntax of
identifiers, not of numbers. Whether this specific character â…£ is
an identifier or should give a syntax error is a choice one needs
to make, certainly. What would be your choice?
What about a circled 4? What about unicode
line breaks and paragraph breaks? What about opening and closing quote
marks? What about right-to-left characters? What about ligatures?
Non-breaking spaces? Function application?
The Unicode consortium gives guidance on all these questions. As I said,
I would closely follow the Java principles, which were derived from
the Unicode consortium guidance. Here is my proposal:

Legal non-ASCII identifiers are what legal non-ASCII
identifiers are in Java, except that Python may use
a different version of the Unicode character database.
Python would share the property that future versions
allow more characters in identifiers than older versions.

If you are too lazy too look up the Java definition,
here is a rough overview:
An identifier is "JavaLetter JavaLetterOrDigit*"

JavaLetter is a character of the classes Lu, Ll,
Lt, Lm, or Lo, or a currency symbol (for Python:
excluding $), or a connecting punctuation character
(which is unfortunately underspecified - will
research the implementation).

JavaLetterOrDigit is a JavaLetter, or a digit,
a numeric letter, a combining mark, a non-spacing
mark, or an ignorable control character.

I believe this specification allows you to answer your questions
yourself.
I think the assumption some people have is that Unicode will only ever
be used for things that are like the roman alphabet: adding diacritical
marks, etc. It sounds like the most worthless extension ever, and the
only language I think of when I think of special characters is Intercal.

That is certainly not my assumption. Instead, I expect that this
extension will primarily be used by developers whose native language
is Russian, Japanese, Chinese, Korean, or Arabic. Atleast, I've heard
developers from these cultures ask for the specific feature in the
past (I've also heard French and German people ask for the feature,
but that fits with your expectation).

Regards,
Martin

Jul 18 '05 #14

Neil Hodgson

Scott David Daniels:

Because some of us use archaic systems and/or fonts which are
incapable of displaying such symbols. Never mind whether we
can read them.

For such circumstances, I would like to see hex escape sequences allowed
in identifiers as in Java. That means that there is a representation of last
resort that can be used by those using less capable tools. A simple filter
could translate to and from this format for the extremely rare occasions it
would be needed.

Neil

Jul 18 '05 #15

Joe Mason

In article <c0*************@news.t-online.com>, Martin v. Löwis wrote:

It is the case that some people would have problems invoking certain
functions. Why would that be a catastrophy?
Oh, it wouldn't be. Not being catastrophic doesn't make it good.
Authors of Python software should make a choice whether they prefer
readability of the source code, or accessibility to everyone.
Yeah, they should, but they won't. They'll go nuts with the cool
features and not stop to think about the consequences. Those of us
stuck cleaning up after them will then be hindered by the cool features
that don't work. History has shown us this.

If non-ASCII characters are allowed, they'll be used frivolously.
Somebody will put "et tu, Bruté" in a comment, or start their career
planning package with "import resumé", and these otherwise working
programs would break for people without Unicode support.
Python should not police that decision for the developer.

Why not? It polices everything else. Isn't Python still the "only one
way to do it" language?

If you were suggesting this for Perl or Ruby, I'd be all in favour (in fact,
it'd be especially apropriate for Ruby). But in Python it's perfectly
appropriate to restrict something that many people would find useful in
favour of simplicity and consistency.

Joe

Jul 18 '05 #16

Paul Prescod

Dietrich Epp wrote:

I personally think that Unicode identifiers would be catastrophic.

This is an overstatement. One of the great things about Python is that
it borrows from other langauges. VB and C# for sure and I think Java
allow non-ASCII identifiers and there was no catastrophe. VB has its
problems but Unicode identifiers is not a big one.

I am +0 on this proposal because I really doubt it will cause me big
problems and at least some foreign language speakers claim it will make
their lives much easier. If they post to c.l.py asking for help with
code I can't read I'll tell them I can't read it. If they write
extension modules I can't use I'll just ask them to put an ASCII API
alongside their Unicode one (language is likely to be a bigger
readability problem than encoding anyhow)

Paul Prescod

Jul 18 '05 #17

Martin v. Löwis

Joe Mason wrote:

If non-ASCII characters are allowed, they'll be used frivolously.
Somebody will put "et tu, Bruté" in a comment
People can (and do) already put their natural language into comments;
whether or not non-ASCII characters are allowed in identifiers is
irrelevant for that usage.

Also, people don't need "Unicode support" to read those comments.
They just need an editor that can display the character set that
the people wrote their comments in.

Assuming you speak the language in which the comments are written,
you very likely have a text editor which can display them. Or you
use IDLE.

Python should not police that decision for the developer.

Why not? It polices everything else. Isn't Python still the "only one
way to do it" language?

And that wouldn't change: There would be only a single way to do

import resumé

Currently, there is no way, which is less than "only one way".
If you were suggesting this for Perl or Ruby, I'd be all in favour (in fact,
it'd be especially apropriate for Ruby). But in Python it's perfectly
appropriate to restrict something that many people would find useful in
favour of simplicity and consistency.

And indeed, using non-ASCII characters in identifiers is simple and
consistent.

Regards,
Martin

Jul 18 '05 #18

Allowing non-ASCII identifiers (Fran?ois Pinard)

Similar topics