By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,366 Members | 1,366 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,366 IT Pros & Developers. It's quick & easy.

Proposal: require 7-bit source str's

P: n/a
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.

It can be used to ensure that the source file doesn't contain national
characters that the program will treat as characters in the current
locale's character set instead of in the source file's character set.

An environment variable or command line option to set this for all
files would also be very useful (and -*- str7bit:False -*- to override
it), so one can easily check someone else's code for trouble spots.

Possibly an s'' syntax or something would also be useful for non-
Unicode strings that intentionally contain national characters.

I dislike the '7bit' part of the name - it's misleading both because
one can get 8-bit strings e.g. with the '\x<hex>' notation (a feature,
not a bug) and because some 'valid' characters will be 8bit in
character sets like EBCDIC. However, I can't think of a better name.

Comments?
Has it been discussed before?

--
Hallvard
Jul 18 '05 #1
Share this Question
Share on Google+
30 Replies


P: n/a
Hallvard B Furuseth wrote:
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.


Could

# -*- coding: ascii -*-

be sufficient? Why would you reintroduce ambiguity with your s-prefixed
strings? The long-term goal would be unicode throughout, IMHO.

Peter
Jul 18 '05 #2

P: n/a

"Hallvard B Furuseth" <h.**********@usit.uio.no> wrote in message
news:HB**************@bombur.uio.no...
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.

It can be used to ensure that the source file doesn't contain national
characters that the program will treat as characters in the current
locale's character set instead of in the source file's character set.

An environment variable or command line option to set this for all
files would also be very useful (and -*- str7bit:False -*- to override
it), so one can easily check someone else's code for trouble spots.

Possibly an s'' syntax or something would also be useful for non-
Unicode strings that intentionally contain national characters.

I dislike the '7bit' part of the name - it's misleading both because
one can get 8-bit strings e.g. with the '\x<hex>' notation (a feature,
not a bug) and because some 'valid' characters will be 8bit in
character sets like EBCDIC. However, I can't think of a better name.

Comments?
Has it been discussed before?
Is this even an issue? If you specify utf-8 as the character
set, I can't see how non-unicode strings could have
anything other than 7-bit ascii, for the simple reason that
the interpreter wouldn't know which encoding to use.
(of course, hex escapes would still be legal, as well as
constructed strings and strings read in and so forth.)

On the other hand, I don't know that it actually does it this
way, and PEP 263 seems to be completely uninformative
on the issue.

John Roth
--
Hallvard

Jul 18 '05 #3

P: n/a
John Roth wrote:
"Hallvard B Furuseth" <h.**********@usit.uio.no> wrote in message
news:HB**************@bombur.uio.no...
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.

It can be used to ensure that the source file doesn't contain national
characters that the program will treat as characters in the current
locale's character set instead of in the source file's character set.
(...)


Is this even an issue? If you specify utf-8 as the character
set, I can't see how non-unicode strings could have
anything other than 7-bit ascii, for the simple reason that
the interpreter wouldn't know which encoding to use.


Sorry, I should have included an example.

# -*- coding:iso-8859-1; str7bit:True; -*-

A = u'hr' # ok
B = 'hr' # error because of str7bit.
print B

The 'coding' directive ensures this source code is translated correctly
to Unicode. However, string B is then translated back to the source
character set so it can be stored as a str object and not a unicode
object.

The print statement just outputs the bytes in B, it doesn't do any
character set handling. So if your terminal uses latin-2, it will
output the '' as Latin small letter r with caron.

coding:utf-8 wouldn't help. B would remain a plain string, not a
Unicode string. The raw utf-8 bytes would be output.

--
Hallvard
Jul 18 '05 #4

P: n/a
Peter Otten wrote:
Hallvard B Furuseth wrote:
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.
Could

# -*- coding: ascii -*-

be sufficient?


No. It would be used together with coding: <non-ascii charset>. The
point is to ensure that all non-ASCII strings are u'' strings instead
of plain strings.
Why would you reintroduce ambiguity with your s-prefixed
strings?
For programs that work with non-Unicode output devices or files and
know which character set they use. Which is quite a lot of programs.
The long-term goal would be unicode throughout, IMHO.


Whose long-term goal for what? For things like Internet communication,
fine. But there are lot of less 'global' applications where other
character encodings make more sense.

In any case, a language's both short-term and long-term goals should be
to support current programming, not programming like it 'should be done'
some day in the future.

--
Hallvard
Jul 18 '05 #5

P: n/a
Hallvard B Furuseth wrote:
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.

It can be used to ensure that the source file doesn't contain national
characters that the program will treat as characters in the current
locale's character set instead of in the source file's character set.


I doubt this helps as much as you'd like. You will need to change every
source file with that annotation. While you are at it, you could just
as well check every source file directly.

So if anything, I think this should be a global option. Or, better yet,
external checkers like pychecker could check for that.

Regards,
Martin
Jul 18 '05 #6

P: n/a
Peter Otten wrote:
Could

# -*- coding: ascii -*-

be sufficient?


No. He still wants to allow non-ASCII in Unicode literals and
comments.

Regards,
Martin
Jul 18 '05 #7

P: n/a
Hallvard B Furuseth wrote:
Peter Otten wrote:
Hallvard B Furuseth wrote:
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.
Could

# -*- coding: ascii -*-

be sufficient?


No. It would be used together with coding: <non-ascii charset>. The
point is to ensure that all non-ASCII strings are u'' strings instead
of plain strings.


OK.
Why would you reintroduce ambiguity with your s-prefixed
strings?


For programs that work with non-Unicode output devices or files and
know which character set they use. Which is quite a lot of programs.


I'd say a lot of programs work with non-unicode, but many don't know what
they are doing - i. e. you cannot move them into an environment with a
different encoding (if you do they won't notice).
The long-term goal would be unicode throughout, IMHO.


Whose long-term goal for what? For things like Internet communication,
fine. But there are lot of less 'global' applications where other
character encodings make more sense.


Here we disagree. Showing the right image for a character should be the job
of the OS and should safely work cross-platform. Why shouldn't I be able to
store a file with a greek or chinese name? I wasn't able to quote Martin's
surname correctly for the Python-URL. That's a mess that should be cleaned
up once per OS rather than once per user. I don't see how that can happen
without unicode (only). Even NASA blunders when they have to deal with
meters and inches.
In any case, a language's both short-term and long-term goals should be
to support current programming, not programming like it 'should be done'
some day in the future.


Well, Python's integers already work like they 'should be done'. I'm no
expert, but I think Java is closer to the 'real thing' concerning strings.
Perl 6 is going for unicode, if only to overcome the limititations of their
operator set (they want the yen symbol as a zipping operator because it
looks like a zipper :-).
You have to make compromises and I think an external checker would be the
way to go in your case. If I were to add a switch to Python's string
handling it would be "all-unicode". But it may well be that I would curse
it after the first real-world use...

Peter

Jul 18 '05 #8

P: n/a
Martin v. Lwis wrote:
Hallvard B Furuseth wrote:
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:

# -*- str7bit:True -*-

After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.

It can be used to ensure that the source file doesn't contain national
characters that the program will treat as characters in the current
locale's character set instead of in the source file's character set.
I doubt this helps as much as you'd like. You will need to change every
source file with that annotation.


perl -i.bak -pe '
/\bstr7bit\b/ or
s/^(\s*#.*?-\*-.*?coding[=:]\s*[\w.-]+)(?=[;\s])/$1;str7bit:True/
' `find . -name '*.py' | xargs grep -l 'coding[=:]'`
While you are at it, you could just
as well check every source file directly.
True at first pass, but if Python catches it, a file will stay
clean once it has been cleaned up and marked as str7bit. That's
particularly useful when several people are working on the source.

A fix to your objection would be to instead warn about the
offending strings _unless_ the file is marked with str7bit:False,
but I figure that's a bit too drastic for the time being:-)
So if anything, I think this should be a global option.
-W::str7bitWarning?

Come to think of it, that would also make it possible for a Python
program to reject add-ons (modules, execfile etc) which contain
unmarked 8-bit strings.
Or, better yet,
external checkers like pychecker could check for that.


Well, I don't think that's better, but if it's rejected for Python
that'll be my next stop.

--
Hallvard
Jul 18 '05 #9

P: n/a
Hallvard B Furuseth wrote:
Well, I don't think that's better, but if it's rejected for Python
that'll be my next stop.


I can see how the global warning is provided (which then can be
configured into an error through the warnings module). However,
I don't want to introduce additional magic comments. I already
dislike the coding declarations for being comments, and would
have preferred if they had been spelled as

directive encoding "utf-8"

The coding declaration was only acceptable because
- a statement would have to go before the doc string, in
which case it would not have been a docstring anymore, and
- there was prior art (Emacs and VIM) for declaring encodings
to editors, inside comments

Your proposed annotation has no prior art. As it has
effects on the syntax of the language, it should not be in
a comment.

Regards,
Martin
Jul 18 '05 #10

P: n/a
"Hallvard B Furuseth" <h.**********@usit.uio.no> wrote:
Martin v. Lwis wrote:
Or, better yet,
external checkers like pychecker could check for that.


Well, I don't think that's better, but if it's rejected for Python
that'll be my next stop.


I'm getting fed up with UnicodeDecodeError exceptions myself, so I've
added a pychecker feature request for this on sourceforge.

- Anders

Jul 18 '05 #11

P: n/a
Martin v. Lwis wrote:
Hallvard B Furuseth wrote:
Well, I don't think that's better, but if it's rejected for Python
that'll be my next stop.
I can see how the global warning is provided (which then can be
configured into an error through the warnings module). However,
I don't want to introduce additional magic comments. I already
dislike the coding declarations for being comments, and would
have preferred if they had been spelled as

directive encoding "utf-8"


I can see that.
The coding declaration was only acceptable because
- a statement would have to go before the doc string, in
which case it would not have been a docstring anymore, and
Hmm...
def foo(): ... """hr"""
... pass
... help(foo) Help on function foo in module __main__:

foo()
hr
def bar(): ... u"""hr"""
... pass
... help(bar)
help(bar)

Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character '\uf8'
in position 59: ordinal not in range(128)

Even if the doc string is written in English, it may still need to
use non-English names.
- there was prior art (Emacs and VIM) for declaring encodings
to editors, inside comments

Your proposed annotation has no prior art. As it has
effects on the syntax of the language, it should not be in
a comment.


Well, I think it belongs logically together with the coding
declarations, but I see your point.

Still, how about 'directive str7bit', 'directive -W::str7bitWarning' or
something?

Or '@str7bit' / '@option -W::str7bitWarning', come to think of it. Like
your 'directive encoding "utf-8"', it affects the parsing of the file.
So I think this is a good time to use a special character, so the
construct will stand out.

And '@decorator_name' would be '@decorator decorator_name', of course.
But I'll go to the other thread with that.

--
Hallvard
Jul 18 '05 #12

P: n/a
Hallvard B Furuseth wrote:
The coding declaration was only acceptable because
- a statement would have to go before the doc string, in
which case it would not have been a docstring anymore, and

Hmm...


You misunderstood. I was talking about the module docstring

"Written by Martin v. Lwis"
directive encoding "utf-8"

Here, the declaration comes after the first non-ASCII character,
which does not work.

directive encoding "utf-8"
"Written by Martin v. Lwis"

Here, the string is not a docstring anymore, because it is
not the first expression in the module.
>>> help(bar)

Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character '\uf8'
in position 59: ordinal not in range(128)


That's a bug in the help function.
Even if the doc string is written in English, it may still need to
use non-English names.
Certainly. This is why the encoding declaration can't be a statement.
Still, how about 'directive str7bit', 'directive -W::str7bitWarning' or
something?


See PEP 244. I would have liked a directive statement, but the PEP was
rejected (in favour of __future__ imports, at that time).

A future import might work, except that this is a commitment that
the future comes some day, which, for str7bit, would not be the case:
there will *always* be a possibility to put non-ASCII bytes into 8-bit
strings (of course, requiring that people use escape sequences for
them might be acceptable).

In any case, you probably have to write a PEP for this.

Regards,
Martin
Jul 18 '05 #13

P: n/a
Martin v. Lwis wrote:
Hallvard B Furuseth wrote:
The coding declaration was only acceptable because
- a statement would have to go before the doc string, in
which case it would not have been a docstring anymore, and

Hmm...


You misunderstood. I was talking about the module docstring

"Written by Martin v. Lwis"


So if the file has -*- coding: iso-8859-1 -*-, how does that doc string
look to someone using a iso-8859-2 locale?

I would think such doc strings should be u"""...""" strings.
(After the help function is fixed:-)
directive encoding "utf-8"
"Written by Martin v. Lwis"

Here, the string is not a docstring anymore, because it is
not the first expression in the module.
Just like a str7bit directive, in whatever form, would not catch the
missing u in front of the doc string.
Still, how about 'directive str7bit', 'directive -W::str7bitWarning' or
something?


See PEP 244. I would have liked a directive statement, but the PEP was
rejected (in favour of __future__ imports, at that time).

A future import might work, except that this is a commitment that
the future comes some day, which, for str7bit, would not be the case:
there will *always* be a possibility to put non-ASCII bytes into 8-bit
strings


Yup.
(of course, requiring that people use escape sequences for
them might be acceptable).
Argh! Please, no.
In any case, you probably have to write a PEP for this.


I will.

--
Hallvard
Jul 18 '05 #14

P: n/a
Peter Otten wrote:
Hallvard B Furuseth wrote:
Peter Otten wrote:
Hallvard B Furuseth wrote:

Why would you reintroduce ambiguity with your s-prefixed
strings?
For programs that work with non-Unicode output devices or files and
know which character set they use. Which is quite a lot of programs.


I'd say a lot of programs work with non-unicode, but many don't know what
they are doing - i. e. you cannot move them into an environment with a
different encoding (if you do they won't notice).


True, but for them it would probably be simpler to not use the str7bit
declaration, or to explicitly declare str7bit:False for the entire file.
The long-term goal would be unicode throughout, IMHO.


Whose long-term goal for what? For things like Internet communication,
fine. But there are lot of less 'global' applications where other
character encodings make more sense.


Here we disagree. Showing the right image for a character should be
the job of the OS and should safely work cross-platform.


Yes. What of it?

Programs that show text still need to know which character set the
source text has, so it can pass the OS the text it expects, or send a
charset directive to the OS, or whatever.
Why shouldn't I be able to store a file with a greek or chinese name?
If you want an OS that allows that, get an OS which allows that.
I wasn't able to quote Martin's
surname correctly for the Python-URL. That's a mess that should be cleaned
up once per OS rather than once per user. I don't see how that can happen
without unicode (only). Even NASA blunders when they have to deal with
meters and inches.
Yes, there are many non-'global' applications too where Unicode is
desirable. What of it?

Just because you want Unicode, why shouldn't I be allowed to use
other charcater encodings in cases where they are more practical?

For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
replaced with , sorting by simple byte ordering will sort text
correctly. Unicode text _can't_ be sorted correctly, because of
characters like '': Swedish '' should match Norwegian '' and sort
with that, while German '' should not match '' and sorts with 'o'.
In any case, a language's both short-term and long-term goals should be
to support current programming, not programming like it 'should be done'
some day in the future.


Well, Python's integers already work like they 'should be done'.


And they can be used that way now.
I'm no
expert, but I think Java is closer to the 'real thing' concerning strings.
I don't know Java.
Perl 6 is going for unicode, if only to overcome the limititations of their
operator set (they want the yen symbol as a zipping operator because it
looks like a zipper :-).
I don't know Perl 6, but Perl 5 is an excellent example of how not do to
this. So is Emacs' MULE, for that matter.

I recently had to downgrade to perl5.004 when perl5.8 broke my programs.
They worked fine until they were moved to a machine where someone had
set up the locale to use UTF-8. Then Perl decided that my data, which
has nothing at all to do with the locale, was Unicode data. I tried to
insert 'use bytes', but that didn't work. It does seem to work in newer
Perl versions, but it's not clear to me how many places I have to insert
some magic to prevent that. Nor am I interested in finding out: I just
don't trust the people who released such a piece of crap to leave my
non-Unicode strings alone. In particular since _most_ of the strings
are UTF-8, so I wonder if Perl might decide to do something 'friendly'
with them.
You have to make compromises and I think an external checker would be
the way to go in your case. If I were to add a switch to Python's
string handling it would be "all-unicode".
Meaning what?
But it may well be that I would curse it after the first real-world
use...


--
Hallvard
Jul 18 '05 #15

P: n/a
Hallvard B Furuseth wrote:
"Written by Martin v. Lwis"

So if the file has -*- coding: iso-8859-1 -*-, how does that doc string
look to someone using a iso-8859-2 locale?


Let's start all over. I'm referring to a time when there was no encoding
declaration, and PEP 263 was not written yet. At that time, I thought
that a proper encoding declaration (i.e. a statement) would be the
best thing to do. So in my example, there is no -*- coding: iso-8859-1
-*- in the file. Instead, there is a directive.

About the unrelated question: How should a docstring be displayed
to a user working in a different locale? Well, in theory, the docstring
should be converted from its source encoding to the encoding where
it is displayed. In practice, this is difficult to implement, and
requires access to the original source code. However, Francois Pinard
has suggested to add an __encoding__ attribute to each module,
which could be used to recode the docstring.

About your literal question: In the current implementation, the string
looks just fine, as this docstring is codepoint-by-codepoint identical
in iso-8859-1 and iso-8859-2.
Just like a str7bit directive, in whatever form, would not catch the
missing u in front of the doc string.


Not necessarily. It would be possible to go back and find all strings
that fail to meet the requirement.

Notice that your approach only works for languages with single-byte
character sets anyway. Many multi-byte character sets use only
bytes < 128, and still they should get the warning you want to produce.
(of course, requiring that people use escape sequences for
them might be acceptable).

Argh! Please, no.


Think again. There absolutely is a need to represent byte arrays
in Python source code, e.g. for libraries that manipulate binary
data, e.g. generate MPEG files and so on. They do have a legitimate
need to represent arbitrary bytes in source code, with no intention
of these bytes being interpreted as characters.

Regards,
Martin
Jul 18 '05 #16

P: n/a
Martin v. Lwis wrote:
Hallvard B Furuseth wrote:
"Written by Martin v. Lwis"
So if the file has -*- coding: iso-8859-1 -*-, how does that doc string
look to someone using a iso-8859-2 locale?


Let's start all over. I'm referring to a time when there was no encoding
declaration, and PEP 263 was not written yet. At that time, I thought
that a proper encoding declaration (i.e. a statement) would be the
best thing to do. So in my example, there is no -*- coding: iso-8859-1
-*- in the file. Instead, there is a directive.

About the unrelated question: How should a docstring be displayed
to a user working in a different locale? Well, in theory, the docstring
should be converted from its source encoding to the encoding where
it is displayed. In practice, this is difficult to implement, and
requires access to the original source code. However, Francois Pinard
has suggested to add an __encoding__ attribute to each module,
which could be used to recode the docstring.


Sounds OK for normal use. It's not reliable, though: If files f1 and
f2 have different 'coding:'s and f1 does execfile(f2), f2's doc strings
won't match sys.modules[<something from f2>.__module__].__encoding__.

(Or maybe the other way around: I notice that the execfile sets
f1.__doc__ = <f2's doc string>. But I'll report that as a bug.)
About your literal question: In the current implementation, the string
looks just fine, as this docstring is codepoint-by-codepoint identical
in iso-8859-1 and iso-8859-2.
Whoops. Please pretend I said iso-8859-5 or something. I was thinking
of , not . Had just written about that in another posting.
Just like a str7bit directive, in whatever form, would not catch the
missing u in front of the doc string.


Not necessarily. It would be possible to go back and find all strings
that fail to meet the requirement.


That sounds like it could have a severe performance impact. However,
maybe the compiler can set a flag if there are any such strings when it
converts parsed strings from Unicode back to the file's encoding. Then
the str7bit directive can warn that the file contains one or more bad
strings. Or if the directive is executed while the file is being
parsed, it can catch such strings below the directive and give the less
informative warning if there are such string above the directive.

I can't say I like the idea, though. It assumes Python retains the
internal implementations of 'coding:' which is described in PEP 263:
Convert the source code to Unicode, then convert string literals back
to the source character set.
Notice that your approach only works for languages with single-byte
character sets anyway. Many multi-byte character sets use only
bytes < 128, and still they should get the warning you want to produce.


They will. That's why I specified to do this after conversion to
Unicode. But I notice my spec was unclear about that point.

New spec:

After the source file has been converted to Unicode, cause a
parse error if a non-u'' string contains a converted character
whose Unicode code point is >= 128.

Except...

None of this properly addresses encodings that are not ASCII supersets
(or subsets), like EBCDIC. Both Python and many Python programs seem to
make the assumption that the character set is ASCII-based, so plain
strings (with type str) can be output without conversion, while Unicode
strings must be converted to the output device's character set.
E.g. from Info node 'File Objects':

`encoding'
The encoding that this file uses. When Unicode strings are written
to a file, they will be converted to byte strings using this
encoding.

Nothing about converting 'str' strings. Solving that one seems far out
of scope for this PEP-to-be, so my proposal inherits the above
assumption. Though the problem may have to be discussed in order to get
a str7bit feature which does not get in the way of a clean solution to
character sets like EBCDIC.
(of course, requiring that people use escape sequences for
them might be acceptable).


Argh! Please, no.


Think again. There absolutely is a need to represent byte arrays
in Python source code, e.g. for libraries that manipulate binary
data, e.g. generate MPEG files and so on. They do have a legitimate
need to represent arbitrary bytes in source code, with no intention
of these bytes being interpreted as characters.


Sure. I wasn't protesting against people using of escape sequences.
I was protesting against requiring that people use them.

--
Hallvard
Jul 18 '05 #17

P: n/a
Hallvard B Furuseth wrote:
That sounds like it could have a severe performance impact. However,
maybe the compiler can set a flag if there are any such strings when it
converts parsed strings from Unicode back to the file's encoding.
Yes. Unfortunately, line information is gone by that time, so you can't
point to the place of the error anymore.
I can't say I like the idea, though. It assumes Python retains the
internal implementations of 'coding:' which is described in PEP 263:
Convert the source code to Unicode, then convert string literals back
to the source character set.
It's a pretty safe assumption, though. It is the only reasonable
implementation strategy.
Notice that your approach only works for languages with single-byte
character sets anyway. Many multi-byte character sets use only
bytes < 128, and still they should get the warning you want to produce.

They will. That's why I specified to do this after conversion to
Unicode. But I notice my spec was unclear about that point.


Ah, ok.
None of this properly addresses encodings that are not ASCII supersets
(or subsets), like EBCDIC. Both Python and many Python programs seem to
make the assumption that the character set is ASCII-based, so plain
strings (with type str) can be output without conversion, while Unicode
strings must be converted to the output device's character set.
Yes, Python assumes ASCII. There are is some code for EBCDIC support,
but on those platforms, Unicode is not supported.
Sure. I wasn't protesting against people using of escape sequences.
I was protesting against requiring that people use them.


But isn't that the idea of the str7bit feature? How else would you
put non-ASCII bytes into a string literal while simultaneously turning
on the 7-bit feature?

Regards,
Martin
Jul 18 '05 #18

P: n/a
Martin v. Lwis wrote:
Hallvard B Furuseth wrote:
That sounds like it could have a severe performance impact. However,
maybe the compiler can set a flag if there are any such strings when it
converts parsed strings from Unicode back to the file's encoding.


Yes. Unfortunately, line information is gone by that time, so you can't
point to the place of the error anymore.


True. One could recompile with a str7bit option to catch it earlier,
or one could make str7bit a compiler directive - then the unidentified
string will in practice be the doc string above the directive.
I can't say I like the idea, though. It assumes Python retains the
internal implementations of 'coding:' which is described in PEP 263:
Convert the source code to Unicode, then convert string literals back
to the source character set.


It's a pretty safe assumption, though. It is the only reasonable
implementation strategy.


I disagree:

- For a number of source encodings (like utf-8:-) it should be easy
to parse and charset-convert in the same step, and only convert
selected parts of the source to Unicode.

- I think the spec is buggy anyway. Converting to Unicode and back
can change the string representation. But I'll file a separate
bug report for that.
Sure. I wasn't protesting against people using of escape sequences.
I was protesting against requiring that people use them.


But isn't that the idea of the str7bit feature? How else would you
put non-ASCII bytes into a string literal while simultaneously turning
on the 7-bit feature?


Sorry, I thought you were speaking of promising a __future__ when all
string literals are required to be 7-bit or u'' literals.

To use non-ASCII str literals with the str7bit feature turned on:
- insert a 'str7bit:False' declaration in the file, or
- use the s'8-bit str literal' syntax I suggested.

--
Hallvard
Jul 18 '05 #19

P: n/a
Hallvard B Furuseth wrote:
- For a number of source encodings (like utf-8:-) it should be easy
to parse and charset-convert in the same step, and only convert
selected parts of the source to Unicode.
Correct. However, that it works "for a number of source encodings"
is insufficient - if it doesn't work for all of them, it only
unreasonably complicates the code.

For some source encodings (namely the CJK ones), conversion to UTF-8
is absolutely necessary even for proper lexical analysis, as the
byte that represents a backslash in ASCII might be the first byte
of a two-byte sequence.
- I think the spec is buggy anyway. Converting to Unicode and back
can change the string representation. But I'll file a separate
bug report for that.
That is by design. The only effect of such a bug report will be that
the documentation clearly clarifies that. Users that need to make
sure the run-time representation of a string is the same of as the
source representation need to pick a source encoding that round-trips.
Sorry, I thought you were speaking of promising a __future__ when all
string literals are required to be 7-bit or u'' literals.


Yes, but that *will* cause a wide debate. Say, Python 3.5, to be
release 2017 or so. I could live with such a language, but I'm
certain many users can't, in any foreseeable future.

Regards,
Martin
Jul 18 '05 #20

P: n/a
Martin v. Lwis:
For some source encodings (namely the CJK ones), conversion to UTF-8
is absolutely necessary even for proper lexical analysis, as the
byte that represents a backslash in ASCII might be the first byte
of a two-byte sequence.


Do you have a link to such an encoding? I understand 0x5c, '\' is often
displayed as a yen sign, but haven't seen it as the start byte of a multi
byte character.

Regarding the 's' string prefix in the proposal, adding more prefixes
damages ease of understanding particularly when used in combination. There
should be a very strong need before another is introduced: I'd really hate
to be trying to work out the meaning of:

r$tu"/Raw/ $interpolated, translated Unicode string"

Neil
Jul 18 '05 #21

P: n/a
Neil Hodgson wrote:
Do you have a link to such an encoding? I understand 0x5c, '\' is often
displayed as a yen sign, but haven't seen it as the start byte of a multi
byte character.
The ISO-2022 ones:
u"\u69f9\u6a0c".encode("iso-2022-jp")

'\x1b$B\\_\\n\x1b(B'

ESC $ B and ESC ( B are the codeset switch sequences, and \_ \n are
the actual encodings of the characters.
Regarding the 's' string prefix in the proposal, adding more prefixes
damages ease of understanding particularly when used in combination. There
should be a very strong need before another is introduced: I'd really hate
to be trying to work out the meaning of:

r$tu"/Raw/ $interpolated, translated Unicode string"


Indeed. Perhaps some combinations can be ruled out, though.

Regards,
Martin
Jul 18 '05 #22

P: n/a
Martin v. Lwis wrote:
Hallvard B Furuseth wrote:
- For a number of source encodings (like utf-8:-) it should be easy
to parse and charset-convert in the same step, and only convert
selected parts of the source to Unicode.
Correct. However, that it works "for a number of source encodings"
is insufficient - if it doesn't work for all of them, it only
unreasonably complicates the code.


For UTF-8 source, the complication might simply be to not call a charset
conversion routine. For some other character sets - well, fixing the
problem below would probably introduce that complication anyway.
For some source encodings (namely the CJK ones), conversion to UTF-8
is absolutely necessary even for proper lexical analysis, as the
byte that represents a backslash in ASCII might be the first byte
of a two-byte sequence.


No. It's necessary to convert the source file to logical characters
and feed those to the parser in some way, and conversion to UTF-8 in
a simple way to do that.

I think the 'right way', as far as source character set handling is
concerned, would be to have the source reader and the language parser
cooperate: The reader translates the source file to logical source
characters which it feeds to the parser (UTF-8 is fine for that), and
the parser notifies the reader when it sees the start and end of a
source character string which should be given to the parser in its
original form (by some other means than feeding it to the parser as if
it was charset-converted source code, of course).

Now, that might conflict with Python's design goals, if it is supposed
to be possible to keep the reading and parsing steps separate. Or it
might just take more effort to rearrange the code than anyone is
interested in doing. But in either case it still looks like a bug to
me, even if it's at best a low-priority one.
- I think the spec is buggy anyway. Converting to Unicode and back
can change the string representation. But I'll file a separate
bug report for that.


That is by design. The only effect of such a bug report will be that
the documentation clearly clarifies that.


OK, I'll make it a doc bug.

--
Hallvard
Jul 18 '05 #23

P: n/a
Hallvard B Furuseth wrote:
The long-term goal would be unicode throughout, IMHO.

Whose long-term goal for what? For things like Internet communication,
fine. But there are lot of less 'global' applications where other
character encodings make more sense.
More sense? I doubt that. What does make sense is an api that abstracts from
the encoding. You can then reduce the points where data in limited i. e.
non-unicode encodings is imported/exported as the adoption of unicode grows
without affecting the core of your app. IMHO chr(ord("a") - 32) is inferior
to "a".upper() even in an all-ascii environment.
Here we disagree. Showing the right image for a character should be
the job of the OS and should safely work cross-platform.
Yes. What of it?


I don't understand the question.

Programs that show text still need to know which character set the
source text has, so it can pass the OS the text it expects, or send a
charset directive to the OS, or whatever.
Why shouldn't I be able to store a file with a greek or chinese name?
If you want an OS that allows that, get an OS which allows that.


That was not the point. I was trying to say that the usefulness of a
standard grows with its adoption.
I wasn't able to quote Martin's
surname correctly for the Python-URL. That's a mess that should be
cleaned up once per OS rather than once per user. I don't see how that
can happen without unicode (only). Even NASA blunders when they have to
deal with meters and inches.


Yes, there are many non-'global' applications too where Unicode is
desirable. What of it?


I don't understand the question.
Just because you want Unicode, why shouldn't I be allowed to use
other charcater encodings in cases where they are more practical?
Again, my contention is that once the use of unicode has reached the tipping
point you will encounter no cases where other encodings are more practical.
For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
replaced with , sorting by simple byte ordering will sort text
correctly. Unicode text _can't_ be sorted correctly, because of
characters like '': Swedish '' should match Norwegian '' and sort
with that, while German '' should not match '' and sorts with 'o'.
Why not sort depending on the locale instead of ordinal values of the
bytes/characters?
In any case, a language's both short-term and long-term goals should be
to support current programming, not programming like it 'should be done'
some day in the future.

At some point you have to ask yourself whether the dirty tricks that work
depending on the country you live in, its current orthography and the
current state of your favourite programming language do save you some time
at so many places in your program that one centralized api that does it
right is more efficient even today.
I don't know Perl 6, but Perl 5 is an excellent example of how not do to
this. So is Emacs' MULE, for that matter.

I recently had to downgrade to perl5.004 when perl5.8 broke my programs.
They worked fine until they were moved to a machine where someone had
set up the locale to use UTF-8. Then Perl decided that my data, which
has nothing at all to do with the locale, was Unicode data. I tried to
insert 'use bytes', but that didn't work. It does seem to work in newer
Perl versions, but it's not clear to me how many places I have to insert
some magic to prevent that. Nor am I interested in finding out: I just
don't trust the people who released such a piece of crap to leave my
non-Unicode strings alone. In particular since _most_ of the strings
are UTF-8, so I wonder if Perl might decide to do something 'friendly'
with them.


I see you know more Perl than me - well, my mentioning of the zipper was
rather a lightweight digression prompted by the ongoing decorator frenzy.
the way to go in your case. If I were to add a switch to Python's
string handling it would be "all-unicode".


Meaning what?


All strings are unicode by default. If you need byte sequences instead of
character sequences you would have to provide a b-prefixed string.

Peter

Jul 18 '05 #24

P: n/a
Peter Otten wrote:
Hallvard B Furuseth wrote:
> The long-term goal would be unicode throughout, IMHO.

Whose long-term goal for what? For things like Internet communication,
fine. But there are lot of less 'global' applications where other
character encodings make more sense.
More sense? I doubt that. What does make sense is an api that abstracts from
the encoding.
If the application knows which encoding it is so it can convert at all,
and is 'big enough' to bother with encoding back and forth, and the
encoding doesn't already provide what one needs such abstraction to do.
You can then reduce the points where data in limited i. e.
non-unicode encodings is imported/exported as the adoption of unicode grows
without affecting the core of your app. IMHO chr(ord("a") - 32) is inferior
to "a".upper() even in an all-ascii environment.


If you mean 'limited' to some other character set than Unicode, that's
not much use if the appliation is designed for something which has that
'limited' character set/encoding anyway.
Here we disagree. Showing the right image for a character should be
the job of the OS and should safely work cross-platform.


Yes. What of it?


I don't understand the question.


I explained that in the next paragraph:
Programs that show text still need to know which character set the
source text has, so it can pass the OS the text it expects, or send a
charset directive to the OS, or whatever.
If you disagree with that, is that because you think of Unicode as The
One True Character Set which everything can assume is in use if not
otherwise specified? That's a long way from the world I'm living in.
Besides, even if you have 'everything is Unicode', that still doesn't
necessarily mean UTF-8. It could be UCS-4, or whatever. Unicode or no,
displaying a character does involve telling the OS what encoding is in
use. Or not telling it and trusting the application to handle it, which
is again what's being done outside the Unicode world.
Why shouldn't I be able to store a file with a greek or chinese name?


If you want an OS that allows that, get an OS which allows that.


That was not the point. I was trying to say that the usefulness of a
standard grows with its adoption.


And the thing about standards is that there are so many of them to
choose from. Enforcing a standard somewhere in an environment where
that is not the standard is not useful. Try the standard of driving on
the right side of the road in a country where everyone else drives on
the left side. Standards are supposed to serve us, it's not we who are
supposed to server standards.
I wasn't able to quote Martin's
surname correctly for the Python-URL. That's a mess that should be
cleaned up once per OS rather than once per user. I don't see how that
can happen without unicode (only). Even NASA blunders when they have to
deal with meters and inches.


Yes, there are many non-'global' applications too where Unicode is
desirable. What of it?


I don't understand the question.


You claimed one non-global application where Unicode would have been
good, as an argument that there are no non-global application where
Unicode would not be good.
Just because you want Unicode, why shouldn't I be allowed to use
other charcater encodings in cases where they are more practical?


Again, my contention is that once the use of unicode has reached the tipping
point you will encounter no cases where other encodings are more practical.


So because you are fond of Unicode, you want to force a quick transition
on everyone else and leave us to deal with the troubles of the
transition, even in cases where things worked perfectly fine without
Unicode.

But I'm pretty sure that "tipping point" where no cases of non-Unicode
is no practical is pretty close to 100% usage of Unicode around the
world.
For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
replaced with , sorting by simple byte ordering will sort text
correctly. Unicode text _can't_ be sorted correctly, because of
characters like '': Swedish '' should match Norwegian '' and sort
with that, while German '' should not match '' and sorts with 'o'.


Why not sort depending on the locale instead of ordinal values of the
bytes/characters?


I'm in Norway. Both Swedes and Germans are foreigners.
In any case, a language's both short-term and long-term goals should be
to support current programming, not programming like it 'should be done'
some day in the future.
At some point you have to ask yourself whether the dirty tricks that work
depending on the country you live in, its current orthography and the
current state of your favourite programming language do save you some time
at so many places in your program that one centralized api that does it
right is more efficient even today.


Just that you are fond of Unicode and think that's the Right Solution to
everything, doesn't make other ways of doing things a dirty trick.

As for dirty tricks, that's exactly what such premature standardization
leads to, and one reason I don't like it. Like Perl and Emacs which
have decided that if they don't know which character set is in use, then
it's the character set of the current locale (if they can deduce it) -
even though they have no idea if the data they are processing have
anything to do with the current locale. I wrote a long rant addressed
to the wrong person about that recently; please read article
<HB**************@bombur.uio.no> in the 'PEP 263 status check' thread.
If I were to add a switch to Python's
string handling it would be "all-unicode".


Meaning what?


All strings are unicode by default. If you need byte sequences instead of
character sequences you would have to provide a b-prefixed string.


I've been wondering about something like that myself, but it still
requires the program to be told which character set is in use so it can
convert back and forth between that and Unicode. To get that right,
Python would need to tag I/O streams and other stuff with their
character set/encoding. And either Python would have to guess when it
didn't know (like looking at the locale's name), or if it didn't
programmers would guess to get rid of the annoyance of encoding
exceptions cropping up everywhere. Then at a later date we'd have to
clean up all the code with the bogus guesses, so the problem would
really just have been transformed to another problem...

--
Hallvard
Jul 18 '05 #25

P: n/a
Peter Otten wrote:
Hallvard B Furuseth wrote:
> The long-term goal would be unicode throughout, IMHO.

Whose long-term goal for what? For things like Internet communication,
fine. But there are lot of less 'global' applications where other
character encodings make more sense.
More sense? I doubt that. What does make sense is an api that abstracts from
the encoding.
If the application knows which encoding it is so it can convert at all,
and is 'big enough' to bother with encoding back and forth, and the
encoding doesn't already provide what one needs such abstraction to do.
You can then reduce the points where data in limited i. e.
non-unicode encodings is imported/exported as the adoption of unicode grows
without affecting the core of your app. IMHO chr(ord("a") - 32) is inferior
to "a".upper() even in an all-ascii environment.


If you mean 'limited' to some other character set than Unicode, that's
not much use if the appliation is designed for something which has that
'limited' character set/encoding anyway.
Here we disagree. Showing the right image for a character should be
the job of the OS and should safely work cross-platform.


Yes. What of it?


I don't understand the question.


I explained that in the next paragraph:
Programs that show text still need to know which character set the
source text has, so it can pass the OS the text it expects, or send a
charset directive to the OS, or whatever.
If you disagree with that, is that because you think of Unicode as The
One True Character Set which everything can assume is in use if not
otherwise specified? That's a long way from the world I'm living in.
Besides, even if you have 'everything is Unicode', that still doesn't
necessarily mean UTF-8. It could be UCS-4, or whatever. Unicode or no,
displaying a character does involve telling the OS what encoding is in
use. Or not telling it and trusting the application to handle it, which
is again what's being done outside the Unicode world.
Why shouldn't I be able to store a file with a greek or chinese name?


If you want an OS that allows that, get an OS which allows that.


That was not the point. I was trying to say that the usefulness of a
standard grows with its adoption.


And the thing about standards is that there are so many of them to
choose from. Enforcing a standard somewhere in an environment where
that is not the standard is not useful. Try the standard of driving on
the right side of the road in a country where everyone else drives on
the left side. Standards are supposed to serve us, it's not we who are
supposed to server standards.
I wasn't able to quote Martin's
surname correctly for the Python-URL. That's a mess that should be
cleaned up once per OS rather than once per user. I don't see how that
can happen without unicode (only). Even NASA blunders when they have to
deal with meters and inches.


Yes, there are many non-'global' applications too where Unicode is
desirable. What of it?


I don't understand the question.


You claimed one non-global application where Unicode would have been
good, as an argument that there are no non-global application where
Unicode would not be good.
Just because you want Unicode, why shouldn't I be allowed to use
other charcater encodings in cases where they are more practical?


Again, my contention is that once the use of unicode has reached the tipping
point you will encounter no cases where other encodings are more practical.


So because you are fond of Unicode, you want to force a quick transition
on everyone else and leave us to deal with the troubles of the
transition, even in cases where things worked perfectly fine without
Unicode.

But I'm pretty sure that "tipping point" where no cases of non-Unicode
is no practical is pretty close to 100% usage of Unicode around the
world.
For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
replaced with , sorting by simple byte ordering will sort text
correctly. Unicode text _can't_ be sorted correctly, because of
characters like '': Swedish '' should match Norwegian '' and sort
with that, while German '' should not match '' and sorts with 'o'.


Why not sort depending on the locale instead of ordinal values of the
bytes/characters?


I'm in Norway. Both Swedes and Germans are foreigners.
In any case, a language's both short-term and long-term goals should be
to support current programming, not programming like it 'should be done'
some day in the future.
At some point you have to ask yourself whether the dirty tricks that work
depending on the country you live in, its current orthography and the
current state of your favourite programming language do save you some time
at so many places in your program that one centralized api that does it
right is more efficient even today.


Just that you are fond of Unicode and think that's the Right Solution to
everything, doesn't make other ways of doing things a dirty trick.

As for dirty tricks, that's exactly what such premature standardization
leads to, and one reason I don't like it. Like Perl and Emacs which
have decided that if they don't know which character set is in use, then
it's the character set of the current locale (if they can deduce it) -
even though they have no idea if the data they are processing have
anything to do with the current locale. I wrote a long rant addressed
to the wrong person about that recently; please read article
<HB**************@bombur.uio.no> in the 'PEP 263 status check' thread.
If I were to add a switch to Python's
string handling it would be "all-unicode".


Meaning what?


All strings are unicode by default. If you need byte sequences instead of
character sequences you would have to provide a b-prefixed string.


I've been wondering about something like that myself, but it still
requires the program to be told which character set is in use so it can
convert back and forth between that and Unicode. To get that right,
Python would need to tag I/O streams and other stuff with their
character set/encoding. And either Python would have to guess when it
didn't know (like looking at the locale's name), or if it didn't
programmers would guess to get rid of the annoyance of encoding
exceptions cropping up everywhere. Then at a later date we'd have to
clean up all the code with the bogus guesses, so the problem would
really just have been transformed to another problem...

--
Hallvard
Jul 18 '05 #26

P: n/a
Hallvard B Furuseth wrote:
For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
replaced with , sorting by simple byte ordering will sort text
correctly. Unicode text _can't_ be sorted correctly, because of
characters like '': Swedish '' should match Norwegian '' and sort
with that, while German '' should not match '' and sorts with 'o'.


Why not sort depending on the locale instead of ordinal values of the
bytes/characters?

I'm in Norway. Both Swedes and Germans are foreigners.


I agree with many things you said, but this example is bogus. If I
(as a German) use ns_4551-1, sorting is simple - and incorrect, because,
as you say, sorts with o in my language - yet the simple sorting of
ns_4551-1 doesn't. So sorting is *not* simple with ns_4551-1.

Likewise, sorting *is* possible with Unicode if you take the locale into
account. The order of character doesn't have to be the numerical one,
and, as you explain, it might even depend on the locale. So if you
want a Swedish collaction, use a Swedish locale; if you want a German
collation, use a German locale.

Regards,
Martin
Jul 18 '05 #27

P: n/a
Martin v. Lwis wrote:
Hallvard B Furuseth wrote:
For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
replaced with , sorting by simple byte ordering will sort text
correctly. Unicode text _can't_ be sorted correctly, because of
characters like '': Swedish '' should match Norwegian '' and sort
with that, while German '' should not match '' and sorts with 'o'.

Why not sort depending on the locale instead of ordinal values of the
bytes/characters?
I'm in Norway. Both Swedes and Germans are foreigners.


I agree with many things you said, but this example is bogus. If I
(as a German) use ns_4551-1, sorting is simple - and incorrect, because,
as you say, sorts with o in my language - yet the simple sorting of
ns_4551-1 doesn't. So sorting is *not* simple with ns_4551-1.


Sorry, I seem to a left out a vital point here: I thought the correct -
or rather, least incorrect - ns_4551-1 character for German was o, not
. Then it works out. Oh well, one learns something every day. Time
to check if there are other examples, or if I can forget it... Gotta
try an easy one - would you also translate German to rather than a?
Likewise, sorting *is* possible with Unicode if you take the locale
into account. The order of character doesn't have to be the numerical
one, and, as you explain, it might even depend on the locale. So if
you want a Swedish collaction, use a Swedish locale; if you want a
German collation, use a German locale.


And if I want to get both right, I need a sort_name field which is
distinct from the display_name field. There you would be lowis, while
the Swede Trnquist would be trnquist. Or maybe lowis\tlwis or
something; a kind of private implementation of strxfrm().

--
Hallvard
Jul 18 '05 #28

P: n/a
Hallvard B Furuseth wrote:
I agree with many things you said, but this example is bogus. If I
(as a German) use ns_4551-1, sorting is simple - and incorrect, because,
as you say, sorts with o in my language - yet the simple sorting of
ns_4551-1 doesn't. So sorting is *not* simple with ns_4551-1.

Ah, I missed the point that there is no in ns_4551-1. If so, then the
best way to represent the characters is to replace with "oe" and
with "ae"; replacing them merely with "o" and "a" would be considered
inadequat.
And if I want to get both right, I need a sort_name field which is
distinct from the display_name field. There you would be lowis, while
the Swede Trnquist would be trnquist. Or maybe lowis\tlwis or
something; a kind of private implementation of strxfrm().


But you can have a strxfrm for Unicode as well! There is nothing
inherent in Unicode that prevents using the same approach.

Of course, the question always is what result you *want*: If you
have text that contains simultaneously Latin and Greek characters,
how would you like to collate it? Neither the German or Greek
collation rules are likely to help, as they don't consider the issue
of additional alphabets. If possible, you should assign a language
tag to each entry, and then sort first by language, then according
to the language's collation rules.

Regards,
Martin
Jul 18 '05 #29

P: n/a
Martin v. Lwis wrote:
Hallvard B Furuseth wrote:
I agree with many things you said, but this example is bogus. If I
(as a German) use ns_4551-1, sorting is simple - and incorrect, because,
as you say, sorts with o in my language - yet the simple sorting of
ns_4551-1 doesn't. So sorting is *not* simple with ns_4551-1.
Ah, I missed the point that there is no in ns_4551-1. If so, then the
best way to represent the characters is to replace with "oe" and
with "ae"; replacing them merely with "o" and "a" would be considered
inadequat.
Duh. Of course. We usually did that too when we had to write Norwegian
in ASCII. It bites sometimes, though - like when it hits the common '1
character = 1 byte' assumption which someone -- John Roth? mentioned.
Maybe that's why we are getting to ->o in e-mail addresses and such
things nowadays, to keep things simple.

In a way, it is rather nice to notice that I'm forgetthing that stuff.
Maybe someday I won't even be able to read texts with {|} for
without slowing down:-)
And if I want to get both right, I need a sort_name field which is
distinct from the display_name field. There you would be lowis, while
the Swede Trnquist would be trnquist. Or maybe lowis\tlwis or
something; a kind of private implementation of strxfrm().


But you can have a strxfrm for Unicode as well! There is nothing
inherent in Unicode that prevents using the same approach.


Not after you have discarded the information which says whether to sort
as or o.
Of course, the question always is what result you *want*: If you
have text that contains simultaneously Latin and Greek characters,
how would you like to collate it? Neither the German or Greek
collation rules are likely to help, as they don't consider the issue
of additional alphabets.
True enough. But when you mix entirely different scripts, you have
worse problems anyway; you'll often need to transliterate your name to
the local script - or to something close to English, I guess. A written
name in a script the locals can't read isn't particularly useful.
If possible, you should assign a language tag to each entry, and then
sort first by language, then according to the language's collation
rules.


That sounds very wrong for lists that are sorted for humans to search,
unless I misunderstand you. That would place all Swedes after all
Norwegians in the phone book, for example. And if you aren't sure of
the nationality of someone, you'd have to look through all foreign
languages that are present.

--
Hallvard
Jul 18 '05 #30

P: n/a
Hallvard B Furuseth wrote:
If you disagree with that, is that because you think of Unicode as The
One True Character Set which everything can assume is in use if not
otherwise specified? That's a long way from the world I'm living in.
It's even worse. I think conceptually there is "One True Character Set" of
which unicode is the closest approximation -- yes, I know that this
position is "idealism" by its philosophical definition.
And the thing about standards is that there are so many of them to
choose from. Enforcing a standard somewhere in an environment where
that is not the standard is not useful. Try the standard of driving on
the right side of the road in a country where everyone else drives on
the left side. Standards are supposed to serve us, it's not we who are
supposed to server standards.


If you go to GB from the continent it is clear that you have to switch
lanes. You can still get it wrong but either completely or not at all.

Now consider a road you can drive on in many directions, say 100, with two
or three directions allowed simultaneously in one country. The best
available method to find out the correct direction would be to drive a few
kilometers and then get out of the car and look for damages in the car's
body. If there are dents you had an accident, so either you or another car
took the wrong lane...
How is it that many drive faithfully then? The dominant car-make has a
preference built-in. When they drive on the internet, everyone ignores the
signs and just drives on the same lane as anybody else...

By the way, I'm not "fond" of unicode, There may even be problems that
cannot be solved in principle by a universal standard (like your sorting
across three locales). I just think unicode would make a better default
than what we have now and many apps that will break in the transition are
broken now - you just didn't realize it.

Peter
Jul 18 '05 #31

This discussion thread is closed

Replies have been disabled for this discussion.