Hello,
here is something that surprises me.
#coding: iso-8859-1
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass
Running this code produces a UnicodeDecodeError:
Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)
I would have expected that "s1 == s2" gives True... or maybe False...
but raising an error here is unnecessary. I guess that the comparison
operator decides to convert s2 to a Unicode but forgets that I said
#coding: iso-8859-1 at the beginning of the file.
TIA for any comments.
Luc Saffre 15 3831 lu********@gmail.com wrote:
Hello,
here is something that surprises me.
#coding: iso-8859-1
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass
Running this code produces a UnicodeDecodeError:
Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)
I would have expected that "s1 == s2" gives True... or maybe False...
but raising an error here is unnecessary. I guess that the comparison
operator decides to convert s2 to a Unicode but forgets that I said
#coding: iso-8859-1 at the beginning of the file.
The #coding declaration is not effective at runtime. It's
there strictly to guide the compiler in how to compile
byte strings.
The default encoding at run time is ascii unless
it's been set to something else, which is why the
error message specifies ascii.
John Roth
TIA for any comments.
Luc Saffre
On 2006-10-16, lu********@gmail.com <lu********@gmail.comwrote:
Hello,
here is something that surprises me.
#coding: iso-8859-1
I think that's supposed to be:
# -*- coding: iso-8859-1 -*-
The special comment changes only the encoding of unicode
literals. In particular, it doesn't change the default encoding
of str literals.
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass
On my machine, the ü and ß in s2 are being stored in the code
points of my terminal's encoding, cp437. Unforunately cp437 code
points from 127-255 are not the same as those in iso-8859-1.
To fix this, I have to do the following:
>>s1 == s2.decode('cp437')
True
Running this code produces a UnicodeDecodeError:
Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)
I would have expected that "s1 == s2" gives True... or maybe
False... but raising an error here is unnecessary. I guess that
the comparison operator decides to convert s2 to a Unicode but
forgets that I said #coding: iso-8859-1 at the beginning of the
file.
It's trying to interpret s2 as ascii, and failing, since 129 and
225 code points are out of range.
--
Neil Cerutti
Thanks, John and Neil, for your explanations.
Still I find it rather difficult to explain to a Python beginner why
this error occurs.
Suggestion: shouldn't an error raise already when I try to assign s2? A
normal string should never be allowed to contain characters that are
not codable using the system encoding. This test could be made at
compile time and would render Python more didadic.
Luc lu********@gmail.com schrieb:
Hello,
here is something that surprises me.
#coding: iso-8859-1
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass
Running this code produces a UnicodeDecodeError:
Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)
I would have expected that "s1 == s2" gives True... or maybe False...
but raising an error here is unnecessary. I guess that the comparison
operator decides to convert s2 to a Unicode but forgets that I said
#coding: iso-8859-1 at the beginning of the file.
TIA for any comments.
Luc Saffre
On 2006-10-19, lu********@gmail.com <lu********@gmail.comwrote:
Suggestion: shouldn't an error raise already when I try to
assign s2?
There's been discussion on pydev about changing this, but for now
I believe a str is a sequence of bytes in Python, rather than a
string of characters. My current project (an implementation of
the Glk API in Python) would be more troublesome to write if I
had to store all my latin-1 character strings as lists or arrays
of bytes.
--
Neil Cerutti lu********@gmail.com wrote:
Thanks, John and Neil, for your explanations.
Still I find it rather difficult to explain to a Python beginner why
this error occurs.
Suggestion: shouldn't an error raise already when I try to assign s2? A
normal string should never be allowed to contain characters that are
not codable using the system encoding. This test could be made at
compile time and would render Python more didadic.
This is impossible because of backward compatibility, your suggestion
will break a lot of existing programs. The change is planned to happen
in python 3.0 where it's ok to break backward compatibility if needed.
-- Leo. lu********@gmail.com wrote:
Suggestion: shouldn't an error raise already when I try to assign s2?
variables are not typed in Python. plain assignment will never raise an
exception.
</F>
I didn't mean that the *assignment* should raise exception. I mean that
any string constant that cannot be decoded using
sys.getdefaultencoding() should be considered a kind of syntax error.
I agree of course with the argument of backward compatibility, which
means that my suggestion is for Python 3.0, not earlier.
And I admit that my suggestion lacks a solution for Neil Cerutti's use
of non-decodable simple strings. And I admit that there are certainly
more competent people than me to think about this question. I just
wanted to throw my penny into the pond :-)
Luc
Fredrik Lundh wrote:
lu********@gmail.com wrote:
Suggestion: shouldn't an error raise already when I try to assign s2?
variables are not typed in Python. plain assignment will never raise an
exception.
</F>
In <11*********************@e3g2000cwe.googlegroups.c om>, lu********@gmail.com wrote:
I didn't mean that the *assignment* should raise exception. I mean that
any string constant that cannot be decoded using
sys.getdefaultencoding() should be considered a kind of syntax error.
Why? Python strings are *byte strings* and bytes have values in the range
0..255. Why would you restrict them to ASCII only?
Ciao,
Marc 'BlackJack' Rintsch
Marc 'BlackJack' Rintsch wrote:
Why? Python strings are *byte strings* and bytes have values in the range
0..255. Why would you restrict them to ASCII only?
Because getting an exception when comparing a string with a unicode
string is irritating.
But I don't insist on my PEP. The example just shows just another
pitfall with Unicode and why I'll advise to any beginner: Never write
text constants that contain non-ascii chars as simple strings, always
make them Unicode strings by prepending the "u".
Luc
Neil Cerutti wrote:
On 2006-10-16, lu********@gmail.com <lu********@gmail.comwrote:
Hello,
here is something that surprises me.
#coding: iso-8859-1
I think that's supposed to be:
# -*- coding: iso-8859-1 -*-
Not quite. As PEP 263 says:
"""
More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)".
"""
On 2006-11-10, lu********@gmail.com <lu********@gmail.comwrote:
Marc 'BlackJack' Rintsch wrote:
>Why? Python strings are *byte strings* and bytes have values in the range 0..255. Why would you restrict them to ASCII only?
Because getting an exception when comparing a string with a unicode
string is irritating.
But I don't insist on my PEP. The example just shows just
another pitfall with Unicode and why I'll advise to any
beginner: Never write text constants that contain non-ascii
chars as simple strings, always make them Unicode strings by
prepending the "u".
That doesn't do any good if you aren't writing them in unicode
code points, though.
--
Neil Cerutti
To succeed in the world it is not enough to be stupid, you must also
be well-mannered. --Voltaire
On 2006-11-10, John Machin <sj******@lexicon.netwrote:
>
Neil Cerutti wrote:
>On 2006-10-16, lu********@gmail.com <lu********@gmail.comwrote:
Hello,
here is something that surprises me.
#coding: iso-8859-1
I think that's supposed to be:
# -*- coding: iso-8859-1 -*-
Not quite. As PEP 263 says:
"""
More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)".
"""
Yep. I was erroneously going by the example in the Unicode Howto.
Thanks for the correction.
--
Neil Cerutti
Neil Cerutti wrote:
On 2006-11-10, lu********@gmail.com <lu********@gmail.comwrote:
>Marc 'BlackJack' Rintsch wrote:
>>Why? Python strings are *byte strings* and bytes have values in the range 0..255. Why would you restrict them to ASCII only?
Because getting an exception when comparing a string with a unicode string is irritating.
But I don't insist on my PEP. The example just shows just another pitfall with Unicode and why I'll advise to any beginner: Never write text constants that contain non-ascii chars as simple strings, always make them Unicode strings by prepending the "u".
That doesn't do any good if you aren't writing them in unicode
code points, though.
You tell the interpreter what encoding your source code is in. It then
knows precisely how to decode your string literals into Unicode. How do
you write things in "Unicode code points"?
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden
On 2006-11-10, Steve Holden <st***@holdenweb.comwrote:
>>But I don't insist on my PEP. The example just shows just another pitfall with Unicode and why I'll advise to any beginner: Never write text constants that contain non-ascii chars as simple strings, always make them Unicode strings by prepending the "u".
That doesn't do any good if you aren't writing them in unicode code points, though.
You tell the interpreter what encoding your source code is in.
It then knows precisely how to decode your string literals into
Unicode. How do you write things in "Unicode code points"?
for = u"f\xfcr"
--
Neil Cerutti
Neil Cerutti wrote:
On 2006-11-10, Steve Holden <st***@holdenweb.comwrote:
>But I don't insist on my PEP. The example just shows just another pitfall with Unicode and why I'll advise to any beginner: Never write text constants that contain non-ascii chars as simple strings, always make them Unicode strings by prepending the "u".
That doesn't do any good if you aren't writing them in unicode
code points, though.
You tell the interpreter what encoding your source code is in.
It then knows precisely how to decode your string literals into
Unicode. How do you write things in "Unicode code points"?
for = u"f\xfcr"
Unless you're using unicode unfriendly editor or console, u"f\xfcr" is
the same as u"für":
>>u"f\xfcr" is u"für"
True
So there is no need to write unicode strings in hexadecimal
representation of code points.
-- Leo This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Michael Weir |
last post by:
I'm sure this is a very simple thing to do, once you know how to do it, but
I am having no fun at all trying to write utf-8 strings to a unicode file.
Does anyone have a couple of lines of code...
|
by: Bill Eldridge |
last post by:
I'm trying to grab a document off the Web and toss it
into a MySQL database, but I keep running into the
various encoding problems with Unicode (that aren't
a problem for me with GB2312, BIG 5,...
|
by: Peter |
last post by:
Hi
how can I compare two byte arrays in VB.NET
Thank
Peter
|
by: webdev |
last post by:
lo all,
some of the questions i'll ask below have most certainly been discussed
already, i just hope someone's kind enough to answer them again to help
me out..
so i started a python 2.3...
|
by: Neil Schemenauer |
last post by:
python-dev@python.org.]
The PEP has been rewritten based on a suggestion by Guido to change
str() rather than adding a new built-in function. Based on my
testing, I believe the idea is...
|
by: Nikolay Petrov |
last post by:
How can I convert DOS cyrillic text to Unicode
|
by: 7stud |
last post by:
Based on this example and the error:
-----
u_str = u"abc\u9999"
print u_str
UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in
position 3: ordinal not in range(128)
------
|
by: =?Utf-8?B?UElFQkFMRA==?= |
last post by:
Not really a C#-specific comment, more general .net observations.
1) A while back I found the need to determine whether or not a particular
StringComparer was case-insensitive. The best way I...
|
by: Avi1 |
last post by:
Hi,
I got the code (from the internet)for comparing two files and showing the difference in contents.Now,I tried the same code for two files written in japanese language(kanji).If I save the two...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 2 August 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM)
The start time is equivalent to 19:00 (7PM) in Central...
|
by: linyimin |
last post by:
Spring Startup Analyzer generates an interactive Spring application startup report that lets you understand what contributes to the application startup time and helps to optimize it. Support for...
|
by: Taofi |
last post by:
I try to insert a new record but the error message says the number of query names and destination fields are not the same
This are my field names
ID, Budgeted, Actual, Status and Differences
...
|
by: DJRhino1175 |
last post by:
When I run this code I get an error, its Run-time error# 424 Object required...This is my first attempt at doing something like this. I test the entire code and it worked until I added this -
If...
|
by: DJRhino |
last post by:
Private Sub CboDrawingID_BeforeUpdate(Cancel As Integer)
If = 310029923 Or 310030138 Or 310030152 Or 310030346 Or 310030348 Or _
310030356 Or 310030359 Or 310030362 Or...
|
by: lllomh |
last post by:
Define the method first
this.state = {
buttonBackgroundColor: 'green',
isBlinking: false, // A new status is added to identify whether the button is blinking or not
}
autoStart=()=>{
|
by: lllomh |
last post by:
How does React native implement an English player?
|
by: Mushico |
last post by:
How to calculate date of retirement from date of birth
|
by: DJRhino |
last post by:
Was curious if anyone else was having this same issue or not....
I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...
| |