By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
439,931 Members | 2,015 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 439,931 IT Pros & Developers. It's quick & easy.

comparing Unicode and string

P: n/a
Hello,

here is something that surprises me.

#coding: iso-8859-1
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass

Running this code produces a UnicodeDecodeError:

Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)

I would have expected that "s1 == s2" gives True... or maybe False...
but raising an error here is unnecessary. I guess that the comparison
operator decides to convert s2 to a Unicode but forgets that I said
#coding: iso-8859-1 at the beginning of the file.

TIA for any comments.

Luc Saffre

Oct 16 '06 #1
Share this Question
Share on Google+
15 Replies


P: n/a

lu********@gmail.com wrote:
Hello,

here is something that surprises me.

#coding: iso-8859-1
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass

Running this code produces a UnicodeDecodeError:

Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)

I would have expected that "s1 == s2" gives True... or maybe False...
but raising an error here is unnecessary. I guess that the comparison
operator decides to convert s2 to a Unicode but forgets that I said
#coding: iso-8859-1 at the beginning of the file.
The #coding declaration is not effective at runtime. It's
there strictly to guide the compiler in how to compile
byte strings.

The default encoding at run time is ascii unless
it's been set to something else, which is why the
error message specifies ascii.

John Roth

TIA for any comments.

Luc Saffre
Oct 16 '06 #2

P: n/a
On 2006-10-16, lu********@gmail.com <lu********@gmail.comwrote:
Hello,

here is something that surprises me.

#coding: iso-8859-1
I think that's supposed to be:

# -*- coding: iso-8859-1 -*-

The special comment changes only the encoding of unicode
literals. In particular, it doesn't change the default encoding
of str literals.
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass
On my machine, the ü and ß in s2 are being stored in the code
points of my terminal's encoding, cp437. Unforunately cp437 code
points from 127-255 are not the same as those in iso-8859-1.

To fix this, I have to do the following:
>>s1 == s2.decode('cp437')
True
Running this code produces a UnicodeDecodeError:

Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)

I would have expected that "s1 == s2" gives True... or maybe
False... but raising an error here is unnecessary. I guess that
the comparison operator decides to convert s2 to a Unicode but
forgets that I said #coding: iso-8859-1 at the beginning of the
file.
It's trying to interpret s2 as ascii, and failing, since 129 and
225 code points are out of range.

--
Neil Cerutti
Oct 16 '06 #3

P: n/a
Thanks, John and Neil, for your explanations.

Still I find it rather difficult to explain to a Python beginner why
this error occurs.

Suggestion: shouldn't an error raise already when I try to assign s2? A
normal string should never be allowed to contain characters that are
not codable using the system encoding. This test could be made at
compile time and would render Python more didadic.

Luc

lu********@gmail.com schrieb:
Hello,

here is something that surprises me.

#coding: iso-8859-1
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass

Running this code produces a UnicodeDecodeError:

Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)

I would have expected that "s1 == s2" gives True... or maybe False...
but raising an error here is unnecessary. I guess that the comparison
operator decides to convert s2 to a Unicode but forgets that I said
#coding: iso-8859-1 at the beginning of the file.

TIA for any comments.

Luc Saffre
Oct 19 '06 #4

P: n/a
On 2006-10-19, lu********@gmail.com <lu********@gmail.comwrote:
Suggestion: shouldn't an error raise already when I try to
assign s2?
There's been discussion on pydev about changing this, but for now
I believe a str is a sequence of bytes in Python, rather than a
string of characters. My current project (an implementation of
the Glk API in Python) would be more troublesome to write if I
had to store all my latin-1 character strings as lists or arrays
of bytes.

--
Neil Cerutti
Oct 19 '06 #5

P: n/a

lu********@gmail.com wrote:
Thanks, John and Neil, for your explanations.

Still I find it rather difficult to explain to a Python beginner why
this error occurs.

Suggestion: shouldn't an error raise already when I try to assign s2? A
normal string should never be allowed to contain characters that are
not codable using the system encoding. This test could be made at
compile time and would render Python more didadic.
This is impossible because of backward compatibility, your suggestion
will break a lot of existing programs. The change is planned to happen
in python 3.0 where it's ok to break backward compatibility if needed.

-- Leo.

Oct 20 '06 #6

P: n/a
lu********@gmail.com wrote:
Suggestion: shouldn't an error raise already when I try to assign s2?
variables are not typed in Python. plain assignment will never raise an
exception.

</F>

Oct 20 '06 #7

P: n/a
I didn't mean that the *assignment* should raise exception. I mean that
any string constant that cannot be decoded using
sys.getdefaultencoding() should be considered a kind of syntax error.

I agree of course with the argument of backward compatibility, which
means that my suggestion is for Python 3.0, not earlier.

And I admit that my suggestion lacks a solution for Neil Cerutti's use
of non-decodable simple strings. And I admit that there are certainly
more competent people than me to think about this question. I just
wanted to throw my penny into the pond :-)

Luc

Fredrik Lundh wrote:
lu********@gmail.com wrote:
Suggestion: shouldn't an error raise already when I try to assign s2?

variables are not typed in Python. plain assignment will never raise an
exception.

</F>
Oct 23 '06 #8

P: n/a
In <11*********************@e3g2000cwe.googlegroups.c om>,
lu********@gmail.com wrote:
I didn't mean that the *assignment* should raise exception. I mean that
any string constant that cannot be decoded using
sys.getdefaultencoding() should be considered a kind of syntax error.
Why? Python strings are *byte strings* and bytes have values in the range
0..255. Why would you restrict them to ASCII only?

Ciao,
Marc 'BlackJack' Rintsch
Oct 23 '06 #9

P: n/a
Marc 'BlackJack' Rintsch wrote:
Why? Python strings are *byte strings* and bytes have values in the range
0..255. Why would you restrict them to ASCII only?
Because getting an exception when comparing a string with a unicode
string is irritating.

But I don't insist on my PEP. The example just shows just another
pitfall with Unicode and why I'll advise to any beginner: Never write
text constants that contain non-ascii chars as simple strings, always
make them Unicode strings by prepending the "u".

Luc

Nov 10 '06 #10

P: n/a

Neil Cerutti wrote:
On 2006-10-16, lu********@gmail.com <lu********@gmail.comwrote:
Hello,

here is something that surprises me.

#coding: iso-8859-1

I think that's supposed to be:

# -*- coding: iso-8859-1 -*-
Not quite. As PEP 263 says:

"""
More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)".
"""

Nov 10 '06 #11

P: n/a
On 2006-11-10, lu********@gmail.com <lu********@gmail.comwrote:
Marc 'BlackJack' Rintsch wrote:
>Why? Python strings are *byte strings* and bytes have values in the range
0..255. Why would you restrict them to ASCII only?

Because getting an exception when comparing a string with a unicode
string is irritating.

But I don't insist on my PEP. The example just shows just
another pitfall with Unicode and why I'll advise to any
beginner: Never write text constants that contain non-ascii
chars as simple strings, always make them Unicode strings by
prepending the "u".
That doesn't do any good if you aren't writing them in unicode
code points, though.

--
Neil Cerutti
To succeed in the world it is not enough to be stupid, you must also
be well-mannered. --Voltaire
Nov 10 '06 #12

P: n/a
On 2006-11-10, John Machin <sj******@lexicon.netwrote:
>
Neil Cerutti wrote:
>On 2006-10-16, lu********@gmail.com <lu********@gmail.comwrote:
Hello,

here is something that surprises me.

#coding: iso-8859-1

I think that's supposed to be:

# -*- coding: iso-8859-1 -*-

Not quite. As PEP 263 says:

"""
More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)".
"""
Yep. I was erroneously going by the example in the Unicode Howto.
Thanks for the correction.

--
Neil Cerutti
Nov 10 '06 #13

P: n/a
Neil Cerutti wrote:
On 2006-11-10, lu********@gmail.com <lu********@gmail.comwrote:
>Marc 'BlackJack' Rintsch wrote:
>>Why? Python strings are *byte strings* and bytes have values in the range
0..255. Why would you restrict them to ASCII only?
Because getting an exception when comparing a string with a unicode
string is irritating.

But I don't insist on my PEP. The example just shows just
another pitfall with Unicode and why I'll advise to any
beginner: Never write text constants that contain non-ascii
chars as simple strings, always make them Unicode strings by
prepending the "u".

That doesn't do any good if you aren't writing them in unicode
code points, though.
You tell the interpreter what encoding your source code is in. It then
knows precisely how to decode your string literals into Unicode. How do
you write things in "Unicode code points"?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Nov 10 '06 #14

P: n/a
On 2006-11-10, Steve Holden <st***@holdenweb.comwrote:
>>But I don't insist on my PEP. The example just shows just
another pitfall with Unicode and why I'll advise to any
beginner: Never write text constants that contain non-ascii
chars as simple strings, always make them Unicode strings by
prepending the "u".

That doesn't do any good if you aren't writing them in unicode
code points, though.

You tell the interpreter what encoding your source code is in.
It then knows precisely how to decode your string literals into
Unicode. How do you write things in "Unicode code points"?
for = u"f\xfcr"

--
Neil Cerutti
Nov 10 '06 #15

P: n/a
Neil Cerutti wrote:
On 2006-11-10, Steve Holden <st***@holdenweb.comwrote:
>But I don't insist on my PEP. The example just shows just
another pitfall with Unicode and why I'll advise to any
beginner: Never write text constants that contain non-ascii
chars as simple strings, always make them Unicode strings by
prepending the "u".

That doesn't do any good if you aren't writing them in unicode
code points, though.
You tell the interpreter what encoding your source code is in.
It then knows precisely how to decode your string literals into
Unicode. How do you write things in "Unicode code points"?

for = u"f\xfcr"
Unless you're using unicode unfriendly editor or console, u"f\xfcr" is
the same as u"für":
>>u"f\xfcr" is u"für"
True

So there is no need to write unicode strings in hexadecimal
representation of code points.

-- Leo

Nov 10 '06 #16

This discussion thread is closed

Replies have been disabled for this discussion.