473,386 Members | 1,736 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

comparing Unicode and string

Hello,

here is something that surprises me.

#coding: iso-8859-1
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass

Running this code produces a UnicodeDecodeError:

Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)

I would have expected that "s1 == s2" gives True... or maybe False...
but raising an error here is unnecessary. I guess that the comparison
operator decides to convert s2 to a Unicode but forgets that I said
#coding: iso-8859-1 at the beginning of the file.

TIA for any comments.

Luc Saffre

Oct 16 '06 #1
15 3871

lu********@gmail.com wrote:
Hello,

here is something that surprises me.

#coding: iso-8859-1
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass

Running this code produces a UnicodeDecodeError:

Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)

I would have expected that "s1 == s2" gives True... or maybe False...
but raising an error here is unnecessary. I guess that the comparison
operator decides to convert s2 to a Unicode but forgets that I said
#coding: iso-8859-1 at the beginning of the file.
The #coding declaration is not effective at runtime. It's
there strictly to guide the compiler in how to compile
byte strings.

The default encoding at run time is ascii unless
it's been set to something else, which is why the
error message specifies ascii.

John Roth

TIA for any comments.

Luc Saffre
Oct 16 '06 #2
On 2006-10-16, lu********@gmail.com <lu********@gmail.comwrote:
Hello,

here is something that surprises me.

#coding: iso-8859-1
I think that's supposed to be:

# -*- coding: iso-8859-1 -*-

The special comment changes only the encoding of unicode
literals. In particular, it doesn't change the default encoding
of str literals.
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass
On my machine, the ü and ß in s2 are being stored in the code
points of my terminal's encoding, cp437. Unforunately cp437 code
points from 127-255 are not the same as those in iso-8859-1.

To fix this, I have to do the following:
>>s1 == s2.decode('cp437')
True
Running this code produces a UnicodeDecodeError:

Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)

I would have expected that "s1 == s2" gives True... or maybe
False... but raising an error here is unnecessary. I guess that
the comparison operator decides to convert s2 to a Unicode but
forgets that I said #coding: iso-8859-1 at the beginning of the
file.
It's trying to interpret s2 as ascii, and failing, since 129 and
225 code points are out of range.

--
Neil Cerutti
Oct 16 '06 #3
Thanks, John and Neil, for your explanations.

Still I find it rather difficult to explain to a Python beginner why
this error occurs.

Suggestion: shouldn't an error raise already when I try to assign s2? A
normal string should never be allowed to contain characters that are
not codable using the system encoding. This test could be made at
compile time and would render Python more didadic.

Luc

lu********@gmail.com schrieb:
Hello,

here is something that surprises me.

#coding: iso-8859-1
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass

Running this code produces a UnicodeDecodeError:

Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)

I would have expected that "s1 == s2" gives True... or maybe False...
but raising an error here is unnecessary. I guess that the comparison
operator decides to convert s2 to a Unicode but forgets that I said
#coding: iso-8859-1 at the beginning of the file.

TIA for any comments.

Luc Saffre
Oct 19 '06 #4
On 2006-10-19, lu********@gmail.com <lu********@gmail.comwrote:
Suggestion: shouldn't an error raise already when I try to
assign s2?
There's been discussion on pydev about changing this, but for now
I believe a str is a sequence of bytes in Python, rather than a
string of characters. My current project (an implementation of
the Glk API in Python) would be more troublesome to write if I
had to store all my latin-1 character strings as lists or arrays
of bytes.

--
Neil Cerutti
Oct 19 '06 #5

lu********@gmail.com wrote:
Thanks, John and Neil, for your explanations.

Still I find it rather difficult to explain to a Python beginner why
this error occurs.

Suggestion: shouldn't an error raise already when I try to assign s2? A
normal string should never be allowed to contain characters that are
not codable using the system encoding. This test could be made at
compile time and would render Python more didadic.
This is impossible because of backward compatibility, your suggestion
will break a lot of existing programs. The change is planned to happen
in python 3.0 where it's ok to break backward compatibility if needed.

-- Leo.

Oct 20 '06 #6
lu********@gmail.com wrote:
Suggestion: shouldn't an error raise already when I try to assign s2?
variables are not typed in Python. plain assignment will never raise an
exception.

</F>

Oct 20 '06 #7
I didn't mean that the *assignment* should raise exception. I mean that
any string constant that cannot be decoded using
sys.getdefaultencoding() should be considered a kind of syntax error.

I agree of course with the argument of backward compatibility, which
means that my suggestion is for Python 3.0, not earlier.

And I admit that my suggestion lacks a solution for Neil Cerutti's use
of non-decodable simple strings. And I admit that there are certainly
more competent people than me to think about this question. I just
wanted to throw my penny into the pond :-)

Luc

Fredrik Lundh wrote:
lu********@gmail.com wrote:
Suggestion: shouldn't an error raise already when I try to assign s2?

variables are not typed in Python. plain assignment will never raise an
exception.

</F>
Oct 23 '06 #8
In <11*********************@e3g2000cwe.googlegroups.c om>,
lu********@gmail.com wrote:
I didn't mean that the *assignment* should raise exception. I mean that
any string constant that cannot be decoded using
sys.getdefaultencoding() should be considered a kind of syntax error.
Why? Python strings are *byte strings* and bytes have values in the range
0..255. Why would you restrict them to ASCII only?

Ciao,
Marc 'BlackJack' Rintsch
Oct 23 '06 #9
Marc 'BlackJack' Rintsch wrote:
Why? Python strings are *byte strings* and bytes have values in the range
0..255. Why would you restrict them to ASCII only?
Because getting an exception when comparing a string with a unicode
string is irritating.

But I don't insist on my PEP. The example just shows just another
pitfall with Unicode and why I'll advise to any beginner: Never write
text constants that contain non-ascii chars as simple strings, always
make them Unicode strings by prepending the "u".

Luc

Nov 10 '06 #10

Neil Cerutti wrote:
On 2006-10-16, lu********@gmail.com <lu********@gmail.comwrote:
Hello,

here is something that surprises me.

#coding: iso-8859-1

I think that's supposed to be:

# -*- coding: iso-8859-1 -*-
Not quite. As PEP 263 says:

"""
More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)".
"""

Nov 10 '06 #11
On 2006-11-10, lu********@gmail.com <lu********@gmail.comwrote:
Marc 'BlackJack' Rintsch wrote:
>Why? Python strings are *byte strings* and bytes have values in the range
0..255. Why would you restrict them to ASCII only?

Because getting an exception when comparing a string with a unicode
string is irritating.

But I don't insist on my PEP. The example just shows just
another pitfall with Unicode and why I'll advise to any
beginner: Never write text constants that contain non-ascii
chars as simple strings, always make them Unicode strings by
prepending the "u".
That doesn't do any good if you aren't writing them in unicode
code points, though.

--
Neil Cerutti
To succeed in the world it is not enough to be stupid, you must also
be well-mannered. --Voltaire
Nov 10 '06 #12
On 2006-11-10, John Machin <sj******@lexicon.netwrote:
>
Neil Cerutti wrote:
>On 2006-10-16, lu********@gmail.com <lu********@gmail.comwrote:
Hello,

here is something that surprises me.

#coding: iso-8859-1

I think that's supposed to be:

# -*- coding: iso-8859-1 -*-

Not quite. As PEP 263 says:

"""
More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)".
"""
Yep. I was erroneously going by the example in the Unicode Howto.
Thanks for the correction.

--
Neil Cerutti
Nov 10 '06 #13
Neil Cerutti wrote:
On 2006-11-10, lu********@gmail.com <lu********@gmail.comwrote:
>Marc 'BlackJack' Rintsch wrote:
>>Why? Python strings are *byte strings* and bytes have values in the range
0..255. Why would you restrict them to ASCII only?
Because getting an exception when comparing a string with a unicode
string is irritating.

But I don't insist on my PEP. The example just shows just
another pitfall with Unicode and why I'll advise to any
beginner: Never write text constants that contain non-ascii
chars as simple strings, always make them Unicode strings by
prepending the "u".

That doesn't do any good if you aren't writing them in unicode
code points, though.
You tell the interpreter what encoding your source code is in. It then
knows precisely how to decode your string literals into Unicode. How do
you write things in "Unicode code points"?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Nov 10 '06 #14
On 2006-11-10, Steve Holden <st***@holdenweb.comwrote:
>>But I don't insist on my PEP. The example just shows just
another pitfall with Unicode and why I'll advise to any
beginner: Never write text constants that contain non-ascii
chars as simple strings, always make them Unicode strings by
prepending the "u".

That doesn't do any good if you aren't writing them in unicode
code points, though.

You tell the interpreter what encoding your source code is in.
It then knows precisely how to decode your string literals into
Unicode. How do you write things in "Unicode code points"?
for = u"f\xfcr"

--
Neil Cerutti
Nov 10 '06 #15
Neil Cerutti wrote:
On 2006-11-10, Steve Holden <st***@holdenweb.comwrote:
>But I don't insist on my PEP. The example just shows just
another pitfall with Unicode and why I'll advise to any
beginner: Never write text constants that contain non-ascii
chars as simple strings, always make them Unicode strings by
prepending the "u".

That doesn't do any good if you aren't writing them in unicode
code points, though.
You tell the interpreter what encoding your source code is in.
It then knows precisely how to decode your string literals into
Unicode. How do you write things in "Unicode code points"?

for = u"f\xfcr"
Unless you're using unicode unfriendly editor or console, u"f\xfcr" is
the same as u"für":
>>u"f\xfcr" is u"für"
True

So there is no need to write unicode strings in hexadecimal
representation of code points.

-- Leo

Nov 10 '06 #16

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Michael Weir | last post by:
I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code...
8
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...
11
by: Peter | last post by:
Hi how can I compare two byte arrays in VB.NET Thank Peter
4
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
2
by: Neil Schemenauer | last post by:
python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...
10
by: Nikolay Petrov | last post by:
How can I convert DOS cyrillic text to Unicode
7
by: 7stud | last post by:
Based on this example and the error: ----- u_str = u"abc\u9999" print u_str UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in position 3: ordinal not in range(128) ------
5
by: =?Utf-8?B?UElFQkFMRA==?= | last post by:
Not really a C#-specific comment, more general .net observations. 1) A while back I found the need to determine whether or not a particular StringComparer was case-insensitive. The best way I...
1
by: Avi1 | last post by:
Hi, I got the code (from the internet)for comparing two files and showing the difference in contents.Now,I tried the same code for two files written in japanese language(kanji).If I save the two...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.