comparing Unicode and string

luc.saffre

Hello,

here is something that surprises me.

#coding: iso-8859-1
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass

Running this code produces a UnicodeDecodeError:

Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)

I would have expected that "s1 == s2" gives True... or maybe False...
but raising an error here is unnecessary. I guess that the comparison
operator decides to convert s2 to a Unicode but forgets that I said
#coding: iso-8859-1 at the beginning of the file.

TIA for any comments.

Luc Saffre

Oct 16 '06 #1

Subscribe Post Reply

3871

John Roth

lu********@gmail.com wrote:

Hello,

here is something that surprises me.

#coding: iso-8859-1
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass

Running this code produces a UnicodeDecodeError:

Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)

I would have expected that "s1 == s2" gives True... or maybe False...
but raising an error here is unnecessary. I guess that the comparison
operator decides to convert s2 to a Unicode but forgets that I said
#coding: iso-8859-1 at the beginning of the file.

The #coding declaration is not effective at runtime. It's
there strictly to guide the compiler in how to compile
byte strings.

The default encoding at run time is ascii unless
it's been set to something else, which is why the
error message specifies ascii.

John Roth

TIA for any comments.

Luc Saffre

Oct 16 '06 #2

Neil Cerutti

On 2006-10-16, lu********@gmail.com <lu********@gmail.comwrote:

Hello,

here is something that surprises me.

#coding: iso-8859-1

I think that's supposed to be:

# -*- coding: iso-8859-1 -*-

The special comment changes only the encoding of unicode
literals. In particular, it doesn't change the default encoding
of str literals.

s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass

On my machine, the ü and ß in s2 are being stored in the code
points of my terminal's encoding, cp437. Unforunately cp437 code
points from 127-255 are not the same as those in iso-8859-1.

To fix this, I have to do the following:

>>s1 == s2.decode('cp437')

True

Running this code produces a UnicodeDecodeError:

Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)

I would have expected that "s1 == s2" gives True... or maybe
False... but raising an error here is unnecessary. I guess that
the comparison operator decides to convert s2 to a Unicode but
forgets that I said #coding: iso-8859-1 at the beginning of the
file.

It's trying to interpret s2 as ascii, and failing, since 129 and
225 code points are out of range.

--
Neil Cerutti

Oct 16 '06 #3

luc.saffre

Thanks, John and Neil, for your explanations.

Still I find it rather difficult to explain to a Python beginner why
this error occurs.

Suggestion: shouldn't an error raise already when I try to assign s2? A
normal string should never be allowed to contain characters that are
not codable using the system encoding. This test could be made at
compile time and would render Python more didadic.

Luc

lu********@gmail.com schrieb:

Hello,

here is something that surprises me.

#coding: iso-8859-1
s1=u"Frau Müller machte große Augen"
s2="Frau Müller machte große Augen"
if s1 == s2:
pass

Running this code produces a UnicodeDecodeError:

Traceback (most recent call last):
File "tmp.py", line 4, in ?
if s1 == s2:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
ordinal not in range(128)

I would have expected that "s1 == s2" gives True... or maybe False...
but raising an error here is unnecessary. I guess that the comparison
operator decides to convert s2 to a Unicode but forgets that I said
#coding: iso-8859-1 at the beginning of the file.

TIA for any comments.

Luc Saffre

Oct 19 '06 #4

Neil Cerutti

On 2006-10-19, lu********@gmail.com <lu********@gmail.comwrote:

Suggestion: shouldn't an error raise already when I try to
assign s2?

There's been discussion on pydev about changing this, but for now
I believe a str is a sequence of bytes in Python, rather than a
string of characters. My current project (an implementation of
the Glk API in Python) would be more troublesome to write if I
had to store all my latin-1 character strings as lists or arrays
of bytes.

--
Neil Cerutti

Oct 19 '06 #5

Leo Kislov

lu********@gmail.com wrote:

Thanks, John and Neil, for your explanations.

Still I find it rather difficult to explain to a Python beginner why
this error occurs.

Suggestion: shouldn't an error raise already when I try to assign s2? A
normal string should never be allowed to contain characters that are
not codable using the system encoding. This test could be made at
compile time and would render Python more didadic.

This is impossible because of backward compatibility, your suggestion
will break a lot of existing programs. The change is planned to happen
in python 3.0 where it's ok to break backward compatibility if needed.

-- Leo.

Oct 20 '06 #6

Fredrik Lundh

lu********@gmail.com wrote:

Suggestion: shouldn't an error raise already when I try to assign s2?

variables are not typed in Python. plain assignment will never raise an
exception.

</F>

Oct 20 '06 #7

luc.saffre

I didn't mean that the *assignment* should raise exception. I mean that
any string constant that cannot be decoded using
sys.getdefaultencoding() should be considered a kind of syntax error.

I agree of course with the argument of backward compatibility, which
means that my suggestion is for Python 3.0, not earlier.

And I admit that my suggestion lacks a solution for Neil Cerutti's use
of non-decodable simple strings. And I admit that there are certainly
more competent people than me to think about this question. I just
wanted to throw my penny into the pond :-)

Luc

Fredrik Lundh wrote:

lu********@gmail.com wrote:

Suggestion: shouldn't an error raise already when I try to assign s2?

variables are not typed in Python. plain assignment will never raise an
exception.

</F>

Oct 23 '06 #8

Marc 'BlackJack' Rintsch

In <11*********************@e3g2000cwe.googlegroups.c om>,
lu********@gmail.com wrote:

I didn't mean that the *assignment* should raise exception. I mean that
any string constant that cannot be decoded using
sys.getdefaultencoding() should be considered a kind of syntax error.

Why? Python strings are *byte strings* and bytes have values in the range
0..255. Why would you restrict them to ASCII only?

Ciao,
Marc 'BlackJack' Rintsch

Oct 23 '06 #9

luc.saffre

Marc 'BlackJack' Rintsch wrote:

Why? Python strings are *byte strings* and bytes have values in the range
0..255. Why would you restrict them to ASCII only?

Because getting an exception when comparing a string with a unicode
string is irritating.

But I don't insist on my PEP. The example just shows just another
pitfall with Unicode and why I'll advise to any beginner: Never write
text constants that contain non-ascii chars as simple strings, always
make them Unicode strings by prepending the "u".

Luc

Nov 10 '06 #10

John Machin

Neil Cerutti wrote:

On 2006-10-16, lu********@gmail.com <lu********@gmail.comwrote:
Hello,

here is something that surprises me.

#coding: iso-8859-1

I think that's supposed to be:

# -*- coding: iso-8859-1 -*-

Not quite. As PEP 263 says:

"""
More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)".
"""

Nov 10 '06 #11

Neil Cerutti

On 2006-11-10, lu********@gmail.com <lu********@gmail.comwrote:

Marc 'BlackJack' Rintsch wrote:
>Why? Python strings are *byte strings* and bytes have values in the range
0..255. Why would you restrict them to ASCII only?

Because getting an exception when comparing a string with a unicode
string is irritating.

But I don't insist on my PEP. The example just shows just
another pitfall with Unicode and why I'll advise to any
beginner: Never write text constants that contain non-ascii
chars as simple strings, always make them Unicode strings by
prepending the "u".

That doesn't do any good if you aren't writing them in unicode
code points, though.

--
Neil Cerutti
To succeed in the world it is not enough to be stupid, you must also
be well-mannered. --Voltaire

Nov 10 '06 #12

Neil Cerutti

On 2006-11-10, John Machin <sj******@lexicon.netwrote:

>
Neil Cerutti wrote:
>On 2006-10-16, lu********@gmail.com <lu********@gmail.comwrote:
Hello,

here is something that surprises me.

#coding: iso-8859-1

I think that's supposed to be:

# -*- coding: iso-8859-1 -*-

Not quite. As PEP 263 says:

"""
More precisely, the first or second line must match the regular
expression "coding[:=]\s*([-\w.]+)".
"""

Yep. I was erroneously going by the example in the Unicode Howto.
Thanks for the correction.

--
Neil Cerutti

Nov 10 '06 #13

Steve Holden

Neil Cerutti wrote:

On 2006-11-10, lu********@gmail.com <lu********@gmail.comwrote:
>Marc 'BlackJack' Rintsch wrote:
>>Why? Python strings are *byte strings* and bytes have values in the range
0..255. Why would you restrict them to ASCII only?
Because getting an exception when comparing a string with a unicode
string is irritating.

But I don't insist on my PEP. The example just shows just
another pitfall with Unicode and why I'll advise to any
beginner: Never write text constants that contain non-ascii
chars as simple strings, always make them Unicode strings by
prepending the "u".

That doesn't do any good if you aren't writing them in unicode
code points, though.

You tell the interpreter what encoding your source code is in. It then
knows precisely how to decode your string literals into Unicode. How do
you write things in "Unicode code points"?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Nov 10 '06 #14

Neil Cerutti

On 2006-11-10, Steve Holden <st***@holdenweb.comwrote:

>>But I don't insist on my PEP. The example just shows just
another pitfall with Unicode and why I'll advise to any
beginner: Never write text constants that contain non-ascii
chars as simple strings, always make them Unicode strings by
prepending the "u".

That doesn't do any good if you aren't writing them in unicode
code points, though.

You tell the interpreter what encoding your source code is in.
It then knows precisely how to decode your string literals into
Unicode. How do you write things in "Unicode code points"?

for = u"f\xfcr"

--
Neil Cerutti

Nov 10 '06 #15

Leo Kislov

Neil Cerutti wrote:

On 2006-11-10, Steve Holden <st***@holdenweb.comwrote:

>But I don't insist on my PEP. The example just shows just
another pitfall with Unicode and why I'll advise to any
beginner: Never write text constants that contain non-ascii
chars as simple strings, always make them Unicode strings by
prepending the "u".

That doesn't do any good if you aren't writing them in unicode
code points, though.
You tell the interpreter what encoding your source code is in.
It then knows precisely how to decode your string literals into
Unicode. How do you write things in "Unicode code points"?

for = u"f\xfcr"

Unless you're using unicode unfriendly editor or console, u"f\xfcr" is
the same as u"für":

>>u"f\xfcr" is u"für"

True

So there is no need to write unicode strings in hexadecimal
representation of code points.

-- Leo

Nov 10 '06 #16

Similar topics

Writing UTF-8 string to UNICODE file

by: Michael Weir | last post by:

I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code...

Python

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...

Python

Comparing byte arrays

by: Peter | last post by:

Hi how can I compare two byte arrays in VB.NET Thank Peter

.NET Framework

minidom xml & non ascii / unicode & files

by: webdev | last post by:

lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...

Python

Revised PEP 349: Allow str() to return unicode strings

by: Neil Schemenauer | last post by:

python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...

Python

Convert DOS Cyrillic text to Unicode

by: Nikolay Petrov | last post by:

How can I convert DOS cyrillic text to Unicode

Visual Basic .NET

unicode

by: 7stud | last post by:

Based on this example and the error: ----- u_str = u"abc\u9999" print u_str UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in position 3: ordinal not in range(128) ------

Python

Comparing strings and characters

by: =?Utf-8?B?UElFQkFMRA==?= | last post by:

Not really a C#-specific comment, more general .net observations. 1) A while back I found the need to determine whether or not a particular StringComparer was case-insensitive. The best way I...

C# / C Sharp

Problems in comparing two files written in Japanese

by: Avi1 | last post by:

Hi, I got the code (from the internet)for comparing two files and showing the difference in contents.Now,I tried the same code for two files written in japanese language(kanji).If I save the two...

Perl

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++