473,893 Members | 1,741 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Problem with sets and Unicode strings

Hi!

The following program in an UTF-8 encoded file:
# -*- coding: UTF-8 -*-

FIELDS = ("Fächer", )
FROZEN_FIELDS = frozenset(FIELD S)
FIELDS_SET = set(FIELDS)

print u"Fächer" in FROZEN_FIELDS
print u"Fächer" in FIELDS_SET
print u"Fächer" in FIELDS
gives this output
False
False
Traceback (most recent call last):
File "test.py", line 9, in ?
print u"FÀcher" in FIELDS
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
Why do the first two print statements succeed and the third one fails
with an exception?

Why does the use of set/frozenset remove the exception?
Thanks,
Dennis
Jun 27 '06 #1
14 2546
On 6/27/06, Dennis Benzinger <De************ **@gmx.net> wrote:
Hi!

The following program in an UTF-8 encoded file:
# -*- coding: UTF-8 -*-

FIELDS = ("Fächer", )
FROZEN_FIELDS = frozenset(FIELD S)
FIELDS_SET = set(FIELDS)

print u"Fächer" in FROZEN_FIELDS
print u"Fächer" in FIELDS_SET
print u"Fächer" in FIELDS
gives this output
False
False
Traceback (most recent call last):
File "test.py", line 9, in ?
print u"FÀcher" in FIELDS
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
Why do the first two print statements succeed and the third one fails
with an exception?
Actually all three statements fail to produce correct result.
Why does the use of set/frozenset remove the exception?


Because sets use hash algorithm to find matches, whereas the last
statement directly compares a unicode string with a byte string. Byte
strings can only contain ascii characters, that's why python raises an
exception. The problem is very easy to fix: use unicode strings for
all non-ascii strings.
Jun 27 '06 #2
Serge Orlov wrote:
On 6/27/06, Dennis Benzinger <De************ **@gmx.net> wrote:
Hi!

The following program in an UTF-8 encoded file:
# -*- coding: UTF-8 -*-

FIELDS = ("Fächer", )
FROZEN_FIELDS = frozenset(FIELD S)
FIELDS_SET = set(FIELDS)

print u"Fächer" in FROZEN_FIELDS
print u"Fächer" in FIELDS_SET
print u"Fächer" in FIELDS
gives this output
False
False
Traceback (most recent call last):
File "test.py", line 9, in ?
print u"FÀcher" in FIELDS
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
Why do the first two print statements succeed and the third one fails
with an exception?
Actually all three statements fail to produce correct result.


So this is a bug in Python?
frozenset remove the exception?

Because sets use hash algorithm to find matches, whereas the last
statement directly compares a unicode string with a byte string. Byte
strings can only contain ascii characters, that's why python raises an
exception. The problem is very easy to fix: use unicode strings for
all non-ascii strings.


No, byte strings contain characters which are at least 8-bit wide
<http://docs.python.org/ref/types.html>. But I don't understand what
Python is trying to decode and why the exception says something about
the ASCII codec, because my file is encoded with UTF-8.
Dennis
Jun 27 '06 #3
On 6/27/06, Dennis Benzinger <De************ **@gmx.net> wrote:
Serge Orlov wrote:
On 6/27/06, Dennis Benzinger <De************ **@gmx.net> wrote:
Hi!

The following program in an UTF-8 encoded file:
# -*- coding: UTF-8 -*-

FIELDS = ("Fächer", )
FROZEN_FIELDS = frozenset(FIELD S)
FIELDS_SET = set(FIELDS)

print u"Fächer" in FROZEN_FIELDS
print u"Fächer" in FIELDS_SET
print u"Fächer" in FIELDS
gives this output
False
False
Traceback (most recent call last):
File "test.py", line 9, in ?
print u"FÀcher" in FIELDS
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
Why do the first two print statements succeed and the third one fails
with an exception?
Actually all three statements fail to produce correct result.


So this is a bug in Python?


No.
frozenset remove the exception?

Because sets use hash algorithm to find matches, whereas the last
statement directly compares a unicode string with a byte string. Byte
strings can only contain ascii characters, that's why python raises an
exception. The problem is very easy to fix: use unicode strings for
all non-ascii strings.


No, byte strings contain characters which are at least 8-bit wide
<http://docs.python.org/ref/types.html>.


Yes, but later it's written that non-ascii characters do not have
universal meaning assigned to them. In other words if you put byte
0xE4 into a bytes string all python knows about it is that it's *some*
character. If you put character U+00E4 into a unicode string python
knows it's a "latin small letter a with diaeresis". Trying to compare
*some* character with a specific character is obviously undefined.
But I don't understand what
Python is trying to decode and why the exception says something about
the ASCII codec, because my file is encoded with UTF-8.


Because byte strings can come from different sources (network, files,
etc) not only from the sources of your program python cannot assume
all of them are utf-8. It assumes they are ascii, because most of
wide-spread text encodings are ascii bases. Actually it's a guess,
since there are utf-16, utf-32 and other non-ascii encodings. If you
want to experience the life without guesses put
sys.setdefaulte ncoding("undefi ned") into site.py
Jun 27 '06 #4
Dennis Benzinger wrote:
Serge Orlov wrote:
On 6/27/06, Dennis Benzinger <De************ **@gmx.net> wrote:
Hi!

The following program in an UTF-8 encoded file:
# -*- coding: UTF-8 -*-

FIELDS = ("Fächer", )
FROZEN_FIELDS = frozenset(FIELD S)
FIELDS_SET = set(FIELDS)

print u"Fächer" in FROZEN_FIELDS
print u"Fächer" in FIELDS_SET
print u"Fächer" in FIELDS
gives this output
False
False
Traceback (most recent call last):
File "test.py", line 9, in ?
print u"FÀcher " in FIELDS
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
Why do the first two print statements succeed and the third one fails
with an exception?

Actually all three statements fail to produce correct result.


So this is a bug in Python?


No.
frozenset remove the exception?

Because sets use hash algorithm to find matches, whereas the last
statement directly compares a unicode string with a byte string. Byte
strings can only contain ascii characters, that's why python raises an
exception. The problem is very easy to fix: use unicode strings for
all non-ascii strings.


No, byte strings contain characters which are at least 8-bit wide
<http://docs.python.org/ref/types.html>. But I don't understand what
Python is trying to decode and why the exception says something about
the ASCII codec, because my file is encoded with UTF-8.


Please read

http://www.amk.ca/python/howto/unicode

The string in all of the containers (FIELDS, FROZEN_FIELDS, FIELDS_SET) is a
regular byte string, not a Unicode string. The encoding declaration only
controls how the file is parsed. The string literal that you use for FIELDS is a
regular string literal, not a Unicode string literal, so the object it creates
is an 8-bit byte string. The tuple containment test is attempting to compare
your Unicode string object to the regular string object for equality. Python
does these comparisons by attempting to decode the regular string into a Unicode
string. Since there is no encoding information present on regular strings at
this point (since the encoding declaration in your file only controls parsing,
nothing else), Python assumes ASCII and throws an exception otherwise.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Jun 27 '06 #5
Dennis Benzinger a écrit :
No, byte strings contain characters which are at least 8-bit wide
<http://docs.python.org/ref/types.html>. But I don't understand what
Python is trying to decode and why the exception says something about
the ASCII codec, because my file is encoded with UTF-8.


[addendum to others replies]

The file encoding directive is used by Python to convert u"xxx" strings
into unicode objects using right conversion rules when compiling the code.
When a string is written simply with "xxx", its a 8 bits string with NO
encoding data associated. When these strings must be converted they are
considered to be using sys.getdefaulte ncoding() [generally ascii -
forced ascii in python 2.5]

So a short reply: the utf8 directive has no effect on 8 bits strings,
use unicode strings to manage correctly non-ascii texts.

A+

Laurent.

Jun 28 '06 #6
Serge Orlov wrote:
On 6/27/06, Dennis Benzinger <De************ **@gmx.net> wrote:
Serge Orlov wrote:
> On 6/27/06, Dennis Benzinger <De************ **@gmx.net> wrote:
>> Hi!
>>
>> The following program in an UTF-8 encoded file:
>>
>>
>> # -*- coding: UTF-8 -*-
>>
>> FIELDS = ("Fächer", )
>> FROZEN_FIELDS = frozenset(FIELD S)
>> FIELDS_SET = set(FIELDS)
>>
>> print u"Fächer" in FROZEN_FIELDS
>> print u"Fächer" in FIELDS_SET
>> print u"Fächer" in FIELDS
>>
>>
>> gives this output
>>
>>
>> False
>> False
>> Traceback (most recent call last):
>> File "test.py", line 9, in ?
>> print u"FÀcher" in FIELDS
>> UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xc3 in

position 1:
>> ordinal not in range(128)
>>
>>
>> Why do the first two print statements succeed and the third one fails
>> with an exception?
>
> Actually all three statements fail to produce correct result.


So this is a bug in Python?


No.
> frozenset remove the exception?
>
> Because sets use hash algorithm to find matches, whereas the last
> statement directly compares a unicode string with a byte string. Byte
> strings can only contain ascii characters, that's why python raises an
> exception. The problem is very easy to fix: use unicode strings for
> all non-ascii strings.


No, byte strings contain characters which are at least 8-bit wide
<http://docs.python.org/ref/types.html>.


Yes, but later it's written that non-ascii characters do not have
universal meaning assigned to them. In other words if you put byte
0xE4 into a bytes string all python knows about it is that it's *some*
character. If you put character U+00E4 into a unicode string python
knows it's a "latin small letter a with diaeresis". Trying to compare
*some* character with a specific character is obviously undefined.
[...]


But <http://docs.python.org/ref/comparisons.htm l> says:

Strings are compared lexicographical ly using the numeric equivalents
(the result of the built-in function ord()) of their characters. Unicode
and 8-bit strings are fully interoperable in this behavior.

Doesn't this mean that Unicode and 8-bit strings can be compared and
this comparison is well defined? (even if it's is not meaningful)

Thanks for your anwsers,
Dennis
Jun 28 '06 #7
Robert Kern wrote:
Dennis Benzinger wrote:
Serge Orlov wrote:
On 6/27/06, Dennis Benzinger <De************ **@gmx.net> wrote:
Hi!

The following program in an UTF-8 encoded file:
# -*- coding: UTF-8 -*-

FIELDS = ("Fächer", )
FROZEN_FIELDS = frozenset(FIELD S)
FIELDS_SET = set(FIELDS)

print u"Fächer" in FROZEN_FIELDS
print u"Fächer" in FIELDS_SET
print u"Fächer" in FIELDS
gives this output
False
False
Traceback (most recent call last):
File "test.py", line 9, in ?
print u"FÀcher " in FIELDS
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
Why do the first two print statements succeed and the third one fails
with an exception?
Actually all three statements fail to produce correct result.


So this is a bug in Python?


No.
[...]


But I'd say that it's not intuitive that for sets x in y can be false
(without raising an exception!) while the doing the same with a tuple
raises an exception. Where is this difference documented?
Thanks,
Dennis
Jun 28 '06 #8
> But <http://docs.python.org/ref/comparisons.htm l> says:

Strings are compared lexicographical ly using the numeric equivalents
(the result of the built-in function ord()) of their characters. Unicode
and 8-bit strings are fully interoperable in this behavior.

Doesn't this mean that Unicode and 8-bit strings can be compared and
this comparison is well defined? (even if it's is not meaningful)


Obviously not - otherwise you wouldn't have the problems you'd observed,
wouldn't you?

What happens of course is that in case of string to unicode-comparison, the
string gets coerced to an unicode value - using the default encoding!
# -*- coding: latin1 -*-

print "ö".decode("la tin1") == u"ö"
print "ö" == u"ö"

So - they are fully interoperable and the comparison is well defined - when
the coercion is successful.

Diez
Jun 28 '06 #9
> But I'd say that it's not intuitive that for sets x in y can be false
(without raising an exception!) while the doing the same with a tuple
raises an exception. Where is this difference documented?


2.3.7 Set Types -- set, frozenset

....

Set elements are like dictionary keys; they need to define both __hash__ and
__eq__ methods.
....

And it has to hold that

a == b => hash(a) == hash(b)

but NOT

hash(a) == hash(b) => a == b

Thus if the hashes vary, the set doesn't bother to actually compare the
values.

Diez
Jun 28 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

30
2798
by: aurora | last post by:
I have long find the Python default encoding of strict ASCII frustrating. For one thing I prefer to get garbage character than an exception. But the biggest issue is Unicode exception often pop up in unexpected places and only when a non-ASCII or unicode character first found its way into the system. Below is an example. The program may runs fine at the beginning. But as soon as an unicode character u'b' is introduced, the program boom...
7
1761
by: copx | last post by:
For some reason Python (on Windows) doesn't use the system's default character set and that's a serious problem for me. I need to process German textfiles (containing umlauts and other > 7bit ASCII characters) and generally work with strings which need to be processed using the local encoding (I need to display the text using a Tk-based GUI for example). The only solution I managed to find was converting between unicode and latin-1 all the...
0
1438
by: JJY | last post by:
Hi. I have a few sets of unicode strings I am trying to display. I can display session variables with unicode strings from a XML file, but I can't display a unicode string coming from a DLL. If I save the failing unicode string in a unicode file, I can view it fine, and the browser selects the encoding. The encoding it selects is unicode and not unicode (UTF-8).
19
9542
by: David zhu | last post by:
I've got different result when comparing two strings using "==" and string.Compare(). The two strings seems to have same value "1202002" in the quick watch, and both have the same length 7 which I have tried to print out by debug.writeline(). But the "==" operator results false, and string.Compare() results true. Somebody helps me!
4
2014
by: Richard506 | last post by:
If you take this Byte Arra Dim bytes() As Byte = { 207, 224, 135, 161, 253, 233, 111, 110, 99, 111, 100, 105, 110, 103, 32, 69, 120, 97, 109, 112, 108, 101 and write the byte array to disk FileOpen(2, "f:\aaapicture_album\result.txt", OpenMode.Binary FilePut(2, bytes FileClose(2 then convert it into a string --->
4
6922
by: Cott Lang | last post by:
ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 Running 7.4.5, I frequently get this error, and ONLY on this particular character despite seeing quite a bit of 8 bit. I don't really follow why it can't be converted, it's the same character (239) in both character sets. Databases are in ISO8859-1, JDBC driver is defaulting to UTF-8. Am I flubbing something up? I'm probably going to (reluctantly) convert to UTF-8 in the...
12
3235
by: Steven Nagy | last post by:
Hi all, I have to do a website in chinese! Basically I just need to know how to output chinese characters. I am assuming its very easy, but have never done it before. I can however do simple things like changing the formats of currency and calendars and so on. I am guessing the answer is quite simple given; I assume Unicode would support all the chinese characters right? Ideally I'd like them to be able to enter their own content...
5
1759
by: Norman Diamond | last post by:
Here are two complete lines of output from Visual Studio 2005: 1>$B%W%m%8%'%/%H=PNO$K(B Authenticode $B=pL>$7$F$$$^$9(B... 1>Successfully signed: c:\T The first line means roughly: Doing Authenticode signature to project output. The second line is harder to translate. The reason is that the second line says it successfully signed something that doesn't exist. I don't have a
1
5884
by: erikcw | last post by:
Hi, I'm trying to insert some data from an XML file into MySQL. However, while importing one of the files, I got this error: Traceback (most recent call last): File "wa.py", line 304, in ? main() File "wa.py", line 257, in main curHandler.walkData()
0
9985
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
11244
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10839
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10927
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9646
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
7173
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
6066
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4684
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
4280
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.