UnicodeDecodeError help please?

Robin Haswell

Okay I'm getting really frustrated with Python's Unicode handling, I'm
trying everything I can think of an I can't escape Unicode(En|De)codeError
no matter what I try.

Could someone explain to me what I'm doing wrong here, so I can hope to
throw light on the myriad of similar problems I'm having? Thanks :-)

Python 2.4.1 (#2, May 6 2005, 11:22:24)
[GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import sys
sys.getdefaultencoding() 'utf-8' import htmlentitydefs
char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a copyright symbol
print char Â© str = u"Apple"
print str Apple str + char Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte a = str+char Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte

Basically my app is a search engine - I'm grabbing content from pages
using HTMLParser and storing it in a database but I'm running in to these
problems all over the shop (from decoding the entities to calling
str.lower()) - I don't know what encoding my pages are coming in as, I'm
just happy enough to accept that they're either UTF-8 or latin-1 with
entities.

Any help would be great, I just hope that I have a brainwave over the
weekend because I've lost two days to Unicode errors now. It's even worse
that I've written the same app in PHP before with none of these problems -
and PHP4 doesn't even support Unicode.

Cheers

-Rob

Apr 7 '06 #1

Subscribe Post Reply

3482

Robert Kern

Robin Haswell wrote:

Okay I'm getting really frustrated with Python's Unicode handling, I'm
trying everything I can think of an I can't escape Unicode(En|De)codeError
no matter what I try.
Have you read any of the documentation about Python's Unicode support? E.g.,

http://effbot.org/zone/unicode-objects.htm
Could someone explain to me what I'm doing wrong here, so I can hope to
throw light on the myriad of similar problems I'm having? Thanks :-)

Python 2.4.1 (#2, May 6 2005, 11:22:24)
[GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import sys
sys.getdefaultencoding()
'utf-8'

How did this happen? It's supposed to be 'ascii' and not user-settable.

import htmlentitydefs
char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a copyright symbol
print char
Â©
str = u"Apple"
print str
Apple
str + char
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte
a = str+char

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte

The values in htmlentitydefs.entitydefs are encoded in latin-1 (or are numeric
entities which you still have to parse). So decode using the latin-1 codec.

--
Robert Kern
ro*********@gmail.com

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Apr 7 '06 #2

Fredrik Lundh

Robin Haswell wrote:

Could someone explain to me what I'm doing wrong here, so I can hope to
throw light on the myriad of similar problems I'm having? Thanks :-)

Python 2.4.1 (#2, May 6 2005, 11:22:24)
[GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import sys
sys.getdefaultencoding() 'utf-8'
that's bad. do not hack the default encoding. it'll only make you sorry
when you try to port your code to some other python installation, or use
a library that relies on the factory settings being what they're supposed
to be. do not hack the default encoding.

back to your code:

import htmlentitydefs
char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a copyright symbol
print char ©

that's a standard (8-bit) string:

type(char) <type 'str'> ord(char) 169 len(char) 1

one byte that contains the value 169. looks like ISO-8859-1 (Latin-1) to me.
let's see what the documentation says:

entitydefs
A dictionary mapping XHTML 1.0 entity definitions to their replacement
text in ISO Latin-1.

alright, so it's an ISO Latin-1 string.
str = u"Apple"
print str Apple

type(str) <type 'unicode'> len(str) 5

that's a 5-character unicode string.
str + char Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0:
unexpected code byte

you're trying to combine an 8-bit string with a Unicode string, and you've
told Python (by hacking the site module) to treat all 8-bit strings as if they
contain UTF-8. UTF-8 != ISO-Latin-1.

so, you can of course convert the string you got from the entitydefs dict
to a unicode string before you combine the two strings

unicode(char, "iso-8859-1") + str u'\xa9Apple'

but the htmlentitydefs module offers a better alternative:

name2codepoint
A dictionary that maps HTML entity names to the Unicode
codepoints. New in version 2.3.

which allows you to do
char = unichr(htmlentitydefs.name2codepoint["copy"])
char u'\xa9' char + str u'\xa9Apple'

without having to deal with things like
len(htmlentitydefs.entitydefs["copy"]) 1 len(htmlentitydefs.entitydefs["rarr"])

7
Basically my app is a search engine - I'm grabbing content from pages
using HTMLParser and storing it in a database but I'm running in to these
problems all over the shop (from decoding the entities to calling
str.lower()) - I don't know what encoding my pages are coming in as, I'm
just happy enough to accept that they're either UTF-8 or latin-1 with
entities.
UTF-8 and Latin-1 are two different things, so your (international) users
will hate you if you don't do this right.
It's even worse that I've written the same app in PHP before with none of
these problems - and PHP4 doesn't even support Unicode.

a PHP4 application without I18N problems? I'm not sure I believe you... ;-)

</F>

Apr 7 '06 #3

Paul Boddie

Robin Haswell wrote:

Okay I'm getting really frustrated with Python's Unicode handling, I'm
trying everything I can think of an I can't escape Unicode(En|De)codeError
no matter what I try.
If you follow a few relatively simple rules, the days of Unicode errors
will be over. Let's take a look!
Could someone explain to me what I'm doing wrong here, so I can hope to
throw light on the myriad of similar problems I'm having? Thanks :-)

Python 2.4.1 (#2, May 6 2005, 11:22:24)
[GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import sys
sys.getdefaultencoding() 'utf-8'
Note that this only specifies the encoding assumed to be used in plain
strings when such strings are used to create Unicode objects. For some
applications this is sufficient, but where you may be dealing with many
different character sets (or encodings), having a default encoding will
not be sufficient. This has an impact below and in your wider problem.

import htmlentitydefs
char = htmlentitydefs.entitydefs["copy"] # this is an HTML © -a copyright symbol
print char ©

It's better here to use repr(char) to see exactly what kind of object
it is (or just give the name of the variable at the prompt). For me,
it's a plain string, despite htmlentitydefs defining the each name in
terms of its "Unicode codepoint". Moreover, for me the plain string
uses the "Latin-1" (or more correctly iso-8859-1) character set, and I
imagine that you get the same result.

str = u"Apple"
print str Apple str + char

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte

Here, Python attempts to make a Unicode object from char, using the
default encoding (which is utf-8) and finds that char is a plain string
containing non-utf-8 character values, specifically a single iso-8859-1
character value. It consequently complains. This is quite unfortunate
since you probably expected Python to give you the entity definition
either as a Unicode object or a plain string of your chosen encoding.

Having never used htmlentitydefs before, I can only imagine that it
provides plain strings containing iso-8859-1 values in order to support
"legacy" HTML processing (given that semi-modern HTML favours &#xx;
entities, and XHTML uses genuine character sequences in the stated
encoding), and that getting anything other than such strings might not
be particularly useful.

Anyway, what you'd do here is this:

str + unicode(char, "iso-8859-1)

Rule #1: if you have plain strings and you want them as Unicode, you
must somewhere state what encoding those strings are in, preferably as
you convert them to Unicode objects. Here, we can't rely on the default
encoding being correct and must explicitly state a different encoding.
Generally, stating the encoding is the right thing to do, rather than
assuming some default setting that may differ across environments.
Somehow, my default encoding is "ascii" not "utf-8", so your code would
fail on my system by relying on the default encoding.

[...]
Basically my app is a search engine - I'm grabbing content from pages
using HTMLParser and storing it in a database but I'm running in to these
problems all over the shop (from decoding the entities to calling
str.lower()) - I don't know what encoding my pages are coming in as, I'm
just happy enough to accept that they're either UTF-8 or latin-1 with
entities.
Rule #2: get your content as Unicode as soon as possible, then work
with it in Unicode. Once you've made your content Unicode, you
shouldn't get UnicodeDecodeError all over the place, and the only time
you then risk an UnicodeEncodeError is when you convert your content
back to plain strings, typically for serialisation purposes.

Rule #3: get acquainted with what kind of encodings apply to the
incoming data. If you are prepared to assume that the data is either
utf-8 or iso-8859-1, first try making Unicode objects from the data
stating that utf-8 is the encoding employed, and only if that fails
should you consider it as iso-8859-1, since an utf-8 string can quite
happily be interpreted (incorrectly) as a bunch of iso-8859-1
characters but not vice versa; thus, you have a primitive means of
validation.
Any help would be great, I just hope that I have a brainwave over the
weekend because I've lost two days to Unicode errors now. It's even worse
that I've written the same app in PHP before with none of these problems -
and PHP4 doesn't even support Unicode.

Perhaps that's why you never saw any such problems, but have you looked
at the quality of your data?

Paul

Apr 7 '06 #4

Ben C

On 2006-04-07, Robin Haswell <ro*@digital-crocus.com> wrote:

Okay I'm getting really frustrated with Python's Unicode handling, I'm
trying everything I can think of an I can't escape Unicode(En|De)codeError
no matter what I try.

Could someone explain to me what I'm doing wrong here, so I can hope to
throw light on the myriad of similar problems I'm having? Thanks :-)

Python 2.4.1 (#2, May 6 2005, 11:22:24)
[GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import sys
sys.getdefaultencoding() 'utf-8' import htmlentitydefs
char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a copyright symbol
print char © str = u"Apple"
print str Apple str + char Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte a = str+char

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte

Try this:

import htmlentitydefs

char = htmlentitydefs.entitydefs["copy"]
char = unicode(char, "Latin1")

str = u"Apple"
print str
print str + char

htmlentitydefs.entitydefs is "A dictionary mapping XHTML 1.0 entity
definitions to their replacement text in ISO Latin-1".

So you get "char" back as a Latin-1 string. Then we use the builtin
function unicode to make a unicode string (which doesn't have an
encoding, as I understand it, it's just unicode). This can be added to
u"Apple" and printed out.

It prints out OK on a UTF-8 terminal, but you can print it in other
encodings using encode:

print (str + char).encode("Latin1")

for example.

For your search engine you should look at server headers, metatags,
BOMs, and guesswork, in roughly that order, to determine the encoding of
the source document. Convert it all to unicode (using builtin function
unicode) and use that to build your indexes etc., and write results out
in whatever you need to write it out in (probably UTF-8).

HTH.

Apr 7 '06 #5

by: Ruslan | last post by:

Hi, everybody. In this excerpt of code enc = 'some_type_of_encoding' def _encode(v): if isinstance(v, UnicodeType): v = v.encode(v) return v

Python

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 10: ordinal not in range(128)

by: Robin Siebler | last post by:

I have no idea what is causing this error, or how to fix it. The full error is: Traceback (most recent call last): File "D:\ScriptRuntime\PS\Automation\Handlers\SCMTestToolResourceToolsBAT.py",...

Python

Bug in python (Weird UnicodeDecodeError)

by: dbri.tcc | last post by:

Hello I am getting somewhat random UnicodeDecodeError messages in my program. It is random in that I will be going through a pysqlite database of records, manipulate the results, and it will...

Python

what is this UnicodeDecodeError:....?

by: kath | last post by:

I have a number of excel files. In each file DATE is represented by different name. I want to read the date from those different file. Also the date is in different column in different file. To...

Python

[Help]UnicodeDecodeError

by: Karl | last post by:

error msg: Mod_python error: "PythonHandler mod_python.publisher" Traceback (most recent call last): File "/usr/lib/python2.3/site-packages/mod_python/apache.py", line 299, in HandlerDispatch...

Python

Long way around UnicodeDecodeError, or 'ascii' codec can't decode byte

by: Oleg Parashchenko | last post by:

Hello, I'm working on an unicode-aware application. I like to use "print" to debug programs, but in this case it was nightmare. The most popular result of "print" was: UnicodeDecodeError:...

Python

UnicodeDecodeError, how to elegantly deal with this?

by: Jorgen Bodde | last post by:

Hi All, I am relatively new to python unicode pains and I would like to have some advice. I have this snippet of code: def playFile(cmd, args): argstr = list() for arg in...

Python

Re: UnicodeDecodeError, how to elegantly deal with this?

by: Jorgen Bodde | last post by:

Hi Edwin, Filemask is obvious as it is assigned in the python code itself. It is "%file%". The idea is that the file clicked is substituted for the "%file%" by the replace action. The file that...

Python

[2.5.1] "UnicodeDecodeError: 'ascii' codec can't decode byte"?

by: Gilles Ganault | last post by:

Hello I'm getting this error while downloading and parsing web pages: ===== title = m.group(1) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 48: ordinal not in...

Python

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

UnicodeDecodeError help please?

Similar topics