Treating a unicode string as latin-1

Simon Willison

Hello,

I'm using ElementTree to parse an XML file which includes some data
encoded as cp1252, for example:

<name>Bob\x92s Breakfast</name>

If this was a regular bytestring, I would convert it to utf8 using the
following:

>>print 'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')

Bob's Breakfast

But ElementTree gives me back a unicode string, so I get the following
error:

>>print u'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 3: ordinal not in range(128)

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

Thanks,

Simon Willison

Jan 3 '08 #1

Subscribe Post Reply

5669

Paul Hankin

On Jan 3, 1:31 pm, Simon Willison <si...@simonwillison.netwrote:

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

u'Bob\x92s Breakfast'.encode('latin-1')

--
Paul Hankin

Jan 3 '08 #2

Duncan Booth

Simon Willison <si***@simonwillison.netwrote:

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

Can you not just fix your xml file so that it uses the same encoding as it
claims to use? If the xml says it contains utf8 encoded data then it should
not contain cp1252 encoded data, period.

If you really must, then try encoding with latin1 and then decoding with
cp1252:

>>print u'Bob\x92s Breakfast'.encode('latin1').decode('cp1252')

Bob’s Breakfast

The latin1 codec will convert unicode characters in the range 0-255 to the
same single-byte value.

Jan 3 '08 #3

Diez B. Roggisch

Simon Willison wrote:

Hello,

I'm using ElementTree to parse an XML file which includes some data
encoded as cp1252, for example:

<name>Bob\x92s Breakfast</name>

If this was a regular bytestring, I would convert it to utf8 using the
following:

>>>print 'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')

Bob's Breakfast

But ElementTree gives me back a unicode string, so I get the following
error:

>>>print u'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 3: ordinal not in range(128)

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

I don't get your problem. You get a unicode-object. Which means that it got
decoded by ET for you, as any XML-parser must do.

So - why don't you get rid of that .decode('cp1252') and happily encode it
to utf-8?

Diez

Jan 3 '08 #4

Jeroen Ruigrok van der Werven

-On [20080103 14:36], Simon Willison (si***@simonwillison.net) wrote:

>How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

Although it does not address the exact question it does raise the issue how
you are using ElementTree. When I use the following:

test.xml

<entry>
<name>Bob\x92s Breakfast</name>
</entry>

parse.py

from xml.etree.ElementTree import ElementTree

xmlfile = open('test.xml')

tree = ElementTree()
tree.parse(xmlfile)
elem = tree.find('name')

print type(elem.text)

I get a string type back and not a unicode string.

However, if you are mixing encodings within the same file, e.g. cp1252 in an
UTF8 encoded file, then you are creating a ton of problems.

--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org/ asmodai
ã‚¤ã‚§ãƒ«ãƒ¼ãƒ³ ãƒ©ã‚¦ãƒ•ãƒ*ãƒƒã‚¯ ãƒ´ã‚¡ãƒ³ ãƒ‡ãƒ« ã‚¦ã‚§ãƒ«ãƒ´ã‚§ãƒ³
http://www.in-nomine.org/ | http://www.rangaku.org/
When moved to complain about others, remember that karma is endless and it
is loving that leads to love...

Jan 3 '08 #5

Fredrik Lundh

Simon Willison wrote:

But ElementTree gives me back a unicode string, so I get the following
error:

>>>print u'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 3: ordinal not in range(128)

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

ET has already decoded the CP1252 data for you. If you want UTF-8, all
you need to do is to encode it:

>>u'Bob\x92s Breakfast'.encode('utf8')

'Bob\xc2\x92s Breakfast'

</F>

Jan 3 '08 #6

Duncan Booth

Fredrik Lundh <fr*****@pythonware.comwrote:

ET has already decoded the CP1252 data for you. If you want UTF-8, all
you need to do is to encode it:

>u'Bob\x92s Breakfast'.encode('utf8')

'Bob\xc2\x92s Breakfast'

I think he is claiming that the encoding information in the file is
incorrect and therefore it has been decoded incorrectly.

I would think it more likely that he wants to end up with u'Bob\u2019s
Breakfast' rather than u'Bob\x92s Breakfast' although u'Dog\u2019s dinner'
seems a probable consequence.

Jan 3 '08 #7

Diez B. Roggisch

Duncan Booth schrieb:

Fredrik Lundh <fr*****@pythonware.comwrote:

>ET has already decoded the CP1252 data for you. If you want UTF-8, all
you need to do is to encode it:

>>>>u'Bob\x92s Breakfast'.encode('utf8')
'Bob\xc2\x92s Breakfast'

I think he is claiming that the encoding information in the file is
incorrect and therefore it has been decoded incorrectly.

I would think it more likely that he wants to end up with u'Bob\u2019s
Breakfast' rather than u'Bob\x92s Breakfast' although u'Dog\u2019s dinner'
seems a probable consequence.

If that's the case, he should read the file as string, de- and encode it
(probably into a StringIO) and then feed it to the parser.

Diez

Jan 3 '08 #8

Fredrik Lundh

Diez B. Roggisch wrote:

>I would think it more likely that he wants to end up with u'Bob\u2019s
Breakfast' rather than u'Bob\x92s Breakfast' although u'Dog\u2019s dinner'
seems a probable consequence.

If that's the case, he should read the file as string, de- and encode it
(probably into a StringIO) and then feed it to the parser.

some alternatives:

- clean up the offending strings:

http://effbot.org/zone/unicode-gremlins.htm

- turn the offending strings back to iso-8859-1, and decode them again:

u = u'Bob\x92s Breakfast'
u = u.encode("iso-8859-1").decode("cp1252")

- upgrade to ET 1.3 (available in alpha) and use the parser's encoding
option to override the file's encoding:

parser = ET.XMLParser(encoding="cp1252")
tree = ET.parse(source, parser)

</F>

Jan 3 '08 #9

Similar topics

convert Unicode to lower/uppercase?

by: Hallvard B Furuseth | last post by:

Has someone got a Python routine or module which converts Unicode strings to lowercase (or uppercase)? What I actually need to do is to compare a number of strings in a case-insensitive manner,...

Python

How do I display unicode-paths?

by: Pettersen, Bjorn S | last post by:

I've been trying to stay blissfully unaware of Unicode, however now it seems like it's my turn. From the outside it seems like a rather massive subject, so any pointers as to where I should _start_...

Python

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...

Python

Retrive unicode keys from the registry

by: Thomas Heller | last post by:

First I was astonished to see that _winreg.QueryValue doesn't accept unicode key names, then I came up with this pattern: def RegQueryValue(root, subkey): if isinstance(subkey, unicode): return...

Python

unicode question

by: wolfgang haefelinger | last post by:

Hi, I wonder whether someone could explain me a bit what's going on here: import sys # I'm running Mandrake 1o and Windows XP. print sys.version ## 2.3.3 (#2, Feb 17 2004, 11:45:40)

Python

unicode in windows 2003

by: Onega | last post by:

Hi I create a simple win32 project (VC2003, windows2003(English) , and do simple paint in WM_PAINT message, when the project use multi-character set, it is OK. but when I change to UNICODE,...

.NET Framework

Array of Bytes to Unicode chars (ISO-8859-1)

by: abhi147 | last post by:

Hi , I want to convert an array of bytes like : {79,104,-37,-66,24,123,30,-26,-99,-8,80,-38,19,14,-127,-3} into Unicode character with ISO-8859-1 standard. Can anyone help me .. how should...

C / C++

coercing to Unicode: need string or buffer, NoneType found

by: Jon Bowlas | last post by:

Hi listers, I wrote this script in Zope some time ago and it worked for a while, but now I'm getting the following error: TypeError: coercing to Unicode: need string or buffer, NoneType found ...

Python

Novice: replacing strings with unicode variables in a list

by: aine_canby | last post by:

Hi, Im totally new to Python so please bare with me. Data is entered into my program using the folling code - str = raw_input(command) words = str.split() for word in words:

Python

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++