473,387 Members | 1,512 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Treating a unicode string as latin-1

Hello,

I'm using ElementTree to parse an XML file which includes some data
encoded as cp1252, for example:

<name>Bob\x92s Breakfast</name>

If this was a regular bytestring, I would convert it to utf8 using the
following:
>>print 'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
Bob's Breakfast

But ElementTree gives me back a unicode string, so I get the following
error:
>>print u'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 3: ordinal not in range(128)

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

Thanks,

Simon Willison
Jan 3 '08 #1
8 5669
On Jan 3, 1:31 pm, Simon Willison <si...@simonwillison.netwrote:
How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?
u'Bob\x92s Breakfast'.encode('latin-1')

--
Paul Hankin
Jan 3 '08 #2
Simon Willison <si***@simonwillison.netwrote:
How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?
Can you not just fix your xml file so that it uses the same encoding as it
claims to use? If the xml says it contains utf8 encoded data then it should
not contain cp1252 encoded data, period.

If you really must, then try encoding with latin1 and then decoding with
cp1252:
>>print u'Bob\x92s Breakfast'.encode('latin1').decode('cp1252')
Bob’s Breakfast

The latin1 codec will convert unicode characters in the range 0-255 to the
same single-byte value.
Jan 3 '08 #3
Simon Willison wrote:
Hello,

I'm using ElementTree to parse an XML file which includes some data
encoded as cp1252, for example:

<name>Bob\x92s Breakfast</name>

If this was a regular bytestring, I would convert it to utf8 using the
following:
>>>print 'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
Bob's Breakfast

But ElementTree gives me back a unicode string, so I get the following
error:
>>>print u'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 3: ordinal not in range(128)

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?
I don't get your problem. You get a unicode-object. Which means that it got
decoded by ET for you, as any XML-parser must do.

So - why don't you get rid of that .decode('cp1252') and happily encode it
to utf-8?

Diez
Jan 3 '08 #4
-On [20080103 14:36], Simon Willison (si***@simonwillison.net) wrote:
>How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?
Although it does not address the exact question it does raise the issue how
you are using ElementTree. When I use the following:

test.xml

<entry>
<name>Bob\x92s Breakfast</name>
</entry>

parse.py

from xml.etree.ElementTree import ElementTree

xmlfile = open('test.xml')

tree = ElementTree()
tree.parse(xmlfile)
elem = tree.find('name')

print type(elem.text)

I get a string type back and not a unicode string.

However, if you are mixing encodings within the same file, e.g. cp1252 in an
UTF8 encoded file, then you are creating a ton of problems.

--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org/ asmodai
イェルーン ラウフãƒ*ック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/
When moved to complain about others, remember that karma is endless and it
is loving that leads to love...
Jan 3 '08 #5
Simon Willison wrote:
But ElementTree gives me back a unicode string, so I get the following
error:
>>>print u'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 3: ordinal not in range(128)

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?
ET has already decoded the CP1252 data for you. If you want UTF-8, all
you need to do is to encode it:
>>u'Bob\x92s Breakfast'.encode('utf8')
'Bob\xc2\x92s Breakfast'

</F>

Jan 3 '08 #6
Fredrik Lundh <fr*****@pythonware.comwrote:
ET has already decoded the CP1252 data for you. If you want UTF-8, all
you need to do is to encode it:
>u'Bob\x92s Breakfast'.encode('utf8')
'Bob\xc2\x92s Breakfast'
I think he is claiming that the encoding information in the file is
incorrect and therefore it has been decoded incorrectly.

I would think it more likely that he wants to end up with u'Bob\u2019s
Breakfast' rather than u'Bob\x92s Breakfast' although u'Dog\u2019s dinner'
seems a probable consequence.
Jan 3 '08 #7
Duncan Booth schrieb:
Fredrik Lundh <fr*****@pythonware.comwrote:
>ET has already decoded the CP1252 data for you. If you want UTF-8, all
you need to do is to encode it:
>>>>u'Bob\x92s Breakfast'.encode('utf8')
'Bob\xc2\x92s Breakfast'
I think he is claiming that the encoding information in the file is
incorrect and therefore it has been decoded incorrectly.

I would think it more likely that he wants to end up with u'Bob\u2019s
Breakfast' rather than u'Bob\x92s Breakfast' although u'Dog\u2019s dinner'
seems a probable consequence.
If that's the case, he should read the file as string, de- and encode it
(probably into a StringIO) and then feed it to the parser.

Diez
Jan 3 '08 #8
Diez B. Roggisch wrote:
>I would think it more likely that he wants to end up with u'Bob\u2019s
Breakfast' rather than u'Bob\x92s Breakfast' although u'Dog\u2019s dinner'
seems a probable consequence.

If that's the case, he should read the file as string, de- and encode it
(probably into a StringIO) and then feed it to the parser.
some alternatives:

- clean up the offending strings:

http://effbot.org/zone/unicode-gremlins.htm

- turn the offending strings back to iso-8859-1, and decode them again:

u = u'Bob\x92s Breakfast'
u = u.encode("iso-8859-1").decode("cp1252")

- upgrade to ET 1.3 (available in alpha) and use the parser's encoding
option to override the file's encoding:

parser = ET.XMLParser(encoding="cp1252")
tree = ET.parse(source, parser)

</F>

Jan 3 '08 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

23
by: Hallvard B Furuseth | last post by:
Has someone got a Python routine or module which converts Unicode strings to lowercase (or uppercase)? What I actually need to do is to compare a number of strings in a case-insensitive manner,...
1
by: Pettersen, Bjorn S | last post by:
I've been trying to stay blissfully unaware of Unicode, however now it seems like it's my turn. From the outside it seems like a rather massive subject, so any pointers as to where I should _start_...
8
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...
9
by: Thomas Heller | last post by:
First I was astonished to see that _winreg.QueryValue doesn't accept unicode key names, then I came up with this pattern: def RegQueryValue(root, subkey): if isinstance(subkey, unicode): return...
14
by: wolfgang haefelinger | last post by:
Hi, I wonder whether someone could explain me a bit what's going on here: import sys # I'm running Mandrake 1o and Windows XP. print sys.version ## 2.3.3 (#2, Feb 17 2004, 11:45:40)
12
by: Onega | last post by:
Hi I create a simple win32 project (VC2003, windows2003(English) , and do simple paint in WM_PAINT message, when the project use multi-character set, it is OK. but when I change to UNICODE,...
14
by: abhi147 | last post by:
Hi , I want to convert an array of bytes like : {79,104,-37,-66,24,123,30,-26,-99,-8,80,-38,19,14,-127,-3} into Unicode character with ISO-8859-1 standard. Can anyone help me .. how should...
5
by: Jon Bowlas | last post by:
Hi listers, I wrote this script in Zope some time ago and it worked for a while, but now I'm getting the following error: TypeError: coercing to Unicode: need string or buffer, NoneType found ...
7
by: aine_canby | last post by:
Hi, Im totally new to Python so please bare with me. Data is entered into my program using the folling code - str = raw_input(command) words = str.split() for word in words:
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.