469,292 Members | 1,335 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,292 developers. It's quick & easy.

Treating a unicode string as latin-1

Hello,

I'm using ElementTree to parse an XML file which includes some data
encoded as cp1252, for example:

<name>Bob\x92s Breakfast</name>

If this was a regular bytestring, I would convert it to utf8 using the
following:
>>print 'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
Bob's Breakfast

But ElementTree gives me back a unicode string, so I get the following
error:
>>print u'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 3: ordinal not in range(128)

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

Thanks,

Simon Willison
Jan 3 '08 #1
8 5277
On Jan 3, 1:31 pm, Simon Willison <si...@simonwillison.netwrote:
How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?
u'Bob\x92s Breakfast'.encode('latin-1')

--
Paul Hankin
Jan 3 '08 #2
Simon Willison <si***@simonwillison.netwrote:
How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?
Can you not just fix your xml file so that it uses the same encoding as it
claims to use? If the xml says it contains utf8 encoded data then it should
not contain cp1252 encoded data, period.

If you really must, then try encoding with latin1 and then decoding with
cp1252:
>>print u'Bob\x92s Breakfast'.encode('latin1').decode('cp1252')
Bobs Breakfast

The latin1 codec will convert unicode characters in the range 0-255 to the
same single-byte value.
Jan 3 '08 #3
Simon Willison wrote:
Hello,

I'm using ElementTree to parse an XML file which includes some data
encoded as cp1252, for example:

<name>Bob\x92s Breakfast</name>

If this was a regular bytestring, I would convert it to utf8 using the
following:
>>>print 'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
Bob's Breakfast

But ElementTree gives me back a unicode string, so I get the following
error:
>>>print u'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 3: ordinal not in range(128)

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?
I don't get your problem. You get a unicode-object. Which means that it got
decoded by ET for you, as any XML-parser must do.

So - why don't you get rid of that .decode('cp1252') and happily encode it
to utf-8?

Diez
Jan 3 '08 #4
-On [20080103 14:36], Simon Willison (si***@simonwillison.net) wrote:
>How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?
Although it does not address the exact question it does raise the issue how
you are using ElementTree. When I use the following:

test.xml

<entry>
<name>Bob\x92s Breakfast</name>
</entry>

parse.py

from xml.etree.ElementTree import ElementTree

xmlfile = open('test.xml')

tree = ElementTree()
tree.parse(xmlfile)
elem = tree.find('name')

print type(elem.text)

I get a string type back and not a unicode string.

However, if you are mixing encodings within the same file, e.g. cp1252 in an
UTF8 encoded file, then you are creating a ton of problems.

--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org/ asmodai
イェルーン ラウフ*ック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/
When moved to complain about others, remember that karma is endless and it
is loving that leads to love...
Jan 3 '08 #5
Simon Willison wrote:
But ElementTree gives me back a unicode string, so I get the following
error:
>>>print u'Bob\x92s Breakfast'.decode('cp1252').encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 3: ordinal not in range(128)

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?
ET has already decoded the CP1252 data for you. If you want UTF-8, all
you need to do is to encode it:
>>u'Bob\x92s Breakfast'.encode('utf8')
'Bob\xc2\x92s Breakfast'

</F>

Jan 3 '08 #6
Fredrik Lundh <fr*****@pythonware.comwrote:
ET has already decoded the CP1252 data for you. If you want UTF-8, all
you need to do is to encode it:
>u'Bob\x92s Breakfast'.encode('utf8')
'Bob\xc2\x92s Breakfast'
I think he is claiming that the encoding information in the file is
incorrect and therefore it has been decoded incorrectly.

I would think it more likely that he wants to end up with u'Bob\u2019s
Breakfast' rather than u'Bob\x92s Breakfast' although u'Dog\u2019s dinner'
seems a probable consequence.
Jan 3 '08 #7
Duncan Booth schrieb:
Fredrik Lundh <fr*****@pythonware.comwrote:
>ET has already decoded the CP1252 data for you. If you want UTF-8, all
you need to do is to encode it:
>>>>u'Bob\x92s Breakfast'.encode('utf8')
'Bob\xc2\x92s Breakfast'
I think he is claiming that the encoding information in the file is
incorrect and therefore it has been decoded incorrectly.

I would think it more likely that he wants to end up with u'Bob\u2019s
Breakfast' rather than u'Bob\x92s Breakfast' although u'Dog\u2019s dinner'
seems a probable consequence.
If that's the case, he should read the file as string, de- and encode it
(probably into a StringIO) and then feed it to the parser.

Diez
Jan 3 '08 #8
Diez B. Roggisch wrote:
>I would think it more likely that he wants to end up with u'Bob\u2019s
Breakfast' rather than u'Bob\x92s Breakfast' although u'Dog\u2019s dinner'
seems a probable consequence.

If that's the case, he should read the file as string, de- and encode it
(probably into a StringIO) and then feed it to the parser.
some alternatives:

- clean up the offending strings:

http://effbot.org/zone/unicode-gremlins.htm

- turn the offending strings back to iso-8859-1, and decode them again:

u = u'Bob\x92s Breakfast'
u = u.encode("iso-8859-1").decode("cp1252")

- upgrade to ET 1.3 (available in alpha) and use the parser's encoding
option to override the file's encoding:

parser = ET.XMLParser(encoding="cp1252")
tree = ET.parse(source, parser)

</F>

Jan 3 '08 #9

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

23 posts views Thread by Hallvard B Furuseth | last post: by
1 post views Thread by Pettersen, Bjorn S | last post: by
8 posts views Thread by Bill Eldridge | last post: by
9 posts views Thread by Thomas Heller | last post: by
14 posts views Thread by wolfgang haefelinger | last post: by
12 posts views Thread by Onega | last post: by
14 posts views Thread by abhi147 | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
1 post views Thread by Geralt96 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.