I'm having a horrible time trying to get xml.dom.pulldom to consume a
UTF8 encoded XML file. Here's what I've tried so far:
>>xml_utf8 = """<?xml version="1.0" encoding="UTF-8" ?>
<msg>Simon\xe2\x80\x99s XML nightmare</msg>
"""
>>from xml.dom import pulldom
parser = pulldom.parseString(xml_utf8)
parser.next()
('START_DOCUMENT', <xml.dom.minidom.Document instance at 0x6f06c0>)
>>parser.next()
('START_ELEMENT', <DOM Element: msg at 0x6f0710>)
>>parser.next()
....
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 21: ordinal not in range(128)
xml.dom.minidom can handle the string just fine:
>>from xml.dom import minidom
dom = minidom.parseString(xml_utf8)
dom.toxml()
u'<?xml version="1.0" ?><msg>Simon\u2019s XML nightmare</msg>'
If I pass a unicode string to pulldom instead of a utf8 encoded
bytestring it still breaks:
>>xml_unicode = u'<?xml version="1.0" ?><msg>Simon\u2019s XML nightmare</msg>'
parser = pulldom.parseString(xml_unicode)
....
/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
xml/dom/pulldom.py in parseString(string, parser)
346
347 bufsize = len(string)
--348 buf = StringIO(string)
349 if not parser:
350 parser = xml.sax.make_parser()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 32: ordinal not in range(128)
Is it possible to consume utf8 or unicode using xml.dom.pulldom or
should I try something else?
Thanks,
Simon Willison