UTF8 encoded XML file. Here's what I've tried so far:
<msg>Simon\xe2\x80\x99s XML nightmare</msg>>>xml_utf8 = """<?xml version="1.0" encoding="UTF-8" ?>
"""
('START_DOCUMENT', <xml.dom.minidom.Document instance at 0x6f06c0>)>>from xml.dom import pulldom
parser = pulldom.parseString(xml_utf8)
parser.next()
('START_ELEMENT', <DOM Element: msg at 0x6f0710>)>>parser.next()
....>>parser.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 21: ordinal not in range(128)
xml.dom.minidom can handle the string just fine:
u'<?xml version="1.0" ?><msg>Simon\u2019s XML nightmare</msg>'>>from xml.dom import minidom
dom = minidom.parseString(xml_utf8)
dom.toxml()
If I pass a unicode string to pulldom instead of a utf8 encoded
bytestring it still breaks:
....>>xml_unicode = u'<?xml version="1.0" ?><msg>Simon\u2019s XML nightmare</msg>'
parser = pulldom.parseString(xml_unicode)
/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
xml/dom/pulldom.py in parseString(string, parser)
346
347 bufsize = len(string)
--348 buf = StringIO(string)
349 if not parser:
350 parser = xml.sax.make_parser()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 32: ordinal not in range(128)
Is it possible to consume utf8 or unicode using xml.dom.pulldom or
should I try something else?
Thanks,
Simon Willison