I'm having trouble using elementtree with an XML file that has some
gbk-encoded text. (I can't read Chinese, so I'm taking their word for
it that it's gbk-encoded.) I always have trouble with encodings, so I'm
sure I'm just screwing something simple up. Can anyone help me?
Here's the interactive session. Sorry it's a little verbose, but I
figured it would be better to include too much than not enough. I
basically expected et.ElementTree(file=...) to fail since no encoding
was specified, but I don't know what I'm doing wrong when I use
codecs.open(...)
Thanks in advance for the help! import elementtree.ElementTree as et import codecs et.ElementTree(file=filename)
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 543, in
__init__
self.parse(file)
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 583, in
parse
parser.feed(data)
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 1242,
in feed
self._parser.Parse(data, 0)
ExpatError: not well-formed (invalid token): line 8, column 6 et.ElementTree(file=codecs.open(filename, 'r', 'gbk'))
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 543, in
__init__
self.parse(file)
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 583, in
parse
parser.feed(data)
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 1242,
in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode characters in position
133-135: ordinal not in range(128) text = open(filename).read() text
'<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n
<DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566>\n( (IP-HLN
(LCP-TMP (IP (NP-PN-SBJ (NR \xb7\xfc\xc3\xf7\xcf\xbc)) \n\t\t (VP
(VV \xbb\xf1\xb5\xc3) \n\t\t\t (NP-OBJ (NN \xc5\xae\xd7\xd3)
\n\t\t\t\t (NN \xcc\xf8\xcc\xa8) \n\t\t\t\t (NN \xcc\xf8\xcb\xae)
\n\t\t\t\t (NN \xb9\xda\xbe\xfc)))) \n\t\t (LC \xba\xf3)) \n
(PU \xa3\xac) \n (NP-SBJ (NP-PN (NR
\xcb\xd5\xc1\xaa\xb6\xd3)) \n (NP (NN
\xbd\xcc\xc1\xb7))) \n (VP (ADVP (AD \xc8\xc8\xc7\xe9)) \n
(PP-DIR (P \xcf\xf2) \n\t\t (NP (PN \xcb\xfd))) \n
(VP (VV \xd7\xa3\xba\xd8))) \n (PU \xa1\xa3)) )
\n</S>\n<S ID=2567>\n( (FRAG (NR \xd0\xc2\xbb\xaa\xc9\xe7) \n
(NN \xbc\xc7\xd5\xdf) \n (NR \xb3\xcc\xd6\xc1\xc9\xc6) \n
(VV \xc9\xe3) )) \n</S>\n</HEADLINE>\n<TEXT>\n</TEXT>\n</BODY>\n</DOC>\n'
STeVe 15 5039
Steven Bethard schrieb: I'm having trouble using elementtree with an XML file that has some gbk-encoded text. (I can't read Chinese, so I'm taking their word for it that it's gbk-encoded.) I always have trouble with encodings, so I'm sure I'm just screwing something simple up. Can anyone help me?
Here's the interactive session. Sorry it's a little verbose, but I figured it would be better to include too much than not enough. I basically expected et.ElementTree(file=...) to fail since no encoding was specified, but I don't know what I'm doing wrong when I use codecs.open(...)
The first and most important lesson to learn here is that well-formed
XML must contain a xml-header that specifies the used encoding. This has
two consequences for you:
1) all xml-parsers expect byte-strings, as they have to first read the
header to know what encoding awaits them. So no use reading the xml-file
with a codec - even if it is the right one. It will get converted back
to a string when fed to the parser, with the default codec being used -
resulting in the well-known unicode error.
2) your xml is _not_ well-formed, as it doesn't contain a xml-header!
You need ask these guys to deliver the xml with header. Of course for
now it is ok to just prepend the text with something like <?xml
version="1.0" encoding="gbk"?>. But I'd still request them to deliver it
with that header - otherwise it is _not_ XML, but just something that
happens to look similar and doesn't guarantee to be well-formed and thus
can be safely fed to a parser.
HTH Diez
Diez B. Roggisch wrote: Steven Bethard schrieb: I'm having trouble using elementtree with an XML file that has some gbk-encoded text. (I can't read Chinese, so I'm taking their word for it that it's gbk-encoded.) I always have trouble with encodings, so I'm sure I'm just screwing something simple up. Can anyone help me?
Here's the interactive session. Sorry it's a little verbose, but I figured it would be better to include too much than not enough. I basically expected et.ElementTree(file=...) to fail since no encoding was specified, but I don't know what I'm doing wrong when I use codecs.open(...)
The first and most important lesson to learn here is that well-formed XML must contain a xml-header that specifies the used encoding. This has two consequences for you:
1) all xml-parsers expect byte-strings, as they have to first read the header to know what encoding awaits them. So no use reading the xml-file with a codec - even if it is the right one. It will get converted back to a string when fed to the parser, with the default codec being used - resulting in the well-known unicode error.
2) your xml is _not_ well-formed, as it doesn't contain a xml-header! You need ask these guys to deliver the xml with header. Of course for now it is ok to just prepend the text with something like <?xml version="1.0" encoding="gbk"?>. But I'd still request them to deliver it with that header - otherwise it is _not_ XML, but just something that happens to look similar and doesn't guarantee to be well-formed and thus can be safely fed to a parser.
Thanks, that's very helpful. I'll definitely harrass the people
producing these files to make sure they put encoding declarations in them.
Here's what I get with the prepending hack: et.fromstring('<?xml version="1.0" encoding="gbk"?>\n' +
open(filename).read())
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 960, in XML
parser.feed(text)
File "C:\Program
Files\Python\lib\site-packages\elementtree\ElementTree.py", line 1242,
in feed
self._parser.Parse(data, 0)
ExpatError: unknown encoding: line 1, column 30
Are the XML encoding names different from the Python ones? The "gbk"
encoding seems to work okay from Python:
open(filename).read().decode('gbk')
u'<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n
<DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566>\n( (IP-HLN
(LCP-TMP (IP (NP-PN-SBJ (NR \u4f0f\u660e\u971e)) \n\t\t (VP (VV
\u83b7\u5f97) \n\t\t\t (NP-OBJ (NN \u5973\u5b50) \n\t\t\t\t (NN
\u8df3\u53f0) \n\t\t\t\t (NN \u8df3\u6c34) \n\t\t\t\t (NN
\u51a0\u519b)))) \n\t\t (LC \u540e)) \n (PU \uff0c) \n
(NP-SBJ (NP-PN (NR \u82cf\u8054\u961f)) \n (NP (NN
\u6559\u7ec3))) \n (VP (ADVP (AD \u70ed\u60c5)) \n
(PP-DIR (P \u5411) \n\t\t (NP (PN \u5979))) \n (VP
(VV \u795d\u8d3a))) \n (PU \u3002)) ) \n</S>\n<S ID=2567>\n(
(FRAG (NR \u65b0\u534e\u793e) \n (NN \u8bb0\u8005) \n
(NR \u7a0b\u81f3\u5584) \n (VV \u6444) ))
\n</S>\n</HEADLINE>\n<TEXT>\n</TEXT>\n</BODY>\n</DOC>\n'
STeve
> Here's what I get with the prepending hack: >>> et.fromstring('<?xml version="1.0" encoding="gbk"?>\n' +
open(filename).read()) Traceback (most recent call last): File "<interactive input>", line 1, in ? File "C:\Program Files\Python\lib\site-packages\elementtree\ElementTree.py", line 960, in XML parser.feed(text) File "C:\Program Files\Python\lib\site-packages\elementtree\ElementTree.py", line 1242, in feed self._parser.Parse(data, 0) ExpatError: unknown encoding: line 1, column 30
Are the XML encoding names different from the Python ones? The "gbk" encoding seems to work okay from Python:
I had similar trouble with cElementTree and cp1252 encodings. But
upgrading to a more recent version helped. Did you try parsing with e.g.
sax?
Diez
Diez B. Roggisch wrote: Here's what I get with the prepending hack:
>>> et.fromstring('<?xml version="1.0" encoding="gbk"?>\n' + open(filename).read()) Traceback (most recent call last): File "<interactive input>", line 1, in ? File "C:\Program Files\Python\lib\site-packages\elementtree\ElementTree.py", line 960, in XML parser.feed(text) File "C:\Program Files\Python\lib\site-packages\elementtree\ElementTree.py", line 1242, in feed self._parser.Parse(data, 0) ExpatError: unknown encoding: line 1, column 30
Are the XML encoding names different from the Python ones? The "gbk" encoding seems to work okay from Python:
I had similar trouble with cElementTree and cp1252 encodings. But upgrading to a more recent version helped. Did you try parsing with e.g. sax?
Hmm... The builtin xml.dom.minidom and xml.sax both also fail to find
the encoding: import xml.dom.minidom as dom dom.parseString('<?xml version="1.0" encoding="gbk"?>' +
open(filename).read())
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Program
Files\Python\lib\site-packages\_xmlplus\dom\minidom.py", line 1925, in
parseString
return expatbuilder.parseString(string)
File "C:\Program
Files\Python\lib\site-packages\_xmlplus\dom\expatbuilder.py", line 942,
in parseString
return builder.parseString(string)
File "C:\Program
Files\Python\lib\site-packages\_xmlplus\dom\expatbuilder.py", line 223,
in parseString
parser.Parse(string, True)
ExpatError: unknown encoding: line 1, column 30
import xml.sax as sax sax.parseString('<?xml version="1.0" encoding="gbk"?>' +
open(filename).read(), sax.handler.ContentHandler())
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\Program
Files\Python\lib\site-packages\_xmlplus\sax\__init__.py", line 47, in
parseString
parser.parse(inpsrc)
File "C:\Program
Files\Python\lib\site-packages\_xmlplus\sax\expatreader.py", line 109,
in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Program
Files\Python\lib\site-packages\_xmlplus\sax\xmlreader.py", line 123, in
parse
self.feed(buffer)
File "C:\Program
Files\Python\lib\site-packages\_xmlplus\sax\expatreader.py", line 220,
in feed
self._err_handler.fatalError(exc)
File "C:\Program
Files\Python\lib\site-packages\_xmlplus\sax\handler.py", line 38, in
fatalError
raise exception
SAXParseException: <unknown>:1:30: unknown encoding
Steven Bethard wrote: I'm having trouble using elementtree with an XML file that has some gbk-encoded text. (I can't read Chinese, so I'm taking their word for it that it's gbk-encoded.) I always have trouble with encodings, so I'm sure I'm just screwing something simple up. Can anyone help me?
absolutely!
pyexpat has only limited support for non-standard encodings; the core
expat library only supports UTF-8, UTF-16, US-ASCII, and ISO-8859-1,
and the Python glue layer then adds support for all byte-to-byte en-
codings support by Python on top of that.
if you're using any other encoding, you need to recode the file on the
way in (just decoding to Unicode doesn't work, since the parser expects
an encoded byte stream). the approach shown on this page should work http://effbot.org/zone/celementtree-encoding.htm
except that it uses the new XMLParser interface which isn't available in
ET 1.2.6, and the corresponding XMLTreeBuilder interface in ET doesn't
support the encoding override argument...
the easiest way to fix this is to modify the file header on the way in; if
the file has an <?xml encoding?> header, rip out the header and recode
from that encoding to utf-8 while parsing.
</F>
Diez B. Roggisch wrote: 2) your xml is _not_ well-formed, as it doesn't contain a xml-header! You need ask these guys to deliver the xml with header. Of course for now it is ok to just prepend the text with something like <?xml version="1.0" encoding="gbk"?>. But I'd still request them to deliver it with that header - otherwise it is _not_ XML, but just something that happens to look similar and doesn't guarantee to be well-formed and thus can be safely fed to a parser.
good advice, but note that an envelope (e.g a HTTP request or response
body) may override the encoding in the XML file itself. if this arrives in a
MIME message with the proper charset information, it's perfectly okay to
leave out the encoding from the file.
</F>
Hi, good advice, but note that an envelope (e.g a HTTP request or response body) may override the encoding in the XML file itself. if this arrives in a MIME message with the proper charset information, it's perfectly okay to leave out the encoding from the file.
It might be practical - still, a xml parser _should_ puke on you, ans
certainly some will (elemnttree not being one of those, I know :))
So even if it goes over the wire headerless, you should be prepending it
when dealing with teh data later.
Regards,
Diez
> pyexpat has only limited support for non-standard encodings; the core expat library only supports UTF-8, UTF-16, US-ASCII, and ISO-8859-1, and the Python glue layer then adds support for all byte-to-byte en- codings support by Python on top of that.
Interesting.
Maybe 4suite is more complete? I'll give it a shot.
Diez
Diez B. Roggisch wrote: good advice, but note that an envelope (e.g a HTTP request or response body) may override the encoding in the XML file itself. if this arrives in a MIME message with the proper charset information, it's perfectly okay to leave out the encoding from the file.
It might be practical - still, a xml parser _should_ puke on you, ans certainly some will (elemnttree not being one of those, I know :))
no, the parser must not to choke on a file for which the encoding has been
overridden.
for example, the HTTP standard allows the transport layer to recode text/* re-
sources as long as it updates the charset properly, so if you e.g send an XML
document as text/xml and charset=iso-8859-1, the transport layer can recode
that to charset=utf-8, *without* rewriting the XML header.
</F>
> no, the parser must not to choke on a file for which the encoding has been overridden.
for example, the HTTP standard allows the transport layer to recode text/* re- sources as long as it updates the charset properly, so if you e.g send an XML document as text/xml and charset=iso-8859-1, the transport layer can recode that to charset=utf-8, *without* rewriting the XML header.
I have to correct myself: I was under the impression that XML _has_ to
contain an XMLDecl (which is the header, possibly with encoding) to be
well-formed.
Interestingly enough, that has not to be the case. A document can very well
be well-formed without a header. The constraints for well-formedness are
scattered throughout the spec, so I'm not sure what they say about the used
encoding in absence of a header.
I am certain though that I've met parsers which weren't able to digest xml
without XMLDecl - which formed my impression. But then, that wasn't
correct.
Boy, that XML-stuff is always full of surprises - even after so many years
dealing with it..
DIez
Diez B. Roggisch wrote: Interestingly enough, that has not to be the case. A document can very well be well-formed without a header. The constraints for well-formedness are scattered throughout the spec, so I'm not sure what they say about the used encoding in absence of a header.
if there's no header, and no external override, the document must use either
UTF-8 or UTF-16, and for UTF-16, a leading byte order mark must be present
(ASCII is of course a subset of UTF-8, but e.g. ISO-8859-1 isn't).
reading http://www.w3.org/TR/2004/REC-xml-20.../#sec-guessing
may also help (at least if you read between the lines).
Boy, that XML-stuff is always full of surprises - even after so many years dealing with it..
a specification written for humans would have saved the world a lot of con-
fusion...
</F>
Fredrik Lundh wrote: Steven Bethard wrote:
I'm having trouble using elementtree with an XML file that has some gbk-encoded text. (I can't read Chinese, so I'm taking their word for it that it's gbk-encoded.) I always have trouble with encodings, so I'm sure I'm just screwing something simple up. Can anyone help me?
absolutely!
pyexpat has only limited support for non-standard encodings; the core expat library only supports UTF-8, UTF-16, US-ASCII, and ISO-8859-1, and the Python glue layer then adds support for all byte-to-byte en- codings support by Python on top of that.
if you're using any other encoding, you need to recode the file on the way in (just decoding to Unicode doesn't work, since the parser expects an encoded byte stream). the approach shown on this page should work
http://effbot.org/zone/celementtree-encoding.htm
except that it uses the new XMLParser interface which isn't available in ET 1.2.6, and the corresponding XMLTreeBuilder interface in ET doesn't support the encoding override argument...
the easiest way to fix this is to modify the file header on the way in; if the file has an <?xml encoding?> header, rip out the header and recode from that encoding to utf-8 while parsing.
Hmm... I downloaded the newest cElementTree (and I already had the
newest ElementTree), and here's what I get: def myparser(file, encoding):
.... f = codecs.open(file, "r", encoding)
.... p = ET.XMLParser(encoding="utf-8")
.... while 1:
.... s = f.read(65536)
.... if not s:
.... break
.... p.feed(s.encode("utf-8"))
.... return ET.ElementTree(p.close())
.... tree = myparser(filename, 'gbk')
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "<interactive input>", line 8, in myparser
SyntaxError: not well-formed (invalid token): line 8, column 6
FWIW, the file used above doesn't have an <?xml encoding?> header:
open(filename).read()
'<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n
<DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566>\n( (IP-HLN
(LCP-TMP (IP (NP-PN-SBJ (NR \xb7\xfc\xc3\xf7\xcf\xbc)) \n\t\t (VP
(VV \xbb\xf1\xb5\xc3) \n\t\t\t (NP-OBJ (NN \xc5\xae\xd7\xd3)
\n\t\t\t\t (NN \xcc\xf8\xcc\xa8) \n\t\t\t\t (NN \xcc\xf8\xcb\xae)
\n\t\t\t\t (NN \xb9\xda\xbe\xfc)))) \n\t\t (LC \xba\xf3)) \n
(PU \xa3\xac) \n (NP-SBJ (NP-PN (NR
\xcb\xd5\xc1\xaa\xb6\xd3)) \n (NP (NN
\xbd\xcc\xc1\xb7))) \n (VP (ADVP (AD \xc8\xc8\xc7\xe9)) \n
(PP-DIR (P \xcf\xf2) \n\t\t (NP (PN \xcb\xfd))) \n
(VP (VV \xd7\xa3\xba\xd8))) \n (PU \xa1\xa3)) )
\n</S>\n<S ID=2567>\n( (FRAG (NR \xd0\xc2\xbb\xaa\xc9\xe7) \n
(NN \xbc\xc7\xd5\xdf) \n (NR \xb3\xcc\xd6\xc1\xc9\xc6) \n
(VV \xc9\xe3) )) \n</S>\n</HEADLINE>\n<TEXT>\n</TEXT>\n</BODY>\n</DOC>\n'
STeVe
Steven Bethard wrote: Hmm... I downloaded the newest cElementTree (and I already had the newest ElementTree), and here's what I get: >>> tree = myparser(filename, 'gbk') Traceback (most recent call last): File "<interactive input>", line 1, in ? File "<interactive input>", line 8, in myparser SyntaxError: not well-formed (invalid token): line 8, column 6
FWIW, the file used above doesn't have an <?xml encoding?> header: >>> open(filename).read()
'<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n <DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566>
<S ID=2655> isn't a valid XML tag (the attribute value must be quoted)
if I recode the file into UTF-8 and fix the two S tags, the result displays
just fine in IE and Firefox (I get a few boxes/question marks, but I assume
that's a font problem).
</F>
In article <Jf********************@comcast.com>,
Steven Bethard <st************@gmail.com> wrote: SyntaxError: not well-formed (invalid token): line 8, column 6
FWIW, the file used above doesn't have an <?xml encoding?> header:
>>> open(filename).read()
'<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n <DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566>\n( (IP-HLN
The error seems correct: attr value have to be quoted in XML, so
<S ID=2566>
is not well formed XML.
Just
Fredrik Lundh wrote: Steven Bethard wrote:
Hmm... I downloaded the newest cElementTree (and I already had the newest ElementTree), and here's what I get:
>>> tree = myparser(filename, 'gbk') Traceback (most recent call last): File "<interactive input>", line 1, in ? File "<interactive input>", line 8, in myparser SyntaxError: not well-formed (invalid token): line 8, column 6
FWIW, the file used above doesn't have an <?xml encoding?> header:
>>> open(filename).read() '<DOC>\n<DOCID>ART242</DOCID>\n<HEADER>\n <DATE></DATE>\n</HEADER>\n<BODY>\n<HEADLINE>\n<S ID=2566>
<S ID=2655> isn't a valid XML tag (the attribute value must be quoted)
if I recode the file into UTF-8 and fix the two S tags, the result displays just fine in IE and Firefox (I get a few boxes/question marks, but I assume that's a font problem).
Thanks (to both Fredrik and Just). You stare at XML too long and you
start to miss the obvious things too. =)
Everything works great now: text = open(filename).read() text = re.sub(r'<S ID=(\w+)', r'<S ID="\1"', text) text = text.decode('gbk').encode('utf-8') et.fromstring(text)
<Element 'DOC' at 00A2AF38>
=)
Steve This discussion thread is closed Replies have been disabled for this discussion. Similar topics
7 posts
views
Thread by Stewart Midwinter |
last post: by
|
4 posts
views
Thread by Lonnie Princehouse |
last post: by
|
14 posts
views
Thread by Erik Bethke |
last post: by
|
1 post
views
Thread by Greg Wilson |
last post: by
|
3 posts
views
Thread by Damjan |
last post: by
|
reply
views
Thread by Chris McDonough |
last post: by
|
7 posts
views
Thread by mirandacascade |
last post: by
|
reply
views
Thread by Greg Aumann |
last post: by
|
6 posts
views
Thread by Tim Arnold |
last post: by
| | | | | | | | | | |