elementtree and gbk encoding

Steven Bethard

I'm having trouble using elementtree with an XML file that has some
gbk-encoded text. (I can't read Chinese, so I'm taking their word for
it that it's gbk-encoded.) I always have trouble with encodings, so I'm
sure I'm just screwing something simple up. Can anyone help me?

Here's the interactive session. Sorry it's a little verbose, but I
figured it would be better to include too much than not enough. I
basically expected et.ElementTree( file=...) to fail since no encoding
was specified, but I don't know what I'm doing wrong when I use
codecs.open(... )

Thanks in advance for the help!

import elementtree.Ele mentTree as et
import codecs
et.ElementTree( file=filename) Traceback (most recent call last):
File "<interacti ve input>", line 1, in ?
File "C:\Program
Files\Python\li b\site-packages\elemen ttree\ElementTr ee.py", line 543, in
__init__
self.parse(file )
File "C:\Program
Files\Python\li b\site-packages\elemen ttree\ElementTr ee.py", line 583, in
parse
parser.feed(dat a)
File "C:\Program
Files\Python\li b\site-packages\elemen ttree\ElementTr ee.py", line 1242,
in feed
self._parser.Pa rse(data, 0)
ExpatError: not well-formed (invalid token): line 8, column 6 et.ElementTree( file=codecs.ope n(filename, 'r', 'gbk')) Traceback (most recent call last):
File "<interacti ve input>", line 1, in ?
File "C:\Program
Files\Python\li b\site-packages\elemen ttree\ElementTr ee.py", line 543, in
__init__
self.parse(file )
File "C:\Program
Files\Python\li b\site-packages\elemen ttree\ElementTr ee.py", line 583, in
parse
parser.feed(dat a)
File "C:\Program
Files\Python\li b\site-packages\elemen ttree\ElementTr ee.py", line 1242,
in feed
self._parser.Pa rse(data, 0)
UnicodeEncodeEr ror: 'ascii' codec can't encode characters in position
133-135: ordinal not in range(128) text = open(filename). read()
text

'<DOC>\n<DOCID> ART242</DOCID>\n<HEADER >\n
<DATE></DATE>\n</HEADER>\n<BODY> \n<HEADLINE>\n< S ID=2566>\n( (IP-HLN
(LCP-TMP (IP (NP-PN-SBJ (NR \xb7\xfc\xc3\xf 7\xcf\xbc)) \n\t\t (VP
(VV \xbb\xf1\xb5\xc 3) \n\t\t\t (NP-OBJ (NN \xc5\xae\xd7\xd 3)
\n\t\t\t\t (NN \xcc\xf8\xcc\xa 8) \n\t\t\t\t (NN \xcc\xf8\xcb\xa e)
\n\t\t\t\t (NN \xb9\xda\xbe\xf c)))) \n\t\t (LC \xba\xf3)) \n
(PU \xa3\xac) \n (NP-SBJ (NP-PN (NR
\xcb\xd5\xc1\xa a\xb6\xd3)) \n (NP (NN
\xbd\xcc\xc1\xb 7))) \n (VP (ADVP (AD \xc8\xc8\xc7\xe 9)) \n
(PP-DIR (P \xcf\xf2) \n\t\t (NP (PN \xcb\xfd))) \n
(VP (VV \xd7\xa3\xba\xd 8))) \n (PU \xa1\xa3)) )
\n</S>\n<S ID=2567>\n( (FRAG (NR \xd0\xc2\xbb\xa a\xc9\xe7) \n
(NN \xbc\xc7\xd5\xd f) \n (NR \xb3\xcc\xd6\xc 1\xc9\xc6) \n
(VV \xc9\xe3) )) \n</S>\n</HEADLINE>\n<TEX T>\n</TEXT>\n</BODY>\n</DOC>\n'
STeVe

Mar 14 '06 #1

Subscribe Reply

5403

Diez B. Roggisch

Steven Bethard schrieb:

I'm having trouble using elementtree with an XML file that has some
gbk-encoded text. (I can't read Chinese, so I'm taking their word for
it that it's gbk-encoded.) I always have trouble with encodings, so I'm
sure I'm just screwing something simple up. Can anyone help me?

Here's the interactive session. Sorry it's a little verbose, but I
figured it would be better to include too much than not enough. I
basically expected et.ElementTree( file=...) to fail since no encoding
was specified, but I don't know what I'm doing wrong when I use
codecs.open(... )

The first and most important lesson to learn here is that well-formed
XML must contain a xml-header that specifies the used encoding. This has
two consequences for you:

1) all xml-parsers expect byte-strings, as they have to first read the
header to know what encoding awaits them. So no use reading the xml-file
with a codec - even if it is the right one. It will get converted back
to a string when fed to the parser, with the default codec being used -
resulting in the well-known unicode error.

2) your xml is _not_ well-formed, as it doesn't contain a xml-header!
You need ask these guys to deliver the xml with header. Of course for
now it is ok to just prepend the text with something like <?xml
version="1.0" encoding="gbk"? >. But I'd still request them to deliver it
with that header - otherwise it is _not_ XML, but just something that
happens to look similar and doesn't guarantee to be well-formed and thus
can be safely fed to a parser.
HTH Diez

Mar 14 '06 #2

Steven Bethard

Diez B. Roggisch wrote:

Steven Bethard schrieb:
I'm having trouble using elementtree with an XML file that has some
gbk-encoded text. (I can't read Chinese, so I'm taking their word for
it that it's gbk-encoded.) I always have trouble with encodings, so
I'm sure I'm just screwing something simple up. Can anyone help me?

Here's the interactive session. Sorry it's a little verbose, but I
figured it would be better to include too much than not enough. I
basically expected et.ElementTree( file=...) to fail since no encoding
was specified, but I don't know what I'm doing wrong when I use
codecs.open(... )

The first and most important lesson to learn here is that well-formed
XML must contain a xml-header that specifies the used encoding. This has
two consequences for you:

1) all xml-parsers expect byte-strings, as they have to first read the
header to know what encoding awaits them. So no use reading the xml-file
with a codec - even if it is the right one. It will get converted back
to a string when fed to the parser, with the default codec being used -
resulting in the well-known unicode error.

2) your xml is _not_ well-formed, as it doesn't contain a xml-header!
You need ask these guys to deliver the xml with header. Of course for
now it is ok to just prepend the text with something like <?xml
version="1.0" encoding="gbk"? >. But I'd still request them to deliver it
with that header - otherwise it is _not_ XML, but just something that
happens to look similar and doesn't guarantee to be well-formed and thus
can be safely fed to a parser.

Thanks, that's very helpful. I'll definitely harrass the people
producing these files to make sure they put encoding declarations in them.

Here's what I get with the prepending hack:

et.fromstring(' <?xml version="1.0" encoding="gbk"? >\n' + open(filename). read())
Traceback (most recent call last):
File "<interacti ve input>", line 1, in ?
File "C:\Program
Files\Python\li b\site-packages\elemen ttree\ElementTr ee.py", line 960, in XML
parser.feed(tex t)
File "C:\Program
Files\Python\li b\site-packages\elemen ttree\ElementTr ee.py", line 1242,
in feed
self._parser.Pa rse(data, 0)
ExpatError: unknown encoding: line 1, column 30
Are the XML encoding names different from the Python ones? The "gbk"
encoding seems to work okay from Python:
open(filename). read().decode(' gbk')

u'<DOC>\n<DOCID >ART242</DOCID>\n<HEADER >\n
<DATE></DATE>\n</HEADER>\n<BODY> \n<HEADLINE>\n< S ID=2566>\n( (IP-HLN
(LCP-TMP (IP (NP-PN-SBJ (NR \u4f0f\u660e\u9 71e)) \n\t\t (VP (VV
\u83b7\u5f97) \n\t\t\t (NP-OBJ (NN \u5973\u5b50) \n\t\t\t\t (NN
\u8df3\u53f0) \n\t\t\t\t (NN \u8df3\u6c34) \n\t\t\t\t (NN
\u51a0\u519b))) ) \n\t\t (LC \u540e)) \n (PU \uff0c) \n
(NP-SBJ (NP-PN (NR \u82cf\u8054\u9 61f)) \n (NP (NN
\u6559\u7ec3))) \n (VP (ADVP (AD \u70ed\u60c5)) \n
(PP-DIR (P \u5411) \n\t\t (NP (PN \u5979))) \n (VP
(VV \u795d\u8d3a))) \n (PU \u3002)) ) \n</S>\n<S ID=2567>\n(
(FRAG (NR \u65b0\u534e\u7 93e) \n (NN \u8bb0\u8005) \n
(NR \u7a0b\u81f3\u5 584) \n (VV \u6444) ))
\n</S>\n</HEADLINE>\n<TEX T>\n</TEXT>\n</BODY>\n</DOC>\n'
STeve

Mar 14 '06 #3

Diez B. Roggisch

> Here's what I get with the prepending hack:

>>> et.fromstring(' <?xml version="1.0" encoding="gbk"? >\n' +

open(filename). read())
Traceback (most recent call last):
File "<interacti ve input>", line 1, in ?
File "C:\Program
Files\Python\li b\site-packages\elemen ttree\ElementTr ee.py", line 960, in
XML
parser.feed(tex t)
File "C:\Program
Files\Python\li b\site-packages\elemen ttree\ElementTr ee.py", line 1242,
in feed
self._parser.Pa rse(data, 0)
ExpatError: unknown encoding: line 1, column 30
Are the XML encoding names different from the Python ones? The "gbk"
encoding seems to work okay from Python:

I had similar trouble with cElementTree and cp1252 encodings. But
upgrading to a more recent version helped. Did you try parsing with e.g.
sax?

Diez

Mar 14 '06 #4

Steven Bethard

Diez B. Roggisch wrote:

Here's what I get with the prepending hack:
>>> et.fromstring(' <?xml version="1.0" encoding="gbk"? >\n' +

open(filename). read())
Traceback (most recent call last):
File "<interacti ve input>", line 1, in ?
File "C:\Program
Files\Python\li b\site-packages\elemen ttree\ElementTr ee.py", line 960,
in XML
parser.feed(tex t)
File "C:\Program
Files\Python\li b\site-packages\elemen ttree\ElementTr ee.py", line 1242,
in feed
self._parser.Pa rse(data, 0)
ExpatError: unknown encoding: line 1, column 30
Are the XML encoding names different from the Python ones? The "gbk"
encoding seems to work okay from Python:

I had similar trouble with cElementTree and cp1252 encodings. But
upgrading to a more recent version helped. Did you try parsing with e.g.
sax?

Hmm... The builtin xml.dom.minidom and xml.sax both also fail to find
the encoding:

import xml.dom.minidom as dom
dom.parseString ('<?xml version="1.0" encoding="gbk"? >' + open(filename). read())
Traceback (most recent call last):
File "<interacti ve input>", line 1, in ?
File "C:\Program
Files\Python\li b\site-packages\_xmlpl us\dom\minidom. py", line 1925, in
parseString
return expatbuilder.pa rseString(strin g)
File "C:\Program
Files\Python\li b\site-packages\_xmlpl us\dom\expatbui lder.py", line 942,
in parseString
return builder.parseSt ring(string)
File "C:\Program
Files\Python\li b\site-packages\_xmlpl us\dom\expatbui lder.py", line 223,
in parseString
parser.Parse(st ring, True)
ExpatError: unknown encoding: line 1, column 30

import xml.sax as sax
sax.parseString ('<?xml version="1.0" encoding="gbk"? >' +

open(filename). read(), sax.handler.Con tentHandler())
Traceback (most recent call last):
File "<interacti ve input>", line 1, in ?
File "C:\Program
Files\Python\li b\site-packages\_xmlpl us\sax\__init__ .py", line 47, in
parseString
parser.parse(in psrc)
File "C:\Program
Files\Python\li b\site-packages\_xmlpl us\sax\expatrea der.py", line 109,
in parse
xmlreader.Incre mentalParser.pa rse(self, source)
File "C:\Program
Files\Python\li b\site-packages\_xmlpl us\sax\xmlreade r.py", line 123, in
parse
self.feed(buffe r)
File "C:\Program
Files\Python\li b\site-packages\_xmlpl us\sax\expatrea der.py", line 220,
in feed
self._err_handl er.fatalError(e xc)
File "C:\Program
Files\Python\li b\site-packages\_xmlpl us\sax\handler. py", line 38, in
fatalError
raise exception
SAXParseExcepti on: <unknown>:1:3 0: unknown encoding

Mar 15 '06 #5

Fredrik Lundh

Steven Bethard wrote:

I'm having trouble using elementtree with an XML file that has some
gbk-encoded text. (I can't read Chinese, so I'm taking their word for
it that it's gbk-encoded.) I always have trouble with encodings, so I'm
sure I'm just screwing something simple up. Can anyone help me?

absolutely!

pyexpat has only limited support for non-standard encodings; the core
expat library only supports UTF-8, UTF-16, US-ASCII, and ISO-8859-1,
and the Python glue layer then adds support for all byte-to-byte en-
codings support by Python on top of that.

if you're using any other encoding, you need to recode the file on the
way in (just decoding to Unicode doesn't work, since the parser expects
an encoded byte stream). the approach shown on this page should work

http://effbot.org/zone/celementtree-encoding.htm

except that it uses the new XMLParser interface which isn't available in
ET 1.2.6, and the corresponding XMLTreeBuilder interface in ET doesn't
support the encoding override argument...

the easiest way to fix this is to modify the file header on the way in; if
the file has an <?xml encoding?> header, rip out the header and recode
from that encoding to utf-8 while parsing.

</F>

Mar 15 '06 #6

Fredrik Lundh

Diez B. Roggisch wrote:

2) your xml is _not_ well-formed, as it doesn't contain a xml-header!
You need ask these guys to deliver the xml with header. Of course for
now it is ok to just prepend the text with something like <?xml
version="1.0" encoding="gbk"? >. But I'd still request them to deliver it
with that header - otherwise it is _not_ XML, but just something that
happens to look similar and doesn't guarantee to be well-formed and thus
can be safely fed to a parser.

good advice, but note that an envelope (e.g a HTTP request or response
body) may override the encoding in the XML file itself. if this arrives in a
MIME message with the proper charset information, it's perfectly okay to
leave out the encoding from the file.

</F>

Mar 15 '06 #7

Diez B. Roggisch

Hi,

good advice, but note that an envelope (e.g a HTTP request or response
body) may override the encoding in the XML file itself. if this arrives
in a MIME message with the proper charset information, it's perfectly okay
to leave out the encoding from the file.

It might be practical - still, a xml parser _should_ puke on you, ans
certainly some will (elemnttree not being one of those, I know :))

So even if it goes over the wire headerless, you should be prepending it
when dealing with teh data later.

Regards,

Diez

Mar 15 '06 #8

Diez B. Roggisch

> pyexpat has only limited support for non-standard encodings; the core

expat library only supports UTF-8, UTF-16, US-ASCII, and ISO-8859-1,
and the Python glue layer then adds support for all byte-to-byte en-
codings support by Python on top of that.

Interesting.

Maybe 4suite is more complete? I'll give it a shot.

Diez

Mar 15 '06 #9

Fredrik Lundh

Diez B. Roggisch wrote:

good advice, but note that an envelope (e.g a HTTP request or response
body) may override the encoding in the XML file itself. if this arrives
in a MIME message with the proper charset information, it's perfectly okay
to leave out the encoding from the file.

It might be practical - still, a xml parser _should_ puke on you, ans
certainly some will (elemnttree not being one of those, I know :))

no, the parser must not to choke on a file for which the encoding has been
overridden.

for example, the HTTP standard allows the transport layer to recode text/* re-
sources as long as it updates the charset properly, so if you e.g send an XML
document as text/xml and charset=iso-8859-1, the transport layer can recode
that to charset=utf-8, *without* rewriting the XML header.

</F>

Mar 15 '06 #10

Similar topics

3253

xml file structure for use with ElementTree?

by: Stewart Midwinter | last post by:

I want to parse a file with ElementTree. My file has the following format:  <?xml version='1.0' encoding='utf-8'?> <population> <person><name="joe" sex="male" age="49"></person> <person><name="hilda" sex="female" age="33"></person> <person><name="bartholomew" sex="male" age="17"> </person> </population>

Python

1512

module file length limitations on windows?

by: Lonnie Princehouse | last post by:

I've run into some eccentric behavior... It appears that one of my modules is being cut off at exactly 2^14 characters when I try to import it. Has anyone else encountered this? I can't find any mention of such a bug, and stranger yet, other modules that exceed 16384 characters seem to work just fine. In particular, suppose that my...

Python

7467

ElementTree cannot parse UTF-8 Unicode?

by: Erik Bethke | last post by:

Hello All, I am getting an error of not well-formed at the beginning of the Korean text in the second example. I am doing something wrong with how I am encoding my Korean? Do I need more of a wrapper about it than simple quotes? Is there some sort of XML syntax for indicating a Unicode string, or does the Elementree library just not...

Python

3075

ElementTree/DTD question

by: Greg Wilson | last post by:

I'm trying to convert from minidom to ElementTree for handling XML, and am having trouble with entities in DTDs. My Python script looks like this: ---------------------------------------------------------------------- #!/usr/bin/env python import sys, os from elementtree import ElementTree

Python

3169

ElementTree, how to get the whole content of a tag

by: Damjan | last post by:

Given the folowing XML snippet, I build an ElementTree instance with et=ElementTree.fromstring(..). Now et.text returns just '\n text\n some other text'. Is there any way I could get everything between the <div> and </div> tag? <div> text some other text<br/> and then some more </div>

Python

1948

encoding during elementtree serialization

by: Chris McDonough | last post by:

ElementTree's XML serialization routine implied by tree._write(file, node, encoding, namespaces looks like this (elided): def _write(self, file, node, encoding, namespaces): # write XML to file tag = node.tag if tag is Comment: file.write("" % _escape_cdata(node.text, encoding)) elif tag is ProcessingInstruction:...

Python

4070

the tostring and XML methods in ElementTree

by: mirandacascade | last post by:

O/S: Windows XP Home Vsn of Python: 2.4 Copy/paste of interactive window is immediately below; the text/questions toward the bottom of this post will refer to the content of the copy/paste >>> from elementtree import ElementTree >>> beforeRoot = ElementTree.Element('beforeRoot') >>> beforeCtag = ElementTree.SubElement(beforeRoot, 'C')

Python

1756

using TreeBuilder in an ElementTree like way

by: Greg Aumann | last post by:

I am trying to write some python code for a library that reads an XML-like language from a file into elementtree data structures. Then I want to be able to read and/or modify the structure and then be able to write it out either as XML or in the original format. I really want the api for the XML-like language to be the same as the elementtree...

Python

11667

extra xml header with ElementTree?

by: Tim Arnold | last post by:

Hi, I'm using ElementTree which is wonderful. I have a need now to write out an XML file with these two headers: <?xml version="1.0" encoding="UTF-8" ?> <?NLS TYPE="org.eclipse.help.toc"?> My elements have the root named tocbody and I'm using: newtree = ET.ElementTree(tocbody) newtree.write(fname) I assume if I add the encoding arg I'll...

Python

7475

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...

General

7409

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...

Windows Server

7918

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...

Online Marketing

7436

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...

Windows Server

5341

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...

Microsoft Access / VBA

4958

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...

C# / C Sharp

3463

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...

Networking - Hardware / Configuration

1897

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

1022

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP