469,275 Members | 1,871 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,275 developers. It's quick & easy.

Processing XML files in CJK encodings

gs
Python gurus,

I need to parse XML files in CJK encodings like GB2312 and Ja in UTF-8.
I was using xml.dom.minidom first. It works with Ja in UTF-8, but doesn't
work with GB2312. An article says,

http://mail.python.org/pipermail/xml...er/010034.html

Then I tried xml.parsers.xmlproc. It works fine with GB2312, but now it
doesn't work with Ja in UTF-8. Another article says,

http://mail.python.org/pipermail/xml...er/009802.html

Is there any way to parse both of them correctly?

Thanks,
-Gen
Jul 18 '05 #1
2 1601
gs******@gmail.com (gs) wrote in message news:<6e**************************@posting.google. com>...
Python gurus,

I need to parse XML files in CJK encodings like GB2312 and Ja in UTF-8.
I was using xml.dom.minidom first. It works with Ja in UTF-8, but doesn't
work with GB2312. An article says,

http://mail.python.org/pipermail/xml...er/010034.html

Then I tried xml.parsers.xmlproc. It works fine with GB2312, but now it
doesn't work with Ja in UTF-8. Another article says,

http://mail.python.org/pipermail/xml...er/009802.html

Is there any way to parse both of them correctly?


You say "doesn't work". Can you be more specific?

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
A hands-on introduction to ISO Schematron -
http://www-106.ibm.com/developerwork...ematron-i.html
Schematron abstract patterns -
http://www.ibm.com/developerworks/xm...y/x-stron.html
Wrestling HTML (using Python) -
http://www.xml.com/pub/a/2004/09/08/pyxml.html
Enterprise data goes high fashion -
http://www.adtmag.com/article.asp?id=10061
Principles of XML design: Considering container elements -
http://www-106.ibm.com/developerwork...x-contain.html
Hacking XML Hacks - http://www-106.ibm.com/developerwork...x-think26.html
A survey of XML standards -
http://www-106.ibm.com/developerwork...rary/x-stand4/
Jul 18 '05 #2
Gen <gs******@gmail.com> wrote:
I need to parse XML files in CJK encodings like GB2312 and Ja in UTF-8.
I assume you've already got CJKCodecs (or Python 2.4 where it's
built-in).

The main problem is that the expat parser (on which much Python XML
kit relies) doesn't understand the DBCS encodings. There are two ways
around this: either use an initial recoding step:

xml= unicode(bytes, 'gb2312').encode('utf-8')
doc= minidom.parseString(xml)

(If your input documents have an <?xml ... encoding="gb2312" ?>
declaration this will also have to be changed to encoding="utf-8" or
simply removed.)

OR, use a pure-Python XML parser, so it'll have access to CJKCodecs.
That means xmlproc+4DOM (validating) or pxdom (non-validating). This
is, in comparison to the recoding method, rather slow.

[Aside: have just released pxdom 1.2:

http://www.doxdesk.com/software/py/pxdom.html

I've processed a bunch of Shift-JIS material with this before without
problem.]
Then I tried xml.parsers.xmlproc. It works fine with GB2312, but now it
doesn't work with Ja in UTF-8.


Ohh. That's a bad one. Actually I'm surprised if it works with GB.

Here's a quick fix; I can't guarantee it's correct as I haven't really
played with xmlproc much but it fixes the error for me when parsing
strings. Oh, checking this out at the SourceForge tracker it looks
like the original reporter came up with the same idea, so it might be
okay. :-)

Near the end of method parse_xml_decl (in PyXML 0.8.3 this is at line
723) in _xmlplus.parsers.xmlproc.xmlutils:

try:
self.data = self.charset_converter(self.data)
self.datasize= len(self.data) ### ADD THIS LINE
except UnicodeError, e:
self._handle_decoding_error(self.data, e)
self.input_encoding = enc1

--
Andrew Clover
mailto:an*@doxdesk.com
http://www.doxdesk.com/
Jul 18 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by christof hoeke | last post: by
15 posts views Thread by tony melnyk | last post: by
1 post views Thread by Xah Lee | last post: by
3 posts views Thread by Travis | last post: by
16 posts views Thread by Brad | last post: by
3 posts views Thread by Philip Semanchuk | last post: by
1 post views Thread by CARIGAR | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.