473,320 Members | 1,732 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Processing XML files in CJK encodings

gs
Python gurus,

I need to parse XML files in CJK encodings like GB2312 and Ja in UTF-8.
I was using xml.dom.minidom first. It works with Ja in UTF-8, but doesn't
work with GB2312. An article says,

http://mail.python.org/pipermail/xml...er/010034.html

Then I tried xml.parsers.xmlproc. It works fine with GB2312, but now it
doesn't work with Ja in UTF-8. Another article says,

http://mail.python.org/pipermail/xml...er/009802.html

Is there any way to parse both of them correctly?

Thanks,
-Gen
Jul 18 '05 #1
2 1660
gs******@gmail.com (gs) wrote in message news:<6e**************************@posting.google. com>...
Python gurus,

I need to parse XML files in CJK encodings like GB2312 and Ja in UTF-8.
I was using xml.dom.minidom first. It works with Ja in UTF-8, but doesn't
work with GB2312. An article says,

http://mail.python.org/pipermail/xml...er/010034.html

Then I tried xml.parsers.xmlproc. It works fine with GB2312, but now it
doesn't work with Ja in UTF-8. Another article says,

http://mail.python.org/pipermail/xml...er/009802.html

Is there any way to parse both of them correctly?


You say "doesn't work". Can you be more specific?

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
A hands-on introduction to ISO Schematron -
http://www-106.ibm.com/developerwork...ematron-i.html
Schematron abstract patterns -
http://www.ibm.com/developerworks/xm...y/x-stron.html
Wrestling HTML (using Python) -
http://www.xml.com/pub/a/2004/09/08/pyxml.html
Enterprise data goes high fashion -
http://www.adtmag.com/article.asp?id=10061
Principles of XML design: Considering container elements -
http://www-106.ibm.com/developerwork...x-contain.html
Hacking XML Hacks - http://www-106.ibm.com/developerwork...x-think26.html
A survey of XML standards -
http://www-106.ibm.com/developerwork...rary/x-stand4/
Jul 18 '05 #2
Gen <gs******@gmail.com> wrote:
I need to parse XML files in CJK encodings like GB2312 and Ja in UTF-8.
I assume you've already got CJKCodecs (or Python 2.4 where it's
built-in).

The main problem is that the expat parser (on which much Python XML
kit relies) doesn't understand the DBCS encodings. There are two ways
around this: either use an initial recoding step:

xml= unicode(bytes, 'gb2312').encode('utf-8')
doc= minidom.parseString(xml)

(If your input documents have an <?xml ... encoding="gb2312" ?>
declaration this will also have to be changed to encoding="utf-8" or
simply removed.)

OR, use a pure-Python XML parser, so it'll have access to CJKCodecs.
That means xmlproc+4DOM (validating) or pxdom (non-validating). This
is, in comparison to the recoding method, rather slow.

[Aside: have just released pxdom 1.2:

http://www.doxdesk.com/software/py/pxdom.html

I've processed a bunch of Shift-JIS material with this before without
problem.]
Then I tried xml.parsers.xmlproc. It works fine with GB2312, but now it
doesn't work with Ja in UTF-8.


Ohh. That's a bad one. Actually I'm surprised if it works with GB.

Here's a quick fix; I can't guarantee it's correct as I haven't really
played with xmlproc much but it fixes the error for me when parsing
strings. Oh, checking this out at the SourceForge tracker it looks
like the original reporter came up with the same idea, so it might be
okay. :-)

Near the end of method parse_xml_decl (in PyXML 0.8.3 this is at line
723) in _xmlplus.parsers.xmlproc.xmlutils:

try:
self.data = self.charset_converter(self.data)
self.datasize= len(self.data) ### ADD THIS LINE
except UnicodeError, e:
self._handle_decoding_error(self.data, e)
self.input_encoding = enc1

--
Andrew Clover
mailto:an*@doxdesk.com
http://www.doxdesk.com/
Jul 18 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: christof hoeke | last post by:
hi, i wrote a small application which extracts a javadoc similar documentation for xslt stylesheets using python, xslt and pyana. using non-ascii characters was a problem. so i set the...
15
by: tony melnyk | last post by:
I'm currently constructing a site one page of which I wish to allow the user to select from a number of different file type downloads. The file types will be .docs .jpgs mp3. I know how to...
0
by: Ganapathy | last post by:
I have COM dll code written in VC 6.0. When i tried compiling this code in VC 7, The MIDL cmpiler gets called twice. i.e. it initially compiles fully & immediately a line - 64 bit processing'...
14
by: Zoro | last post by:
My task is to read html files from disk and save them onto SQL Server database field. I have created an nvarchar(max) field to hold them. The problem is that some characters, particularly html...
1
by: Xah Lee | last post by:
Text Processing with Emacs Lisp Xah Lee, 2007-10-29 This page gives a outline of how to use emacs lisp to do text processing, using a specific real-world problem as example. If you don't know...
3
by: Travis | last post by:
Is there an easy to convert from UnicodeString to string or char *?
13
by: mario | last post by:
Hello! i stumbled on this situation, that is if I decode some string, below just the empty string, using the mcbs encoding, it succeeds, but if I try to encode it back with the same encoding it...
16
by: Brad | last post by:
Is there a way to determine whether a file is plain ascii text or not using standard C++?
3
by: Philip Semanchuk | last post by:
On Nov 9, 2008, at 7:00 PM, News123 wrote: Look under the heading "Standard Encodings": http://docs.python.org/library/codecs.html Note that both the page you found (which appears to be a...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.