472,139 Members | 1,630 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,139 software developers and data experts.

Does python's minidom support Chinese?

The following 4 lines of code parses an XML document
very well if the XML document contains only English
words.

But when I insert one Chinese character into the XML
document, then Python starts to complain when it hits
the Chinese character, saying that it is an invalid
token and thus it is not well-formed.

This is the complaint of Python:

ExpatError: not well-formed (invalid token): line 3,
column 7

line 3 and column 7 exactly pinpoints the 1st Chinese
character in the XML document.

The problem remains even if I try encoding="UTF-16" or
encoding="GB2312" or encoding="GBK" in the xml
document.

Note that GB2312 and GBK are Chinese encodings.

Please give a hint. Thanks a lot!

The 4 lines of code I used is here:

# -*- coding: cp936 -*-
from xml.dom import minidom
xmldoc = minidom.parse('test.xml')
print xmldoc.toxml()

__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com

Jul 18 '05 #1
1 2882
Anthony Liu <an***********@yahoo.com> wrote in message news:<ma**************************************@pyt hon.org>...
The following 4 lines of code parses an XML document
very well if the XML document contains only English
words.

But when I insert one Chinese character into the XML
document, then Python starts to complain when it hits
the Chinese character, saying that it is an invalid
token and thus it is not well-formed.

This is the complaint of Python:

ExpatError: not well-formed (invalid token): line 3,
column 7

line 3 and column 7 exactly pinpoints the 1st Chinese
character in the XML document.
This is an XML problem on your end, not a minidom problem. That error
probably means that you are either omitting the XML declaration (and
thus defaulting to UTF-8 or UTF-16) or declaring a bogus encoding.

The problem remains even if I try encoding="UTF-16" or
encoding="GB2312" or encoding="GBK" in the xml
document.


Well, you can't just go shopping about for oare it accordingly.

Back to minidom: even after you fix your XML problems you may still
have trouble with minidom because the expat reader has to understand
the encoding you're using. I think that it may use the Python codecs
model to find the encoding you declared, so you may just need to
install a Python Chinese codecs package, and you'll be all set. I'm
not entirely sure this si the case, though.
--Uche
http://uche.ogbuji.net
Jul 18 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by python newbie | last post: by
3 posts views Thread by Thierry Lam | last post: by
7 posts views Thread by fyleow | last post: by
1 post views Thread by Ben Edwards (lists) | last post: by
7 posts views Thread by kernel1983 | last post: by
reply views Thread by Kurt B. Kaiser | last post: by
40 posts views Thread by =?iso-8859-1?B?QW5kcuk=?= | last post: by
1 post views Thread by Paul McGuire | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.