471,334 Members | 1,433 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,334 software developers and data experts.

ignoring chinese characters parsing xml file

Hi,
I am parsing an XML file that includes chineses characters, like ^
評評啖啖才是眞.細氺長锍才是愛 or ヘアアイロン... The problem is that I get an error like:
UnicodeEncodeerror:'charmap' codec can't encode characters in position....
The thing is that I would like to ignore it and parse all the characters
less these ones. So, could anyone help me? I suppose that I can catch an
exception that ignores it or maybe use any function that detects this
chinese characters and after that ignore them.

Thanks!!
Fabian
Oct 22 '07 #1
3 2793
On Mon, 22 Oct 2007 21:24:40 +0200, Fabian L贸pez wrote:
I am parsing an XML file that includes chineses characters, like ^
uu鍟栧晼鎵嶆槸w.鎵塋閿嶆墠鏄 or 銉樸偄銈€偆銉*銉... The problem is that I get an error like:
UnicodeEncodeerror:'charmap' codec can't encode characters in
position..
You say you are *parsing* the file but this is an *encode* error. Parsing
means *decoding*.

You have to show some code and the actual traceback to get help. Crystal
balls are not that reliable. ;-)

Ciao,
Marc 'BlackJack' Rintsch
Oct 22 '07 #2
Fabian L贸pez wrote:
Thanks Mark, the code is like this. The attrib name is the problem:

from lxml import etree

context = etree.iterparse("file.xml")
for action, elem in context:
if elem.tag == "weblog":
print action, elem.tag , elem.attrib["name"],elem.attrib["url"],
The problem is the print statement. Looks like your terminal encoding (that
Python needs to encode the unicode string to) can't handle these unicode
characters.

Stefan
Oct 23 '07 #3
On 10/23/07, Stefan Behnel <st******************@web.dewrote:
Fabian L髉ez wrote:
Thanks Mark, the code is like this. The attrib name is the problem:

from lxml import etree

context = etree.iterparse("file.xml")
for action, elem in context:
if elem.tag == "weblog":
print action, elem.tag , elem.attrib["name"],elem.attrib["url"],

The problem is the print statement. Looks like your terminal encoding (that
Python needs to encode the unicode string to) can't handle these unicode
characters.
I agree. For Japanese, you should know the exactly encoding name, and
convert them, just like:

print text.encoding('encoding')

--
I like python!
UliPad <<The Python Editor>>: http://code.google.com/p/ulipad/
meide <<wxPython UI module>>: http://code.google.com/p/meide/
My Blog: http://www.donews.net/limodou
Oct 23 '07 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by Kobi Lurie | last post: by
4 posts views Thread by Knackeback | last post: by
2 posts views Thread by Dean A. Hoover | last post: by
8 posts views Thread by Agnes | last post: by
8 posts views Thread by pabv | last post: by
reply views Thread by st.frey | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.