473,385 Members | 1,312 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

ElementTree cannot parse UTF-8 Unicode?

Hello All,

I am getting an error of not well-formed at the beginning of the Korean
text in the second example. I am doing something wrong with how I am
encoding my Korean? Do I need more of a wrapper about it than simple
quotes? Is there some sort of XML syntax for indicating a Unicode
string, or does the Elementree library just not support reading of
Unicode?

here is my test snippet:

from elementtree import ElementTree
vocabXML = ElementTree.parse('test2.xml').getroot()

where I have two data files:

this one works:
<?xml version="1.0" encoding="UTF-8"?>
<Vocab>
<Word L1='Hahha'></Word>
</Vocab>

this one fails:
<?xml version="1.0" encoding="UTF-8"?>
<Vocab>
<Word L1="어녕하세요!"></Word>
</Vocab>

Jul 18 '05 #1
14 7454
Erik Bethke wrote:
I am getting an error of not well-formed at the beginning of the Korean
text in the second example. I am doing something wrong with how I am
encoding my Korean? Do I need more of a wrapper about it than simple
quotes? Is there some sort of XML syntax for indicating a Unicode
string, or does the Elementree library just not support reading of
Unicode?
XML is Unicode, and ElementTree supports all common encodings just
fine (including UTF-8).
this one fails:
<?xml version="1.0" encoding="UTF-8"?>
<Vocab>
<Word L1="?????!"></Word>
</Vocab>


this works just fine on my machine.

what's the exact error message?

what does

print repr(open("test2.xml").read())

print on your machine?

what happens if you attempt to parse

<Vocab>
<Word L1="어녕하세요!" />
</Vocab>

?

</F>

Jul 18 '05 #2
Hello Fredrik,

1) The exact error is in line 1160 of self._parser.Parse(data, 0 ):
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 3,
column 16

2) You are right in that the print of the file read works just fine.

3) You are also right in that the digitally encoded unicode also works
fine. However, this solution has two new problems:

1) The xml file is now not human readable
2) After ElementTree gets done parsing it, I am feeding the text to a
wx.TextCtrl via .SetValue() but that is now giving me an error message
of being unable to convert that style of string

So it seems to me, that ElementTree is just not expecting to run into
the Korean characters for it is at column 16 that these begin. Am I
formatting the XML properly?

Thank you,
-Erik

Jul 18 '05 #3
On Wed, 19 Jan 2005 16:35:23 -0800, Erik Bethke wrote:
So it seems to me, that ElementTree is just not expecting to run into the
Korean characters for it is at column 16 that these begin. Am I
formatting the XML properly?


You should post the file somewhere on the web. (I wouldn't expect Usenet
to transmit it properly.)

(Just jumping in to possibly save you a reply cycle.)

Jul 18 '05 #4
Erik Bethke wrote:
2) You are right in that the print of the file read works just fine.
but what does it look like? I saved a raw copy of your original mail,
fixed the quoted-printable encoding, and got an UTF-8 encoded file
that works just fine. the thing you've been parsing, and that you've
cut and pasted into your mail, must be different, in some way.
3) You are also right in that the digitally encoded unicode also works
fine. However, this solution has two new problems:
that was just a test to make sure that your version of elementtree could
handle Unicode characters on your platform.
1) The xml file is now not human readable
2) After ElementTree gets done parsing it, I am feeding the text to a
wx.TextCtrl via .SetValue() but that is now giving me an error message
of being unable to convert that style of string
on my machine, the L1 attribute contains a Unicode string:
print repr(root.find("Word").get("L1"))

u'\uc5b4\ub155\ud558\uc138\uc694!'

what does it give you on your machine? (looks like wxPython cannot handle
Unicode strings, but can that really be true?)
So it seems to me, that ElementTree is just not expecting to run into
the Korean characters for it is at column 16 that these begin. Am I
formatting the XML properly?


nobody knows...

</F>

Jul 18 '05 #5
Hi !
...Usenet to transmit it properly


newsgroups (NNTP) : yes, it does it
usenet : perhaps (that depends on the newsgroups)
clp : no

Michel Claveau
Jul 18 '05 #6
Fredrik Lundh, Quinta 20 Janeiro 2005 05:17, wrote:
what does it give you on your machine? (looks like wxPython cannot handle
Unicode strings, but can that really be true?)


It does support Unicode if it was built to do so...

--
Godoy. <go***@ieee.org>

Jul 18 '05 #7
Jorge Luiz Godoy Filho wrote:
what does it give you on your machine? (looks like wxPython cannot handle
Unicode strings, but can that really be true?)


It does support Unicode if it was built to do so...


Python has supported Unicode in release 1.6, 2.0, 2.1, 2.2, 2.3 and 2.4, so
you might think that Unicode should be enabled by default in a UI toolkit for
Python...

</F>

Jul 18 '05 #8
There is something wrong with the physical file... I d/l a trial
version of XML Spy home edition and built an equivalent of the korean
test file, and tried it and it got past the element tree error and now
I am stuck with the wxEditCtrl error.

To build the xml file in the first place I had code that looked like
this:

d=wxFileDialog( self, message="Choose a file",
defaultDir=os.getcwd(), defaultFile="", wildcard="*.xml", style=wx.SAVE
| wxOVERWRITE_PROMPT | wx.CHANGE_DIR)
if d.ShowModal() == wx.ID_OK:
# This returns a Python list of files that were selected.
paths = d.GetPaths()
layout = '<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n'
L1Word = self.t1.GetValue()
L2Word = 'undefined'

layout += '<Vocab>\n'
layout += ' <Word L1=\'' + L1Word + '\'></Word>\n'
layout += '</Vocab>'
open( paths[0], 'w' ).write(layout)
d.Destroy()

So apprantly there is something wrong with physically constructing the
file in this manner?

Thank you,
-Erik

Jul 18 '05 #9
Erik Bethke wrote:
layout += '<Vocab>\n'
layout += ' <Word L1=\'' + L1Word + '\'></Word>\n'


what does "print repr(L1Word)" print (that is, what does wxPython return?).
it should be a Unicode string, but that would give you an error when you write
it out:
f = open("file.txt", "w")
f.write(u'\uc5b4\ub155\ud558\uc138\uc694!')

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters
in position 0-4: ordinal not in range(128)

have you hacked the default encoding in site/sitecustomize?

what happens if you replace the L1Word term with L1Word.encode("utf-8")

can you post the repr() (either of what's in your file or of the thing, whatever
it is, that wxPython returns...)

</F>

Jul 18 '05 #10
That was a great clue. I am an idiot and tapped on the wrong download
link... now I can read and parse the xml file fine - as long as I
create it in XML spy - if I create it by this method:

d=wxFileDialog( self, message="Choose a file",
defaultDir=os.getcwd(), defaultFile="", wildcard="*.xml", style=wx.SAVE
| wxOVERWRITE_PROMPT | wx.CHANGE_DIR)
if d.ShowModal() == wx.ID_OK:
# This returns a Python list of files that were selected.
paths = d.GetPaths()
layout = '<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n'
L1Word = self.t1.GetValue()
L2Word = 'undefined'

layout += '<Vocab>\n'
layout += ' <Word L1=\'' + L1Word + '\'></Word>\n'
layout += '</Vocab>'
open( paths[0], 'w' ).write(layout)

I get hung up on the write statement, I am off to look for a a Unicode
capable file write I think...

-Erik

Jul 18 '05 #11
Woo-hoo! Everything is working now!

Thank you everyone!

The TWO problems I had:

1) I needed to save my XML file in the first place with this code:
f = codecs.open(paths[0], 'w', 'utf8')
2) I needed to download the UNICODE version of wxPython, duh.

So why are there non-UNICODE versions of wxPython??? To save memory or
something???

Thank you all!

Best!
-Erik

Jul 18 '05 #12
Erik Bethke wrote:
So why are there non-UNICODE versions of wxPython??? To save memory or
something???


Win95, Win98, WinME have problems with unicode. GTK1 does not support
unicode at all.

--
Jarek Zgoda
http://jpa.berlios.de/ | http://www.zgodowie.org/
Jul 18 '05 #13
Jarek Zgoda wrote:
So why are there non-UNICODE versions of wxPython??? To save memory or
something???

Win95, Win98, WinME have problems with unicode.


This problem can be solved - on W9x, wxPython would have to
pass all Unicode strings to WideCharToMultiByte, using
CP_ACP, and then pass the result to the API function.

Regards,
Martin
Jul 18 '05 #14
Martin v. Löwis wrote:
Jarek Zgoda wrote:
So why are there non-UNICODE versions of wxPython??? To save memory or
something???


Robin Dunn has an explanation here:

http://wiki.wxpython.org/index.cgi/UnicodeBuild

.... which is the first hit from a Google search on
"wxpython unicode build".

Also, from the wxPython downloads page:

"There are two versions of wxPython for each of the supported
Python versions on Win32. They are nearly identical, except one
of them has been compiled with support for the Unicode version of
the platform APIs. If you don't know what that means then you
probably don't need the Unicode version, get the ANSI version
instead. The Unicode verison works best on Windows NT/2000/XP. It
will also mostly work on Windows 95/98/Me systems, but it is
based on a Microsoft hack called MSLU (or unicows.dll) that
translates unicode API calls to ansi API calls, but the coverage
of the API is not complete so there are some difficult bugs
lurking in there."

Steve
Jul 18 '05 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: Stewart Midwinter | last post by:
I want to parse a file with ElementTree. My file has the following format: <!-- file population.xml --> <?xml version='1.0' encoding='utf-8'?> <population> <person><name="joe" sex="male"...
4
by: Lonnie Princehouse | last post by:
I've run into some eccentric behavior... It appears that one of my modules is being cut off at exactly 2^14 characters when I try to import it. Has anyone else encountered this? I can't find any...
1
by: Greg Wilson | last post by:
I'm trying to convert from minidom to ElementTree for handling XML, and am having trouble with entities in DTDs. My Python script looks like this: ...
4
by: alainpoint | last post by:
Hello, I use Elementtree to parse an elementary SVG file (in fact, it is one of the examples in the "SVG essentials" book). More precisely, it is the fig0201.svg file in the second chapter. The...
3
by: mirandacascade | last post by:
Verion of Python: 2.4 O/S: Windows XP ElementTree resides in the c:\python24\lib\site-packages\elementtree\ folder When a string that does not contain well-formed XML is passed as an argument...
15
by: Steven Bethard | last post by:
I'm having trouble using elementtree with an XML file that has some gbk-encoded text. (I can't read Chinese, so I'm taking their word for it that it's gbk-encoded.) I always have trouble with...
0
by: Greg Aumann | last post by:
I am trying to write some python code for a library that reads an XML-like language from a file into elementtree data structures. Then I want to be able to read and/or modify the structure and then...
1
by: Willemsjunk | last post by:
I tried the tips I found in other posts but I still get 'none' back: import easygui as eg import xml.etree.ElementTree as ET import sys #kml source is: #<?xml version="1.0"...
11
by: Peter Pei | last post by:
One bad design about elementtree is that it has different ways parsing a string and a file, even worse they return different objects: 1) When you parse a file, you can simply call parse, which...
2
by: =?ISO-8859-1?Q?J=2E_Pablo_Fern=E1ndez?= | last post by:
Hello, Is ElementTree supposed to load DTDs? I have some xmls heavy on entities and it fails this way: Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42) on linux2 Type "help", "copyright",...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.