473,573 Members | 2,925 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

problem parsing utf-8 encoded xml - minidom

Hi,
I am trying to parse an xml file using the minidom parser.

<code>
from xml.dom import minidom
xmlfilename = "sample.xml "
xmldoc = minidom.parse(x mlfilename)
</code>

The parser is failing on this line:

<mrcb245-c>Heinrich Kčufner, Norbert Nedopil, Heinz Schčoch (Hrsg.).</
mrcb245-c>

This is the error message I get:

Traceback (most recent call last):
File "readXML.py ", line 11, in <module>
xmldoc = minidom.parse(x mlfilename)
File "C:\Python25\li b\xml\dom\minid om.py", line 1913, in parse
return expatbuilder.pa rse(file)
File "C:\Python25\li b\xml\dom\expat builder.py", line 924, in parse
result = builder.parseFi le(fp)
File "C:\Python25\li b\xml\dom\expat builder.py", line 207, in
parseFile
parser.Parse(bu ffer, 0)
xml.parsers.exp at.ExpatError: not well-formed (invalid token): line
2254, column 21

It seems to me that it is having an issue with the 'č' character. I
have even tried the following to make sure it recognises the file as
utf-8 file:

<code>
from xml.dom import minidom
import codecs

xmlfilename = "sample.xml "
xmlfile = codecs.open(xml filename,"r","u tf-8")
xmlstring = xmlfile.read()
xmldoc = minidom.parse(x mlfilename)
</code>

However, this doesn't work either and I get the following error
message:

Traceback (most recent call last):
File "readXML.py ", line 9, in <module>
xmlstring = xmlfile.read()
File "C:\Python25\li b\codecs.py", line 618, in read
return self.reader.rea d(size)
File "C:\Python25\li b\codecs.py", line 424, in read
newchars, decodedbytes = self.decode(dat a, self.errors)
UnicodeDecodeEr ror: 'utf8' codec can't decode bytes in position
69343-69345: invalid data

I'm assuming here that it is failing at the same place...

Can someone please point me in the right direction?
Thanks,
Ashmir
Jul 4 '08 #1
2 12557
The parser is failing on this line:
>
<mrcb245-c>Heinrich Kčufner, Norbert Nedopil, Heinz Schčoch (Hrsg.).</
mrcb245-c>
If it is literally this line, it's no surprise: there must not be a line
break between the slash and the closing element name.

However, since you are getting the error in a different column, it's
indeed more likely that there is a problem with the encoding.

Given that the Python UTF-8 codec refuses the data, most likely, the
data is *not* encoded in UTF-8 (but perhaps in Latin-1). If so, you
need to prefix the XML document with a proper XML declaration, such
as

<?xml version="1.0" encoding="iso-8859-1"?>

Alternatively, make sure that the file is really encoded in UTF-8.

Regards,
Martin
Jul 4 '08 #2
On Jul 4, 2:36 pm, "Martin v. Löwis" <mar...@v.loewi s.dewrote:
The parser is failing on this line:
<mrcb245-c>Heinrich Kčufner, Norbert Nedopil, Heinz Schčoch (Hrsg.)..</
mrcb245-c>

If it is literally this line, it's no surprise: there must not be a line
break between the slash and the closing element name.

However, since you are getting the error in a different column, it's
indeed more likely that there is a problem with the encoding.

Given that the Python UTF-8 codec refuses the data, most likely, the
data is *not* encoded in UTF-8 (but perhaps in Latin-1). If so, you
need to prefix the XML document with a proper XML declaration, such
as

<?xml version="1.0" encoding="iso-8859-1"?>

Alternatively, make sure that the file is really encoded in UTF-8.

Regards,
Martin

There is no line break in the xml file. It was just a formatting issue
on this forum.

However, you were right about the encoding not being
utf-8. The xml file is autogenerated by a different script so that's
probably where it is going wrong.
The parser works fine if I change the first line to
<?xml version="1.0" encoding="iso-8859-1"?>

Thank you very much
Jul 4 '08 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
5999
by: wenke | last post by:
Hi, I am using the following code (see below) from php.net (http://www.php.net/manual/en/ref.xml.php, example 1) to parse an XML file (encoded in UTF-8). I changed the code slightly so that the cdata sections will be echoed an not the element names as in the original example. In the cdata sections of my XML file I have terms like this:
2
6718
by: Jim Cobban | last post by:
I must be missing something. I am using org.apache.xml.serialize.XMLSerializer to save a DOM but I am not getting non-basic characters converted to UTF-8. I create Text nodes in the DOM by, for example: Document doc; JTextArea textPrompt; Text newTextNode;
6
2111
by: Ulrich Vollenbruch | last post by:
Hi all! since I'am used to work with matlab for a long time and now have to work with c/c++, I have again some problems with the usage of strings, pointers and arrays. So please excuse my basic question: I want to parse a string like "3.12" to get two integers 3 and 12. I wanted to use the function STRTOK() I wrote a main- and a...
2
2147
by: Benzari.Alex | last post by:
Hello, the majority of my sites use PHP MYSQL and XSLT for dynamic pages and all of it works ok for English characters. The problem begins when I try to use Romanian or Russian characters while creating the XML string. What I used to do is: 1) Query the MYSQL database that is UTF-8 (the conection is also set by a query to UTF-8) 2) Create...
0
1131
by: Uncle Leo | last post by:
I created an OleDbDataAdapter with the wizard in Visual Studio 2003. It created a dataset, connectionstring etc. for me to work with. It also created a .xsd file where one of the columns type is set to date. My program is being used in many different countries, and many different local settings. Some time ago a user from Turkey contacted me...
7
5025
by: Charles | last post by:
I'm designing a C++ application for the web (with FastCGI) and it has to use UTF-8 because there will be users who will type Asian glyphs. When I compile the application, if I use ANSI, no problem, it compiles properly. But if I save the files as UTF-8, I get this error message: %g++ -o cgi-bin/test.fcgi test.cpp test.csp.cpp:1: error:...
5
1400
by: barronmo | last post by:
I'm having difficulty getting the following code to work. All I want to do is remove the '0:00:00' from the end of each line. Here is part of the original file: 3,3,"Dyspepsia NOS",9/12/2003 0:00:00 4,3,"OA of lower leg",9/12/2003 0:00:00 5,4,"Cholera NOS",9/12/2003 0:00:00 6,4,"Open wound of ear NEC*",9/12/2003 0:00:00 7,4,"Migraine...
2
9784
by: Zvi | last post by:
Hi All, Can someone tell me why id the following not working? I have a soap response envelope, for test purpose it's just a string and I create ElementTree from it. Then I try to find Response tag, but I get None. data = """<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/ soap/envelope/"...
0
1062
by: gnewsgroup | last post by:
I need to bind *some* nodes of an xml document to an asp.net Menu control. The problem is that the XML document uses the same node name and attribute name for for the entire document among all levels like so: <node attr1="blahblah" attr2="blahblahblahblah"> <node attr1="flufffluff" attr2="shshshshsh"> <node attr1="cluckcluck"...
12
4613
by: pindoriya1 | last post by:
Hi, m using split function to parse the csv file ..... i m getting problem in one line of file which looks like this : "BLUEUSBXM","X-MICRO (XBT-DG5R (C) R1) BLUETOOTH USB DONGLE PLUS, CLASS 1",360,0.00,"","",0.00,"",0,0.00,"","T1","",5.58,"4000",8.50,0.00,"02/10/2006",0.00,"16/04/2007" now the problem is it parses first field properly....
0
7771
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
8009
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
1
7771
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
8060
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
6406
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5580
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3731
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1296
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
1036
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.