473,715 Members | 6,112 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

problem parsing utf-8 encoded xml - minidom

Hi,
I am trying to parse an xml file using the minidom parser.

<code>
from xml.dom import minidom
xmlfilename = "sample.xml "
xmldoc = minidom.parse(x mlfilename)
</code>

The parser is failing on this line:

<mrcb245-c>Heinrich Kčufner, Norbert Nedopil, Heinz Schčoch (Hrsg.).</
mrcb245-c>

This is the error message I get:

Traceback (most recent call last):
File "readXML.py ", line 11, in <module>
xmldoc = minidom.parse(x mlfilename)
File "C:\Python25\li b\xml\dom\minid om.py", line 1913, in parse
return expatbuilder.pa rse(file)
File "C:\Python25\li b\xml\dom\expat builder.py", line 924, in parse
result = builder.parseFi le(fp)
File "C:\Python25\li b\xml\dom\expat builder.py", line 207, in
parseFile
parser.Parse(bu ffer, 0)
xml.parsers.exp at.ExpatError: not well-formed (invalid token): line
2254, column 21

It seems to me that it is having an issue with the 'č' character. I
have even tried the following to make sure it recognises the file as
utf-8 file:

<code>
from xml.dom import minidom
import codecs

xmlfilename = "sample.xml "
xmlfile = codecs.open(xml filename,"r","u tf-8")
xmlstring = xmlfile.read()
xmldoc = minidom.parse(x mlfilename)
</code>

However, this doesn't work either and I get the following error
message:

Traceback (most recent call last):
File "readXML.py ", line 9, in <module>
xmlstring = xmlfile.read()
File "C:\Python25\li b\codecs.py", line 618, in read
return self.reader.rea d(size)
File "C:\Python25\li b\codecs.py", line 424, in read
newchars, decodedbytes = self.decode(dat a, self.errors)
UnicodeDecodeEr ror: 'utf8' codec can't decode bytes in position
69343-69345: invalid data

I'm assuming here that it is failing at the same place...

Can someone please point me in the right direction?
Thanks,
Ashmir
Jul 4 '08 #1
2 12592
The parser is failing on this line:
>
<mrcb245-c>Heinrich Kčufner, Norbert Nedopil, Heinz Schčoch (Hrsg.).</
mrcb245-c>
If it is literally this line, it's no surprise: there must not be a line
break between the slash and the closing element name.

However, since you are getting the error in a different column, it's
indeed more likely that there is a problem with the encoding.

Given that the Python UTF-8 codec refuses the data, most likely, the
data is *not* encoded in UTF-8 (but perhaps in Latin-1). If so, you
need to prefix the XML document with a proper XML declaration, such
as

<?xml version="1.0" encoding="iso-8859-1"?>

Alternatively, make sure that the file is really encoded in UTF-8.

Regards,
Martin
Jul 4 '08 #2
On Jul 4, 2:36 pm, "Martin v. Löwis" <mar...@v.loewi s.dewrote:
The parser is failing on this line:
<mrcb245-c>Heinrich Kčufner, Norbert Nedopil, Heinz Schčoch (Hrsg.)..</
mrcb245-c>

If it is literally this line, it's no surprise: there must not be a line
break between the slash and the closing element name.

However, since you are getting the error in a different column, it's
indeed more likely that there is a problem with the encoding.

Given that the Python UTF-8 codec refuses the data, most likely, the
data is *not* encoded in UTF-8 (but perhaps in Latin-1). If so, you
need to prefix the XML document with a proper XML declaration, such
as

<?xml version="1.0" encoding="iso-8859-1"?>

Alternatively, make sure that the file is really encoded in UTF-8.

Regards,
Martin

There is no line break in the xml file. It was just a formatting issue
on this forum.

However, you were right about the encoding not being
utf-8. The xml file is autogenerated by a different script so that's
probably where it is going wrong.
The parser works fine if I change the first line to
<?xml version="1.0" encoding="iso-8859-1"?>

Thank you very much
Jul 4 '08 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
6004
by: wenke | last post by:
Hi, I am using the following code (see below) from php.net (http://www.php.net/manual/en/ref.xml.php, example 1) to parse an XML file (encoded in UTF-8). I changed the code slightly so that the cdata sections will be echoed an not the element names as in the original example. In the cdata sections of my XML file I have terms like this:
2
6733
by: Jim Cobban | last post by:
I must be missing something. I am using org.apache.xml.serialize.XMLSerializer to save a DOM but I am not getting non-basic characters converted to UTF-8. I create Text nodes in the DOM by, for example: Document doc; JTextArea textPrompt; Text newTextNode;
6
2119
by: Ulrich Vollenbruch | last post by:
Hi all! since I'am used to work with matlab for a long time and now have to work with c/c++, I have again some problems with the usage of strings, pointers and arrays. So please excuse my basic question: I want to parse a string like "3.12" to get two integers 3 and 12. I wanted to use the function STRTOK() I wrote a main- and a subfunction like: main() {
2
2152
by: Benzari.Alex | last post by:
Hello, the majority of my sites use PHP MYSQL and XSLT for dynamic pages and all of it works ok for English characters. The problem begins when I try to use Romanian or Russian characters while creating the XML string. What I used to do is: 1) Query the MYSQL database that is UTF-8 (the conection is also set by a query to UTF-8) 2) Create a XML string in PHP using the query results 3) Transform the XML string using a XSLT parser
0
1135
by: Uncle Leo | last post by:
I created an OleDbDataAdapter with the wizard in Visual Studio 2003. It created a dataset, connectionstring etc. for me to work with. It also created a .xsd file where one of the columns type is set to date. My program is being used in many different countries, and many different local settings. Some time ago a user from Turkey contacted me saying my program crashed on his system with the following error code: System.ArgumentException:...
7
5031
by: Charles | last post by:
I'm designing a C++ application for the web (with FastCGI) and it has to use UTF-8 because there will be users who will type Asian glyphs. When I compile the application, if I use ANSI, no problem, it compiles properly. But if I save the files as UTF-8, I get this error message: %g++ -o cgi-bin/test.fcgi test.cpp test.csp.cpp:1: error: stray '\239' in program test.csp.cpp:1: error: stray '\187' in program test.csp.cpp:1: error: stray...
5
1403
by: barronmo | last post by:
I'm having difficulty getting the following code to work. All I want to do is remove the '0:00:00' from the end of each line. Here is part of the original file: 3,3,"Dyspepsia NOS",9/12/2003 0:00:00 4,3,"OA of lower leg",9/12/2003 0:00:00 5,4,"Cholera NOS",9/12/2003 0:00:00 6,4,"Open wound of ear NEC*",9/12/2003 0:00:00 7,4,"Migraine with aura",9/12/2003 0:00:00 8,6,"HTN ",10/15/2003 0:00:00
2
9791
by: Zvi | last post by:
Hi All, Can someone tell me why id the following not working? I have a soap response envelope, for test purpose it's just a string and I create ElementTree from it. Then I try to find Response tag, but I get None. data = """<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/ soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
0
1064
by: gnewsgroup | last post by:
I need to bind *some* nodes of an xml document to an asp.net Menu control. The problem is that the XML document uses the same node name and attribute name for for the entire document among all levels like so: <node attr1="blahblah" attr2="blahblahblahblah"> <node attr1="flufffluff" attr2="shshshshsh"> <node attr1="cluckcluck" attr2="naynaynay" /> <node attr1="chillaladolads" attr2="lasfojelajdfljalsdf" /> <node attr1="eoijoawelkjladsf"...
12
4620
by: pindoriya1 | last post by:
Hi, m using split function to parse the csv file ..... i m getting problem in one line of file which looks like this : "BLUEUSBXM","X-MICRO (XBT-DG5R (C) R1) BLUETOOTH USB DONGLE PLUS, CLASS 1",360,0.00,"","",0.00,"",0,0.00,"","T1","",5.58,"4000",8.50,0.00,"02/10/2006",0.00,"16/04/2007" now the problem is it parses first field properly. BUT Not parsing second field.....
0
8823
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9047
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7973
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6646
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5967
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4477
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4738
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3175
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2119
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.