473,698 Members | 2,796 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

problem parsing utf-8 encoded xml - minidom

Hi,
I am trying to parse an xml file using the minidom parser.

<code>
from xml.dom import minidom
xmlfilename = "sample.xml "
xmldoc = minidom.parse(x mlfilename)
</code>

The parser is failing on this line:

<mrcb245-c>Heinrich Kčufner, Norbert Nedopil, Heinz Schčoch (Hrsg.).</
mrcb245-c>

This is the error message I get:

Traceback (most recent call last):
File "readXML.py ", line 11, in <module>
xmldoc = minidom.parse(x mlfilename)
File "C:\Python25\li b\xml\dom\minid om.py", line 1913, in parse
return expatbuilder.pa rse(file)
File "C:\Python25\li b\xml\dom\expat builder.py", line 924, in parse
result = builder.parseFi le(fp)
File "C:\Python25\li b\xml\dom\expat builder.py", line 207, in
parseFile
parser.Parse(bu ffer, 0)
xml.parsers.exp at.ExpatError: not well-formed (invalid token): line
2254, column 21

It seems to me that it is having an issue with the 'č' character. I
have even tried the following to make sure it recognises the file as
utf-8 file:

<code>
from xml.dom import minidom
import codecs

xmlfilename = "sample.xml "
xmlfile = codecs.open(xml filename,"r","u tf-8")
xmlstring = xmlfile.read()
xmldoc = minidom.parse(x mlfilename)
</code>

However, this doesn't work either and I get the following error
message:

Traceback (most recent call last):
File "readXML.py ", line 9, in <module>
xmlstring = xmlfile.read()
File "C:\Python25\li b\codecs.py", line 618, in read
return self.reader.rea d(size)
File "C:\Python25\li b\codecs.py", line 424, in read
newchars, decodedbytes = self.decode(dat a, self.errors)
UnicodeDecodeEr ror: 'utf8' codec can't decode bytes in position
69343-69345: invalid data

I'm assuming here that it is failing at the same place...

Can someone please point me in the right direction?
Thanks,
Ashmir
Jul 4 '08 #1
2 12584
The parser is failing on this line:
>
<mrcb245-c>Heinrich Kčufner, Norbert Nedopil, Heinz Schčoch (Hrsg.).</
mrcb245-c>
If it is literally this line, it's no surprise: there must not be a line
break between the slash and the closing element name.

However, since you are getting the error in a different column, it's
indeed more likely that there is a problem with the encoding.

Given that the Python UTF-8 codec refuses the data, most likely, the
data is *not* encoded in UTF-8 (but perhaps in Latin-1). If so, you
need to prefix the XML document with a proper XML declaration, such
as

<?xml version="1.0" encoding="iso-8859-1"?>

Alternatively, make sure that the file is really encoded in UTF-8.

Regards,
Martin
Jul 4 '08 #2
On Jul 4, 2:36 pm, "Martin v. Löwis" <mar...@v.loewi s.dewrote:
The parser is failing on this line:
<mrcb245-c>Heinrich Kčufner, Norbert Nedopil, Heinz Schčoch (Hrsg.)..</
mrcb245-c>

If it is literally this line, it's no surprise: there must not be a line
break between the slash and the closing element name.

However, since you are getting the error in a different column, it's
indeed more likely that there is a problem with the encoding.

Given that the Python UTF-8 codec refuses the data, most likely, the
data is *not* encoded in UTF-8 (but perhaps in Latin-1). If so, you
need to prefix the XML document with a proper XML declaration, such
as

<?xml version="1.0" encoding="iso-8859-1"?>

Alternatively, make sure that the file is really encoded in UTF-8.

Regards,
Martin

There is no line break in the xml file. It was just a formatting issue
on this forum.

However, you were right about the encoding not being
utf-8. The xml file is autogenerated by a different script so that's
probably where it is going wrong.
The parser works fine if I change the first line to
<?xml version="1.0" encoding="iso-8859-1"?>

Thank you very much
Jul 4 '08 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
6004
by: wenke | last post by:
Hi, I am using the following code (see below) from php.net (http://www.php.net/manual/en/ref.xml.php, example 1) to parse an XML file (encoded in UTF-8). I changed the code slightly so that the cdata sections will be echoed an not the element names as in the original example. In the cdata sections of my XML file I have terms like this:
2
6724
by: Jim Cobban | last post by:
I must be missing something. I am using org.apache.xml.serialize.XMLSerializer to save a DOM but I am not getting non-basic characters converted to UTF-8. I create Text nodes in the DOM by, for example: Document doc; JTextArea textPrompt; Text newTextNode;
6
2117
by: Ulrich Vollenbruch | last post by:
Hi all! since I'am used to work with matlab for a long time and now have to work with c/c++, I have again some problems with the usage of strings, pointers and arrays. So please excuse my basic question: I want to parse a string like "3.12" to get two integers 3 and 12. I wanted to use the function STRTOK() I wrote a main- and a subfunction like: main() {
2
2150
by: Benzari.Alex | last post by:
Hello, the majority of my sites use PHP MYSQL and XSLT for dynamic pages and all of it works ok for English characters. The problem begins when I try to use Romanian or Russian characters while creating the XML string. What I used to do is: 1) Query the MYSQL database that is UTF-8 (the conection is also set by a query to UTF-8) 2) Create a XML string in PHP using the query results 3) Transform the XML string using a XSLT parser
0
1134
by: Uncle Leo | last post by:
I created an OleDbDataAdapter with the wizard in Visual Studio 2003. It created a dataset, connectionstring etc. for me to work with. It also created a .xsd file where one of the columns type is set to date. My program is being used in many different countries, and many different local settings. Some time ago a user from Turkey contacted me saying my program crashed on his system with the following error code: System.ArgumentException:...
7
5029
by: Charles | last post by:
I'm designing a C++ application for the web (with FastCGI) and it has to use UTF-8 because there will be users who will type Asian glyphs. When I compile the application, if I use ANSI, no problem, it compiles properly. But if I save the files as UTF-8, I get this error message: %g++ -o cgi-bin/test.fcgi test.cpp test.csp.cpp:1: error: stray '\239' in program test.csp.cpp:1: error: stray '\187' in program test.csp.cpp:1: error: stray...
5
1401
by: barronmo | last post by:
I'm having difficulty getting the following code to work. All I want to do is remove the '0:00:00' from the end of each line. Here is part of the original file: 3,3,"Dyspepsia NOS",9/12/2003 0:00:00 4,3,"OA of lower leg",9/12/2003 0:00:00 5,4,"Cholera NOS",9/12/2003 0:00:00 6,4,"Open wound of ear NEC*",9/12/2003 0:00:00 7,4,"Migraine with aura",9/12/2003 0:00:00 8,6,"HTN ",10/15/2003 0:00:00
2
9788
by: Zvi | last post by:
Hi All, Can someone tell me why id the following not working? I have a soap response envelope, for test purpose it's just a string and I create ElementTree from it. Then I try to find Response tag, but I get None. data = """<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/ soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
0
1064
by: gnewsgroup | last post by:
I need to bind *some* nodes of an xml document to an asp.net Menu control. The problem is that the XML document uses the same node name and attribute name for for the entire document among all levels like so: <node attr1="blahblah" attr2="blahblahblahblah"> <node attr1="flufffluff" attr2="shshshshsh"> <node attr1="cluckcluck" attr2="naynaynay" /> <node attr1="chillaladolads" attr2="lasfojelajdfljalsdf" /> <node attr1="eoijoawelkjladsf"...
12
4620
by: pindoriya1 | last post by:
Hi, m using split function to parse the csv file ..... i m getting problem in one line of file which looks like this : "BLUEUSBXM","X-MICRO (XBT-DG5R (C) R1) BLUETOOTH USB DONGLE PLUS, CLASS 1",360,0.00,"","",0.00,"",0,0.00,"","T1","",5.58,"4000",8.50,0.00,"02/10/2006",0.00,"16/04/2007" now the problem is it parses first field properly. BUT Not parsing second field.....
0
8683
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9170
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
8901
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8871
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6528
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4371
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4622
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3052
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2336
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.