473,395 Members | 1,720 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

XML SAX parser bug?

Hi,

I think I ran into a bug in the XML SAX parser.

part of my program consist of reading a rather large XML file (about
10Mb) containing a few thousand elements.
I have the following problem. Sometimes that SAX parses misreads a
line.
Let me explain: the XML file contains a few thousand lines like this:
"
<TargetRef>WINOSSPI:Storage@@n91c90a.cmc.com</TargetRef>
"
where 'n91c90a.cmc.com' is the name of a system and thus changes per
system.
I a few cases, the SAX parser misreads the line. The parser sometimes
plits characters the line in:
"WINOSSPI:Storage@@n" and "91c90a.cmc.com".
I put a 'print characters' line in the 'characters' method of the
parser that is how I found out.
It only happens for a few of the thousand lines but you can imagine
that is very annoying.

I checked for errors in the XML file but the file seems ok.

Is this a bug or am I doing something wrong?
I am new to Python.

I am using Python 2.4.1, pyWin32 extension 2.4 and PyXML 0.8.4

Any help very much appreciated.

Kris

Jan 19 '06 #1
4 3675
mi*****@skynet.be wrote:
I think I ran into a bug in the XML SAX parser.

part of my program consist of reading a rather large XML file (about
10Mb) containing a few thousand elements.
I have the following problem. Sometimes that SAX parses misreads a
line.
Let me explain: the XML file contains a few thousand lines like this:
"
<TargetRef>WINOSSPI:Storage@@n91c90a.cmc.com</TargetRef>
"
where 'n91c90a.cmc.com' is the name of a system and thus changes per
system.
I a few cases, the SAX parser misreads the line. The parser sometimes
plits characters the line in:
"WINOSSPI:Storage@@n" and "91c90a.cmc.com".
I put a 'print characters' line in the 'characters' method of the
parser that is how I found out.
It only happens for a few of the thousand lines but you can imagine
that is very annoying.

I checked for errors in the XML file but the file seems ok.

Is this a bug or am I doing something wrong?


it's not a bug; the parser is free to split up character runs (due to buffering,
entities or character references, etc). it's up to you to merge character runs
into strings.

</F>

Jan 19 '06 #2

Fredrik Lundh schreef:
mi*****@skynet.be wrote:
I think I ran into a bug in the XML SAX parser.

part of my program consist of reading a rather large XML file (about
10Mb) containing a few thousand elements.
I have the following problem. Sometimes that SAX parses misreads a
line.
Let me explain: the XML file contains a few thousand lines like this:
"
<TargetRef>WINOSSPI:Storage@@n91c90a.cmc.com</TargetRef>
"
where 'n91c90a.cmc.com' is the name of a system and thus changes per
system.
I a few cases, the SAX parser misreads the line. The parser sometimes
plits characters the line in:
"WINOSSPI:Storage@@n" and "91c90a.cmc.com".
I put a 'print characters' line in the 'characters' method of the
parser that is how I found out.
It only happens for a few of the thousand lines but you can imagine
that is very annoying.

I checked for errors in the XML file but the file seems ok.

Is this a bug or am I doing something wrong?


it's not a bug; the parser is free to split up character runs (due to buffering,
entities or character references, etc). it's up to you to merge character runs
into strings.

</F>

Thanks for the feedback,

but how do I detect that the parser has split up the characters? I gues
I need to detect it in order to reconstruct the complete string

Jan 19 '06 #3
mi*****@skynet.be wrote:
but how do I detect that the parser has split up the characters? I gues
I need to detect it in order to reconstruct the complete string


Don't try to detect it. Instead, assume it always happens, and collect
the strings in characters(), rather than processing them. Do something
like this

def startElement(self, ...):
self.chardata = ""

def characters(self, data):
self.chardata += data

def endElement(self, ...):
process(self.chardata)

This is simplified - you might have to deal with nested elements,
somehow.

Regards,
Martin
Jan 19 '06 #4
mi*****@skynet.be wrote:
Fredrik Lundh schreef:
mi*****@skynet.be wrote:
I think I ran into a bug in the XML SAX parser.

part of my program consist of reading a rather large XML file (about
10Mb) containing a few thousand elements.
I have the following problem. Sometimes that SAX parses misreads a
line.


it's not a bug; the parser is free to split up character runs (due to buffering,
entities or character references, etc). it's up to you to merge character runs
into strings.


but how do I detect that the parser has split up the characters? I gues
I need to detect it in order to reconstruct the complete string


Here's a recipe:

http://aspn.activestate.com/ASPN/Coo.../Recipe/265881

Using this filter you can then write SAX code that assumes normalized
text events. Also, 4Suite's SAX implementation, Saxlette,
automatically does this text event merging for you at C speed:

http://4suite.org/docs/CoreManual.xml#saxlette

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://fourthought.com
http://copia.ogbuji.net http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/

Feb 7 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

13
by: Paulo Pinto | last post by:
Hi, does anyone know of a Python package that is able to load XML like the XML::Simple Perl package does? For those that don't know it, this package maps the XML file to a dictionary.
11
by: Jean de Largentaye | last post by:
Hi, I need to parse a subset of C (a header file), and generate some unit tests for the functions listed in it. I thus need to parse the code, then rewrite function calls with wrong parameters....
1
by: Karalius, Joseph | last post by:
Can anyone explain what is happening here? I haven't found any useful info on Google yet. Thanks in advance. mmagnet:/home/jkaralius/src/zopeplone/Python-2.3.5 # make gcc -pthread -c...
3
by: Himanshu Garg | last post by:
Hello, I am trying to pinpoint an apparent bug in HTML::Parser. The encoding of the text seems to change incorrectly if the locale isn't set properly. However Parser.pm in the directory...
2
by: Joel Hedlund | last post by:
Hi! I have a possibly dumb question about imports. I've written two python modules: parser.py ------------------------------------ class Parser(object): "my parser"...
5
by: thewarden | last post by:
I've come into a situation where I require to have BBCode parsed, this includes the standard tags supported by PEAR package HTML_BBCodeParser and custom BBCode tags I've added myself. My problem...
28
by: Marc Gravell | last post by:
In Linq, you can apparently get a meaningful body from and expression's .ToString(); random question - does anybody know if linq also includes a parser? It just seemed it might be a handy way to...
0
by: UncleRic | last post by:
Environment: Mac OS X (10.4.10) on MacBook Pro I'm a Perl Neophyte. I've downloaded the XML::Parser module and am attempting to install it in my working directory (referenced via PERL5LIB env): ...
18
by: Just Another Victim of the Ambient Morality | last post by:
Is pyparsing really a recursive descent parser? I ask this because there are grammars it can't parse that my recursive descent parser would parse, should I have written one. For instance: ...
0
by: arvindkgs | last post by:
Iam using c lexer that is flex generated and a c++ parser that is bison generated. i have modified the parser to acccept only string input. I am calling the parser function yyparse in a loop and...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.