By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,389 Members | 2,052 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,389 IT Pros & Developers. It's quick & easy.

XML file parsing with SAX

P: n/a
I decided to use SAX to parse my xml file.
But the parser crashes on:
File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: NCBI_Entrezgene.dtd:8:0: error in processing external entity reference

This is caused by:
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN"
"NCBI_Entrezgene.dtd">

If I remove it, it parses normally.
I've created my parser like this:
import sys
from xml.sax import make_parser
from handler import EntrezGeneHandler

fopen = open("mouse2.xml", "r")
ch = EntrezGeneHandler()
saxparser = make_parser()
saxparser.setContentHandler(ch)
saxparser.parse(fopen)

And the handler is:
from xml.sax import ContentHandler

class EntrezGeneHandler(ContentHandler):
"""
A handler to deal with EntrezGene in XML
"""

def startElement(self, name, attrs):
print "Start element:", name

So it doesn't do much yet. And still it crashes...
How can I tell the parser not to look at the DOCTYPE declaration.
On a website:
http://www.devarticles.com/c/a/XML/P...-and-Python/1/
it states that the SAX parsers are not validating, so this error shouldn't
even occur?

Cheers,

Willem
Jul 19 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
On Sat, 2005-04-23 at 15:20 +0200, Willem Ligtenberg wrote:
I decided to use SAX to parse my xml file.
But the parser crashes on:
File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: NCBI_Entrezgene.dtd:8:0: error in processing external entity reference

This is caused by:
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN"
"NCBI_Entrezgene.dtd">

If I remove it, it parses normally.
I've created my parser like this:
import sys
from xml.sax import make_parser
from handler import EntrezGeneHandler

fopen = open("mouse2.xml", "r")
ch = EntrezGeneHandler()
saxparser = make_parser()
saxparser.setContentHandler(ch)
saxparser.parse(fopen)

And the handler is:
from xml.sax import ContentHandler

class EntrezGeneHandler(ContentHandler):
"""
A handler to deal with EntrezGene in XML
"""

def startElement(self, name, attrs):
print "Start element:", name

So it doesn't do much yet. And still it crashes...
How can I tell the parser not to look at the DOCTYPE declaration.
On a website:
http://www.devarticles.com/c/a/XML/P...-and-Python/1/
it states that the SAX parsers are not validating, so this error shouldn't
even occur?


Just because it's not validating doesn't mean that the parser won't try
to read the external entity.

Maybe you're looking for

"""
feature_external_ges
Value: "http://xml.org/sax/features/external-general-entities"
true: Include all external general (text) entities.
false: Do not include external general entities.
access: (parsing) read-only; (not parsing) read/write
"""

Quote from:

http://docs.python.org/lib/module-xml.sax.handler.html

But you're on pretty shaky ground in any XML 1.x toolkit using a bogus
DTDecl in this way. Why go through the hassle? Why not use a catalog,
or remove the DTDecl?
--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://fourthought.com
http://copia.ogbuji.net http://4Suite.org
Use CSS to display XML, part 2 - http://www-128.ibm.com/developerwork...xmlcss2-i.html
XML Output with 4Suite & AMara - http://www.xml.com/pub/a/2005/04/20/py-xml.html
Use XSLT to prepare XML for import into OpenOffice Calc - http://www.ibm.com/developerworks/xml/library/x-oocalc/
Schema standardization for top-down semantic transparency - http://www-128.ibm.com/developerwork...x-think31.html

Jul 19 '05 #2

P: n/a
I didn't make the XML file. And I don't like messing with other peoples
data. So I just want my SAX parser to ignore it. I can't help if other
people make it hard for me to read their xml file...

On Sat, 23 Apr 2005 13:48:49 -0600, Uche Ogbuji wrote:
On Sat, 2005-04-23 at 15:20 +0200, Willem Ligtenberg wrote:
I decided to use SAX to parse my xml file.
But the parser crashes on:
File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: NCBI_Entrezgene.dtd:8:0: error in processing external entity reference

This is caused by:
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN"
"NCBI_Entrezgene.dtd">

If I remove it, it parses normally.
I've created my parser like this:
import sys
from xml.sax import make_parser
from handler import EntrezGeneHandler

fopen = open("mouse2.xml", "r")
ch = EntrezGeneHandler()
saxparser = make_parser()
saxparser.setContentHandler(ch)
saxparser.parse(fopen)

And the handler is:
from xml.sax import ContentHandler

class EntrezGeneHandler(ContentHandler):
"""
A handler to deal with EntrezGene in XML
"""

def startElement(self, name, attrs):
print "Start element:", name

So it doesn't do much yet. And still it crashes...
How can I tell the parser not to look at the DOCTYPE declaration.
On a website:
http://www.devarticles.com/c/a/XML/P...-and-Python/1/
it states that the SAX parsers are not validating, so this error shouldn't
even occur?


Just because it's not validating doesn't mean that the parser won't try
to read the external entity.

Maybe you're looking for

"""
feature_external_ges
Value: "http://xml.org/sax/features/external-general-entities"
true: Include all external general (text) entities.
false: Do not include external general entities.
access: (parsing) read-only; (not parsing) read/write
"""

Quote from:

http://docs.python.org/lib/module-xml.sax.handler.html

But you're on pretty shaky ground in any XML 1.x toolkit using a bogus
DTDecl in this way. Why go through the hassle? Why not use a catalog,
or remove the DTDecl?


Jul 19 '05 #3

P: n/a
On 4/23/05, Willem Ligtenberg <wl*********@gmail.com> wrote:
so that will be sax.handler.feature_external_ges = "false"
Yes.
And it will work?
Honestly, I'm not sure. It should, but I've found these edge cases a
bit hard to predict in the Python built-in libs :-(
But what about using a catalog? I am very new to python and XML...


Catalogs allow you to rewrite the IDs for entities and such. So if
you had an XML file with an entity at a URL, but you were working
offline, you could use a catalog to "redirect" the entity to a copy on
your local filesystem.

Problem, now that I think of it, is that I'm not sure you can specify
an catalog in PySAX. You might instead have to override the method
entityResolver in your handler (and be sure to ). See the example in
listing 1 and and discussion here:

http://www.xml.com/pub/a/2005/03/02/pyxml.html

Good luck.

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://fourthought.com
http://copia.ogbuji.net http://4Suite.org
Use CSS to display XML, part 2 -
http://www-128.ibm.com/developerwork...xmlcss2-i.html
XML Output with 4Suite & Amara -
http://www.xml.com/pub/a/2005/04/20/py-xml.htmlUse XSLT to prepare XML
for import into OpenOffice Calc -
http://www.ibm.com/developerworks/xml/library/x-oocalc/
Schema standardization for top-down semantic transparency -
http://www-128.ibm.com/developerwork...x-think31.html
Jul 19 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.