473,322 Members | 1,494 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

nonstandard XML character entities?

I'm new to xml mongering so forgive me if there's an obvious
well-known answer to this. It's not real obvious from the library
documentation I've looked at so far. Basically I have to munch of a
bunch of xml files which contain character entities like ú
which are apparently nonstandard. They appear in w3.org tables but
xml.etree.cElementTree.ElementTree.parse barfs at them and xmllint
barfs at them.

Basically I want to know if there's a way to supply the regular parser
(preferably xml.etree but I guess I can switch to another one if
necessary) with some kind of entity table, and/or if the info is
supposed to be found in the DTD or someplace like that. Right now I'm
ignoring the DTD and simply figuring out the doc structure by
eyeballing the xml files, maybe not a perfectly approved method but
it seems to be what most people do.

Thanks

Apr 14 '07 #1
4 2061
I'm new to xml mongering so forgive me if there's an obvious
well-known answer to this. It's not real obvious from the library
documentation I've looked at so far. Basically I have to munch of a
bunch of xml files which contain character entities like ú
which are apparently nonstandard.
If they contain such things, and do not contain a document type
definition, they are not well-formed XML files (i.e. can't be
called "XML" in a meaningful sense).

It would have been helpful if you had given an example of such
a document.
Basically I want to know if there's a way to supply the regular parser
(preferably xml.etree but I guess I can switch to another one if
necessary) with some kind of entity table, and/or if the info is
supposed to be found in the DTD or someplace like that. Right now I'm
ignoring the DTD and simply figuring out the doc structure by
eyeballing the xml files, maybe not a perfectly approved method but
it seems to be what most people do.
If there is a document type declaration in the document, the best
way is to parse it in a mode where the parser downloads the DTD
when parsing it, and resolves the entity references itself.

In SAX, you can put an EntityResolver into the parser, and then
return a file-like object from resolveEntity. This can be used
to avoid the network download; the document type declaration
would still have to be present.

Alternatively, you can implement a skippedEntity callback in
the SAX content handler.

In ElementTree, the XMLTreeBuilder has an attribute entity
which is a dictionary used to map entity names in entity references
to their definitions. Whether you can make the parser download
the DTD itself, I don't know.

Regards,
Martin

Apr 14 '07 #2
Martin v. Löwis wrote this on Sat, 14 Apr 2007 09:10:44 +0200. My
reply is below.
Paul Rubin:
>I'm new to xml mongering so forgive me if there's an obvious
well-known answer to this. It's not real obvious from the library
documentation I've looked at so far. Basically I have to munch of
a bunch of xml files which contain character entities like ú
which are apparently nonstandard.
-snip-
In ElementTree, the XMLTreeBuilder has an attribute entity which is
a dictionary used to map entity names in entity references to their
definitions. Whether you can make the parser download the DTD
itself, I don't know.
What he said....

Try this on your piano:

: import xml.etree.ElementTree # or elementtree.ElementTree prior to 2.5
: ElementTree = xml.etree.ElementTree
: import htmlentitydefs
: class XmlFile(ElementTree.ElementTree):

: def __init__(self, file=None, tag='global', **extra):
: ElementTree.ElementTree.__init__(self)
: parser = ElementTree.XMLTreeBuilder(
: target=ElementTree.TreeBuilder(Element))
: parser.entity = htmlentitydefs.entitydefs
: self.parse(source=file, parser=parser)
: return
It looks goofy as can be, but it works for me.

--
... Chuck Rhode, Sheboygan, WI, USA
... Weather: http://LacusVeris.com/WX
... 32° — Wind Calm
Apr 14 '07 #3
Chuck Rhode wrote this on Sat, 14 Apr 2007 09:04:45 -0500. My reply is
below.

Fixed text wrap:
import xml.etree.ElementTree # or elementtree.ElementTree prior to 2.5
ElementTree = xml.etree.ElementTree
import htmlentitydefs
class XmlFile(ElementTree.ElementTree):
def __init__(self, file=None, tag='global', **extra):
ElementTree.ElementTree.__init__(self)
parser = ElementTree.XMLTreeBuilder(
target=ElementTree.TreeBuilder(Element))
parser.entity = htmlentitydefs.entitydefs
self.parse(source=file, parser=parser)
return

--
... Chuck Rhode, Sheboygan, WI, USA
... Weather: http://LacusVeris.com/WX
... 32° — Wind Calm
Apr 14 '07 #4
"Martin v. Löwis" <ma****@v.loewis.dewrites:
If they contain such things, and do not contain a document type
definition, they are not well-formed XML files (i.e. can't be
called "XML" in a meaningful sense).
The documents do have a DTD, however the DTD file doesn't say anything
about these entities.
It would have been helpful if you had given an example of such
a document.
I can't post a whole document because these docs are very large
and I'm not sure that the data is public. It does look like the DTD
is public: the document begins with

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE ONIXmessage SYSTEM "http://www.editeur.org/onix/2.1/short/onix-international.dtd">
<ONIXmessage release="2.1">
...

and that url points to the DTD which is online.

Basically the doc has elements like

<b036>Diana Montan&eacute;</b036>

and both ElementTree and xmllint complain about the character entities
(and there are a lot of them).
If there is a document type declaration in the document, the best
way is to parse it in a mode where the parser downloads the DTD
when parsing it, and resolves the entity references itself.
Hmm, ok, I see there are a lot of <!ENTITY ...directives in the
DTD but nothing about those character entities--am I looking in
the right place?
In ElementTree, the XMLTreeBuilder has an attribute entity
which is a dictionary used to map entity names in entity references
to their definitions. Whether you can make the parser download
the DTD itself, I don't know.
Chuck Rhode posted some code for something like this so I'll try it
on Monday.

Thanks!
Apr 14 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: news.hunterlink.net.au | last post by:
(* note the escaped ampersand and the character reference have extra spaces to avoid being converted when viewed) I have a job that requires the following <ThisElement>Here is some text & a m...
11
by: Albretch | last post by:
Hi HTML gurus, I understand that you would use HTML character entities for &auml; and &euro; but why on earth would anyone encode: a colon: ":", a semicolon ";", or a gramatical period...
76
by: Zenobia | last post by:
How do I display character 151 (long hyphen) in XHTML (utf-8) ? Is there another character that will substitute? The W3C validation parser, http://validator.w3.org, tells me that this character...
19
by: Ian | last post by:
I'm using the following meta tag with my documents: <meta http-equiv="Content-Type" content= "text/html; charset=us-ascii" /> and yet using character entities like &rsquo; and &mdash; It...
50
by: The Bicycling Guitarist | last post by:
A browser conforming to HTML 4.0 is required to recognize &#number; notations. If I use XHTML 1.0 and charset UTF-8 though, does &eacute; have as much support as é ? Sometimes when I run...
40
by: Shmuel (Seymour J.) Metz | last post by:
I'd like to include some Hebrew names in a web page. HTML 4 doesn't appear to include character attributes for ISO-8859-8. I'd prefer avoiding numeric references, e.g.,...
2
by: Diilb | last post by:
I am using DOM to create an rss feed. The problem I am running into is "special characters" such as é è ç. If I try adding them to the XML as character data (CData), DOM chokes and throws out...
3
by: bsagert | last post by:
Some web feeds use decimal character entities that seem to confuse Python (or me). For example, the string "doesn't" may be coded as "doesn’t" which should produce a right leaning apostrophe....
7
by: tempest | last post by:
Hi all. This is a rather long posting but I have some questions concerning the usage of character entities in XML documents and PCI security compliance. The company I work for is using a...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.