On Aug 4, 6:58*pm, jake <jakedim...@gmail.comwrote:
I am new to xml. *I have a routine that parses xml files using a
regular XmlReader class. *Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É". *I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. *In the
case of É I replaced it with "\xC9", the rest follow suit. *The
list of characters is long and I doubt if this is the way it should be
handled. *The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. *At any rate, is there something I am missing? *Some
XmlReader setting perhaps? *Your help is greatly appreciated.
XML has the following named character entities predefined in the
absence of any DTDs or entity declarations: amp, lt, gt, apos, quot.
This is just enough to be able to escape characters that are otherwise
reserved in XML.
All other named character entities should be declared, either directly
within the XML document, or in the .dtd file specified by the XML's
file DOCTYPE directive. As an example, have a look at the DTDs for
XHTML, which contain many character entity declarations:
http://www.w3.org/TR/xhtml1/dtds.html
In particular, if you search for "Eacute" on that page, you'll find
this declaration:
<!ENTITY Eacute "É"<!-- latin capital letter E with acute, U
+00C9 ISOlat1 -->
So, to parse your XML, you'll need to specify a DTD for it, and
declare the entity within that DTD. If your input XML is actually
XHTML, then you can just download the .dtd and .ent files from the
link I've given earlier, and use them; otherwise, you'll need to write
your own.
Once you have the .dtd, you can associate it with XmlReader on
creation by creating an instance of XmlParserContext, specifying its
SystemId property (it should be an URI referencing the .dtd file), and
then using the three-argument version of XmlReader.Create (one of the
arguments will be XmlParserContext).
Alternatively, if you have control over the original XML (i.e. you can
mandate that it is changed), then you can just put the doctype
definition in the XML file itself. If your file looks something like
this:
<root... </root>
Then you can change it as follows:
<!DOCTYPE root [
<!ENTITY Eacute "É">
...
]>
<root>...</root>
Technically, if the XML document is supposed to be standalone, this is
the preferred way of doing things.