471,337 Members | 829 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,337 software developers and data experts.

xml and character codes such as É

I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É". I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. In the
case of É I replaced it with "\xC9", the rest follow suit. The
list of characters is long and I doubt if this is the way it should be
handled. The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. At any rate, is there something I am missing? Some
XmlReader setting perhaps? Your help is greatly appreciated.
jake
Aug 4 '08 #1
3 7545
jake wrote:
I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É".
That is an entity reference. To not "choke" on that entity reference you
need to declare the entity in the DTD you include in the XML document.
Otherwise the XML is not well-formed and the XML parser will reject it.
Note that DTD support is by default disabled in .NET 2.0 and later so
you will need to use
XmlReaderSettings settings = new XmlReaderSettings();
settings.ProhibitDtd = false;
using (XmlReader reader = XmlReader.Create("file.xml", settings))
{
...
}
if you want to use a DTD declaring the entities the XML uses.

--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
Aug 4 '08 #2
On Aug 4, 6:58*pm, jake <jakedim...@gmail.comwrote:
I am new to xml. *I have a routine that parses xml files using a
regular XmlReader class. *Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "&Eacute;". *I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. *In the
case of &Eacute; I replaced it with "\xC9", the rest follow suit. *The
list of characters is long and I doubt if this is the way it should be
handled. *The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. *At any rate, is there something I am missing? *Some
XmlReader setting perhaps? *Your help is greatly appreciated.
XML has the following named character entities predefined in the
absence of any DTDs or entity declarations: amp, lt, gt, apos, quot.
This is just enough to be able to escape characters that are otherwise
reserved in XML.

All other named character entities should be declared, either directly
within the XML document, or in the .dtd file specified by the XML's
file DOCTYPE directive. As an example, have a look at the DTDs for
XHTML, which contain many character entity declarations:

http://www.w3.org/TR/xhtml1/dtds.html

In particular, if you search for "Eacute" on that page, you'll find
this declaration:

<!ENTITY Eacute "É"<!-- latin capital letter E with acute, U
+00C9 ISOlat1 -->

So, to parse your XML, you'll need to specify a DTD for it, and
declare the entity within that DTD. If your input XML is actually
XHTML, then you can just download the .dtd and .ent files from the
link I've given earlier, and use them; otherwise, you'll need to write
your own.

Once you have the .dtd, you can associate it with XmlReader on
creation by creating an instance of XmlParserContext, specifying its
SystemId property (it should be an URI referencing the .dtd file), and
then using the three-argument version of XmlReader.Create (one of the
arguments will be XmlParserContext).

Alternatively, if you have control over the original XML (i.e. you can
mandate that it is changed), then you can just put the doctype
definition in the XML file itself. If your file looks something like
this:

<root... </root>

Then you can change it as follows:

<!DOCTYPE root [
<!ENTITY Eacute "É">
...
]>
<root>...</root>

Technically, if the XML document is supposed to be standalone, this is
the preferred way of doing things.
Aug 4 '08 #3
Thank you Martin and Pavel. I understand a little more about it now.
Hoped that xml files would be a shallow wade but "nay" said the
gatekeeper. At least now I can proceed on solid grounds. I will most
likely include all the declarations in a separate .DTD that is
independently editable. This way, I can edit the file and add some
expletives without recompiling!
Thank you both again.
jake
On Aug 4, 12:05 pm, Pavel Minaev <int...@gmail.comwrote:
On Aug 4, 6:58 pm, jake <jakedim...@gmail.comwrote:
I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "&Eacute;". I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. In the
case of &Eacute; I replaced it with "\xC9", the rest follow suit. The
list of characters is long and I doubt if this is the way it should be
handled. The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. At any rate, is there something I am missing? Some
XmlReader setting perhaps? Your help is greatly appreciated.

XML has the following named character entities predefined in the
absence of any DTDs or entity declarations: amp, lt, gt, apos, quot.
This is just enough to be able to escape characters that are otherwise
reserved in XML.

All other named character entities should be declared, either directly
within the XML document, or in the .dtd file specified by the XML's
file DOCTYPE directive. As an example, have a look at the DTDs for
XHTML, which contain many character entity declarations:

http://www.w3.org/TR/xhtml1/dtds.html

In particular, if you search for "Eacute" on that page, you'll find
this declaration:

<!ENTITY Eacute "É"<!-- latin capital letter E with acute, U
+00C9 ISOlat1 -->

So, to parse your XML, you'll need to specify a DTD for it, and
declare the entity within that DTD. If your input XML is actually
XHTML, then you can just download the .dtd and .ent files from the
link I've given earlier, and use them; otherwise, you'll need to write
your own.

Once you have the .dtd, you can associate it with XmlReader on
creation by creating an instance of XmlParserContext, specifying its
SystemId property (it should be an URI referencing the .dtd file), and
then using the three-argument version of XmlReader.Create (one of the
arguments will be XmlParserContext).

Alternatively, if you have control over the original XML (i.e. you can
mandate that it is changed), then you can just put the doctype
definition in the XML file itself. If your file looks something like
this:

<root... </root>

Then you can change it as follows:

<!DOCTYPE root [
<!ENTITY Eacute "É">
...
]>
<root>...</root>

Technically, if the XML document is supposed to be standalone, this is
the preferred way of doing things.
Aug 4 '08 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by ahsan Imam | last post: by
1 post views Thread by Jayme Assuncao Casimiro | last post: by
7 posts views Thread by Harlan Messinger | last post: by
50 posts views Thread by The Bicycling Guitarist | last post: by
26 posts views Thread by S!mb | last post: by
9 posts views Thread by dnevado | last post: by
7 posts views Thread by John Nagle | last post: by
reply views Thread by rosydwin | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.