473,396 Members | 1,992 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

xml and character codes such as É

I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É". I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. In the
case of É I replaced it with "\xC9", the rest follow suit. The
list of characters is long and I doubt if this is the way it should be
handled. The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. At any rate, is there something I am missing? Some
XmlReader setting perhaps? Your help is greatly appreciated.
jake
Aug 4 '08 #1
3 7957
jake wrote:
I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É".
That is an entity reference. To not "choke" on that entity reference you
need to declare the entity in the DTD you include in the XML document.
Otherwise the XML is not well-formed and the XML parser will reject it.
Note that DTD support is by default disabled in .NET 2.0 and later so
you will need to use
XmlReaderSettings settings = new XmlReaderSettings();
settings.ProhibitDtd = false;
using (XmlReader reader = XmlReader.Create("file.xml", settings))
{
...
}
if you want to use a DTD declaring the entities the XML uses.

--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
Aug 4 '08 #2
On Aug 4, 6:58*pm, jake <jakedim...@gmail.comwrote:
I am new to xml. *I have a routine that parses xml files using a
regular XmlReader class. *Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "&Eacute;". *I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. *In the
case of &Eacute; I replaced it with "\xC9", the rest follow suit. *The
list of characters is long and I doubt if this is the way it should be
handled. *The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. *At any rate, is there something I am missing? *Some
XmlReader setting perhaps? *Your help is greatly appreciated.
XML has the following named character entities predefined in the
absence of any DTDs or entity declarations: amp, lt, gt, apos, quot.
This is just enough to be able to escape characters that are otherwise
reserved in XML.

All other named character entities should be declared, either directly
within the XML document, or in the .dtd file specified by the XML's
file DOCTYPE directive. As an example, have a look at the DTDs for
XHTML, which contain many character entity declarations:

http://www.w3.org/TR/xhtml1/dtds.html

In particular, if you search for "Eacute" on that page, you'll find
this declaration:

<!ENTITY Eacute "É"<!-- latin capital letter E with acute, U
+00C9 ISOlat1 -->

So, to parse your XML, you'll need to specify a DTD for it, and
declare the entity within that DTD. If your input XML is actually
XHTML, then you can just download the .dtd and .ent files from the
link I've given earlier, and use them; otherwise, you'll need to write
your own.

Once you have the .dtd, you can associate it with XmlReader on
creation by creating an instance of XmlParserContext, specifying its
SystemId property (it should be an URI referencing the .dtd file), and
then using the three-argument version of XmlReader.Create (one of the
arguments will be XmlParserContext).

Alternatively, if you have control over the original XML (i.e. you can
mandate that it is changed), then you can just put the doctype
definition in the XML file itself. If your file looks something like
this:

<root... </root>

Then you can change it as follows:

<!DOCTYPE root [
<!ENTITY Eacute "É">
...
]>
<root>...</root>

Technically, if the XML document is supposed to be standalone, this is
the preferred way of doing things.
Aug 4 '08 #3
Thank you Martin and Pavel. I understand a little more about it now.
Hoped that xml files would be a shallow wade but "nay" said the
gatekeeper. At least now I can proceed on solid grounds. I will most
likely include all the declarations in a separate .DTD that is
independently editable. This way, I can edit the file and add some
expletives without recompiling!
Thank you both again.
jake
On Aug 4, 12:05 pm, Pavel Minaev <int...@gmail.comwrote:
On Aug 4, 6:58 pm, jake <jakedim...@gmail.comwrote:
I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "&Eacute;". I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. In the
case of &Eacute; I replaced it with "\xC9", the rest follow suit. The
list of characters is long and I doubt if this is the way it should be
handled. The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. At any rate, is there something I am missing? Some
XmlReader setting perhaps? Your help is greatly appreciated.

XML has the following named character entities predefined in the
absence of any DTDs or entity declarations: amp, lt, gt, apos, quot.
This is just enough to be able to escape characters that are otherwise
reserved in XML.

All other named character entities should be declared, either directly
within the XML document, or in the .dtd file specified by the XML's
file DOCTYPE directive. As an example, have a look at the DTDs for
XHTML, which contain many character entity declarations:

http://www.w3.org/TR/xhtml1/dtds.html

In particular, if you search for "Eacute" on that page, you'll find
this declaration:

<!ENTITY Eacute "É"<!-- latin capital letter E with acute, U
+00C9 ISOlat1 -->

So, to parse your XML, you'll need to specify a DTD for it, and
declare the entity within that DTD. If your input XML is actually
XHTML, then you can just download the .dtd and .ent files from the
link I've given earlier, and use them; otherwise, you'll need to write
your own.

Once you have the .dtd, you can associate it with XmlReader on
creation by creating an instance of XmlParserContext, specifying its
SystemId property (it should be an URI referencing the .dtd file), and
then using the three-argument version of XmlReader.Create (one of the
arguments will be XmlParserContext).

Alternatively, if you have control over the original XML (i.e. you can
mandate that it is changed), then you can just put the doctype
definition in the XML file itself. If your file looks something like
this:

<root... </root>

Then you can change it as follows:

<!DOCTYPE root [
<!ENTITY Eacute "É">
...
]>
<root>...</root>

Technically, if the XML document is supposed to be standalone, this is
the preferred way of doing things.
Aug 4 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: ahsan Imam | last post by:
Hello All, I have this file and when I import the file in the python interpretor I get the following error: "__main__:1: DeprecationWarning: Non-ASCII character '\xc0' in file trans.py on...
1
by: Jayme Assuncao Casimiro | last post by:
I am witing a wrapper to extract data from werb sites. I am storing the data in xml format. But it doesn't produce valid xml because of html codes like the above - typical of latin languages....
7
by: Harlan Messinger | last post by:
How far back in their version history did Netscape and Internet Explorer support — and – codes for em and en dashes in text? In ALT attributes? In TITLE tags? I notice that Netscape 4.7 and 6...
3
by: Jim Higson | last post by:
Does anyone know a technique in javascript to transform from (for example) &hearts; to the char '♥'? I'm doing this because I have to interpret some data I got over XHTMLHTTP that isn't XML,...
50
by: The Bicycling Guitarist | last post by:
A browser conforming to HTML 4.0 is required to recognize &#number; notations. If I use XHTML 1.0 and charset UTF-8 though, does &eacute; have as much support as é ? Sometimes when I run...
26
by: S!mb | last post by:
Hi all, I'm currently developping a tool to convert texts files between linux, windows and mac. the end of a line is coded by 2 characters in windows, and only one in unix & mac. So I have to...
9
by: dnevado | last post by:
Hi, I have developed a javascript script which sends some html code to w3 validator service through xmlHttpRequest interface in IE. I simply request a page, take responseText property and send...
7
by: John Nagle | last post by:
I've been parsing existing HTML with BeautifulSoup, and occasionally hit content which has something like "Design & Advertising", that is, an "&" instead of an "&amp;". Is there some way I can get...
4
by: Dan Andrews | last post by:
Hello, I was wondering what is the correct way to handle special characters via javascript and the DOM. I would like to avoid document.write and innerHTML. What I am doing is dynamically...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.