xml and character codes such as É

jake

I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É". I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. In the
case of É I replaced it with "\xC9", the rest follow suit. The
list of characters is long and I doubt if this is the way it should be
handled. The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. At any rate, is there something I am missing? Some
XmlReader setting perhaps? Your help is greatly appreciated.
jake

Aug 4 '08 #1

Subscribe Post Reply

7957

Martin Honnen

jake wrote:

I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É".

That is an entity reference. To not "choke" on that entity reference you
need to declare the entity in the DTD you include in the XML document.
Otherwise the XML is not well-formed and the XML parser will reject it.
Note that DTD support is by default disabled in .NET 2.0 and later so
you will need to use
XmlReaderSettings settings = new XmlReaderSettings();
settings.ProhibitDtd = false;
using (XmlReader reader = XmlReader.Create("file.xml", settings))
{
...
}
if you want to use a DTD declaring the entities the XML uses.

--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/

Aug 4 '08 #2

Pavel Minaev

On Aug 4, 6:58*pm, jake <jakedim...@gmail.comwrote:

I am new to xml. *I have a routine that parses xml files using a
regular XmlReader class. *Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É". *I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. *In the
case of É I replaced it with "\xC9", the rest follow suit. *The
list of characters is long and I doubt if this is the way it should be
handled. *The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. *At any rate, is there something I am missing? *Some
XmlReader setting perhaps? *Your help is greatly appreciated.

XML has the following named character entities predefined in the
absence of any DTDs or entity declarations: amp, lt, gt, apos, quot.
This is just enough to be able to escape characters that are otherwise
reserved in XML.

All other named character entities should be declared, either directly
within the XML document, or in the .dtd file specified by the XML's
file DOCTYPE directive. As an example, have a look at the DTDs for
XHTML, which contain many character entity declarations:

http://www.w3.org/TR/xhtml1/dtds.html

In particular, if you search for "Eacute" on that page, you'll find
this declaration:

<!ENTITY Eacute "É"

So, to parse your XML, you'll need to specify a DTD for it, and
declare the entity within that DTD. If your input XML is actually
XHTML, then you can just download the .dtd and .ent files from the
link I've given earlier, and use them; otherwise, you'll need to write
your own.

Once you have the .dtd, you can associate it with XmlReader on
creation by creating an instance of XmlParserContext, specifying its
SystemId property (it should be an URI referencing the .dtd file), and
then using the three-argument version of XmlReader.Create (one of the
arguments will be XmlParserContext).

Alternatively, if you have control over the original XML (i.e. you can
mandate that it is changed), then you can just put the doctype
definition in the XML file itself. If your file looks something like
this:

<root... </root>

Then you can change it as follows:

<!DOCTYPE root [
<!ENTITY Eacute "É">
...
]>
<root>...</root>

Technically, if the XML document is supposed to be standalone, this is
the preferred way of doing things.

Aug 4 '08 #3

jake

Thank you Martin and Pavel. I understand a little more about it now.
Hoped that xml files would be a shallow wade but "nay" said the
gatekeeper. At least now I can proceed on solid grounds. I will most
likely include all the declarations in a separate .DTD that is
independently editable. This way, I can edit the file and add some
expletives without recompiling!
Thank you both again.
jake
On Aug 4, 12:05 pm, Pavel Minaev <int...@gmail.comwrote:

On Aug 4, 6:58 pm, jake <jakedim...@gmail.comwrote:

I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É". I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. In the
case of É I replaced it with "\xC9", the rest follow suit. The
list of characters is long and I doubt if this is the way it should be
handled. The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. At any rate, is there something I am missing? Some
XmlReader setting perhaps? Your help is greatly appreciated.

XML has the following named character entities predefined in the
absence of any DTDs or entity declarations: amp, lt, gt, apos, quot.
This is just enough to be able to escape characters that are otherwise
reserved in XML.

All other named character entities should be declared, either directly
within the XML document, or in the .dtd file specified by the XML's
file DOCTYPE directive. As an example, have a look at the DTDs for
XHTML, which contain many character entity declarations:

http://www.w3.org/TR/xhtml1/dtds.html

In particular, if you search for "Eacute" on that page, you'll find
this declaration:

<!ENTITY Eacute "É"

So, to parse your XML, you'll need to specify a DTD for it, and
declare the entity within that DTD. If your input XML is actually
XHTML, then you can just download the .dtd and .ent files from the
link I've given earlier, and use them; otherwise, you'll need to write
your own.

Once you have the .dtd, you can associate it with XmlReader on
creation by creating an instance of XmlParserContext, specifying its
SystemId property (it should be an URI referencing the .dtd file), and
then using the three-argument version of XmlReader.Create (one of the
arguments will be XmlParserContext).

Alternatively, if you have control over the original XML (i.e. you can
mandate that it is changed), then you can just put the doctype
definition in the XML file itself. If your file looks something like
this:

<root... </root>

Then you can change it as follows:

<!DOCTYPE root [
<!ENTITY Eacute "É">
...
]>
<root>...</root>

Technically, if the XML document is supposed to be standalone, this is
the preferred way of doing things.

Aug 4 '08 #4

by: ahsan Imam | last post by:

Hello All, I have this file and when I import the file in the python interpretor I get the following error: "__main__:1: DeprecationWarning: Non-ASCII character '\xc0' in file trans.py on...

Python

xml html codes

by: Jayme Assuncao Casimiro | last post by:

I am witing a wrapper to extract data from werb sites. I am storing the data in xml format. But it doesn't produce valid xml because of html codes like the above - typical of latin languages....

.NET Framework

REVISED question: history of dash character support

by: Harlan Messinger | last post by:

How far back in their version history did Netscape and Internet Explorer support — and – codes for em and en dashes in text? In ALT attributes? In TITLE tags? I notice that Netscape 4.7 and 6...

HTML / CSS

expanding character entity references in javascript

by: Jim Higson | last post by:

Does anyone know a technique in javascript to transform from (for example) &hearts; to the char 'â™¥'? I'm doing this because I have to interpret some data I got over XHTMLHTTP that isn't XML,...

Javascript

number or name for special character

by: The Bicycling Guitarist | last post by:

A browser conforming to HTML 4.0 is required to recognize &#number; notations. If I use XHTML 1.0 and charset UTF-8 though, does é have as much support as é ? Sometimes when I run...

HTML / CSS

how to delete a character in a file ?

by: S!mb | last post by:

Hi all, I'm currently developping a tool to convert texts files between linux, windows and mac. the end of a line is coded by 2 characters in windows, and only one in unix & mac. So I have to...

C / C++

& special characters

by: dnevado | last post by:

Hi, I have developed a javascript script which sends some html code to w3 validator service through xmlHttpRequest interface in IE. I simply request a page, take responseText property and send...

HTML / CSS

BeautifulSoup vs. loose & chars

by: John Nagle | last post by:

I've been parsing existing HTML with BeautifulSoup, and occasionally hit content which has something like "Design & Advertising", that is, an "&" instead of an "&". Is there some way I can get...

Python

createTextNode/innerHTML and special character handling

by: Dan Andrews | last post by:

Hello, I was wondering what is the correct way to handle special characters via javascript and the DOM. I would like to avoid document.write and innerHTML. What I am doing is dynamically...

Javascript

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

xml and character codes such as &Eacute;

Similar topics

xml and character codes such as É