xml and character codes such as É

jake

I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É". I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. In the
case of É I replaced it with "\xC9", the rest follow suit. The
list of characters is long and I doubt if this is the way it should be
handled. The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. At any rate, is there something I am missing? Some
XmlReader setting perhaps? Your help is greatly appreciated.
jake

Aug 4 '08 #1

Subscribe Reply

8009

Martin Honnen

jake wrote:

I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É".

That is an entity reference. To not "choke" on that entity reference you
need to declare the entity in the DTD you include in the XML document.
Otherwise the XML is not well-formed and the XML parser will reject it.
Note that DTD support is by default disabled in .NET 2.0 and later so
you will need to use
XmlReaderSettin gs settings = new XmlReaderSettin gs();
settings.Prohib itDtd = false;
using (XmlReader reader = XmlReader.Creat e("file.xml", settings))
{
...
}
if you want to use a DTD declaring the entities the XML uses.

--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/

Aug 4 '08 #2

Pavel Minaev

On Aug 4, 6:58*pm, jake <jakedim...@gma il.comwrote:

I am new to xml. *I have a routine that parses xml files using a
regular XmlReader class. *Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É". *I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. *In the
case of É I replaced it with "\xC9", the rest follow suit. *The
list of characters is long and I doubt if this is the way it should be
handled. *The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. *At any rate, is there something I am missing? *Some
XmlReader setting perhaps? *Your help is greatly appreciated.

XML has the following named character entities predefined in the
absence of any DTDs or entity declarations: amp, lt, gt, apos, quot.
This is just enough to be able to escape characters that are otherwise
reserved in XML.

All other named character entities should be declared, either directly
within the XML document, or in the .dtd file specified by the XML's
file DOCTYPE directive. As an example, have a look at the DTDs for
XHTML, which contain many character entity declarations:

http://www.w3.org/TR/xhtml1/dtds.html

In particular, if you search for "Eacute" on that page, you'll find
this declaration:

<!ENTITY Eacute "É"

So, to parse your XML, you'll need to specify a DTD for it, and
declare the entity within that DTD. If your input XML is actually
XHTML, then you can just download the .dtd and .ent files from the
link I've given earlier, and use them; otherwise, you'll need to write
your own.

Once you have the .dtd, you can associate it with XmlReader on
creation by creating an instance of XmlParserContex t, specifying its
SystemId property (it should be an URI referencing the .dtd file), and
then using the three-argument version of XmlReader.Creat e (one of the
arguments will be XmlParserContex t).

Alternatively, if you have control over the original XML (i.e. you can
mandate that it is changed), then you can just put the doctype
definition in the XML file itself. If your file looks something like
this:

<root... </root>

Then you can change it as follows:

<!DOCTYPE root [
<!ENTITY Eacute "É">
...
]>
<root>...</root>

Technically, if the XML document is supposed to be standalone, this is
the preferred way of doing things.

Aug 4 '08 #3

jake

Thank you Martin and Pavel. I understand a little more about it now.
Hoped that xml files would be a shallow wade but "nay" said the
gatekeeper. At least now I can proceed on solid grounds. I will most
likely include all the declarations in a separate .DTD that is
independently editable. This way, I can edit the file and add some
expletives without recompiling!
Thank you both again.
jake
On Aug 4, 12:05 pm, Pavel Minaev <int...@gmail.c omwrote:

On Aug 4, 6:58 pm, jake <jakedim...@gma il.comwrote:

I am new to xml. I have a routine that parses xml files using a
regular XmlReader class. Unfortunately, the XmlReader chokes (throws
an exception) on character codes such as "É". I resorted to
streaming the file first and replacing all the character codes with
their corresponding characters (copying the file while replacing
character codes at the same time) just to get things going. In the
case of É I replaced it with "\xC9", the rest follow suit. The
list of characters is long and I doubt if this is the way it should be
handled. The eventual parsed pieces of the xml files will be used as
parts of html web pages, not that that fact should make any
difference. At any rate, is there something I am missing? Some
XmlReader setting perhaps? Your help is greatly appreciated.

XML has the following named character entities predefined in the
absence of any DTDs or entity declarations: amp, lt, gt, apos, quot.
This is just enough to be able to escape characters that are otherwise
reserved in XML.

All other named character entities should be declared, either directly
within the XML document, or in the .dtd file specified by the XML's
file DOCTYPE directive. As an example, have a look at the DTDs for
XHTML, which contain many character entity declarations:

http://www.w3.org/TR/xhtml1/dtds.html

In particular, if you search for "Eacute" on that page, you'll find
this declaration:

<!ENTITY Eacute "É"

So, to parse your XML, you'll need to specify a DTD for it, and
declare the entity within that DTD. If your input XML is actually
XHTML, then you can just download the .dtd and .ent files from the
link I've given earlier, and use them; otherwise, you'll need to write
your own.

Once you have the .dtd, you can associate it with XmlReader on
creation by creating an instance of XmlParserContex t, specifying its
SystemId property (it should be an URI referencing the .dtd file), and
then using the three-argument version of XmlReader.Creat e (one of the
arguments will be XmlParserContex t).

Alternatively, if you have control over the original XML (i.e. you can
mandate that it is changed), then you can just put the doctype
definition in the XML file itself. If your file looks something like
this:

<root... </root>

Then you can change it as follows:

<!DOCTYPE root [
<!ENTITY Eacute "É">
...
]>
<root>...</root>

Technically, if the XML document is supposed to be standalone, this is
the preferred way of doing things.

Aug 4 '08 #4

Similar topics

4513

DeprecationWarning: Non-ASCII character '\xc0'

by: ahsan Imam | last post by:

Hello All, I have this file and when I import the file in the python interpretor I get the following error: "__main__:1: DeprecationWarning: Non-ASCII character '\xc0' in file trans.py on line 11, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details" I am not sure what encoding to use. I am not sure where to look. I

Python

2936

xml html codes

by: Jayme Assuncao Casimiro | last post by:

I am witing a wrapper to extract data from werb sites. I am storing the data in xml format. But it doesn't produce valid xml because of html codes like the above - typical of latin languages. á ã ó ö é What Would I do to overcome this problem?

.NET Framework

2771

REVISED question: history of dash character support

by: Harlan Messinger | last post by:

How far back in their version history did Netscape and Internet Explorer support — and – codes for em and en dashes in text? In ALT attributes? In TITLE tags? I notice that Netscape 4.7 and 6 and IE 6 all display these correctly as dashes even when the charset is specified as ISO-8859-1. Is that supposed to happen? -- Harlan Messinger

HTML / CSS

2266

expanding character entity references in javascript

by: Jim Higson | last post by:

Does anyone know a technique in javascript to transform from (for example) &hearts; to the char 'â™¥'? I'm doing this because I have to interpret some data I got over XHTMLHTTP that isn't XML, but might contain some XML char entities. Thanks, Jim

Javascript

4356

number or name for special character

by: The Bicycling Guitarist | last post by:

A browser conforming to HTML 4.0 is required to recognize &#number; notations. If I use XHTML 1.0 and charset UTF-8 though, does é have as much support as é ? Sometimes when I run the TIDY utility on my code, it replaces my character notations with weird looking things I don't recognize. Also, when I converted to UTF-8 from ISO-8859-1, I discovered many special characters

HTML / CSS

21696

how to delete a character in a file ?

by: S!mb | last post by:

Hi all, I'm currently developping a tool to convert texts files between linux, windows and mac. the end of a line is coded by 2 characters in windows, and only one in unix & mac. So I have to delete a character at each end of a line. car = fgetc(myFile); while (car != EOF) {

C / C++

2129

& special characters

by: dnevado | last post by:

Hi, I have developed a javascript script which sends some html code to w3 validator service through xmlHttpRequest interface in IE. I simply request a page, take responseText property and send it to w3 with fragment parameter specified. It works as if you fill the textarea out and press Check button. I´m always getting validator errors due to the title tag contains é symbol and it says it ´s not terminated correctly.

HTML / CSS

4628

BeautifulSoup vs. loose & chars

by: John Nagle | last post by:

I've been parsing existing HTML with BeautifulSoup, and occasionally hit content which has something like "Design & Advertising", that is, an "&" instead of an "&". Is there some way I can get BeautifulSoup to clean those up? There are various parsing options related to "&" handling, but none of them seem to do quite the right thing. If I write the BeautifulSoup parse tree back out with "prettify", the loose "&" is still in there. So...

Python

19574

createTextNode/innerHTML and special character handling

by: Dan Andrews | last post by:

Hello, I was wondering what is the correct way to handle special characters via javascript and the DOM. I would like to avoid document.write and innerHTML. What I am doing is dynamically creating options for a select. The innerHTML example below works for firefox and internet explorer, but is this the accepted way of dynamically adding special characters. option = document.createElement("OPTION");

Javascript

9423

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10216

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10049

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9997

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

6675

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5448

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3965

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3565

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2815

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General

xml and character codes such as &Eacute;

Similar topics

xml and character codes such as É