Connecting Tech Pros Worldwide Help | Site Map

XmlTextWriter Encodes HTML Entities?

clintonG
Guest
 
Posts: n/a
#1: May 28 '07
Can anybody make sense of this crazy and inconsistent results?

// IE7 Feed Reading View disabled displays this raw XML
<?xml version="1.0" encoding="utf-8" ?>
<!-- AT&T HTML entities & XML <elementsare displayed -->
<rss version="2.0">
<channel>
<title>AT&T HTML entities & XML <elementsare displayed</title>
....
<description>
<![CDATA[ AT&T HTML entities & XML <elementsusing CDATA ]]>
</description>
....

The XML comment data comes directly from the TextBox on the Form
as text. The XmlTextWriter writer.WriteElementString("title", title)
generates
the <titleelement and writer.WriteCData(description) generates the
<descriptionelement.

// Drag the testRSS.xml file into NotePad displays
<?xml version="1.0" encoding="utf-8"?>
<!--AT&T HTML entities & XML <elementsare displayed-->
<rss version="2.0">
<channel>
<title>AT&amp;T HTML entities &amp; XML &lt;elements&gt; are
displayed</title>
<description><![CDATA[AT&T HTML entities & XML <elementsusing
CDATA]]></description>

// Enable IE7 Feed Reading View and observe that IE7
// either violates XML by encoding HTML entities and XML elements
// or encodes unencoded XML data for display of RSS
AT&T HTML entities & XML <elementsare displayed


Its bad enough IE7 is likely still a sloppy parser and will violate XML
validity rules
by encoding unencoded feed data which really makes life all FUBAR for an
application developer but worse yet what is encoding the HTML entities and
the
XML element in the <titleelement when the testRSS.xml file is dragged into
NotePad?

Does the XmlTextWriter encode HTML and XML? How does the data in the
<titleelement in the file end up encoded?

<%= Clinton Gallagher
NET csgallagher AT metromilwaukee.com
URL http://clintongallagher.metromilwaukee.com/



Martin Honnen
Guest
 
Posts: n/a
#2: May 29 '07

re: XmlTextWriter Encodes HTML Entities?


clintonG wrote:
Quote:
Does the XmlTextWriter encode HTML and XML? How does the data in the
<titleelement in the file end up encoded?
With XmlWriter respectively XmlTextWriter you can ensure that your XML
markup is well-formed as methods like WriteElementString make sure that
'&' is escaped as &amp; and '<' is escaped as '&lt;' so for example
xmlWriter.WriteElementString("title",
"AT & T, <element>content</element>");
yields
<title>AT &amp; T,&lt;element&gt;content&lt;/element&gt;</title>

That has nothing to do with HTML or HTML entities, rather XML defines
entities like amp or gt or lt itself.

If you wanted that 'title element to have a child 'element' then you
need to use
xmlWriter.WriteStartElement("title");
xmlWriter.WriteString("AT & T");
xmlWriter.WriteElementString("element", "content");
xmlWriter.WriteEndElement();
which yields

<title>AT &amp; T<element>content</element></title>


--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
clintonG
Guest
 
Posts: n/a
#3: May 29 '07

re: XmlTextWriter Encodes HTML Entities?


Thanks for confirming that the XmlTextWriter methods escapes and encodes
specific text characters as HTML character entities. The HTML character
entity naming conventions you attempt to clarify are defined by W3C (24.4.1
The list of characters, Special characters for HTML [1]). My question should
have asked if the method escape and encode "text characters" as HTML
entities. Nitpicker ;-)

Anyhow I didn't observe MSDN documentation make note of this inherent
feature of the class as the escaping and encoded features are not explicitly
documented in any page I have yet to read. There is a pthy comment within
the narrative of the "Writing XML with the XmlWriter" document [2] but the
narrative is poorly written and easily misunderstood.

<%= Clinton Gallagher

[1] http://www.w3.org/TR/html401/sgml/entities.html
[2] http://msdn2.microsoft.com/en-us/lib...hb(VS.80).aspx


"Martin Honnen" <mahotrash@yahoo.dewrote in message
news:%23%23CMDdeoHHA.5052@TK2MSFTNGP04.phx.gbl...
Quote:
clintonG wrote:
>
Quote:
>Does the XmlTextWriter encode HTML and XML? How does the data in the
><titleelement in the file end up encoded?
>
With XmlWriter respectively XmlTextWriter you can ensure that your XML
markup is well-formed as methods like WriteElementString make sure that
'&' is escaped as &amp; and '<' is escaped as '&lt;' so for example
xmlWriter.WriteElementString("title",
"AT & T, <element>content</element>");
yields
<title>AT &amp; T,&lt;element&gt;content&lt;/element&gt;</title>
>
That has nothing to do with HTML or HTML entities, rather XML defines
entities like amp or gt or lt itself.
>
If you wanted that 'title element to have a child 'element' then you need
to use
xmlWriter.WriteStartElement("title");
xmlWriter.WriteString("AT & T");
xmlWriter.WriteElementString("element", "content");
xmlWriter.WriteEndElement();
which yields
>
<title>AT &amp; T<element>content</element></title>
>
>
--
>
Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/

Martin Honnen
Guest
 
Posts: n/a
#4: May 29 '07

re: XmlTextWriter Encodes HTML Entities?


clintonG wrote:
Quote:
Thanks for confirming that the XmlTextWriter methods escapes and encodes
specific text characters as HTML character entities. The HTML character
entity naming conventions you attempt to clarify are defined by W3C (24.4.1
The list of characters, Special characters for HTML [1]). My question should
have asked if the method escape and encode "text characters" as HTML
entities. Nitpicker ;-)
XML defines its own entities and what XmlWriter does is based on the XML
specification and _not_ on the HTML specification.
See <http://www.w3.org/TR/REC-xml/#sec-predefined-ent>.

--

Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
clintonG
Guest
 
Posts: n/a
#5: May 29 '07

re: XmlTextWriter Encodes HTML Entities?


I kept following links and finally found the arcane documentation: an
XmlWriterSettings.CheckCharacters Property [1]. So it seems to me ASP.NET
developers don't have to fool around with Regular Expressions to validate
and replace text characters that would be illegal when the document is saved
as XML, i.e. RSS feeds for example.

I understand what W3C documents say but XML and HTML derive from SGML and
there are some semantic ambiguities in this context in the W3C documents.
Most of us and most documentation including W3C documentation define &amp;
as an HTML character entity. When we get to the W3C page(s) for XML they
drop the verbiage "HTML" when describing character entities.

As I'm sure you'll have to agree reading the EBNF, the DTDs indicate we're
talking about the same thing using context specific nomenclature.
So we really don't need to quibble about semantics. All I want to do is
write code that will generate valid XML RSS feeds that will be parsed by the
greatest number of aggregators which in itself requires a personal
relationship with all the blessings of Heaven because everybody has been so
FUBAR in their respective implementations.

<%= Clinton Gallagher

[1]
http://msdn2.microsoft.com/en-us/lib...rs(VS.80).aspx


"Martin Honnen" <mahotrash@yahoo.dewrote in message
news:eqqbG%23goHHA.5052@TK2MSFTNGP04.phx.gbl...
Quote:
clintonG wrote:
Quote:
>Thanks for confirming that the XmlTextWriter methods escapes and encodes
>specific text characters as HTML character entities. The HTML character
>entity naming conventions you attempt to clarify are defined by W3C
>(24.4.1 The list of characters, Special characters for HTML [1]). My
>question should have asked if the method escape and encode "text
>characters" as HTML entities. Nitpicker ;-)
>
XML defines its own entities and what XmlWriter does is based on the XML
specification and _not_ on the HTML specification.
See <http://www.w3.org/TR/REC-xml/#sec-predefined-ent>.
>
--
>
Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/

Bjoern Hoehrmann
Guest
 
Posts: n/a
#6: May 29 '07

re: XmlTextWriter Encodes HTML Entities?


* clintonG wrote in microsoft.public.dotnet.xml:
Quote:
>I understand what W3C documents say but XML and HTML derive from SGML and
>there are some semantic ambiguities in this context in the W3C documents.
>Most of us and most documentation including W3C documentation define &amp;
>as an HTML character entity. When we get to the W3C page(s) for XML they
>drop the verbiage "HTML" when describing character entities.
It would be very confusing otherwise. As an example, &apos; is valid in
XML but not part of HTML, while &ouml; is part of HTML but not of XML;
so if you speak about the pre-defined entities in XML you refer to five,
if you speak about those in HTML you refer to hundreds of them.
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
clintonG
Guest
 
Posts: n/a
#7: Jun 1 '07

re: XmlTextWriter Encodes HTML Entities?



"Bjoern Hoehrmann" <bjoern@hoehrmann.dewrote in message
news:4tuo539ctpg81mok0gqb38r980cqg7aaft@hive.bjoer n.hoehrmann.de...
Quote:
>* clintonG wrote in microsoft.public.dotnet.xml:
Quote:
>>I understand what W3C documents say but XML and HTML derive from SGML and
>>there are some semantic ambiguities in this context in the W3C documents.
>>Most of us and most documentation including W3C documentation define &amp;
>>as an HTML character entity. When we get to the W3C page(s) for XML they
>>drop the verbiage "HTML" when describing character entities.
>
It would be very confusing otherwise. As an example, &apos; is valid in
XML but not part of HTML, while &ouml; is part of HTML but not of XML;
so if you speak about the pre-defined entities in XML you refer to five,
if you speak about those in HTML you refer to hundreds of them.
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Nobody argues that point Björn except to say the correct use of the English
language used in a formal document requires the use of "narrative" and
"expository" use of the grammar which we native speakers of English are
taught in grade school.

I value consistency in technical documentation which is considered a formal
use of the language. Consistency should not be compromised for the sake of
brevity which in this context results in the obfuscation of terminology. I
mean what are we talking about being needed here? A single paragraph of
narrative supported by a single expository table of five rows to resolve an
apparent contradiction which is not a contradiction at all?

Sometimes the people on the W3C working groups do not always make the best
decisions and are not neccessarily known for their mastery of the English
language which is said to be the most difficult language to master. That
said, over the years having observed how software developers will quibble
with one another for weeks or perhaps months about a single term and its
meaning I'm genuinely surprised this discrepancy has become over-looked.

<%= Clinton


Closed Thread