By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,652 Members | 1,487 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,652 IT Pros & Developers. It's quick & easy.

System.Xml.XmlDocument parses  as well-formed

P: n/a
With both .NET 1.0 and 1.1 I have found the following strange behaviour
where System.Xml.XmlDocument.LoadXml doesn't throw an error when parsing
a text node with a character reference to an invalid characters like .
Using the CreateTextNode method I create a text node containing
"\u0001a" (C# string literal notation). As far as I understand the DOM
allows that and an implementation is not required to throw an error.
When OuterXml is used to serialize the document .NET serializes that as
<root>&#x1;a</root>
which is not well-formed in my understanding. Even worse when parsing
that string again using LoadXml no error occurs which I see as a bug.
Here is an example C# program, it first demonstrates that .NET throws an
error when parsing "<root>\u0001a</root>" and then does what I have
described above:

using System;
using System.Xml;

public class Test20040113 {
public static void Main (string[] args) {
XmlDocument xmlDocument = new XmlDocument();
string xmlSource = "<root>\u0001a</root>";
try {
xmlDocument.LoadXml(xmlSource);
Console.WriteLine("Successfully parsed " + xmlSource);
}
catch (Exception e) {
Console.WriteLine("Error parsing " + xmlSource + ": " + e);
}
Console.WriteLine();
xmlDocument = new XmlDocument();
xmlDocument.LoadXml("<root />");
XmlText textNode = xmlDocument.CreateTextNode("\u0001a");
xmlDocument.DocumentElement.AppendChild(textNode);
xmlSource = xmlDocument.OuterXml;
Console.WriteLine("Created XML serialized as " + xmlSource);
try {
xmlDocument.LoadXml(xmlSource);
Console.WriteLine("Successfully parsed " + xmlSource);
}
catch (Exception e) {
Console.WriteLine("Error parsing " + xmlSource + ": " + e);
}
}
}

I get the following output (the exception message is in German as I am
on a German Win XP version here):

Error parsing <root>☺a</root>: System.Xml.XmlException: '☺',
hexidezimaler Wert
0x01, ist ein ungültiges Zeichen. Zeile 1, Position 7.
at System.Xml.XmlScanner.ScanContent()
at System.Xml.XmlTextReader.ParseBeginTagExpandCharEn tities()
at System.Xml.XmlTextReader.Read()
at System.Xml.XmlValidatingReader.ReadNoCollectTextTo ken()
at System.Xml.XmlValidatingReader.Read()
at System.Xml.XmlLoader.LoadChildren(XmlNode parent)
at System.Xml.XmlLoader.LoadElementNode()
at System.Xml.XmlLoader.LoadCurrentNode()
at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
at System.Xml.XmlLoader.Load(XmlDocument doc, XmlReader reader,
Boolean prese
rveWhitespace)
at System.Xml.XmlDocument.Load(XmlReader reader)
at System.Xml.XmlDocument.LoadXml(String xml)
at Test20040113.Main(String[] args)

Created XML serialized as <root>&#x1;a</root>
Successfully parsed <root>&#x1;a</root>
Why does .NET parse the &#x1; as well-formed??
--

Martin Honnen
http://JavaScript.FAQTs.com/

Nov 12 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
Martin Honnen wrote:
With both .NET 1.0 and 1.1 I have found the following strange behaviour
where System.Xml.XmlDocument.LoadXml doesn't throw an error when parsing
a text node with a character reference to an invalid characters like &#x1;.


This is well-known and afair recognized (in xml-dev?) bug in
XmltextReader implementation. This behaviour is controlled by
XmlReader.Normalization property, which is unfortunately turned off by
default due to perf reasons. Here is what MSDN says:

"If Normalization is set to false, this also disables character range
checking for numeric entities. As a result, character entities, such as
�, are allowed."
ms-help://MS.MSDNQTR.2003JUL.1033/cpref/html/frlrfSystemXmlXmlTextReaderClassNormalizationTopic .htm

So to fix your test case, read XML via XmlTextReader, which
Normalization property set to true.
--
Oleg Tkachenko
XML Insider
http://www.tkachenko.com/blog
Nov 12 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.