By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,870 Members | 1,189 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,870 IT Pros & Developers. It's quick & easy.

XmlDocument.ReadNode() breaking - why?

P: n/a
Hi...

A colleague just referred this question to me. He's getting an xml file
from another party, which he's trying to process into another dom using an
XmlTextReader and XmlDocument.ReadNode(). The problem is that it's breaking
and he doesn't understand why. I didn't exactly either, which is why I'm
posting a question here.

First, his program just creates a new dom using new document like this:
XmlDocument xml = new XmlDocument();
XmlElement root = xml.CreateElement("root");
xml.AppendChild(root);

Then it starts sucking in various xml files on disk like this
StreamReader streamreader = File.OpenText(fPath);
XmlTextReader reader = new XmlTextReader(streamreader);
reader.MoveToContent();
XmlNode node = xml.ReadNode(reader);
root.AppendChild(node);

What's happening is that for this weird xml file he gets, the
xml.ReadNode(reader); line throws an encoding error.

The file he got has a bunch of high-bit characters (looks like garbage) that
are valid iso-8859-1 (the document's declared encoding) in a CDATA section.
The error that ReadNode() throws appears to be that the XmlTextReader is
trying to read through this CDATA blob as utf-8, trying to mash these
individual high-bit characters back together according to utf-8 rules to
make unicode chars out of them. Specifically, it's trying to mash ED B3 A8
into &DCE8;, and ReadNode() throws an error that that is an invalid character.

It's as though the XmlTextReader is applying the encoding rules of the
parent dom calling ReadNode() rather than paying attention to the encoding
declaration it saw go by.

The xml file in question does parse successfully on its own.

Is there anything to do so that XmlTextReader/ReadNode pay attention to the
information going by them as it parses?

I know I could recommend that he use xml.ImportNode() to suck the result of
the parsing into his main dom, but I'd like to better understand the rules
ReadNode/XmlTextReader are going to be using; if I can get it to handle the
encoding issues better it seems like it would be more efficient than

XmlDocument xFile = new XmlDocument();
xFile.Load (filePath);
XmlNode n = xml.ImportNode (xFile.documentElement, true);
xml.documentElement.AppendChild (n);

Thanks
-mark

Nov 12 '05 #1
Share this Question
Share on Google+
7 Replies


P: n/a
"Mark" <mm******@nospam.nospam> wrote in message news:7E**********************************@microsof t.com...
XmlTextReader reader = new XmlTextReader(streamreader);
reader.MoveToContent();
XmlNode node = xml.ReadNode(reader); : : Is there anything to do so that XmlTextReader/ReadNode pay attention to the
information going by them as it parses?


Is reader.Encoding equal to null? (If unspecified, it will be UTF-8.)

If reader.Encoding is not null, then what is reader.Encoding.EncodingName?
Derek Harmon
Nov 12 '05 #2

P: n/a
Hi. The reader.Encoding property is System.Text.UnicodeEncoding. The
reader.Encoding.EncodingName is Unicode. Thanks, David.

Derek Harmon wrote:
"Mark" <mm******@nospam.nospam> wrote in message news:7E**********************************@microsof t.com...
XmlTextReader reader = new XmlTextReader(streamreader);
reader.MoveToContent();
XmlNode node = xml.ReadNode(reader);

: :
Is there anything to do so that XmlTextReader/ReadNode pay attention to the information going by them as it parses?


Is reader.Encoding equal to null? (If unspecified, it will be

UTF-8.)
If reader.Encoding is not null, then what is reader.Encoding.EncodingName?

Derek Harmon


Nov 12 '05 #3

P: n/a
Hi. The reader.Encoding property is System.Text.UnicodeEncoding. The
reader.Encoding.EncodingName is Unicode. Thanks, David.

Derek Harmon wrote:
"Mark" <mm******@nospam.nospam> wrote in message news:7E**********************************@microsof t.com...
XmlTextReader reader = new XmlTextReader(streamreader);
reader.MoveToContent();
XmlNode node = xml.ReadNode(reader);

: :
Is there anything to do so that XmlTextReader/ReadNode pay attention to the information going by them as it parses?


Is reader.Encoding equal to null? (If unspecified, it will be

UTF-8.)
If reader.Encoding is not null, then what is reader.Encoding.EncodingName?

Derek Harmon


Nov 12 '05 #4

P: n/a
Hi Derek...

As David noted, the reader encoding appears to be defaulting to unicode, so
that seems to be the problem. But why doesn't the reader pay attention to
the processing directive that passed under its nose? It is a step you can do
manually (check to see if the first node you have is a processing directive
and grab the encoding yourself), but it seems like one of those things that
would have been good to build in under the covers too.

Thanks
-mark
"Derek Harmon" wrote:
"Mark" <mm******@nospam.nospam> wrote in message news:7E**********************************@microsof t.com...
XmlTextReader reader = new XmlTextReader(streamreader);
reader.MoveToContent();
XmlNode node = xml.ReadNode(reader);

: :
Is there anything to do so that XmlTextReader/ReadNode pay attention to the
information going by them as it parses?


Is reader.Encoding equal to null? (If unspecified, it will be UTF-8.)

If reader.Encoding is not null, then what is reader.Encoding.EncodingName?
Derek Harmon

Nov 12 '05 #5

P: n/a
Okay... Just tried it, and XmlTextReader.Encoding is a read-only property.
The only way I can see to change that is with the constructor that takes an
XmlParserContext - but this leads to the counter-intuitive fact that you have
to guess the document encoding before you create the reader that will read
your xml file.

It doesn't appear that XmlTextReader will pay attention to the encoding in
the processing directive, leaving you kinda high and dry. Is this really the
case? seems like a bad way to make the tools.

Thanks
-mark
"Mark" wrote:
Hi Derek...

As David noted, the reader encoding appears to be defaulting to unicode, so
that seems to be the problem. But why doesn't the reader pay attention to
the processing directive that passed under its nose? It is a step you can do
manually (check to see if the first node you have is a processing directive
and grab the encoding yourself), but it seems like one of those things that
would have been good to build in under the covers too.

Thanks
-mark
"Derek Harmon" wrote:
"Mark" <mm******@nospam.nospam> wrote in message news:7E**********************************@microsof t.com...
XmlTextReader reader = new XmlTextReader(streamreader);
reader.MoveToContent();
XmlNode node = xml.ReadNode(reader);

: :
Is there anything to do so that XmlTextReader/ReadNode pay attention to the
information going by them as it parses?


Is reader.Encoding equal to null? (If unspecified, it will be UTF-8.)

If reader.Encoding is not null, then what is reader.Encoding.EncodingName?
Derek Harmon

Nov 12 '05 #6

P: n/a
Just closing the loop here a bit - another person here pointed out that it
seemed to depend on how you create the stream you feed to XmlTextReader in
the first place.

If you use File.OpenText() to get a StreamReader and then construct
XmlTextReader with a StreamReader, it appears to lock the encoding in place
and XmlTextReader will not respect the processing directive.

If you use File.OpenRead() to get a simple FileStream and use *that* to
construct XmlTextReader, the XmlTextReader is more responsive to what's in
the stream it's reading.

Thanks
-mark

Nov 12 '05 #7

P: n/a
"Mark" <mm******@nospam.nospam> wrote in message news:A3**********************************@microsof t.com...
As David noted, the reader encoding appears to be defaulting to unicode, so
that seems to be the problem.
When the encoding of the XMLDecl and the encoding of the content
presented to the reader are different, then you will have problems.
If you use File.OpenText() to get a StreamReader and then construct
XmlTextReader with a StreamReader, it appears to lock the encoding
in place and XmlTextReader will not respect the processing directive.
The documentation for File.OpenText( ) is clear about interpreting
the file as UTF-8,

http://msdn.microsoft.com/library/en...ntexttopic.asp

Even though the file MAY be encoded as iso-8859-1, doing this will
"present" the file's contents as UTF-8.

The encoding of the I/O StreamReader is paramount because remember,
XmlTextReader depends upon the StreamReader's Read( ) method(s).
The StreamReader is responsible for decoding from whatever bytes
are in the file to characters using it's encoding (it knows nothing about
XMLDecl).
If you use File.OpenRead() to get a simple FileStream and use *that*
to construct XmlTextReader, the XmlTextReader is more responsive
to what's in the stream it's reading.
FileStreams can be binary, therefore choosing a FileStream gives the
XmlTextReader the option to read *bytes* instead of characters. It
then has something to say about what encoding it uses to perform this
translation.
It is a step you can do manually (check to see if the first node
you have is a processing directive and grab the encoding yourself)


That's the XmlDeclaration's Encoding property. It won't appear as
an XmlProcessingInstruction. You could use this code to inject an
XMLDecl if one isn't already present,

if ( xml.FirstChild.NodeType != XmlNodeType.XmlDeclaration )
{
XmlDeclaration decl = xml.CreateXmlDeclaration( "1.0", "iso-8859-1", null);
xml.InsertBefore( decl, xml.FirstChild);
}

to set the XML Declaration if one doesn't exist. To read the
encoding off of an XmlDocument?'s XMLDecl,

string encodingStr = null;
if ( xml.FirstChild.NodeType == XmlNodeType.XmlDeclaration )
encodingStr = (XmlDeclaration)( xml.FirstChild).Encoding;
encodingStr = ( encodingStr == null ) ? "UTF-8" : encodingStr;

If the XmlDocument's FirstChild isn't of XmlNodeType.XmlDeclaration
then it doesn't have an XMLDecl. If there is no XMLDecl, or there is
one without an Encoding, then the encoding is UTF-8 by default.

In my experience, when the encoding on the XMLDecl matches the
encoding of the content, there are no problems.

I've tried producing a file to match your example like this,

- - - WriteOut.cs
using System;
using System.IO;
using System.Text;
using System.Xml;

public class WriteOutIso8859_1
{
public static void Main( )
{
FileStream fs = new FileStream( "iso8859_1.xml", FileMode.CreateNew);
StreamWriter writer = new StreamWriter( fs, Encoding.GetEncoding( "iso-8859-1"));
writer.WriteLine( "<?xml version='1.0' encoding='iso-8859-1'?>");
writer.WriteLine( "<root>");
writer.WriteLine( "\t<first>Hello World</first>");
writer.Write( "\t<second><![CDATA[");
writer.Write( new char[] { (char)0xED, (char)0xB3, (char)0xA8} );
writer.WriteLine( "]]></second>");
writer.WriteLine( "</root>");
writer.Flush( );
writer.Close( );
}
}
- - -

When I read this file in with the following code I have no problems.

FileStream fs = new FileStream( "iso8859_1.xml", FileMode.Open);
StreamReader sw = new StreamReader( fs, Encoding.GetEncoding( "iso-8859-1"));
XmlTextReader reader = new XmlTextReader( sw);
reader.MoveToContent( );
XmlNode node = xmlDoc.ReadNode( reader);
Derek Harmon
Nov 12 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.