472,145 Members | 1,534 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,145 software developers and data experts.

LoadXML and UTF-8 encoding

Hello everybody,
I'm having troubles loading a Xml string encoded in UTF-8.

If I try this code:
------------------------------
XmlDocument doc=new XmlDocument();
String s="<?xml version=\"1.0\" encoding=\"utf-8\"
standalone=\"yes\"?><a>Schönbühl</a>";
doc.LoadXml(s);
doc.Save("d:\\temp\\test.xml");
------------------------------

What I get in the test.xml file is:
------------------------------
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<a>Schönbühl</a>
------------------------------

I'm puzzled about two points in the test.xml file:
- What is the "" at the beginning?
- Why are the special chars double-encoded?

Am I missing some point? Is there any workaround?

Thanks in advance,

jmgonet.
Nov 12 '05 #1
6 18041
jmgonet wrote:
I'm having troubles loading a Xml string encoded in UTF-8.

If I try this code:
------------------------------
XmlDocument doc=new XmlDocument();
String s="<?xml version=\"1.0\" encoding=\"utf-8\"
standalone=\"yes\"?><a>Schönbühl</a>";
doc.LoadXml(s);
doc.Save("d:\\temp\\test.xml");
------------------------------

What I get in the test.xml file is:
------------------------------
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<a>Schönbühl</a>
------------------------------

I'm puzzled about two points in the test.xml file:
- What is the "" at the beginning?
It's Unicode Byte-Order Mark character. It's ok, but actually in UTF-8
it's optional and you can get rid of it:

XmlTextWriter w = new XmlTextWriter("d:\\temp\\test.xml", new
UTF8Encoding(false);
doc.Save(w);
w.Close();
- Why are the special chars double-encoded?


What do you mean?
--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com
Nov 12 '05 #2
Thanks, Oleg, for your reply.

OK for the "". That's interesting

By double encoded I mean that the chars are encoded twice to UTF-8:
Originally the string contained in the xml was "Schnbhl".
To put it into the UTF standard, I've transformed it to "Schönbühl", where
the ="ö" and the ="ü". This is the string I'm trying to load:

-------------------------------------------------
XmlDocument doc=new XmlDocument();
String s="<?xml version=\"1.0\" encoding=\"utf-8\"
standalone=\"yes\"?><a>Schönbühl</a>";
doc.LoadXml(s);
doc.Save("d:\\temp\\test.xml");
--------------------------------------------------

But when I open "test.xml" in a text editor, I get:
------------------------------
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<a>Schf¶nbf¼hl</a>
------------------------------

The string is converted to "Schf¶nbf¼hl", where the ="f", the
="¶"...

I'm puzzled about two points in the test.xml file:
- What is the "" at the beginning?


It's Unicode Byte-Order Mark character. It's ok, but actually in UTF-8
it's optional and you can get rid of it:

XmlTextWriter w = new XmlTextWriter("d:\\temp\\test.xml", new
UTF8Encoding(false);
doc.Save(w);
w.Close();
- Why are the special chars double-encoded?


What do you mean?
--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com


Nov 12 '05 #3
jmgonet wrote:
By double encoded I mean that the chars are encoded twice to UTF-8:
Originally the string contained in the xml was "Schnbhl".
To put it into the UTF standard, I've transformed it to "Schönbühl", where
the ="ö" and the ="ü".


I think that's a bad idea. UTF-8 defines how Unicode characters are
represented in bytes. By doubling characters you get just two characters.

What's wrong with "Schnbhl"? Just use it as is.

--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com
Nov 12 '05 #4
>What's wrong with "Schnbhl"?

Well, I see I was oversimplifying my question, in an attempt to avoid
discussing about other issues. The point is that the Xml file is provided by
guys from another company. These guys seem to be very fond of UTF-8. So they
encode everything into UTF-8. They make UTF-8 files and they leave them in a
FTP server.

I'm writing an application logging into the FTP server, getting the XML
files. Then I end up with the file contained in a string. I don't have any
control over the content of the file or its format. So I have to accept it
"as is".

Now I have a very long string containing lots of datas (about 20kb), its
header is
<?xml version="1.0" encoding="utf-8"> standalone="yes"?>

And there are plenty of elements, some of them containing special chars like
those in the "Schönbühl" city name.

So I want to load this Xml file into a Document. To do this I thought that
the easiest way was to load it from memory:
[... Lot of code here ...]
[... A lot of code here too ...]
[... And a lot more, just to ...]
[... obtain a string with ...]
[... the content of my Xml file in it ...]
XmlDocument doc=new XmlDocument();
doc.LoadXml(s);

At this point I had some issues with bad formated strings. So I started to
investigate. I investigate the file in the FTP server, I investigate the TCP
communication between the server and my application, I investigate the
Encoding in my application. At the end I found that everything seemed to be
correct, BUT still the special chars were broken.

So, as I didn't had any clue about what was wrong, I started reducing my
application, in order to obtain the simplest and shortest code reproducing
my error. And I end up with this:

----------------------------------------------
XmlDocument doc=new XmlDocument();
String s="<?xml version=\"1.0\" encoding=\"utf-8\">
standalone=\"yes\"?><a>Schönbühl</a>";
doc.LoadXml(s);
doc.Save("d:\\temp\\test.xml");
----------------------------------------------

My objective is not having a file called "test.xml", in a folder called
"temp", and containing the name of a bizarre city. It is just that this
small piece of code happens to behave in a way I'm not sure to understand.
In my mind, if you load the string s as a xml document, then you save it to
an xml file, the file and the string should have the same identical content.
Or, at least, an equivalent content. I still believe it.

But instead, if I use a text editor to read the "test.xml" file, I obtain
this:

----------------------------------------------
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<a>Schf¶nbf¼hl</a>
----------------------------------------------

where I expected to see
----------------------------------------------
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<a>Schönbühl</a>
----------------------------------------------

Thanks to you, now I understand the meaning of the "" at the beginning.

But I'm still puzzled about the double encoding of the "Schönbühl"'s
special chars.

Regards,
Jean-Michel Gonet.

"Oleg Tkachenko [MVP]" <oleg@NO!SPAM!PLEASEtkachenko.com> wrote in message
news:us**************@TK2MSFTNGP12.phx.gbl...
jmgonet wrote:
By double encoded I mean that the chars are encoded twice to UTF-8:
Originally the string contained in the xml was "Schnbhl".
To put it into the UTF standard, I've transformed it to "Schönbühl", where the ="ö" and the ="ü".


I think that's a bad idea. UTF-8 defines how Unicode characters are
represented in bytes. By doubling characters you get just two characters.

What's wrong with "Schnbhl"? Just use it as is.

--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com

Nov 12 '05 #5
jmgonet wrote:
Well, I see I was oversimplifying my question, in an attempt to avoid
discussing about other issues. The point is that the Xml file is provided by
guys from another company. These guys seem to be very fond of UTF-8. So they
encode everything into UTF-8. They make UTF-8 files and they leave them in a
FTP server.

I'm writing an application logging into the FTP server, getting the XML
files. Then I end up with the file contained in a string. I don't have any
control over the content of the file or its format. So I have to accept it
"as is".

Now I have a very long string containing lots of datas (about 20kb), its
header is
<?xml version="1.0" encoding="utf-8"> standalone="yes"?>

And there are plenty of elements, some of them containing special chars like
those in the "Schönbühl" city name.

So I want to load this Xml file into a Document. To do this I thought that
the easiest way was to load it from memory:
[... Lot of code here ...]
[... A lot of code here too ...]
[... And a lot more, just to ...]
[... obtain a string with ...]
[... the content of my Xml file in it ...]
XmlDocument doc=new XmlDocument();
doc.LoadXml(s);

At this point I had some issues with bad formated strings. So I started to
investigate. I investigate the file in the FTP server, I investigate the TCP
communication between the server and my application, I investigate the
Encoding in my application. At the end I found that everything seemed to be
correct, BUT still the special chars were broken.


Hmmm, actually reading UTF-8 XML as a string should work. In fact
strings in .NET are always UTF-16 encoded and XmlTextReader has special
ability to recognize such case and to switch to UTF-16.
Usually problems arise when you read XML to a string - it should be done
with respect to UTF-8 encoding.
For instance, XML file:

<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<books>Schönbühl</books>

It's stored in UTF-8 on Windows as
EF BB BF 3C 3F 78 6D 6C │ 20 76 65 72 73 69 6F 6E я╗┐<?xml version
3D 22 31 2E 30 22 20 65 │ 6E 63 6F 64 69 6E 67 3D ="1.0" encoding=
22 75 74 66 2D 38 22 20 │ 73 74 61 6E 64 61 6C 6F "utf-8" standalo
6E 65 3D 22 79 65 73 22 │ 3F 3E 0D 0A 3C 62 6F 6F ne="yes"?>♪◙<boo
6B 73 3E 53 63 68 C3 B6 │ 6E 62 C3 BC 68 6C 3C 2F ks>Sch├╢nb├╝hl</
62 6F 6F 6B 73 3E 0D 0A │ 0D 0A books>♪◙♪◙

Note how letter ö gets encoded in UTF-8 as 2 bytes - C3 B6 and ü - as C3 BC.

Then the following code correctly reads the file into a string and then
loads it into DOM.

StreamReader sr = new StreamReader("foo.xml", Encoding.UTF8);
string xml = sr.ReadToEnd();
sr.Close();
XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
doc.Save("foo2.xml");

The result is the same data.

--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com
Nov 12 '05 #6
Yes, your example is nice, but see that

------------------------------------
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<books>Schnbhl</books>
------------------------------------

Is not correctly formated for a UTF-8 file. Anyway, it shows the strange
behavior of the "LoadXml" method. Look at this: Take your sample, copy it in
a text editor, and save the file as "test.xml". Then try to open it with
Internet Explorer. You get an error:

-----------------------
An invalid character was found in text content. Error
processing resource 'file:///D:/TEMP/test.xml'.
Line 2, Position 11

<books>Sch
--------------------------------------

So, LoadXml can read it, but Internet Explorer can't.

You can try also this other example. Create with a text editor the following
file:

--------------------------------------
<?xml version=\"1.0\" encoding=\"UTF-8\"
standalone=\"yes\"?><a>Schönbühl</a>
--------------------------------------

Save it as "test.xml", and then run the following code:

--------------------------------------
XmlDocument doc=new XmlDocument();
doc.Load("d:\\temp\\test.xml");
doc.Save("d:\\temp\\test2.xml");
--------------------------------------

If you open "test2.xml" with the text editor, you can see that it is
identical to test.xml.

So the "load" method doesn't behave like "LoadXml":
- LoadXml seems to load any xml string as a ISO-8859-1, regardless of the
header. After, Save, uses the encoding information to save the file. But
that is another story.
- Load seems to check the header for enconding information.

But, try this:
--------------------------------------
XmlDocument doc=new XmlDocument();
String s="<?xml version=\"1.0\" encoding=\"UTF-8\"
standalone=\"yes\"?><a>Schönbühl</a>";
StringReader sr=new StringReader(s);
doc.Load(sr);
doc.Save("d:\\temp\\test2.xml");
--------------------------------------

It is the same example as at the begining, but using Load instead of
LoadXml. The result is

--------------------------------------
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<a>Schf¶nbf¼hl</a>
--------------------------------------

So the "Load" method behaves differently when it reads from a file or when
it reads from a stream!
By now I've worked around my problem about UTF-8, but I'm still convinced
that the Load and LoadXml methods have a bizarre behavior.

Regards,
Jean-Michel Gonet.

Hmmm, actually reading UTF-8 XML as a string should work. In fact
strings in .NET are always UTF-16 encoded and XmlTextReader has special
ability to recognize such case and to switch to UTF-16.
Usually problems arise when you read XML to a string - it should be done
with respect to UTF-8 encoding.
For instance, XML file:

<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<books>Schnbhl</books>

It's stored in UTF-8 on Windows as
EF BB BF 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E ?++<?xml version
3D 22 31 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D ="1.0" encoding=
22 75 74 66 2D 38 22 20 73 74 61 6E 64 61 6C 6F "utf-8" standalo
6E 65 3D 22 79 65 73 22 3F 3E 0D 0A 3C 62 6F 6F ne="yes"?>??<boo
6B 73 3E 53 63 68 C3 B6 6E 62 C3 BC 68 6C 3C 2F ks>Sch+nb++hl</
62 6F 6F 6B 73 3E 0D 0A 0D 0A books>????

Note how letter gets encoded in UTF-8 as 2 bytes - C3 B6 and - as C3 BC.
Then the following code correctly reads the file into a string and then
loads it into DOM.

StreamReader sr = new StreamReader("foo.xml", Encoding.UTF8);
string xml = sr.ReadToEnd();
sr.Close();
XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
doc.Save("foo2.xml");

The result is the same data.

--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com

Nov 12 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Tim Haughton | last post: by
reply views Thread by Reshma Prabhu | last post: by
2 posts views Thread by Lupina | last post: by
2 posts views Thread by binder | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.