473,387 Members | 3,781 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

LoadXML and UTF-8 encoding

Hello everybody,
I'm having troubles loading a Xml string encoded in UTF-8.

If I try this code:
------------------------------
XmlDocument doc=new XmlDocument();
String s="<?xml version=\"1.0\" encoding=\"utf-8\"
standalone=\"yes\"?><a>Schönbühl</a>";
doc.LoadXml(s);
doc.Save("d:\\temp\\test.xml");
------------------------------

What I get in the test.xml file is:
------------------------------
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<a>Schönbühl</a>
------------------------------

I'm puzzled about two points in the test.xml file:
- What is the "" at the beginning?
- Why are the special chars double-encoded?

Am I missing some point? Is there any workaround?

Thanks in advance,

jmgonet.
Nov 12 '05 #1
6 18671
jmgonet wrote:
I'm having troubles loading a Xml string encoded in UTF-8.

If I try this code:
------------------------------
XmlDocument doc=new XmlDocument();
String s="<?xml version=\"1.0\" encoding=\"utf-8\"
standalone=\"yes\"?><a>Schönbühl</a>";
doc.LoadXml(s);
doc.Save("d:\\temp\\test.xml");
------------------------------

What I get in the test.xml file is:
------------------------------
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<a>Schönbühl</a>
------------------------------

I'm puzzled about two points in the test.xml file:
- What is the "" at the beginning?
It's Unicode Byte-Order Mark character. It's ok, but actually in UTF-8
it's optional and you can get rid of it:

XmlTextWriter w = new XmlTextWriter("d:\\temp\\test.xml", new
UTF8Encoding(false);
doc.Save(w);
w.Close();
- Why are the special chars double-encoded?


What do you mean?
--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com
Nov 12 '05 #2
Thanks, Oleg, for your reply.

OK for the "". That's interesting

By double encoded I mean that the chars are encoded twice to UTF-8:
Originally the string contained in the xml was "Schnbhl".
To put it into the UTF standard, I've transformed it to "Schönbühl", where
the ="ö" and the ="ü". This is the string I'm trying to load:

-------------------------------------------------
XmlDocument doc=new XmlDocument();
String s="<?xml version=\"1.0\" encoding=\"utf-8\"
standalone=\"yes\"?><a>Schönbühl</a>";
doc.LoadXml(s);
doc.Save("d:\\temp\\test.xml");
--------------------------------------------------

But when I open "test.xml" in a text editor, I get:
------------------------------
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<a>Schf¶nbf¼hl</a>
------------------------------

The string is converted to "Schf¶nbf¼hl", where the ="f", the
="¶"...

I'm puzzled about two points in the test.xml file:
- What is the "" at the beginning?


It's Unicode Byte-Order Mark character. It's ok, but actually in UTF-8
it's optional and you can get rid of it:

XmlTextWriter w = new XmlTextWriter("d:\\temp\\test.xml", new
UTF8Encoding(false);
doc.Save(w);
w.Close();
- Why are the special chars double-encoded?


What do you mean?
--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com


Nov 12 '05 #3
jmgonet wrote:
By double encoded I mean that the chars are encoded twice to UTF-8:
Originally the string contained in the xml was "Schnbhl".
To put it into the UTF standard, I've transformed it to "Schönbühl", where
the ="ö" and the ="ü".


I think that's a bad idea. UTF-8 defines how Unicode characters are
represented in bytes. By doubling characters you get just two characters.

What's wrong with "Schnbhl"? Just use it as is.

--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com
Nov 12 '05 #4
>What's wrong with "Schnbhl"?

Well, I see I was oversimplifying my question, in an attempt to avoid
discussing about other issues. The point is that the Xml file is provided by
guys from another company. These guys seem to be very fond of UTF-8. So they
encode everything into UTF-8. They make UTF-8 files and they leave them in a
FTP server.

I'm writing an application logging into the FTP server, getting the XML
files. Then I end up with the file contained in a string. I don't have any
control over the content of the file or its format. So I have to accept it
"as is".

Now I have a very long string containing lots of datas (about 20kb), its
header is
<?xml version="1.0" encoding="utf-8"> standalone="yes"?>

And there are plenty of elements, some of them containing special chars like
those in the "Schönbühl" city name.

So I want to load this Xml file into a Document. To do this I thought that
the easiest way was to load it from memory:
[... Lot of code here ...]
[... A lot of code here too ...]
[... And a lot more, just to ...]
[... obtain a string with ...]
[... the content of my Xml file in it ...]
XmlDocument doc=new XmlDocument();
doc.LoadXml(s);

At this point I had some issues with bad formated strings. So I started to
investigate. I investigate the file in the FTP server, I investigate the TCP
communication between the server and my application, I investigate the
Encoding in my application. At the end I found that everything seemed to be
correct, BUT still the special chars were broken.

So, as I didn't had any clue about what was wrong, I started reducing my
application, in order to obtain the simplest and shortest code reproducing
my error. And I end up with this:

----------------------------------------------
XmlDocument doc=new XmlDocument();
String s="<?xml version=\"1.0\" encoding=\"utf-8\">
standalone=\"yes\"?><a>Schönbühl</a>";
doc.LoadXml(s);
doc.Save("d:\\temp\\test.xml");
----------------------------------------------

My objective is not having a file called "test.xml", in a folder called
"temp", and containing the name of a bizarre city. It is just that this
small piece of code happens to behave in a way I'm not sure to understand.
In my mind, if you load the string s as a xml document, then you save it to
an xml file, the file and the string should have the same identical content.
Or, at least, an equivalent content. I still believe it.

But instead, if I use a text editor to read the "test.xml" file, I obtain
this:

----------------------------------------------
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<a>Schf¶nbf¼hl</a>
----------------------------------------------

where I expected to see
----------------------------------------------
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<a>Schönbühl</a>
----------------------------------------------

Thanks to you, now I understand the meaning of the "" at the beginning.

But I'm still puzzled about the double encoding of the "Schönbühl"'s
special chars.

Regards,
Jean-Michel Gonet.

"Oleg Tkachenko [MVP]" <oleg@NO!SPAM!PLEASEtkachenko.com> wrote in message
news:us**************@TK2MSFTNGP12.phx.gbl...
jmgonet wrote:
By double encoded I mean that the chars are encoded twice to UTF-8:
Originally the string contained in the xml was "Schnbhl".
To put it into the UTF standard, I've transformed it to "Schönbühl", where the ="ö" and the ="ü".


I think that's a bad idea. UTF-8 defines how Unicode characters are
represented in bytes. By doubling characters you get just two characters.

What's wrong with "Schnbhl"? Just use it as is.

--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com

Nov 12 '05 #5
jmgonet wrote:
Well, I see I was oversimplifying my question, in an attempt to avoid
discussing about other issues. The point is that the Xml file is provided by
guys from another company. These guys seem to be very fond of UTF-8. So they
encode everything into UTF-8. They make UTF-8 files and they leave them in a
FTP server.

I'm writing an application logging into the FTP server, getting the XML
files. Then I end up with the file contained in a string. I don't have any
control over the content of the file or its format. So I have to accept it
"as is".

Now I have a very long string containing lots of datas (about 20kb), its
header is
<?xml version="1.0" encoding="utf-8"> standalone="yes"?>

And there are plenty of elements, some of them containing special chars like
those in the "Schönbühl" city name.

So I want to load this Xml file into a Document. To do this I thought that
the easiest way was to load it from memory:
[... Lot of code here ...]
[... A lot of code here too ...]
[... And a lot more, just to ...]
[... obtain a string with ...]
[... the content of my Xml file in it ...]
XmlDocument doc=new XmlDocument();
doc.LoadXml(s);

At this point I had some issues with bad formated strings. So I started to
investigate. I investigate the file in the FTP server, I investigate the TCP
communication between the server and my application, I investigate the
Encoding in my application. At the end I found that everything seemed to be
correct, BUT still the special chars were broken.


Hmmm, actually reading UTF-8 XML as a string should work. In fact
strings in .NET are always UTF-16 encoded and XmlTextReader has special
ability to recognize such case and to switch to UTF-16.
Usually problems arise when you read XML to a string - it should be done
with respect to UTF-8 encoding.
For instance, XML file:

<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<books>Schönbühl</books>

It's stored in UTF-8 on Windows as
EF BB BF 3C 3F 78 6D 6C │ 20 76 65 72 73 69 6F 6E я╗┐<?xml version
3D 22 31 2E 30 22 20 65 │ 6E 63 6F 64 69 6E 67 3D ="1.0" encoding=
22 75 74 66 2D 38 22 20 │ 73 74 61 6E 64 61 6C 6F "utf-8" standalo
6E 65 3D 22 79 65 73 22 │ 3F 3E 0D 0A 3C 62 6F 6F ne="yes"?>♪◙<boo
6B 73 3E 53 63 68 C3 B6 │ 6E 62 C3 BC 68 6C 3C 2F ks>Sch├╢nb├╝hl</
62 6F 6F 6B 73 3E 0D 0A │ 0D 0A books>♪◙♪◙

Note how letter ö gets encoded in UTF-8 as 2 bytes - C3 B6 and ü - as C3 BC.

Then the following code correctly reads the file into a string and then
loads it into DOM.

StreamReader sr = new StreamReader("foo.xml", Encoding.UTF8);
string xml = sr.ReadToEnd();
sr.Close();
XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
doc.Save("foo2.xml");

The result is the same data.

--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com
Nov 12 '05 #6
Yes, your example is nice, but see that

------------------------------------
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<books>Schnbhl</books>
------------------------------------

Is not correctly formated for a UTF-8 file. Anyway, it shows the strange
behavior of the "LoadXml" method. Look at this: Take your sample, copy it in
a text editor, and save the file as "test.xml". Then try to open it with
Internet Explorer. You get an error:

-----------------------
An invalid character was found in text content. Error
processing resource 'file:///D:/TEMP/test.xml'.
Line 2, Position 11

<books>Sch
--------------------------------------

So, LoadXml can read it, but Internet Explorer can't.

You can try also this other example. Create with a text editor the following
file:

--------------------------------------
<?xml version=\"1.0\" encoding=\"UTF-8\"
standalone=\"yes\"?><a>Schönbühl</a>
--------------------------------------

Save it as "test.xml", and then run the following code:

--------------------------------------
XmlDocument doc=new XmlDocument();
doc.Load("d:\\temp\\test.xml");
doc.Save("d:\\temp\\test2.xml");
--------------------------------------

If you open "test2.xml" with the text editor, you can see that it is
identical to test.xml.

So the "load" method doesn't behave like "LoadXml":
- LoadXml seems to load any xml string as a ISO-8859-1, regardless of the
header. After, Save, uses the encoding information to save the file. But
that is another story.
- Load seems to check the header for enconding information.

But, try this:
--------------------------------------
XmlDocument doc=new XmlDocument();
String s="<?xml version=\"1.0\" encoding=\"UTF-8\"
standalone=\"yes\"?><a>Schönbühl</a>";
StringReader sr=new StringReader(s);
doc.Load(sr);
doc.Save("d:\\temp\\test2.xml");
--------------------------------------

It is the same example as at the begining, but using Load instead of
LoadXml. The result is

--------------------------------------
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<a>Schf¶nbf¼hl</a>
--------------------------------------

So the "Load" method behaves differently when it reads from a file or when
it reads from a stream!
By now I've worked around my problem about UTF-8, but I'm still convinced
that the Load and LoadXml methods have a bizarre behavior.

Regards,
Jean-Michel Gonet.

Hmmm, actually reading UTF-8 XML as a string should work. In fact
strings in .NET are always UTF-16 encoded and XmlTextReader has special
ability to recognize such case and to switch to UTF-16.
Usually problems arise when you read XML to a string - it should be done
with respect to UTF-8 encoding.
For instance, XML file:

<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<books>Schnbhl</books>

It's stored in UTF-8 on Windows as
EF BB BF 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E ?++<?xml version
3D 22 31 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D ="1.0" encoding=
22 75 74 66 2D 38 22 20 73 74 61 6E 64 61 6C 6F "utf-8" standalo
6E 65 3D 22 79 65 73 22 3F 3E 0D 0A 3C 62 6F 6F ne="yes"?>??<boo
6B 73 3E 53 63 68 C3 B6 6E 62 C3 BC 68 6C 3C 2F ks>Sch+nb++hl</
62 6F 6F 6B 73 3E 0D 0A 0D 0A books>????

Note how letter gets encoded in UTF-8 as 2 bytes - C3 B6 and - as C3 BC.
Then the following code correctly reads the file into a string and then
loads it into DOM.

StreamReader sr = new StreamReader("foo.xml", Encoding.UTF8);
string xml = sr.ReadToEnd();
sr.Close();
XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
doc.Save("foo2.xml");

The result is the same data.

--
Oleg Tkachenko [XML MVP, MCP]
http://blog.tkachenko.com

Nov 12 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Alexandra | last post by:
Hi, I am trying to parse an XML file using DOM in VC++. I need an element which is inside other elements. So I am loading the file.
3
by: AFN | last post by:
I'm comfortable with VB.NET but new to XML. I am getting XML data from a remote machine across the internet. Sometimes my LoadXML call works fine and I parse out the data I need thereafter. ...
4
by: Tim Haughton | last post by:
I think I might be misunderstanding just what the LoadXml method is doing. I have 2 seemingly identical XmlDocuments, an XPath query succeeds on one of them, and fails on the other. Can anyone tell...
4
by: cloudx | last post by:
hi there, in VB LoadXML(str) returns true or false so that you can do different coding when the str is XML or not, but in C# LoadXML doesn't have return but only throw exception if str is not XML...
0
by: Reshma Prabhu | last post by:
Hello, I am using XmlDataDocument's LoadXml( ) function to load a particular xml string. This xml string contains a reference to a particular DTD. Though LoadXml function does not perform DTD or...
2
by: Lupina | last post by:
Hi I want load whole xml file, I try do it in the same way as I did it in windows application. try { XmlDocument myDoc = new XmlDocument();
2
by: binder | last post by:
How do I eliminate an extra backslash that is appearing after LoadXML call? This issue is causing an error with Process.Start. I have a string stored in sql: c:\program files\internet...
1
by: sumanmshan | last post by:
Hi everyone, This is my first post to this forum, hope I would get a reply quickly :-) Iam using loadXML(xmlString) and it is always returning "false". My code looks like this : if...
1
by: Beamor | last post by:
function art_menu_xml_parcer($content, $showSubMenus) { $doc = new DOMDocument(); $doc->loadXML($content);//this is the line in question $parent = $doc->documentElement; $elements =...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.