By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
458,052 Members | 1,213 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 458,052 IT Pros & Developers. It's quick & easy.

HowTo: Serialize to a String without extra characters

P: n/a
I have a class that I want to serialize to an XML string. I want the XML to serialize to utf-8 encoding. When I serialize to an XML file, the data looks great. When I try to serialize to a String (ala StringBuilder) I get utf-16 and instead of the parenthesis (") I get a slash and then a " (\") which makes sense when looking at a character in memory, but not in a string

Here is my code

XmlSerializer serializer = new XmlSerializer (myObject.GetType ())
StringBuilder builder = new StringBuilder ()
StringWriter stringWriter = new StringWriter (builder)
XmlTextWriter xmlWriter = new XmlTextWriter (stringWriter)
xmlWriter.Formatting = Formatting.Indented

// Serialize the document to the XML write
serializer.Serialize (xmlWriter, message)

return builder.ToString ()

If I try to write to a memory stream and then convert the byte array to a string via the binary reader, I get exceptions because for some reason there are garbage characters written to the front of the byte array that are not ASCII/Unicode characters

Any help would be great! Thanks

Brian
Nov 12 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Let me restate my problem

I am trying to serialize a class into a string in memory that is encoded with utf-8 encoding. Using the StringBuilder, StringWriter and XmlSerializer, I can only ever serialize to utf-16, which if persisted to a file cannot be parsed by XML viewers such as IE and also the utf-16 string cannot be processed by SQL Server. Is there a way to force these classes to serialize to utf-8 encoding

I have tried serializing to a MemoryStream, but I am getting 3 strange characters at the front of the stream. I am creating an empty MemoryStream and passing it to an XmlTextWriter as the stream behind it. When I serialize I get the complete document, but with the three mysterious characters on the front that prevent me from reading the stream as a string

Any serialization help would be great. Please forgive the rambling in my previous posting

Brian
Nov 12 '05 #2

P: n/a
"Brian Reed" <an*******@discussions.microsoft.com> wrote in message news:3B**********************************@microsof t.com...
I am trying to serialize a class into a string in memory that is encoded with utf-8 encoding.
Using the StringBuilder, StringWriter and XmlSerializer, I can only ever serialize to utf-16, : : Is there a way to force these classes to serialize to utf-8 encoding?
No. UTF-16 is the internal representation of System.String. If you require a different
encoding, you must store it in something other than System.String (MemoryStream or
Byte[] are what commonly come to mind).
I have tried serializing to a MemoryStream, but I am getting 3 strange characters at the
front of the stream.


That's probably the UTF-8 Byte Order Mark (BOM), although AFAIK the BOM
only introduces a pair of "mysterious" characters, not a trio. On the bright side,
it suggests that you really have UTF-8 encoding there. ;-)

The BOM exists because UTF-8 characters can be encoded as little-endian
or Big-Endian. Its possible to suppress the BOM, but if you do, realize that your
UTF-8 might be interpreted as containing gibberish when read on Big-Endian
machines (although in that case, its reasonable for smart receivers to infer the
erroneous data is due to its having the incorrect byte ordering, and ideally an
interpretation using the opposite byte ordering could be attempted.)

Look for the constructor where you create the System.Text.UTF8Encoding in
your code. If you pass true as the first argument (shouldEmitUtf8Bom) to this
constructor, try changing it to false instead. This should remove any BOM
from the output.
Derek Harmon
Nov 12 '05 #3

P: n/a
Derek Harmon wrote:
That's probably the UTF-8 Byte Order Mark (BOM), although AFAIK the BOM
only introduces a pair of "mysterious" characters, not a trio. On the bright side,
it suggests that you really have UTF-8 encoding there. ;-)


BOM length depends on encoding - in UTF-8 it's 3 bytes, while in UTF-16
it's 2 bytes. See
http://www.w3.org/TR/2000/REC-xml-20...ng-no-ext-info

--
Oleg Tkachenko [XML MVP, XmlInsider]
http://blog.tkachenko.com
Nov 12 '05 #4

P: n/a
Hi Derek,
That's probably the UTF-8 Byte Order Mark (BOM), although AFAIK the BOM
only introduces a pair of "mysterious" characters, not a trio. On the bright side,
it suggests that you really have UTF-8 encoding there. ;-)

The BOM exists because UTF-8 characters can be encoded as little-endian
or Big-Endian. Its possible to suppress the BOM,


<-- snip -->

can you tell me how to suppress the BOM?

Thanx,
Timo
Nov 12 '05 #5

P: n/a
Timo,
Have you tried the constructors for System.Text.UTF8Encoding that accept a
boolean parameter which specifies whether to prefix or not prefix an
encoding with a Unicode byte order mark?

Hope this helps
Jay

"Timo Henne" <th*@startext.de> wrote in message
news:2f**************************@posting.google.c om...
Hi Derek,
That's probably the UTF-8 Byte Order Mark (BOM), although AFAIK the BOM
only introduces a pair of "mysterious" characters, not a trio. On the bright side, it suggests that you really have UTF-8 encoding there. ;-)

The BOM exists because UTF-8 characters can be encoded as little-endian
or Big-Endian. Its possible to suppress the BOM,


<-- snip -->

can you tell me how to suppress the BOM?

Thanx,
Timo

Nov 12 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.