469,359 Members | 1,622 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,359 developers. It's quick & easy.

HowTo: Serialize to a String without extra characters

I have a class that I want to serialize to an XML string. I want the XML to serialize to utf-8 encoding. When I serialize to an XML file, the data looks great. When I try to serialize to a String (ala StringBuilder) I get utf-16 and instead of the parenthesis (") I get a slash and then a " (\") which makes sense when looking at a character in memory, but not in a string

Here is my code

XmlSerializer serializer = new XmlSerializer (myObject.GetType ())
StringBuilder builder = new StringBuilder ()
StringWriter stringWriter = new StringWriter (builder)
XmlTextWriter xmlWriter = new XmlTextWriter (stringWriter)
xmlWriter.Formatting = Formatting.Indented

// Serialize the document to the XML write
serializer.Serialize (xmlWriter, message)

return builder.ToString ()

If I try to write to a memory stream and then convert the byte array to a string via the binary reader, I get exceptions because for some reason there are garbage characters written to the front of the byte array that are not ASCII/Unicode characters

Any help would be great! Thanks

Brian
Nov 12 '05 #1
5 12688
Let me restate my problem

I am trying to serialize a class into a string in memory that is encoded with utf-8 encoding. Using the StringBuilder, StringWriter and XmlSerializer, I can only ever serialize to utf-16, which if persisted to a file cannot be parsed by XML viewers such as IE and also the utf-16 string cannot be processed by SQL Server. Is there a way to force these classes to serialize to utf-8 encoding

I have tried serializing to a MemoryStream, but I am getting 3 strange characters at the front of the stream. I am creating an empty MemoryStream and passing it to an XmlTextWriter as the stream behind it. When I serialize I get the complete document, but with the three mysterious characters on the front that prevent me from reading the stream as a string

Any serialization help would be great. Please forgive the rambling in my previous posting

Brian
Nov 12 '05 #2
"Brian Reed" <an*******@discussions.microsoft.com> wrote in message news:3B**********************************@microsof t.com...
I am trying to serialize a class into a string in memory that is encoded with utf-8 encoding.
Using the StringBuilder, StringWriter and XmlSerializer, I can only ever serialize to utf-16, : : Is there a way to force these classes to serialize to utf-8 encoding?
No. UTF-16 is the internal representation of System.String. If you require a different
encoding, you must store it in something other than System.String (MemoryStream or
Byte[] are what commonly come to mind).
I have tried serializing to a MemoryStream, but I am getting 3 strange characters at the
front of the stream.


That's probably the UTF-8 Byte Order Mark (BOM), although AFAIK the BOM
only introduces a pair of "mysterious" characters, not a trio. On the bright side,
it suggests that you really have UTF-8 encoding there. ;-)

The BOM exists because UTF-8 characters can be encoded as little-endian
or Big-Endian. Its possible to suppress the BOM, but if you do, realize that your
UTF-8 might be interpreted as containing gibberish when read on Big-Endian
machines (although in that case, its reasonable for smart receivers to infer the
erroneous data is due to its having the incorrect byte ordering, and ideally an
interpretation using the opposite byte ordering could be attempted.)

Look for the constructor where you create the System.Text.UTF8Encoding in
your code. If you pass true as the first argument (shouldEmitUtf8Bom) to this
constructor, try changing it to false instead. This should remove any BOM
from the output.
Derek Harmon
Nov 12 '05 #3
Derek Harmon wrote:
That's probably the UTF-8 Byte Order Mark (BOM), although AFAIK the BOM
only introduces a pair of "mysterious" characters, not a trio. On the bright side,
it suggests that you really have UTF-8 encoding there. ;-)


BOM length depends on encoding - in UTF-8 it's 3 bytes, while in UTF-16
it's 2 bytes. See
http://www.w3.org/TR/2000/REC-xml-20...ng-no-ext-info

--
Oleg Tkachenko [XML MVP, XmlInsider]
http://blog.tkachenko.com
Nov 12 '05 #4
Hi Derek,
That's probably the UTF-8 Byte Order Mark (BOM), although AFAIK the BOM
only introduces a pair of "mysterious" characters, not a trio. On the bright side,
it suggests that you really have UTF-8 encoding there. ;-)

The BOM exists because UTF-8 characters can be encoded as little-endian
or Big-Endian. Its possible to suppress the BOM,


<-- snip -->

can you tell me how to suppress the BOM?

Thanx,
Timo
Nov 12 '05 #5
Timo,
Have you tried the constructors for System.Text.UTF8Encoding that accept a
boolean parameter which specifies whether to prefix or not prefix an
encoding with a Unicode byte order mark?

Hope this helps
Jay

"Timo Henne" <th*@startext.de> wrote in message
news:2f**************************@posting.google.c om...
Hi Derek,
That's probably the UTF-8 Byte Order Mark (BOM), although AFAIK the BOM
only introduces a pair of "mysterious" characters, not a trio. On the bright side, it suggests that you really have UTF-8 encoding there. ;-)

The BOM exists because UTF-8 characters can be encoded as little-endian
or Big-Endian. Its possible to suppress the BOM,


<-- snip -->

can you tell me how to suppress the BOM?

Thanx,
Timo

Nov 12 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by Andrew | last post: by
reply views Thread by IMS.Rushikesh | last post: by
5 posts views Thread by David Sworder | last post: by
17 posts views Thread by Chad Myers | last post: by
32 posts views Thread by tshad | last post: by
3 posts views Thread by Michael H | last post: by
1 post views Thread by Rick Luckwell | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.