469,589 Members | 2,085 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,589 developers. It's quick & easy.

Need to reliably detect a text file's encoding for XML deserialization

Folks,

I have a text file which contains some XML. In its XML header, it
claims to be of UTF-8 encoding - however, it's really not, it's a ANSI
/ Windows-1252 / ISO-8859-1 encoding.

Trouble is: when I deserialize objects from that file, all the German
umlauts and other special characters get dropped, some even cause
deserialization errors.

When I open the file in a text editor and save it as a REAL UTF-8
file, every thing works just fine as expected.

I then tried to make sure I open the text file with a StreamReader,
telling it to determine the encoding automatically, and I intended to
then store it as real UTF-8 in case it wasn't really in that encoding.

Trouble is: no matter what encoding the file is in, when I tell
StreamReader to auto-detect the encoding, it *ALWAYS* comes back with
UTF-8 and then my deserialization might fail......

I even tried to use the Platform SDK function "IsTextUnicode" on the
first 256 bytes I read from the file using a FileStream - no luck
either, IsTextUnicode always returns false ........

How on earth can I *reliably* detect the encoding of a text file in a
C# app?

Thanks for any hints, pointers, and most notably, CODE SAMPLES !! ;-)

Marc
Apr 6 '06 #1
4 2636
Marc Scheuner <no*****@for.me> wrote:

<snip>
How on earth can I *reliably* detect the encoding of a text file in a
C# app?


You can't. Any Windows-1252 file, for instance, is an equally valid
file in other code pages which use all possible values.

However, there are probably ways of chaining together readers etc so
that you can sort out your XML problem if you know the correct
encoding. Of course, a better solution would be to ask whatever
produces the file to do the right thing in the first place, if possible
- where are you getting the file from?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Apr 6 '06 #2
Hi Jon
You can't. Any Windows-1252 file, for instance, is an equally valid
file in other code pages which use all possible values.
Drats..... I was afraid of that answer :-)
Of course, a better solution would be to ask whatever
produces the file to do the right thing in the first place, if possible
- where are you getting the file from?


It's an file being exchanged between a host app and our app at a
customers site - they *claim* it's UTF-8 and they even put that in the
XML header - yet, it's really an ANSI (Encoding.Default) file, and
that throws off the XML deserialization.....

Thanks!
Marc
Apr 7 '06 #3
Marc Scheuner <no*****@for.me> wrote:
You can't. Any Windows-1252 file, for instance, is an equally valid
file in other code pages which use all possible values.


Drats..... I was afraid of that answer :-)
Of course, a better solution would be to ask whatever
produces the file to do the right thing in the first place, if possible
- where are you getting the file from?


It's an file being exchanged between a host app and our app at a
customers site - they *claim* it's UTF-8 and they even put that in the
XML header - yet, it's really an ANSI (Encoding.Default) file, and
that throws off the XML deserialization.....


So can you ask the authors of the "host app" to fix things?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Apr 7 '06 #4
>So can you ask the authors of the "host app" to fix things?

I doubt it - they *claim* they're delivering UTF-8, while really
they're sending me a ANSI / Windows-1252 file. Guess I'll just have to
find some technical way to make this configurable or something, since
the stupidity and ignorance on the other side can't be cured ;-)

Marc
Apr 9 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

14 posts views Thread by Koulbak | last post: by
29 posts views Thread by list | last post: by
5 posts views Thread by dm3281 | last post: by
4 posts views Thread by guiromero | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.