473,569 Members | 2,761 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Need to reliably detect a text file's encoding for XML deserialization

Folks,

I have a text file which contains some XML. In its XML header, it
claims to be of UTF-8 encoding - however, it's really not, it's a ANSI
/ Windows-1252 / ISO-8859-1 encoding.

Trouble is: when I deserialize objects from that file, all the German
umlauts and other special characters get dropped, some even cause
deserialization errors.

When I open the file in a text editor and save it as a REAL UTF-8
file, every thing works just fine as expected.

I then tried to make sure I open the text file with a StreamReader,
telling it to determine the encoding automatically, and I intended to
then store it as real UTF-8 in case it wasn't really in that encoding.

Trouble is: no matter what encoding the file is in, when I tell
StreamReader to auto-detect the encoding, it *ALWAYS* comes back with
UTF-8 and then my deserialization might fail......

I even tried to use the Platform SDK function "IsTextUnic ode" on the
first 256 bytes I read from the file using a FileStream - no luck
either, IsTextUnicode always returns false ........

How on earth can I *reliably* detect the encoding of a text file in a
C# app?

Thanks for any hints, pointers, and most notably, CODE SAMPLES !! ;-)

Marc
Apr 6 '06 #1
4 2756
Marc Scheuner <no*****@for.me > wrote:

<snip>
How on earth can I *reliably* detect the encoding of a text file in a
C# app?


You can't. Any Windows-1252 file, for instance, is an equally valid
file in other code pages which use all possible values.

However, there are probably ways of chaining together readers etc so
that you can sort out your XML problem if you know the correct
encoding. Of course, a better solution would be to ask whatever
produces the file to do the right thing in the first place, if possible
- where are you getting the file from?

--
Jon Skeet - <sk***@pobox.co m>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Apr 6 '06 #2
Hi Jon
You can't. Any Windows-1252 file, for instance, is an equally valid
file in other code pages which use all possible values.
Drats..... I was afraid of that answer :-)
Of course, a better solution would be to ask whatever
produces the file to do the right thing in the first place, if possible
- where are you getting the file from?


It's an file being exchanged between a host app and our app at a
customers site - they *claim* it's UTF-8 and they even put that in the
XML header - yet, it's really an ANSI (Encoding.Defau lt) file, and
that throws off the XML deserialization .....

Thanks!
Marc
Apr 7 '06 #3
Marc Scheuner <no*****@for.me > wrote:
You can't. Any Windows-1252 file, for instance, is an equally valid
file in other code pages which use all possible values.


Drats..... I was afraid of that answer :-)
Of course, a better solution would be to ask whatever
produces the file to do the right thing in the first place, if possible
- where are you getting the file from?


It's an file being exchanged between a host app and our app at a
customers site - they *claim* it's UTF-8 and they even put that in the
XML header - yet, it's really an ANSI (Encoding.Defau lt) file, and
that throws off the XML deserialization .....


So can you ask the authors of the "host app" to fix things?

--
Jon Skeet - <sk***@pobox.co m>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Apr 7 '06 #4
>So can you ask the authors of the "host app" to fix things?

I doubt it - they *claim* they're delivering UTF-8, while really
they're sending me a ANSI / Windows-1252 file. Guess I'll just have to
find some technical way to make this configurable or something, since
the stupidity and ignorance on the other side can't be cured ;-)

Marc
Apr 9 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
8723
by: Cherif Diallo | last post by:
Hi I have a trivial question for the experts. I would lilke to read be able to read a text file that could contain french characters with accents. I'm opening the file with the File.openText(...) and then loop through the file with the Readline function into a string variable. But everytime a hit a char with an accent I loses it. ...
6
2771
by: dbaldi | last post by:
(this is follow-on message to one posted yesterday) I'm trying to reproduce the capabilities in both Notepad and Excel, whereby a Unicode text file with Unicode characters can be converted to ANSI, while still preserving the unicode characters within. Specifically, I'm using the unicode character &x2022, which is a largish bullet. I've...
14
2596
by: Koulbak | last post by:
I have some unicode (utf8) text file. I _tried_ to write a simple program that read one of them and write it to the standard output but... of course it doesn't work. There is an easy way to do it? Thanks, K. This is my program. #include <fstream> #include <iostream> #include <string>
5
15031
by: Lenard Gunda | last post by:
hi! I have the following problem. I need to read data from a TXT file our company receives. I would use StreamReader, and process it line by line using ReadLine, however, the following problem occurs. The file contains characters with ASCII codes above 128. But the file is still text (nothing like UTF7/8 or the like). It also might...
3
5816
by: Flix | last post by:
Hello. What I want to do is simple: correctly reading a text file whose encoding is not known (it can be Ascii,UTF7,UTF8 or Unicode). I'm thinking of something like that: 1) Read the text as Ascii: string text="";
29
4838
by: list | last post by:
Hi folks, I am new to Googlegroups. I asked my questions at other forums, since now. I have an important question: I have to check files if they are binary(.bmp, .avi, .jpg) or text(.txt, .cpp, .h, .php, .html). How to check a file an find out if the file is binary or text? Thanks for your help.
6
4941
by: Claire | last post by:
I've noticed after copying a text file line by line and comparing, that the original had several bytes of data at the beginning denoting its encoding. How do I use that in my copy? My original code shown below, didn't produce a perfect copy, so I used the StreamReader construct that includes detectEncodingFromByteOrderMarks. But I need to...
5
11238
by: dm3281 | last post by:
Hello, I have a text report from a mainframe that I need to parse. The report has about a 2580 byte header that contains binary information (garbage for the most part); although there are a couple areas that have ASCII text that I need to extract. At the end of the 2580 bytes, I can read the report like a standard text file. It should have...
8
3354
by: jyaseen | last post by:
I used the follwing code to download the text file from my server location. my php file to read the following code name is contacts.php it is downloading the text file but , this text file includes the script of contacts.php too $select_group = $_REQUEST; /*echo "file name ". $select_file = $_FILES;*/ if($select_group == 1){...
0
7612
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8120
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7672
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
7968
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
6283
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5512
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5219
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3653
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
1
2113
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.