471,337 Members | 1,125 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,337 software developers and data experts.

Help!! Convert file encoding

Sun
Hi everyone

. I have two files named a.txt and b.txt.
I open a.txt with ultraeditor.exe. here is the first row of the file:
neu für

then I switch to the HEX mode:
00000000h: FF FE 6E 00 65 00 75 00 20 00 66 00 FC 00 72 00 20 00 0A
00 0D 00
I open b.txt with ultraeditor.exe as well. first row of b.txt
neu für

switch to the HEX mode:
00000000h: 6E 65 75 20 66 FC 72 20 0A 0D

the header byte of a.txt is FF FE, so I think this should be the
Unicode(little endian) encoded file. the header of b.txt has no BOM,
so I think this file is ANSI encoded.

then I use follow C# code to catch the each byte of the file:
DirectoryInfo di = new
DirectoryInfo(System.IO.Directory.GetCurrentDirect ory());
foreach( FileInfo fi in di.GetFiles("*.txt"))
{
FileStream fs = new FileStream(fi.Name, FileMode.Open,
FileAccess.Read, FileShare.Read);
fs.Seek(0,SeekOrigin.Begin);
Console.WriteLine(fi.Name);
for(int i=0; i < 10; i++)
{
byte b = Convert.ToByte(fs.ReadByte());
Console.WriteLine(i.ToString() + " : " + b);
}

fs.Close();
}

here is the result: ( I only display the first row)
a.txt
0 : 110
1 : 101
2 : 117
3 : 32
4 : 102
5 : 195
6 : 188
7 : 114
8 : 32
9 : 13

b.txt
0 : 110
1 : 101
2 : 117
3 : 32
4 : 102
5 : 252
6 : 114
7 : 32
8 : 13
9 : 10
So, I have three questions here:
1 why the ultraeditor show the BOM header of a.txt and the code can
not. every character is two bytes length. but the C# stream can not
read the high byte of the character.
2 the character ü is an extended code of ASCII with codepage 1252
in a.txt. But I really don't know why the bytes I get from the code is
195(dec) and 188(dec), one byte turn to two bytes. How the byte
252(dec) become byte 195(dec) and 188(dec). I really don't know how it
comes.
3 Anyway, I want to convert both files to utf-8 encoded. How should I
do? each character in the file should be converted correctly, also
the characters should be shown correctly as well by opened with the
notepad.

If any one has any suggestion, very thanks.
Sep 2 '08 #1
3 2756
On Mon, 01 Sep 2008 20:09:21 -0700, Sun <Su******@gmail.comwrote:
[...]
So, I have three questions here:
1 why the ultraeditor show the BOM header of a.txt and the code can
not. every character is two bytes length. but the C# stream can not
read the high byte of the character.
2 the character ü is an extended code of ASCII with codepage 1252
in a.txt. But I really don't know why the bytes I get from the code is
195(dec) and 188(dec), one byte turn to two bytes. How the byte
252(dec) become byte 195(dec) and 188(dec). I really don't know how it
comes.
3 Anyway, I want to convert both files to utf-8 encoded. How should I
do? each character in the file should be converted correctly, also
the characters should be shown correctly as well by opened with the
notepad.

If any one has any suggestion, very thanks.
Unless you can provide links where the actual files can be downloaded, I'm
not sure anyone here will be able to offer much information.

I agree that your observations are inconsistent. But there's no reason
that a FileStream shouldn't return the exact bytes found in the file. So
that leaves a few possibilities: 1) the code you posted isn't actually the
exact code you're using to read the files, 2) the files you're reading
with the code aren't the same files being opened in this "UltraEditor"
program, or 3) the "UltraEditor" program is doing something unexpected (at
least by you...it's possible whatever it's doing is completely intentional
and expected for other users) with the files.

Just a wild guess: the first file is already UTF-8, and "UltraEditor"
detects that based on the 2-byte ü character and internally converts that
to a plain UTF-16 file with BOM at the beginning.

You'll need to post the exact code, a concise-but-complete code sample
that reproduces the issue you're seeing, as well as provide a couple of
links to copies of the files so people can use the exact data you're
using. Alternatively, include in your sample some setup code to create
the files appropriately; but I'm guessing that if you could do that
easily, you wouldn't have the question in the first place. :)

As far as your third question goes: the simplest approach to converting
the file would be to used the .NET text i/o classes, StreamReader and
StreamWriter. Let .NET auto-detect the input file encoding, or specify it
yourself, and then explicitly specify the encoding for the output (though,
actually...my recollection is that the default is already UTF-8, and if so
you don't really have to specify it). Then just use ReadLine() and
WriteLine() to go through the file and convert it.

Pete
Sep 2 '08 #2
Yes, one file is ascii and the other is UTF as you suggest.

I would assume that streams understand the differences and open the files
converting into UTF characters.
Sep 2 '08 #3
Ken Foskey wrote:
Yes, one file is ascii and the other is UTF as you suggest.
CP-1252/ISO-8859-1/similar not ASCII.

Arne
Sep 3 '08 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by Merav | last post: by
1 post views Thread by Gaia C via .NET 247 | last post: by
8 posts views Thread by Xarky | last post: by
10 posts views Thread by Marc Jennings | last post: by
1 post views Thread by Alan T | last post: by
4 posts views Thread by Sutharsan Nagasun | last post: by
reply views Thread by rosydwin | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.