Hi everyone
. I have two files named a.txt and b.txt.
I open a.txt with ultraeditor.exe. here is the first row of the file:
neu für
then I switch to the HEX mode:
00000000h: FF FE 6E 00 65 00 75 00 20 00 66 00 FC 00 72 00 20 00 0A
00 0D 00
I open b.txt with ultraeditor.exe as well. first row of b.txt
neu für
switch to the HEX mode:
00000000h: 6E 65 75 20 66 FC 72 20 0A 0D
the header byte of a.txt is FF FE, so I think this should be the
Unicode(little endian) encoded file. the header of b.txt has no BOM,
so I think this file is ANSI encoded.
then I use follow C# code to catch the each byte of the file:
DirectoryInfo di = new
DirectoryInfo(System.IO.Directory.GetCurrentDirect ory());
foreach( FileInfo fi in di.GetFiles("*.txt"))
{
FileStream fs = new FileStream(fi.Name, FileMode.Open,
FileAccess.Read, FileShare.Read);
fs.Seek(0,SeekOrigin.Begin);
Console.WriteLine(fi.Name);
for(int i=0; i < 10; i++)
{
byte b = Convert.ToByte(fs.ReadByte());
Console.WriteLine(i.ToString() + " : " + b);
}
fs.Close();
}
here is the result: ( I only display the first row)
a.txt
0 : 110
1 : 101
2 : 117
3 : 32
4 : 102
5 : 195
6 : 188
7 : 114
8 : 32
9 : 13
b.txt
0 : 110
1 : 101
2 : 117
3 : 32
4 : 102
5 : 252
6 : 114
7 : 32
8 : 13
9 : 10
So, I have three questions here:
1 why the ultraeditor show the BOM header of a.txt and the code can
not. every character is two bytes length. but the C# stream can not
read the high byte of the character.
2 the character ü is an extended code of ASCII with codepage 1252
in a.txt. But I really don't know why the bytes I get from the code is
195(dec) and 188(dec), one byte turn to two bytes. How the byte
252(dec) become byte 195(dec) and 188(dec). I really don't know how it
comes.
3 Anyway, I want to convert both files to utf-8 encoded. How should I
do? each character in the file should be converted correctly, also
the characters should be shown correctly as well by opened with the
notepad.
If any one has any suggestion, very thanks.