By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
426,115 Members | 894 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 426,115 IT Pros & Developers. It's quick & easy.

Crazy with character encoding

P: n/a
Hi,
I have a text file with following content:
"((^)|(.* +))"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "" is lost

If read it with UTF7

then "+" is lost.

Please help, how can I read the file into string so that I have all
characters?

Thancs
Aug 3 '06 #1
Share this Question
Share on Google+
37 Replies


P: n/a
Hi Zhiv,

Well, the text isn't ascii or utf7. Most likely it is encoded with the default code table for the location of the file.

Try Encoding.Default or Encoding.GetEncoding(<write name of encoding here>)

--
Happy coding!
Morten Wennevik [C# MVP]
Aug 3 '06 #2

P: n/a
Zhiv Kurilka <Zh**********@LozhkaVil.cawrote:
I have a text file with following content:
"((^)|(.* +))"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "" is lost
Yes, because that isn't an ASCII character.
If read it with UTF7

then "+" is lost.
Yes, because it isn't a UTF-7 file. (UTF-7 is very rarely used outside
mail.)
Please help, how can I read the file into string so that I have all
characters?
Well, what encoding is the file in? What created it?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Aug 3 '06 #3

P: n/a
I have created it by hand, it is just a number of characters. I suppose I
nead a bytereader for that. Right?

"Jon Skeet [C# MVP]" <sk***@pobox.comschrieb im Newsbeitrag
news:MP************************@msnews.microsoft.c om...
Zhiv Kurilka <Zh**********@LozhkaVil.cawrote:
I have a text file with following content:
"((^)|(.* +))"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "" is lost
Yes, because that isn't an ASCII character.
If read it with UTF7

then "+" is lost.
Yes, because it isn't a UTF-7 file. (UTF-7 is very rarely used outside
mail.)
Please help, how can I read the file into string so that I have all
characters?
Well, what encoding is the file in? What created it?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Aug 3 '06 #4

P: n/a
The question is rather with what did you create it, and how did you save it.
I'm guessing it is saved with the default ansi table for your computer, in which case using Encoding.Default when reading it should give you the proper string.

On Thu, 03 Aug 2006 19:08:48 +0200, Zhiv Kurilka <Zh**********@LozhkaVil.cawrote:
I have created it by hand, it is just a number of characters. I suppose I
nead a bytereader for that. Right?

"Jon Skeet [C# MVP]" <sk***@pobox.comschrieb im Newsbeitrag
news:MP************************@msnews.microsoft.c om...
Zhiv Kurilka <Zh**********@LozhkaVil.cawrote:
>I have a text file with following content:
"((^)|(.* +))"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "" is lost

Yes, because that isn't an ASCII character.
>If read it with UTF7

then "+" is lost.

Yes, because it isn't a UTF-7 file. (UTF-7 is very rarely used outside
mail.)
>Please help, how can I read the file into string so that I have all
characters?

Well, what encoding is the file in? What created it?


--
Happy coding!
Morten Wennevik [C# MVP]
Aug 3 '06 #5

P: n/a
I have created it with Visual Studio text editor. It is a plain text file.
(I suppose). I opened a new text document and wrote those symbols into it.
Encoding.default gives a wrong string.
I am trying byte reader now
"Morten Wennevik" <Mo************@hotmail.comschrieb im Newsbeitrag
news:op.tdpw09k4klbvpo@stone...
The question is rather with what did you create it, and how did you save
it.
I'm guessing it is saved with the default ansi table for your computer, in
which case using Encoding.Default when reading it should give you the
proper string.

On Thu, 03 Aug 2006 19:08:48 +0200, Zhiv Kurilka
<Zh**********@LozhkaVil.cawrote:
>I have created it by hand, it is just a number of characters. I suppose I
nead a bytereader for that. Right?

"Jon Skeet [C# MVP]" <sk***@pobox.comschrieb im Newsbeitrag
news:MP************************@msnews.microsoft. com...
Zhiv Kurilka <Zh**********@LozhkaVil.cawrote:
>>I have a text file with following content:
"((^)|(.* +))"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "" is lost

Yes, because that isn't an ASCII character.
>>If read it with UTF7

then "+" is lost.

Yes, because it isn't a UTF-7 file. (UTF-7 is very rarely used outside
mail.)
>>Please help, how can I read the file into string so that I have all
characters?

Well, what encoding is the file in? What created it?

--
Happy coding!
Morten Wennevik [C# MVP]

Aug 3 '06 #6

P: n/a
A byte reader won't help you as it needs the same kind of encoding as the StreamReader to be able to make sense of the bytes.

My Visual Studio 2005 seems to want to save a text file as Windows-1252, so you can try using that.

StreamReader("file.txt", Encoding.GetEncoding("Windows-1252"));

On Thu, 03 Aug 2006 19:38:44 +0200, Zhiv Kurilka <Zh**********@LozhkaVil.cawrote:
I have created it with Visual Studio text editor. It is a plain text file.
(I suppose). I opened a new text document and wrote those symbols into it.
Encoding.default gives a wrong string.
I am trying byte reader now
"Morten Wennevik" <Mo************@hotmail.comschrieb im Newsbeitrag
news:op.tdpw09k4klbvpo@stone...
>The question is rather with what did you create it, and how did you save
it.
I'm guessing it is saved with the default ansi table for your computer, in
which case using Encoding.Default when reading it should give you the
proper string.

On Thu, 03 Aug 2006 19:08:48 +0200, Zhiv Kurilka
<Zh**********@LozhkaVil.cawrote:
>>I have created it by hand, it is just a number of characters. I suppose I
nead a bytereader for that. Right?

"Jon Skeet [C# MVP]" <sk***@pobox.comschrieb im Newsbeitrag
news:MP************************@msnews.microsoft .com...
Zhiv Kurilka <Zh**********@LozhkaVil.cawrote:
I have a text file with following content:
"((^)|(.* +))"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "" is lost

Yes, because that isn't an ASCII character.

If read it with UTF7

then "+" is lost.

Yes, because it isn't a UTF-7 file. (UTF-7 is very rarely used outside
mail.)

Please help, how can I read the file into string so that I have all
characters?

Well, what encoding is the file in? What created it?

--
Happy coding!
Morten Wennevik [C# MVP]




--
Happy coding!
Morten Wennevik [C# MVP]
Aug 3 '06 #7

P: n/a
Zhiv Kurilka wrote:
I have created it with Visual Studio text editor. It is a plain text file.
(I suppose). I opened a new text document and wrote those symbols into it.
Encoding.default gives a wrong string.
I am trying byte reader now
Have you tried to read it as UTF8? I think VS saves files in that format.

Max
Aug 3 '06 #8

P: n/a
Some of the files are created using VS2003 others VS2005. I need some way to
get encoding from file automatically. Is it possible?
P.S. I have tried UTF8. For most files it fails.
I am sorry, but I still don't understand what is going on. Why VS editor
shows files properly, but I can't write them?

"Markus Stoeger" <sp******@gmx.atschrieb im Newsbeitrag
news:OB**************@TK2MSFTNGP03.phx.gbl...
Zhiv Kurilka wrote:
>I have created it with Visual Studio text editor. It is a plain text
file. (I suppose). I opened a new text document and wrote those symbols
into it. Encoding.default gives a wrong string.
I am trying byte reader now

Have you tried to read it as UTF8? I think VS saves files in that format.

Max

Aug 3 '06 #9

P: n/a
VS Editor shows files properly because it reads them using the correct encoding.
Have you tried Windows-1252?

On Thu, 03 Aug 2006 20:08:36 +0200, Zhiv Kurilka <Zh**********@LozhkaVil.cawrote:
Some of the files are created using VS2003 others VS2005. I need some way to
get encoding from file automatically. Is it possible?
P.S. I have tried UTF8. For most files it fails.
I am sorry, but I still don't understand what is going on. Why VS editor
shows files properly, but I can't write them?

"Markus Stoeger" <sp******@gmx.atschrieb im Newsbeitrag
news:OB**************@TK2MSFTNGP03.phx.gbl...
>Zhiv Kurilka wrote:
>>I have created it with Visual Studio text editor. It is a plain text
file. (I suppose). I opened a new text document and wrote those symbols
into it. Encoding.default gives a wrong string.
I am trying byte reader now

Have you tried to read it as UTF8? I think VS saves files in that format.

Max




--
Happy coding!
Morten Wennevik [C# MVP]
Aug 3 '06 #10

P: n/a
Zhiv Kurilka <Zh**********@LozhkaVil.cawrote:
Some of the files are created using VS2003 others VS2005. I need some way to
get encoding from file automatically. Is it possible?
No. There are ways of making a reasonable guess, but it would still be
a guess.
P.S. I have tried UTF8. For most files it fails.
So it's not UTF-8 and it's not the default encoding for the system.
That's fairly odd. Perhaps you could mail me some of the files?
I am sorry, but I still don't understand what is going on. Why VS editor
shows files properly, but I can't write them?
Visual Studio presumably guesses correctly what encoding they're in.

It sounds like you're still not really sure what an encoding is though.
See if
http://www.pobox.com/~skeet/csharp/unicode.html helps.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Aug 3 '06 #11

P: n/a
"Zhiv Kurilka" <Zh**********@LozhkaVil.caha scritto nel messaggio
Hi,
I have a text file with following content:
"((^)|(.* +))"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "" is lost

If read it with UTF7

then "+" is lost.

Please help, how can I read the file into string so that I have all
characters?
I had the same problem: encoding convert the byte into char using their
rules, but I just want the char that corrispond to the byte without any
conversion.

The solution is boring but quite simple: read and write the file as a byte
array and restore it casting each byte into char and back:
public static byte[] GetBytes(string s)
{
byte[] b = new byte[s.Length];
for (int i = 0; i < b.Length; ++i)
{
b[i] = (byte)s[i];
}
return b;
}

public static byte[] GetBytes(char[] c)
{
byte[] b = new byte[c.Length];
for (int i = 0; i < b.Length; ++i)
{
b[i] = (byte)c[i];
}
return b;
}
public static string GetString(byte[] buffer)
{
return new string(GetChars(buffer));
}

public static char[] GetChars(byte[] b)
{
char[] c = new char[b.Length];
for (int i = 0; i < b.Length; ++i)
{
c[i] = (char)b[i];
}
return c;
}

--

Free .Net Reporting Tool - http://www.neodatatype.net
Aug 3 '06 #12

P: n/a
Fabio <zn*******@virgilio.itwrote:
I had the same problem: encoding convert the byte into char using their
rules, but I just want the char that corrispond to the byte without any
conversion.
That's like saying you want the English that corresponds to a French
word without any translation.
The solution is boring but quite simple: read and write the file as a byte
array and restore it casting each byte into char and back:
public static byte[] GetBytes(string s)
{
byte[] b = new byte[s.Length];
for (int i = 0; i < b.Length; ++i)
{
b[i] = (byte)s[i];
}
return b;
}
That's effectively using ISO-Latin-1 encoding. It's still an encoding.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Aug 3 '06 #13

P: n/a

"Zhiv Kurilka" <Zh**********@LozhkaVil.cawrote in message
news:ef**************@TK2MSFTNGP05.phx.gbl...
Hi,
I have a text file with following content:
"((^)|(.* +))"

if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);

k.readtotheend()

Then "" is lost

If read it with UTF7

then "+" is lost.

Please help, how can I read the file into string so that I have all
characters?
Sounds to me like you are running into Unicode encoding - characters encoded
with both big-endian and little-endian. Try using Encoding.Unicode. See if
that helps.

Aug 3 '06 #14

P: n/a
Dear Sirs,
I have uploaded the file:
http://a1234113.narod.ru/test.zip

I tried all your suggestions.
Dim _sr As New System.IO.StreamReader(_fn, System.Text.Encoding.XXXX)

m_filetext = _sr.ReadToEnd

_sr.Close()

Either + or is missing or all is crap.

Could you give me an advice?

Thanks a lot
Aug 3 '06 #15

P: n/a
Zhiv Kurilka <Zh**********@LozhkaVil.cawrote:
Dear Sirs,
I have uploaded the file:
http://a1234113.narod.ru/test.zip

I tried all your suggestions.
Dim _sr As New System.IO.StreamReader(_fn, System.Text.Encoding.XXXX)

m_filetext = _sr.ReadToEnd

_sr.Close()

Either + or is missing or all is crap.

Could you give me an advice?
Encoding.Default works fine for me.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Aug 3 '06 #16

P: n/a
"Jon Skeet [C# MVP]" <sk***@pobox.comha scritto nel messaggio
Fabio <zn*******@virgilio.itwrote:
>I had the same problem: encoding convert the byte into char using their
rules, but I just want the char that corrispond to the byte without any
conversion.

That's like saying you want the English that corresponds to a French
word without any translation.
Mmmm... what I want is a "double way" conversion.
Each character into the ASCII table has a corrisponding byte, so I want that
a char converted to a byte can be reversed having back the original value.

All the encoders exposed by System.Text I tried do some transformation on
the value and the original information can be lost.
That's effectively using ISO-Latin-1 encoding. It's still an encoding.
Can I have this bheavior directly via some .Net encoder?

Thanks
Aug 3 '06 #17

P: n/a
Fabio <zn*******@virgilio.itwrote:
Fabio <zn*******@virgilio.itwrote:
I had the same problem: encoding convert the byte into char using their
rules, but I just want the char that corrispond to the byte without any
conversion.
That's like saying you want the English that corresponds to a French
word without any translation.

Mmmm... what I want is a "double way" conversion.
Each character into the ASCII table has a corrisponding byte, so I want that
a char converted to a byte can be reversed having back the original value.
There's no way you can do that with a single byte, as a char is a
16-bit value and a byte is an 8-bit value.
All the encoders exposed by System.Text I tried do some transformation on
the value and the original information can be lost.
If you want to encode arbitrary binary data as text data and then
decode it, you should use Base64 - that's what it's there for. Pretty
much any other scheme is asking for trouble.

If you want to encode arbitrary Unicode text data as binary data, I'd
normally suggest using UTF-8. It's efficient for "mainly ASCII" text,
and covers the whole of Unicode.
That's effectively using ISO-Latin-1 encoding. It's still an encoding.

Can I have this bheavior directly via some .Net encoder?
You can use Encoding.GetEncoding(28591) but be aware that between 128
and 139 there's a bit of a no-mans-land. There's contradictory
evidence, but some of it points to ISO-8859-1 not having any characters
defined in that range.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Aug 3 '06 #18

P: n/a
"Jon Skeet [C# MVP]" <sk***@pobox.comschrieb:
>I have uploaded the file:
http://a1234113.narod.ru/test.zip

I tried all your suggestions.
Dim _sr As New System.IO.StreamReader(_fn, System.Text.Encoding.XXXX)

m_filetext = _sr.ReadToEnd

_sr.Close()

Either + or is missing or all is crap.

Could you give me an advice?

Encoding.Default works fine for me.
Maybe the OP's version of Windows uses a different default Windows-ANSI
codepage.

--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://classicvb.org/petition/>

Aug 3 '06 #19

P: n/a
Herfried K. Wagner [MVP] <hi***************@gmx.atwrote:
"Jon Skeet [C# MVP]" <sk***@pobox.comschrieb:
I have uploaded the file:
http://a1234113.narod.ru/test.zip

I tried all your suggestions.
Dim _sr As New System.IO.StreamReader(_fn, System.Text.Encoding.XXXX)

m_filetext = _sr.ReadToEnd

_sr.Close()

Either + or is missing or all is crap.

Could you give me an advice?
Encoding.Default works fine for me.
Maybe the OP's version of Windows uses a different default Windows-ANSI
codepage.
But in that case, I'd have expected Visual Studio to use that default
encoding too - if it works in Studio and it's CP-1252, I can't think
why Studio would choose 1252 instead of the default code page.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Aug 3 '06 #20

P: n/a

Fabio wrote:
<snip>
Mmmm... what I want is a "double way" conversion.
Each character into the ASCII table has a corrisponding byte, so I want that
a char converted to a byte can be reversed having back the original value.

All the encoders exposed by System.Text I tried do some transformation on
the value and the original information can be lost.
<snip>
Can I have this bheavior directly via some .Net encoder?
The Windows ANSI encoding (engoding number 1252) usually works for me,
because (AFAIK) it doesn't apply any transformation to the individual
byte, i.e., there's a mapping from each byte value to each ANSI char,
for a total of 256 possible chars (control chars included).

Dim E As System.Text.Encoding = _
System.Text.Encoding.GetEncoding(1252)

Many encodigs use two or four bytes in the representation of a char;
others use a multibyte system where some specific byte values indicate
that the following sequence is a multibyte char.

This is not the case with the ANSI encoding. In ANSI, each byte value
matches a corresponding char. Of course, if the string you're encoding
contains chars outside the ANSI range, such chars will be
misrepresented. Also, if you read a non-ansi sequence of bytes and
convert them to a string using ANSI, you'll probably get some strange
results.

HTH.

Regards,

Branco.

Aug 4 '06 #21

P: n/a
Branco Medeiros wrote:
All the encoders exposed by System.Text I tried do some transformation on
the value and the original information can be lost.
<snip>
Can I have this bheavior directly via some .Net encoder?

The Windows ANSI encoding (engoding number 1252) usually works for me,
because (AFAIK) it doesn't apply any transformation to the individual
byte, i.e., there's a mapping from each byte value to each ANSI char,
for a total of 256 possible chars (control chars included).
How can you say there isn't any transformation, and then talk about
there being a mapping from each byte value to a character? That *is*
the transformation.

Talking about "the" Windows ANSI Encoding is like talking about "the"
extended ASCII encoding. There are lots of different encodings which
exhibit the same behaviour as 1252, i.e. they have a mapping from any
byte to one of the 256 characters they represent. Each represents a
different set of 256 characters.
This is not the case with the ANSI encoding. In ANSI, each byte value
matches a corresponding char. Of course, if the string you're encoding
contains chars outside the ANSI range, such chars will be
misrepresented. Also, if you read a non-ansi sequence of bytes and
convert them to a string using ANSI, you'll probably get some strange
results.
Exactly - so it's like any other encoding: you've got to make sure you
use the right one.

Code page 1252 has no magic powers.

Jon

Aug 4 '06 #22

P: n/a
"Jon Skeet [C# MVP]" <sk***@pobox.comha scritto nel messaggio
news:MP************************@msnews.microsoft.c om...

>
There's no way you can do that with a single byte, as a char is a
16-bit value and a byte is an 8-bit value.
Wait.
Let's take for a moment VB6.
It uses Unicode strings, but Chr(200) (the "" for example) is always
perfectly reversible into 200 using ASC("").

This works well if I use (char)200 <--(byte)''.

There is no way if I use an encoder: the char I encode is not returned
correctly decoding it, i.e. I can encode "" into a byte value (of course
using a NON double byte encoder) and when I decode it back I could get a
"".

This is not so good when I do comunications via socket or via RS232.

The ASCII table give a number (and only one) for each char.
Encoder/Decoder seems to assign different chars to the same number or seems
to lost informations so decoding the number I could get a char that is not
the one encoded.

Aug 4 '06 #23

P: n/a
Fabio Z wrote:
There's no way you can do that with a single byte, as a char is a
16-bit value and a byte is an 8-bit value.

Wait.
Let's take for a moment VB6.
It uses Unicode strings, but Chr(200) (the "" for example) is always
perfectly reversible into 200 using ASC("").

This works well if I use (char)200 <--(byte)''.
Are you suggesting that VB magically manages to represent 65536
different values in a single byte? I suspect you'll find there are
plenty of Unicode characters (actually UCS-2 characters - let's not go
into full Unicode U+FFFF for the moment) for which ASC doesn't work
on systems with a fixed single-byte default character encoding.
There is no way if I use an encoder: the char I encode is not returned
correctly decoding it, i.e. I can encode "" into a byte value (of course
using a NON double byte encoder) and when I decode it back I could get a
"".
If you use the same encoding for both encoding and decoding, *and* if
that encoding supports the character you wish to encode, it will always
return the correct character.
This is not so good when I do comunications via socket or via RS232.
Well, it's not so good if you don't use the same encoding on both
sides...
The ASCII table give a number (and only one) for each char.
Encoder/Decoder seems to assign different chars to the same number or seems
to lost informations so decoding the number I could get a char that is not
the one encoded.
You still seem to be confused as to the purpose of encodings. Please
read
http://www.pobox.com/~skeet/csharp/unicode.html

Jon

Aug 4 '06 #24

P: n/a
"Jon Skeet [C# MVP]" <sk***@pobox.comha scritto nel messaggio
If you use the same encoding for both encoding and decoding, *and* if
that encoding supports the character you wish to encode, it will always
return the correct character.
I could be confused about this but I'm not so stupid to use different
encoders to encode and decode.
If I get some time I'll provide an example.

Aug 4 '06 #25

P: n/a
Fabio Z wrote:
"Jon Skeet [C# MVP]" <sk***@pobox.comha scritto nel messaggio
If you use the same encoding for both encoding and decoding, *and* if
that encoding supports the character you wish to encode, it will always
return the correct character.

I could be confused about this but I'm not so stupid to use different
encoders to encode and decode.
And similarly the designers of encodings aren't so stupid as to stop
you from encoding and then decoding to get back the original text :)
If I get some time I'll provide an example.
That would be good. I suspect you'll find it hard to provide one
without including characters which aren't supported by the chosen
encoding (or a code error).

Jon

Aug 4 '06 #26

P: n/a
"Jon Skeet [C# MVP]" <sk***@pobox.comha scritto nel messaggio

>If I get some time I'll provide an example.

That would be good. I suspect you'll find it hard to provide one
without including characters which aren't supported by the chosen
encoding (or a code error).
:)

Ok, waiting for it, can you give me an example that can convert a byte[]
that contains all the 0..255 byte values to a string and that convert it
back to the original byte array.

Aug 4 '06 #27

P: n/a
Fabio Z wrote:
"Jon Skeet [C# MVP]" <sk***@pobox.comha scritto nel messaggio

If I get some time I'll provide an example.
That would be good. I suspect you'll find it hard to provide one
without including characters which aren't supported by the chosen
encoding (or a code error).

:)

Ok, waiting for it, can you give me an example that can convert a byte[]
that contains all the 0..255 byte values to a string and that convert it
back to the original byte array.
Sure - although it's not a good idea (see later).

using System;
using System.Text;

class Test
{
static void Main()
{
byte[] b = new byte[256];
for (int i=0; i < 256; i++)
{
b[i] = (byte)i;
}
Encoding enc = Encoding.GetEncoding(28591);
string x = enc.GetString(b);

byte[] o = enc.GetBytes(x);
Console.WriteLine ("Length={0}", o.Length);
for (int i=0; i < 256; i++)
{
if (o[i] != i)
{
Console.WriteLine ("Difference at index {0}", i);
}
}
}
}

Now, that's demonstrating that it happens to work, but it's not a good
way of encoding arbitrary binary data. To do that, I'd recommend using
Base64 - Convert.ToBase64String and Convert.FromBase64String.

Encodings should be used when you *start* with text data, encode it to
binary, and then decode that binary to text data. Decoding binary data
which didn't really start off as text and then get encoded is a bad
idea.

Jon

Aug 4 '06 #28

P: n/a
Fabio Z wrote:
"Jon Skeet [C# MVP]" <sk***@pobox.comha scritto nel messaggio

>>If I get some time I'll provide an example.
That would be good. I suspect you'll find it hard to provide one
without including characters which aren't supported by the chosen
encoding (or a code error).

:)

Ok, waiting for it, can you give me an example that can convert a byte[]
that contains all the 0..255 byte values to a string and that convert it
back to the original byte array.
You didn't say anything about requiring the string to contain the same
number of characters as the byte[] array has members, so:

using System;

class Program
{
static void Main(string[] args)
{
byte[] data = new byte[1024];

for (int i = 0; i <= 255; i++)
data[i] =
data[i + 256] =
data[i + 512] =
data[i + 768] = (byte)i;

// I have some byte data, but I can't print it!

string printable = Convert.ToBase64String(data);

// Now I have the data in printable form, look:

Console.WriteLine(printable);

// I should be able to get the data back, of course:

byte[] data2 = Convert.FromBase64String(printable);

// is it the same?
bool theSame = true;

if (data.Length == data2.Length)
{
for (int i = 0; i < data.Length; i++)
if (data[i] == data2[i])
// carry on
;
else
{
theSame = false;
break;
}

}
else
theSame = false;

if (theSame)
Console.WriteLine("Data is the same after transformation");
else
Console.WriteLine("Data is NOT the same!!!!");

Console.ReadLine();
}
}


--
Larry Lard
la*******@googlemail.com
The address is real, but unread - please reply to the group
For VB and C# questions - tell us which version
Aug 4 '06 #29

P: n/a

Jon Skeet [C# MVP] wrote (inline):

<snip>
How can you say there isn't any transformation, and then talk about
there being a mapping from each byte value to a character? That *is*
the transformation.
I thought it was clear that the kind of transformation I was talking
about had to do with dropping control chars or composition of chars
outside the Ansi range (codes 0 to 255). Of course, mapping a single
byte to the corresponding (Ansi) char is the actual transformation.
Thanks for point it out.
Talking about "the" Windows ANSI Encoding is like talking about "the"
extended ASCII encoding. There are lots of different encodings which
exhibit the same behaviour as 1252, i.e. they have a mapping from any
byte to one of the 256 characters they represent. Each represents a
different set of 256 characters.
I guess you're right when you say that there are other encondings that
act like the Ansi encoding, i.e., provide a one to one mapping from
byte to char. It would be nice if someone (yourself, perhaps) took the
time to identify them. People having to deal with legacy encodings
would certainly appreciate that.

On the other hand, I assume that there is *the* Ansi encoding,
comprising the 256 chars chosen by Microsoft to represent the Western
European latin char set, loosely based on a ANSI draft of the time
(thus the characterization as Windows-Ansi), which is code page 1252.
Of course, I may be wrong.

<snip>
Code page 1252 has no magic powers.
:-)) No, it certainly hasn't.

Best regards,

Branco.

Aug 4 '06 #30

P: n/a
Branco Medeiros <br*************@gmail.comwrote:
<snip>
How can you say there isn't any transformation, and then talk about
there being a mapping from each byte value to a character? That *is*
the transformation.

I thought it was clear that the kind of transformation I was talking
about had to do with dropping control chars or composition of chars
outside the Ansi range (codes 0 to 255).
No - although *something* has to happen to characters outside the range
of the character set. (Note that Windows-1252 is definitely *not*
Unicode 0-255. They differ in the range 128 to 159 inclusive.)
Of course, mapping a single
byte to the corresponding (Ansi) char is the actual transformation.
Thanks for point it out.
And that's the same kind of thing that other encodings do, except they
may not be single byte to single char.
Talking about "the" Windows ANSI Encoding is like talking about "the"
extended ASCII encoding. There are lots of different encodings which
exhibit the same behaviour as 1252, i.e. they have a mapping from any
byte to one of the 256 characters they represent. Each represents a
different set of 256 characters.

I guess you're right when you say that there are other encondings that
act like the Ansi encoding, i.e., provide a one to one mapping from
byte to char. It would be nice if someone (yourself, perhaps) took the
time to identify them. People having to deal with legacy encodings
would certainly appreciate that.

On the other hand, I assume that there is *the* Ansi encoding,
comprising the 256 chars chosen by Microsoft to represent the Western
European latin char set, loosely based on a ANSI draft of the time
(thus the characterization as Windows-Ansi), which is code page 1252.
Of course, I may be wrong.
I *think* you are, I'm afraid.

http://www.stylusstudio.com/xsllist/...post01200.html
and
http://www.stylusstudio.com/xsllist/...post61190.html
have a bit more information.

For another example of a character encoding which could be regarded as
an "ANSI" encoding, consider ASCII. This is also known as
ANSI_X3.4-1968 (according to
http://www.iana.org/assignments/character-sets)

I *believe* people often talk about whatever their default
256-character encoding is as an "ANSI encoding" - and that's not always
Windows-1252.

For more evidence of this, see
http://en.wikipedia.org/wiki/Code_pa....29_code_pages

In particular:
<quote>
Microsoft defined a number of code pages known as the ANSI code pages
(as the first one, 1252 was based on an ansi draft of what became ISO
8859-1).
</quote>

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Aug 4 '06 #31

P: n/a
Ok, waiting for it, can you give me an example that can convert a byte[]
that contains all the 0..255 byte values to a string and that convert it
back to the original byte array.
Wrong.
Most encodings have undefined areas and do not cover the complete range from
0 to 255. So some values will not be converted to Unicode (because they are
not allocated in the original encoding, to begin with).

If 0..255 is what you need, then is no text data, is binary data,
and you should use some other ways to convert to text for transfer
(MIME, BinHex, etc.).
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Aug 5 '06 #32

P: n/a
On the other hand, I assume that there is *the* Ansi encoding,
comprising the 256 chars chosen by Microsoft to represent the Western
European latin char set, loosely based on a ANSI draft of the time
(thus the characterization as Windows-Ansi), which is code page 1252.
Of course, I may be wrong.
What MS documentation means when it says ANSI code page is not 1252.
It is the "default system code page" and depends on the system locale.
It is 932 on Japanese sytems, 1250 on Russian, and so on
(you can get the ANSI CP for a locale by using
GetLocaleInfo with LOCALE_IDEFAULTANSICODEPAGE )
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Aug 5 '06 #33

P: n/a
"Mihai N." <nm**************@yahoo.comha scritto nel messaggio

If 0..255 is what you need, then is no text data, is binary data,
and you should use some other ways to convert to text for transfer
(MIME, BinHex, etc.).
A string is not just "text".
Is a sequence of chars (that in memory are bytes).

So I think you all are definetively talking in a different language than me
about this issue.

My initial code works well and cannot be replaced by some trick such as Mime
or Base64 encoding, that transforms the original value.

The old CopyMemory() did the work as I want to, because it does not say
itself "oh! this is not text! I refuse to convert it to bytes".
It treats strings for what they are: a sequence of byte, nothing more,
nothing less.

;)
Aug 5 '06 #34

P: n/a
Fabio <zn*******@virgilio.itwrote:
If 0..255 is what you need, then is no text data, is binary data,
and you should use some other ways to convert to text for transfer
(MIME, BinHex, etc.).

A string is not just "text".
Is a sequence of chars (that in memory are bytes).
The in-memory encoding happens to be UTF-16. It's almost irrelevant
though.
So I think you all are definetively talking in a different language than me
about this issue.

My initial code works well and cannot be replaced by some trick such as Mime
or Base64 encoding, that transforms the original value.
When you're passing binary data around as text, you really want to make
sure it doesn't get screwed up by systems which assume null-terminated
strings etc. Base64 copes with this. Your code doesn't.
The old CopyMemory() did the work as I want to, because it does not say
itself "oh! this is not text! I refuse to convert it to bytes".
It treats strings for what they are: a sequence of byte, nothing more,
nothing less.
You're doomed to run into encoding issues with that mentality, I'm
afraid. Treat binary data as binary data, text as text, and encode
between the two in rigidly defined ways. Anything else leads to
problesm.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Aug 5 '06 #35

P: n/a
"Jon Skeet [C# MVP]" <sk***@pobox.comha scritto nel messaggio
You're doomed to run into encoding issues with that mentality, I'm
afraid. Treat binary data as binary data, text as text, and encode
between the two in rigidly defined ways. Anything else leads to
problesm.
Ok :)
With my mentality I'm doomed to make serial port and sockets comunications
works [efficiently] :)
With "bug free text encoding mentality" them don't.

I'll accept my doom on this argument :)

I'll leave to Base64 and Mime encoding their role: sending and receiving
e-mails.

Aug 5 '06 #36

P: n/a
Fabio <zn*******@virgilio.itwrote:
You're doomed to run into encoding issues with that mentality, I'm
afraid. Treat binary data as binary data, text as text, and encode
between the two in rigidly defined ways. Anything else leads to
problesm.

Ok :)
With my mentality I'm doomed to make serial port and sockets comunications
works [efficiently] :)
Serial ports and sockets deal with binary data. If you've got binary
data you want to send across serial ports and sockets, you shouldn't be
converting it to or from a string to start with.
With "bug free text encoding mentality" them don't.

I'll accept my doom on this argument :)

I'll leave to Base64 and Mime encoding their role: sending and receiving
e-mails.
I don't remember anyone other than yourself bringing up mime encoding
(although I could be wrong). Base64 has plenty of uses outside email.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Aug 6 '06 #37

P: n/a
A string is not just "text".
Is a sequence of chars (that in memory are bytes).
I am not sure what you mean. Text is also "a sequence of chars"
What is the differenct between text and string?

The main difference between text/string and "just bytes" is that not any
sequence of bytes constitute valid text.
So I think you all are definetively talking in a different language than me
about this issue.
Probably.
My initial code works well and cannot be replaced by some trick such as
Mime or Base64 encoding, that transforms the original value.
Then
The old CopyMemory() did the work as I want to, because it does not say
itself "oh! this is not text! I refuse to convert it to bytes".
The only code I have seen from you is this:
public static byte[] GetBytes(string s)
{
byte[] b = new byte[s.Length];
for (int i = 0; i < b.Length; ++i)
{
b[i] = (byte)s[i];
}
return b;
}
which casts from a character (16 bits) to a byte (8 bits).
So it is 100% sure to loose information.
It treats strings for what they are: a sequence of byte, nothing more,
nothing less.
Nope. Strings are "a certain type of sequence of bytes"
Any string is a sequence of bytes, but not any sequence of bytes is a string.
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Aug 6 '06 #38

This discussion thread is closed

Replies have been disabled for this discussion.