By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
429,365 Members | 3,180 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 429,365 IT Pros & Developers. It's quick & easy.

StreamReader(), System.Text.Encoding and German character Ä.

P: n/a
Hi,

I am puzzled by the following and seeking some assistance to help me
understand what happened. I have very limited encoding knowledge.

Our SAP system writes out a text file which includes German characters.

1. When I use StreamReader(System.String filepath) without specifying an
encoding method, the German characters such as Ä are lost when I do a
ReadLine()

I understand that by default UTF8Encoding is used.

2. When I use StreamReader(System.String filepath,
System.Text.Encoding.Unicode), I got garbage data in. I am not surprised by
this as our SAP system is not configured to read/write Unicode text output.

3. When I use StreamReader(System.String filepath,
System.Text.Encoding.Default), reading German characters are not a problem.
I guess this is also why I am able to read these German characters in
Wordpad.exe, as it must. Through debugger, I was able to find out that
System.Text.Encoding.Default is equal to System.Text.SBCSCodePageEncoding.

Does anyone has any idea what is System.Text.SBCSCodePageEncoding? Is it
one type of "extended ASCII" encoding methods?

4. My guess is that SAP text writing encoding happens to be "compatible"
with my system's default encoding, and this is why StreamReader(filepath,
System.Text.Encoding.Default) worked. If they are not compatible, it would
not have worked. Is that correct?

5. Is there a "cover all cases and automatic" type of encoding method? An
encoding method which is smart enough to detect and encode/decode to the
correct encoding on the fly. (This is encoding beginner's question....)

Thanks,
--
George
Aug 9 '06 #1
Share this Question
Share on Google+
4 Replies


P: n/a
George <wa**@nospam.nospamwrote:
I am puzzled by the following and seeking some assistance to help me
understand what happened. I have very limited encoding knowledge.

Our SAP system writes out a text file which includes German characters.

1. When I use StreamReader(System.String filepath) without specifying an
encoding method, the German characters such as ? are lost when I do a
ReadLine()

I understand that by default UTF8Encoding is used.
Yes.
2. When I use StreamReader(System.String filepath,
System.Text.Encoding.Unicode), I got garbage data in. I am not surprised by
this as our SAP system is not configured to read/write Unicode text output.
Okay.
3. When I use StreamReader(System.String filepath,
System.Text.Encoding.Default), reading German characters are not a problem.
I guess this is also why I am able to read these German characters in
Wordpad.exe, as it must. Through debugger, I was able to find out that
System.Text.Encoding.Default is equal to System.Text.SBCSCodePageEncoding.

Does anyone has any idea what is System.Text.SBCSCodePageEncoding? Is it
one type of "extended ASCII" encoding methods?
Yes - it's a code page which is capable of encoding 256 characters, one
per possible byte. I'm guessing that SBCS is "Single Byte Code Page" or
"Single Byte Character Set".
4. My guess is that SAP text writing encoding happens to be "compatible"
with my system's default encoding, and this is why StreamReader(filepath,
System.Text.Encoding.Default) worked. If they are not compatible, it would
not have worked. Is that correct?
My guess is that SAP writes out the text in whatever the default
encoding for the system it's running on is. It would be worth trying to
configure it to use a *specfic* encoding though.
5. Is there a "cover all cases and automatic" type of encoding method? An
encoding method which is smart enough to detect and encode/decode to the
correct encoding on the fly. (This is encoding beginner's
question....)
No, that's not possible. For instance, every UTF-8 file is a valid file
in your default encoding - and other default encodings, too. It's one
of the problems of text decoding - you really need to know what it is.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Aug 9 '06 #2

P: n/a
Thanks for Jon's complete and good inputs.

Hi George,

I think Jon has provided you the perfect answers. Just help confirm the
following things:

For most application(include the SAP application in your environment) will
save output text file as the following encoding/charset:

1. Unicode charset, include UTF-16/two byte wide char encoding or
UTF-8(efficient for network transmit)

2. The charset/encoding matches the machine's "System Locale", you can find
it in your machine's
"Control Panel--->Regional and Language options--Advanced panel" setting.

This system locale is a charset/encoding that will be used for manipulating
non-unicode text file and it is usually a MBCS(multi byte charset).
In addition, as Jon said, we'd better make sure the exact charset/encoding
of a text file when reading it, there is not 100% accurate means to
autodetect it. Some Editors like Notepad will use try detecting the
charset of a text file among the following ones:

** Unicode (two bytes wide char encoding)
** Unicode (multi bytes char encoding ) UTF-8

** The current system locale's associated charset/encoding of the machine.

If the txt file use a encoding/charset beyond the above ones, notepad will
be unable to read it correctly.

Please feel free to post here if there is anything else you wonder.

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

This posting is provided "AS IS" with no warranties, and confers no rights.
Aug 10 '06 #3

P: n/a
Thanks for the reply.

Regarding the auto detecting Code Page for encoding and decoding, the reason
I asked was it seems, as Steven mentioned earlier, Microsoft applications
seems to be doing some sort of auto detection as well.

If I were to got to "Control Panel--->Regional and Language options-->
Advanced panel", I can see that "English (United States)" is selected for my
machine. Many coversion tables are selected by default, such as CodePage 850
(Multilignual Latin) which is probably the Extended ASCII conversion table.
CodePage 852, 855, ... and so on are selected as well.

I also have Asian language pack installed, hence Code Page 10001, 10002, and
etc. are selected as well.

My guess is that CodePage 852 uses 8 bits while CodePage 10002 uses >= 16
bits (not sure how many bits).

My question here is, when the line System.Text.Encoding.Default is called,
how does it know which conversion table it should return, as many conversion
tables are selected?

In the case for my own machine, System.Text.SBCSCodePageEncoding is
returned. I am guessing that it is not CodePage 10002, which is Traditional
Chinese Big5.

Thanks for any clearification.

--
George
"Steven Cheng[MSFT]" wrote:
Thanks for Jon's complete and good inputs.

Hi George,

I think Jon has provided you the perfect answers. Just help confirm the
following things:

For most application(include the SAP application in your environment) will
save output text file as the following encoding/charset:

1. Unicode charset, include UTF-16/two byte wide char encoding or
UTF-8(efficient for network transmit)

2. The charset/encoding matches the machine's "System Locale", you can find
it in your machine's
"Control Panel--->Regional and Language options--Advanced panel" setting.

This system locale is a charset/encoding that will be used for manipulating
non-unicode text file and it is usually a MBCS(multi byte charset).
In addition, as Jon said, we'd better make sure the exact charset/encoding
of a text file when reading it, there is not 100% accurate means to
autodetect it. Some Editors like Notepad will use try detecting the
charset of a text file among the following ones:

** Unicode (two bytes wide char encoding)
** Unicode (multi bytes char encoding ) UTF-8

** The current system locale's associated charset/encoding of the machine.

If the txt file use a encoding/charset beyond the above ones, notepad will
be unable to read it correctly.

Please feel free to post here if there is anything else you wonder.

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

This posting is provided "AS IS" with no warranties, and confers no rights.
Aug 10 '06 #4

P: n/a
Hello George,

As for the Encoding.Default, it will return the current ANSI codepage of
the system(according to the machine's System Locale) setting.

#Encoding.Default Property
http://msdn2.microsoft.com/en-us/lib...g.default.aspx

And from the disassemble code of the Encoding.Default, it use the following
code logic:

===========
private static Encoding CreateDefaultEncoding()
{
int num1 = Win32Native.GetACP();
if (num1 == 0x4e4)
{
return new SBCSCodePageEncoding(num1);
}
return Encoding.GetEncoding(num1);
}
===========

so it is apparently that it is use the "GetACP"(get ANSI code page) windows
platform API to query the current system ANSI code page.

Also, the "0x4e4" means 1252(decimal) which mapped to the windows
1252(LATIN-1) charset. And this is the charset(Single Byte charset) which
contains the sufficient characters for english(United States) and most
west European countries. And on your system since the system locale is
set to "English(United States) , the current ACP is "0x4e4", so it
internally return the SBCSCodePageEncoding(an internal class). For other
system locale, such as Chinese(P.R.C), it will return other codepage
number.

I can not tell you what's the exact internally algorithm. The supported
means to get the correct ANSI code page to use is to first determine the
language-Region(locale) info. Then, use the windows API(or .net framework
class) to query Codepage info from the locale info(language-Region).

In Win32 api, we use "GetLocaleInfo" method(with passing the
LOCALE_IDEFAULTANSICODEPAGE flag to query the default ANSI code page of
that locale):

#GetLocaleInfo
http://msdn.microsoft.com/library/en...asp?frame=true

In .net framework, we can use the CultureInfo class to construct a locale
object and then use its "TextInfo" member to get the default ANSI code page
(or any other text formatting or encoding information) of that locale. For
example, the following code query the default ANSI code page of
english(United states) and chinese(P.R.C) locale:

========================
CultureInfo ci1 = new CultureInfo("en-US");
Encoding enc1 = Encoding.GetEncoding(ci1.TextInfo.ANSICodePage);
Console.WriteLine(enc1.EncodingName);
CultureInfo ci2 = new CultureInfo("zh-CN");
Encoding enc2 = Encoding.GetEncoding(ci2.TextInfo.ANSICodePage);

Console.WriteLine(enc2.EncodingName);
===========================

Hope this helps clarify further. If you have anything unclear above, please
feel free to let me know.

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

This posting is provided "AS IS" with no warranties, and confers no rights.

Aug 11 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.