StreamReader(), System.Text.Encoding and German character Ã„.

George

Hi,

I am puzzled by the following and seeking some assistance to help me
understand what happened. I have very limited encoding knowledge.

Our SAP system writes out a text file which includes German characters.

1. When I use StreamReader(System.String filepath) without specifying an
encoding method, the German characters such as Ã„ are lost when I do a
ReadLine()

I understand that by default UTF8Encoding is used.

2. When I use StreamReader(System.String filepath,
System.Text.Encoding.Unicode), I got garbage data in. I am not surprised by
this as our SAP system is not configured to read/write Unicode text output.

3. When I use StreamReader(System.String filepath,
System.Text.Encoding.Default), reading German characters are not a problem.
I guess this is also why I am able to read these German characters in
Wordpad.exe, as it must. Through debugger, I was able to find out that
System.Text.Encoding.Default is equal to System.Text.SBCSCodePageEncoding.

Does anyone has any idea what is System.Text.SBCSCodePageEncoding? Is it
one type of "extended ASCII" encoding methods?

4. My guess is that SAP text writing encoding happens to be "compatible"
with my system's default encoding, and this is why StreamReader(filepath,
System.Text.Encoding.Default) worked. If they are not compatible, it would
not have worked. Is that correct?

5. Is there a "cover all cases and automatic" type of encoding method? An
encoding method which is smart enough to detect and encode/decode to the
correct encoding on the fly. (This is encoding beginner's question....)

Thanks,
--
George

Aug 9 '06 #1

Subscribe Reply

34097

Jon Skeet [C# MVP]

George <wa**@nospam.nospamwrote:

I am puzzled by the following and seeking some assistance to help me
understand what happened. I have very limited encoding knowledge.

Our SAP system writes out a text file which includes German characters.

1. When I use StreamReader(System.String filepath) without specifying an
encoding method, the German characters such as ? are lost when I do a
ReadLine()

I understand that by default UTF8Encoding is used.

Yes.

2. When I use StreamReader(System.String filepath,
System.Text.Encoding.Unicode), I got garbage data in. I am not surprised by
this as our SAP system is not configured to read/write Unicode text output.

Okay.

3. When I use StreamReader(System.String filepath,
System.Text.Encoding.Default), reading German characters are not a problem.
I guess this is also why I am able to read these German characters in
Wordpad.exe, as it must. Through debugger, I was able to find out that
System.Text.Encoding.Default is equal to System.Text.SBCSCodePageEncoding.

Does anyone has any idea what is System.Text.SBCSCodePageEncoding? Is it
one type of "extended ASCII" encoding methods?

Yes - it's a code page which is capable of encoding 256 characters, one
per possible byte. I'm guessing that SBCS is "Single Byte Code Page" or
"Single Byte Character Set".

4. My guess is that SAP text writing encoding happens to be "compatible"
with my system's default encoding, and this is why StreamReader(filepath,
System.Text.Encoding.Default) worked. If they are not compatible, it would
not have worked. Is that correct?

My guess is that SAP writes out the text in whatever the default
encoding for the system it's running on is. It would be worth trying to
configure it to use a *specfic* encoding though.

5. Is there a "cover all cases and automatic" type of encoding method? An
encoding method which is smart enough to detect and encode/decode to the
correct encoding on the fly. (This is encoding beginner's
question....)

No, that's not possible. For instance, every UTF-8 file is a valid file
in your default encoding - and other default encodings, too. It's one
of the problems of text decoding - you really need to know what it is.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Aug 9 '06 #2

Steven Cheng[MSFT]

Thanks for Jon's complete and good inputs.

Hi George,

I think Jon has provided you the perfect answers. Just help confirm the
following things:

For most application(include the SAP application in your environment) will
save output text file as the following encoding/charset:

1. Unicode charset, include UTF-16/two byte wide char encoding or
UTF-8(efficient for network transmit)

2. The charset/encoding matches the machine's "System Locale", you can find
it in your machine's
"Control Panel--->Regional and Language options--Advanced panel" setting.

This system locale is a charset/encoding that will be used for manipulating
non-unicode text file and it is usually a MBCS(multi byte charset).
In addition, as Jon said, we'd better make sure the exact charset/encoding
of a text file when reading it, there is not 100% accurate means to
autodetect it. Some Editors like Notepad will use try detecting the
charset of a text file among the following ones:

** Unicode (two bytes wide char encoding)
** Unicode (multi bytes char encoding ) UTF-8

** The current system locale's associated charset/encoding of the machine.

If the txt file use a encoding/charset beyond the above ones, notepad will
be unable to read it correctly.

Please feel free to post here if there is anything else you wonder.

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

This posting is provided "AS IS" with no warranties, and confers no rights.

Aug 10 '06 #3

George

Thanks for the reply.

Regarding the auto detecting Code Page for encoding and decoding, the reason
I asked was it seems, as Steven mentioned earlier, Microsoft applications
seems to be doing some sort of auto detection as well.

If I were to got to "Control Panel--->Regional and Language options-->
Advanced panel", I can see that "English (United States)" is selected for my
machine. Many coversion tables are selected by default, such as CodePage 850
(Multilignual Latin) which is probably the Extended ASCII conversion table.
CodePage 852, 855, ... and so on are selected as well.

I also have Asian language pack installed, hence Code Page 10001, 10002, and
etc. are selected as well.

My guess is that CodePage 852 uses 8 bits while CodePage 10002 uses >= 16
bits (not sure how many bits).

My question here is, when the line System.Text.Encoding.Default is called,
how does it know which conversion table it should return, as many conversion
tables are selected?

In the case for my own machine, System.Text.SBCSCodePageEncoding is
returned. I am guessing that it is not CodePage 10002, which is Traditional
Chinese Big5.

Thanks for any clearification.

--
George
"Steven Cheng[MSFT]" wrote:

Thanks for Jon's complete and good inputs.

Hi George,

I think Jon has provided you the perfect answers. Just help confirm the
following things:

For most application(include the SAP application in your environment) will
save output text file as the following encoding/charset:

1. Unicode charset, include UTF-16/two byte wide char encoding or
UTF-8(efficient for network transmit)

2. The charset/encoding matches the machine's "System Locale", you can find
it in your machine's
"Control Panel--->Regional and Language options--Advanced panel" setting.

This system locale is a charset/encoding that will be used for manipulating
non-unicode text file and it is usually a MBCS(multi byte charset).
In addition, as Jon said, we'd better make sure the exact charset/encoding
of a text file when reading it, there is not 100% accurate means to
autodetect it. Some Editors like Notepad will use try detecting the
charset of a text file among the following ones:

** Unicode (two bytes wide char encoding)
** Unicode (multi bytes char encoding ) UTF-8

** The current system locale's associated charset/encoding of the machine.

If the txt file use a encoding/charset beyond the above ones, notepad will
be unable to read it correctly.

Please feel free to post here if there is anything else you wonder.

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

This posting is provided "AS IS" with no warranties, and confers no rights.

Aug 10 '06 #4

Steven Cheng[MSFT]

Hello George,

As for the Encoding.Default, it will return the current ANSI codepage of
the system(according to the machine's System Locale) setting.

#Encoding.Default Property
http://msdn2.microsoft.com/en-us/lib...g.default.aspx

And from the disassemble code of the Encoding.Default, it use the following
code logic:

===========
private static Encoding CreateDefaultEncoding()
{
int num1 = Win32Native.GetACP();
if (num1 == 0x4e4)
{
return new SBCSCodePageEncoding(num1);
}
return Encoding.GetEncoding(num1);
}
===========

so it is apparently that it is use the "GetACP"(get ANSI code page) windows
platform API to query the current system ANSI code page.

Also, the "0x4e4" means 1252(decimal) which mapped to the windows
1252(LATIN-1) charset. And this is the charset(Single Byte charset) which
contains the sufficient characters for english(United States) and most
west European countries. And on your system since the system locale is
set to "English(United States) , the current ACP is "0x4e4", so it
internally return the SBCSCodePageEncoding(an internal class). For other
system locale, such as Chinese(P.R.C), it will return other codepage
number.

I can not tell you what's the exact internally algorithm. The supported
means to get the correct ANSI code page to use is to first determine the
language-Region(locale) info. Then, use the windows API(or .net framework
class) to query Codepage info from the locale info(language-Region).

In Win32 api, we use "GetLocaleInfo" method(with passing the
LOCALE_IDEFAULTANSICODEPAGE flag to query the default ANSI code page of
that locale):

#GetLocaleInfo
http://msdn.microsoft.com/library/en...asp?frame=true

In .net framework, we can use the CultureInfo class to construct a locale
object and then use its "TextInfo" member to get the default ANSI code page
(or any other text formatting or encoding information) of that locale. For
example, the following code query the default ANSI code page of
english(United states) and chinese(P.R.C) locale:

========================
CultureInfo ci1 = new CultureInfo("en-US");
Encoding enc1 = Encoding.GetEncoding(ci1.TextInfo.ANSICodePage);
Console.WriteLine(enc1.EncodingName);
CultureInfo ci2 = new CultureInfo("zh-CN");
Encoding enc2 = Encoding.GetEncoding(ci2.TextInfo.ANSICodePage);

Console.WriteLine(enc2.EncodingName);
===========================

Hope this helps clarify further. If you have anything unclear above, please
feel free to let me know.

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead

This posting is provided "AS IS" with no warranties, and confers no rights.

Aug 11 '06 #5

Similar topics

4797

StreamReader mutilates '+'characters in UTF-7 files

by: Hans | last post by:

Hi, I need to process files that are created in UTF-7 format. This works fine upto the point where a '+' character (0x2B/43) appears in the line. The string is mutilated... The reader appears...

C# / C Sharp

10122

Bug in StreamReader.ReadLine()? It reads special chars wrong...

by: VMI | last post by:

When I execute a ReadLine from an ascii file with special chars (ie. the 'Ñ' in "NUÑEZ PEREZ"), it automatically deletes this character. So "NUÑEZ PEREZ" becomes "NUEZ PEREZ". How can this be...

C# / C Sharp

9356

StreamReader / StreamWriter Encoding

by: Jaroslav Jakes | last post by:

Hi, please help. Sounds so simple. We receive textfiles (customer orders) as e-mail attachment. These textfiles contain a simple structure of orders, like: custno, itemno, qty, text Since...

C# / C Sharp

6055

StreamReader doesn't pick up special characters

by: Max | last post by:

I'm using StreamReader.ReadToEnd() to populate and string from a file and then display it as a literal on my web site. The problem is I'm losing all the special characters like Æ,Ø, and Å that...

ASP.NET

3181

StreamReader problem with Scandinavian letters (äÄöÖ)

by: Mika M | last post by:

Hello! I'm reading text file line by line using the StreamReader like code below shows. Reading is working fine, but if readed line contains Scandinavian letters like äÄöÖ then those letters are...

Visual Basic .NET

7740

StreamReader omits 0x93 and 0x94 when reading text file

by: Drew Berkemeyer | last post by:

Hello, I'm using the following code to read a text file in VB.NET. Dim sr As StreamReader = File.OpenText(strFilePath) Dim input As String = sr.ReadLine() While Not input Is Nothing...

Visual Basic .NET

2691

Encoding a StreamReader.

by: Sladan | last post by:

Im trying to read a xml-file with a StreamReader. For the moment I'm using the following code. streamReader = new StreamReader(stream, System.Text.Encoding.Default); string feedData =...

.NET Framework

6859

Junk characters when using StreamReader and StreamWriter

by: Rob | last post by:

Hi, I have a VB.Net application that parses an HTML file. This file was an MS Word document that was saved as web page. My application removes all unnecessary code generated by MS Word and does...

Visual Basic .NET

2328

Streamreader unable to recognise German Characters in Ansi file (Ä / Ø)

by: rajana | last post by:

Dear All, We have Ansi file with german characters (Ä / Ø) , We are using Streamreader to read the contents of the file. But Readline() not able to read the German characters. We tried all...

Visual Basic .NET

7115

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7321

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

7036

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

7489

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

3191

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

3179

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1547

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

762

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

414

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General