By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,137 Members | 2,209 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,137 IT Pros & Developers. It's quick & easy.

character sets

P: n/a
Hi all,

I have an application that reads data in from a text file and stores it in a
database. My problem is that there are some characters in the file that
aren't being handled properly. For instance, one of the characters has an
ASCII code of 150 (it looks like a dash '-'), when I'm debugging this
character is displayed as the square box that Windows uses for unsupported
characters and when it's copied to the database it's stored as '?'.

I've played with the encoding while reading the file but the default
encoding still works the best for all of the data. I can copy this
character to a simple texr editor like Notepad and it's displayed properly.
The problem seems to be that the .net character set used is OEM when what I
want is the ANSI character set. Can anyone help me with reading in all of
the characters in the file. Thanks in advance.

--Paul
Sep 12 '08 #1
Share this Question
Share on Google+
10 Replies


P: n/a
On Sep 12, 3:05*am, "Paul W" <nos...@pw-review.comwrote:
I have an application that reads data in from a text file and stores it in a
database. *My problem is that there are some characters in the file that
aren't being handled properly. *For instance, one of the characters hasan
ASCII code of 150 (it looks like a dash '-')
There's no such thing as "ASCII code of 150" - ASCII only goes as far
as 150.

I *suspect* that Encoding.Default is what you're after, but read
http://pobox.com/~skeet/csharp/unicode.html and
http://pobox.com/~skeet/csharp/debuggingunicode.html for more
information.

Jon
Sep 12 '08 #2

P: n/a

"Paul W" wrote:
Hi all,

I have an application that reads data in from a text file and stores it in a
database. My problem is that there are some characters in the file that
aren't being handled properly. For instance, one of the characters has an
ASCII code of 150 (it looks like a dash '-'), when I'm debugging this
character is displayed as the square box that Windows uses for unsupported
characters and when it's copied to the database it's stored as '?'.

I've played with the encoding while reading the file but the default
encoding still works the best for all of the data. I can copy this
character to a simple texr editor like Notepad and it's displayed properly.
The problem seems to be that the .net character set used is OEM when what I
want is the ANSI character set. Can anyone help me with reading in all of
the characters in the file. Thanks in advance.

--Paul
Hi Paul,

It looks like the default encoding is not the correct one. An ANSI
character should be readable in any codepage although it may not display the
correct character. For comparison, ANSI character 150 is û on my system. If
you open the file in Notepad and select Save As ... does it opt for ANSI,
UTF8 or Unicode. It ANSI, do you get the file from another country/system
running potentially other codepages?

--
Happy Coding!
Morten Wennevik [C# MVP]
Sep 12 '08 #3

P: n/a

"Morten Wennevik [C# MVP]" wrote:
>
"Paul W" wrote:
Hi all,

I have an application that reads data in from a text file and stores it in a
database. My problem is that there are some characters in the file that
aren't being handled properly. For instance, one of the characters has an
ASCII code of 150 (it looks like a dash '-'), when I'm debugging this
character is displayed as the square box that Windows uses for unsupported
characters and when it's copied to the database it's stored as '?'.

I've played with the encoding while reading the file but the default
encoding still works the best for all of the data. I can copy this
character to a simple texr editor like Notepad and it's displayed properly.
The problem seems to be that the .net character set used is OEM when what I
want is the ANSI character set. Can anyone help me with reading in all of
the characters in the file. Thanks in advance.

--Paul


Hi Paul,

It looks like the default encoding is not the correct one. An ANSI
character should be readable in any codepage although it may not display the
correct character. For comparison, ANSI character 150 is û on my system. If
you open the file in Notepad and select Save As ... does it opt for ANSI,
UTF8 or Unicode. It ANSI, do you get the file from another country/system
running potentially other codepages?

--
Happy Coding!
Morten Wennevik [C# MVP]
You will indeed get ? characters for extended ascii characters if you try to
read ansi encoded text as ascii. So as Jon pointed out, Encoding.Default may
very well be what you need. Encoding default uses the ansi codepage default
for your locale. To specify a particular codepage use
Encoding.GetEncoding(nameofencoding).

--
Happy Coding!
Morten Wennevik [C# MVP]
Sep 12 '08 #4

P: n/a

"Jon Skeet [C# MVP]" <sk***@pobox.comwrote in message
news:76**********************************@d77g2000 hsb.googlegroups.com...
On Sep 12, 3:05 am, "Paul W" <nos...@pw-review.comwrote:
I have an application that reads data in from a text file and stores it in
a
database. My problem is that there are some characters in the file that
aren't being handled properly. For instance, one of the characters has an
ASCII code of 150 (it looks like a dash '-')
There's no such thing as "ASCII code of 150" - ASCII only goes as far
as 150.

I *suspect* that Encoding.Default is what you're after, but read
http://pobox.com/~skeet/csharp/unicode.html and
http://pobox.com/~skeet/csharp/debuggingunicode.html for more
information.

Jon

I've tried all of the Encoding settings available, Encoding.ASCII gives me
'?', Encoding.UTF8 and Encoding.Default give me the square box and all other
settings give no useful data at all from the file. I'll take a look at
those pages, thanks for sending the links.

--Paul
Sep 12 '08 #5

P: n/a
"Morten Wennevik [C# MVP]" <Mo************@hotmail.comwrote in message
news:83**********************************@microsof t.com...
>
"Paul W" wrote:
>Hi all,

I have an application that reads data in from a text file and stores it
in a
database. My problem is that there are some characters in the file that
aren't being handled properly. For instance, one of the characters has
an
ASCII code of 150 (it looks like a dash '-'), when I'm debugging this
character is displayed as the square box that Windows uses for
unsupported
characters and when it's copied to the database it's stored as '?'.

I've played with the encoding while reading the file but the default
encoding still works the best for all of the data. I can copy this
character to a simple texr editor like Notepad and it's displayed
properly.
The problem seems to be that the .net character set used is OEM when what
I
want is the ANSI character set. Can anyone help me with reading in all
of
the characters in the file. Thanks in advance.

--Paul

Hi Paul,

It looks like the default encoding is not the correct one. An ANSI
character should be readable in any codepage although it may not display
the
correct character. For comparison, ANSI character 150 is on my system.
If
you open the file in Notepad and select Save As ... does it opt for ANSI,
UTF8 or Unicode. It ANSI, do you get the file from another country/system
running potentially other codepages?

--
Happy Coding!
Morten Wennevik [C# MVP]
See my response to Jon regarding the encoding. The reason I mention the
ANSI character set is because I have an editor that provides the character
codes for both OEM and ANSI. OEM shows the same character you are which is
then actually displayed as the square box. ANSI shows character 150 to be
the one actually in the file. This is all very confusing to me but I
believe I've got the correct encoding because the character code I'm
receiving is correct. I believe the problem is the character set. Is there
a way to switch between OEM and ANSI? Thanks for your help.

--Paul
Sep 12 '08 #6

P: n/a
On Sep 12, 2:39*pm, "Paul W" <nos...@pw-review.comwrote:

<snip>
See my response to Jon regarding the encoding. *The reason I mention the
ANSI character set is because I have an editor that provides the character
codes for both OEM and ANSI. *OEM shows the same character you are which is
then actually displayed as the square box. *ANSI shows character 150 tobe
the one actually in the file. *This is all very confusing to me but I
believe I've got the correct encoding because the character code I'm
receiving is correct. *I believe the problem is the character set. *Is there
a way to switch between OEM and ANSI? *Thanks for your help.
When you say "the character code I'm receiving is correct" what
*exactly* do you mean? If possible, provide a short but complete
example which demonstrates the problem. Obviously in this case *we*
won't be able to run the code because we don't have the file, but it
could still help a lot.

Jon
Sep 12 '08 #7

P: n/a

"Jon Skeet [C# MVP]" <sk***@pobox.comwrote in message
news:54**********************************@8g2000hs e.googlegroups.com...
On Sep 12, 2:39 pm, "Paul W" <nos...@pw-review.comwrote:

<snip>
See my response to Jon regarding the encoding. The reason I mention the
ANSI character set is because I have an editor that provides the character
codes for both OEM and ANSI. OEM shows the same character you are which is
then actually displayed as the square box. ANSI shows character 150 to be
the one actually in the file. This is all very confusing to me but I
believe I've got the correct encoding because the character code I'm
receiving is correct. I believe the problem is the character set. Is there
a way to switch between OEM and ANSI? Thanks for your help.
When you say "the character code I'm receiving is correct" what
*exactly* do you mean? If possible, provide a short but complete
example which demonstrates the problem. Obviously in this case *we*
won't be able to run the code because we don't have the file, but it
could still help a lot.

Jon

I don't think a sample of code would help here. What I mean by "the
character code I'm receiving is correct" is that the value of 150 that I
mentioned before is the correct value. In the ANSI character set, that
value maps to a character similar to a '-' and this character displays
exactly as expected in other text editors such as Notepad. However, in the
OEM character set, the character code 150 maps to something different
completely and ultimately is displayed as a square box just like all
unsupported characters are displayed in Windows.

I hope I'm making more sense now. The numeric value I'm receiving is the
correct one, the problem is that the character set, OEM, doesn't map that
value to an appropriate character. There are a couple of other characters
in the data files that do this as well. I don't remember the actual values
off hand though. If I could get my program to use the ANSI character set
instead of the OEM character set my problem would be solved.

Thanks again for taking the time to help me work through this problem.

--Paul
Sep 13 '08 #8

P: n/a
Paul W <no****@pw-review.comwrote:
I don't think a sample of code would help here.
Well I really do, I'm afraid.
What I mean by "the character code I'm receiving is correct" is that
the value of 150 that I mentioned before is the correct value.
Where are you getting that value from? If you could show it in code, it
would really help.
In the ANSI character set
Are you aware that there's no one fixed ANSI character encoding?
There's a whole collection of character encodings which use ASCII for
the 7 bit part and then do different things for the next 128 values.
that value maps to a character similar to a '-' and this character
displays exactly as expected in other text editors such as Notepad.
However, in the OEM character set, the character code 150 maps to
something different completely and ultimately is displayed as a
square box just like all unsupported characters are displayed in
Windows.
Unicode 150 (all .NET strings are in Unicode) is a control character
(start of guarded area). So if you're reading
I hope I'm making more sense now.
Not really, because we still need the code.
The numeric value I'm receiving is the correct one
It's not the correct one in Unicode, which is what you need to read in
for .NET. We also don't know what you mean by "the numeric value I'm
receiving" because we don't know how you're reading it.
the problem is that the character set, OEM, doesn't map that
value to an appropriate character.
OEM character encodings aren't getting involved at all here.
There are a couple of other characters
in the data files that do this as well. I don't remember the actual values
off hand though. If I could get my program to use the ANSI character set
instead of the OEM character set my problem would be solved.

Thanks again for taking the time to help me work through this problem.
If you could just show us the code you're using to read in the file,
I'm sure we could get to the bottom of it - but without code, there's
nothing I can really suggest other than that using Encoding.Default
probably *will* be the solution when you've got the right code to use
it.

--
Jon Skeet - <sk***@pobox.com>
Web site: http://www.pobox.com/~skeet
Blog: http://www.msmvps.com/jon.skeet
C# in Depth: http://csharpindepth.com
Sep 13 '08 #9

P: n/a

"Jon Skeet [C# MVP]" <sk***@pobox.comwrote in message
news:MP*********************@msnews.microsoft.com. ..
Paul W <no****@pw-review.comwrote:
>I don't think a sample of code would help here.

Well I really do, I'm afraid.
>What I mean by "the character code I'm receiving is correct" is that
the value of 150 that I mentioned before is the correct value.

Where are you getting that value from? If you could show it in code, it
would really help.
>In the ANSI character set

Are you aware that there's no one fixed ANSI character encoding?
There's a whole collection of character encodings which use ASCII for
the 7 bit part and then do different things for the next 128 values.
>that value maps to a character similar to a '-' and this character
displays exactly as expected in other text editors such as Notepad.
However, in the OEM character set, the character code 150 maps to
something different completely and ultimately is displayed as a
square box just like all unsupported characters are displayed in
Windows.

Unicode 150 (all .NET strings are in Unicode) is a control character
(start of guarded area). So if you're reading
>I hope I'm making more sense now.

Not really, because we still need the code.
>The numeric value I'm receiving is the correct one

It's not the correct one in Unicode, which is what you need to read in
for .NET. We also don't know what you mean by "the numeric value I'm
receiving" because we don't know how you're reading it.
>the problem is that the character set, OEM, doesn't map that
value to an appropriate character.

OEM character encodings aren't getting involved at all here.
>There are a couple of other characters
in the data files that do this as well. I don't remember the actual
values
off hand though. If I could get my program to use the ANSI character set
instead of the OEM character set my problem would be solved.

Thanks again for taking the time to help me work through this problem.

If you could just show us the code you're using to read in the file,
I'm sure we could get to the bottom of it - but without code, there's
nothing I can really suggest other than that using Encoding.Default
probably *will* be the solution when you've got the right code to use
it.

--
Jon Skeet - <sk***@pobox.com>
Web site: http://www.pobox.com/~skeet
Blog: http://www.msmvps.com/jon.skeet
C# in Depth: http://csharpindepth.com
You were correct Jon, I thought the two following lines of code were the
same:

using (StreamReader sr = new StreamReader(fileName))

using (StreamReader sr = new StreamReader(fileName, Encoding.Default))

But they aren't. The second one is working now. I had tried all of the
Encoding choices except the Default one thinking that it would produce the
same results as ommitting encoding. Thanks for all your help.

--Paul


Sep 13 '08 #10

P: n/a

"Paul W" wrote:
>
"Jon Skeet [C# MVP]" <sk***@pobox.comwrote in message
news:MP*********************@msnews.microsoft.com. ..
Paul W <no****@pw-review.comwrote:
I don't think a sample of code would help here.
Well I really do, I'm afraid.
What I mean by "the character code I'm receiving is correct" is that
the value of 150 that I mentioned before is the correct value.
Where are you getting that value from? If you could show it in code, it
would really help.
In the ANSI character set
Are you aware that there's no one fixed ANSI character encoding?
There's a whole collection of character encodings which use ASCII for
the 7 bit part and then do different things for the next 128 values.
that value maps to a character similar to a '-' and this character
displays exactly as expected in other text editors such as Notepad.
However, in the OEM character set, the character code 150 maps to
something different completely and ultimately is displayed as a
square box just like all unsupported characters are displayed in
Windows.
Unicode 150 (all .NET strings are in Unicode) is a control character
(start of guarded area). So if you're reading
I hope I'm making more sense now.
Not really, because we still need the code.
The numeric value I'm receiving is the correct one
It's not the correct one in Unicode, which is what you need to read in
for .NET. We also don't know what you mean by "the numeric value I'm
receiving" because we don't know how you're reading it.
the problem is that the character set, OEM, doesn't map that
value to an appropriate character.
OEM character encodings aren't getting involved at all here.
There are a couple of other characters
in the data files that do this as well. I don't remember the actual
values
off hand though. If I could get my program to use the ANSI character set
instead of the OEM character set my problem would be solved.

Thanks again for taking the time to help me work through this problem.
If you could just show us the code you're using to read in the file,
I'm sure we could get to the bottom of it - but without code, there's
nothing I can really suggest other than that using Encoding.Default
probably *will* be the solution when you've got the right code to use
it.

--
Jon Skeet - <sk***@pobox.com>
Web site: http://www.pobox.com/~skeet
Blog: http://www.msmvps.com/jon.skeet
C# in Depth: http://csharpindepth.com

You were correct Jon, I thought the two following lines of code were the
same:

using (StreamReader sr = new StreamReader(fileName))

using (StreamReader sr = new StreamReader(fileName, Encoding.Default))

But they aren't. The second one is working now. I had tried all of the
Encoding choices except the Default one thinking that it would produce the
same results as ommitting encoding. Thanks for all your help.

--Paul

To sum this up, as far as I know, all text reader/writer classes will use
UTF-8 unless told otherwise. If there is an overload taking Encoding as
parameter consider using this overload if the type of encoding is important.

--
Happy Coding!
Morten Wennevik [C# MVP]
Sep 16 '08 #11

This discussion thread is closed

Replies have been disabled for this discussion.