Connecting Tech Pros Worldwide Forums | Help | Site Map

UTF8/UTF7/ASCII problem while reading from text file

Lenard Gunda
Guest
 
Posts: n/a
#1: Nov 16 '05
hi!

I have the following problem. I need to read data from a TXT file our
company receives.
I would use StreamReader, and process it line by line using ReadLine,
however, the following problem occurs.

The file contains characters with ASCII codes above 128. But the file is
still text (nothing like UTF7/8 or the like). It also might contain + signs.
As a result:

UTF8 encoding doesn't read characters above 128
UTF7 encoding reads everything ok, except eats the + signs, and some
characters after them
ASCII encoding reads the + sign ok, however, characters above 128 are
disappear.

Because the file arrives in this form, I do not have any control on how it
looks like. The best idea so far was to create an own ReadLine method, that
reads the file byte after byte, and converts using UTF7, while taking
special care to feed the + character (ASCII code 46) to an ASCII encoder.
This way I could build a string from a line, that contains exactly what's in
the file.

But would there be a nicer way, or just this do-it-yourself-manually?

thanx

-Lenard



Jon Skeet [C# MVP]
Guest
 
Posts: n/a
#2: Nov 16 '05

re: UTF8/UTF7/ASCII problem while reading from text file


Lenard Gunda <frenzy@fbi.hu> wrote:[color=blue]
> I have the following problem. I need to read data from a TXT file our
> company receives.
> I would use StreamReader, and process it line by line using ReadLine,
> however, the following problem occurs.
>
> The file contains characters with ASCII codes above 128.[/color]

No it doesn't, because there are no such things. ASCII is a 7-bit
encoding.
[color=blue]
> But the file is still text (nothing like UTF7/8 or the like).[/color]

UTF-7 and UTF-8 are text encodings - a file containing text in UTF-8
encoding is still a text file.
[color=blue]
> It also might contain + signs. As a result:
>
> UTF8 encoding doesn't read characters above 128
> UTF7 encoding reads everything ok, except eats the + signs, and some
> characters after them
> ASCII encoding reads the + sign ok, however, characters above 128 are
> disappear.
>
> Because the file arrives in this form, I do not have any control on how it
> looks like. The best idea so far was to create an own ReadLine method, that
> reads the file byte after byte, and converts using UTF7, while taking
> special care to feed the + character (ASCII code 46) to an ASCII encoder.
> This way I could build a string from a line, that contains exactly what's in
> the file.
>
> But would there be a nicer way, or just this do-it-yourself-manually?[/color]

It sounds like you really need to know what encoding your file is
*really* in. Have you tried Encoding.Default?

See http://www.pobox.com/~skeet/csharp/unicode.html for more
information.

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Lenard Gunda
Guest
 
Posts: n/a
#3: Nov 16 '05

re: UTF8/UTF7/ASCII problem while reading from text file


> UTF-7 and UTF-8 are text encodings - a file containing text in UTF-8[color=blue]
> encoding is still a text file.[/color]

Yup I know that. I wanted to mean plain text file, one that can be read
without conversion.
[color=blue]
>
> It sounds like you really need to know what encoding your file is
> *really* in. Have you tried Encoding.Default?
>[/color]

Well, it contains ASCII characters, extended-ascii characters that are (in
this case) Finnish language characters. There's probably a code-page number
that would describe it. But ... Encoding.Default solved my problem, so it
would seem. Thanks very much!

-Lenard


Jon Skeet [C# MVP]
Guest
 
Posts: n/a
#4: Nov 16 '05

re: UTF8/UTF7/ASCII problem while reading from text file


Lenard Gunda <frenzy@fbi.hu> wrote:[color=blue][color=green]
> > UTF-7 and UTF-8 are text encodings - a file containing text in UTF-8
> > encoding is still a text file.[/color]
>
> Yup I know that. I wanted to mean plain text file, one that can be read
> without conversion.[/color]

There's *always* conversion involved. The file is binary data, and you
want text data. There's a conversion involved, even if it's ASCII.
[color=blue][color=green]
> > It sounds like you really need to know what encoding your file is
> > *really* in. Have you tried Encoding.Default?
> >[/color]
>
> Well, it contains ASCII characters, extended-ascii characters that are (in
> this case) Finnish language characters.[/color]

"Extended-ascii" isn't a well-defined character set (there are many
character sets which are extensions of ASCII) and anything above 127 is
*not* ASCII.
[color=blue]
> There's probably a code-page number
> that would describe it. But ... Encoding.Default solved my problem, so it
> would seem. Thanks very much![/color]

To find out the code page of Encoding.Default, just look at
Encoding.Default.CodePage.

I'm glad it's working for you though.

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Mike Schilling
Guest
 
Posts: n/a
#5: Nov 16 '05

re: UTF8/UTF7/ASCII problem while reading from text file



"Jon Skeet [C# MVP]" <skeet@pobox.com> wrote in message
news:MPG.1b7d8a4f902d1cd298b105@msnews.microsoft.c om...[color=blue]
> Lenard Gunda <frenzy@fbi.hu> wrote:[color=green][color=darkred]
> > > UTF-7 and UTF-8 are text encodings - a file containing text in UTF-8
> > > encoding is still a text file.[/color]
> >
> > Yup I know that. I wanted to mean plain text file, one that can be read
> > without conversion.[/color]
>
> There's *always* conversion involved. The file is binary data, and you
> want text data. There's a conversion involved, even if it's ASCII.
>[color=green][color=darkred]
> > > It sounds like you really need to know what encoding your file is
> > > *really* in. Have you tried Encoding.Default?
> > >[/color]
> >
> > Well, it contains ASCII characters, extended-ascii characters that are[/color][/color]
(in[color=blue][color=green]
> > this case) Finnish language characters.[/color]
>
> "Extended-ascii" isn't a well-defined character set (there are many
> character sets which are extensions of ASCII) and anything above 127 is
> *not* ASCII.
>[color=green]
> > There's probably a code-page number
> > that would describe it. But ... Encoding.Default solved my problem, so[/color][/color]
it[color=blue][color=green]
> > would seem. Thanks very much![/color]
>
> To find out the code page of Encoding.Default, just look at
> Encoding.Default.CodePage.[/color]

And note that it's working because your default encoding is the one with the
Finnish characters. If you need it to work on machines where this is not
the case, take Jon's advice: look at Encoding.Default.CodePage, and add
codes that explicitly uses that encoding to read the file.


Lenard Gunda
Guest
 
Posts: n/a
#6: Nov 16 '05

re: UTF8/UTF7/ASCII problem while reading from text file


Hi,
[color=blue]
> And note that it's working because your default encoding is the one with[/color]
the[color=blue]
> Finnish characters. If you need it to work on machines where this is not
> the case, take Jon's advice: look at Encoding.Default.CodePage, and add
> codes that explicitly uses that encoding to read the file.[/color]

Yup, I finally managed to understand how these Encoders work, and found it
how to create one for a particular code page. Could be useful in the future,
but because this product is supposed to run on our server, it will have the
correct settings. But good advice, anyway.

Thanks for the help.

-Lenard


Closed Thread


Similar C# / C Sharp bytes