UTF8/UTF7/ASCII problem while reading from text file

Lenard Gunda

hi!

I have the following problem. I need to read data from a TXT file our
company receives.
I would use StreamReader, and process it line by line using ReadLine,
however, the following problem occurs.

The file contains characters with ASCII codes above 128. But the file is
still text (nothing like UTF7/8 or the like). It also might contain + signs.
As a result:

UTF8 encoding doesn't read characters above 128
UTF7 encoding reads everything ok, except eats the + signs, and some
characters after them
ASCII encoding reads the + sign ok, however, characters above 128 are
disappear.

Because the file arrives in this form, I do not have any control on how it
looks like. The best idea so far was to create an own ReadLine method, that
reads the file byte after byte, and converts using UTF7, while taking
special care to feed the + character (ASCII code 46) to an ASCII encoder.
This way I could build a string from a line, that contains exactly what's in
the file.

But would there be a nicer way, or just this do-it-yourself-manually?

thanx

-Lenard

Nov 16 '05 #1

Subscribe Post Reply

14998

Jon Skeet [C# MVP]

Lenard Gunda <fr****@fbi.hu> wrote:

I have the following problem. I need to read data from a TXT file our
company receives.
I would use StreamReader, and process it line by line using ReadLine,
however, the following problem occurs.

The file contains characters with ASCII codes above 128.
No it doesn't, because there are no such things. ASCII is a 7-bit
encoding.
But the file is still text (nothing like UTF7/8 or the like).
UTF-7 and UTF-8 are text encodings - a file containing text in UTF-8
encoding is still a text file.
It also might contain + signs. As a result:

UTF8 encoding doesn't read characters above 128
UTF7 encoding reads everything ok, except eats the + signs, and some
characters after them
ASCII encoding reads the + sign ok, however, characters above 128 are
disappear.

Because the file arrives in this form, I do not have any control on how it
looks like. The best idea so far was to create an own ReadLine method, that
reads the file byte after byte, and converts using UTF7, while taking
special care to feed the + character (ASCII code 46) to an ASCII encoder.
This way I could build a string from a line, that contains exactly what's in
the file.

But would there be a nicer way, or just this do-it-yourself-manually?

It sounds like you really need to know what encoding your file is
*really* in. Have you tried Encoding.Default?

See http://www.pobox.com/~skeet/csharp/unicode.html for more
information.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 16 '05 #2

Lenard Gunda

> UTF-7 and UTF-8 are text encodings - a file containing text in UTF-8

encoding is still a text file.
Yup I know that. I wanted to mean plain text file, one that can be read
without conversion.

It sounds like you really need to know what encoding your file is
*really* in. Have you tried Encoding.Default?

Well, it contains ASCII characters, extended-ascii characters that are (in
this case) Finnish language characters. There's probably a code-page number
that would describe it. But ... Encoding.Default solved my problem, so it
would seem. Thanks very much!

-Lenard

Nov 16 '05 #3

Jon Skeet [C# MVP]

Lenard Gunda <fr****@fbi.hu> wrote:

UTF-7 and UTF-8 are text encodings - a file containing text in UTF-8
encoding is still a text file.
Yup I know that. I wanted to mean plain text file, one that can be read
without conversion.

There's *always* conversion involved. The file is binary data, and you
want text data. There's a conversion involved, even if it's ASCII.

It sounds like you really need to know what encoding your file is
*really* in. Have you tried Encoding.Default?

Well, it contains ASCII characters, extended-ascii characters that are (in
this case) Finnish language characters.

"Extended-ascii" isn't a well-defined character set (there are many
character sets which are extensions of ASCII) and anything above 127 is
*not* ASCII.
There's probably a code-page number
that would describe it. But ... Encoding.Default solved my problem, so it
would seem. Thanks very much!

To find out the code page of Encoding.Default, just look at
Encoding.Default.CodePage.

I'm glad it's working for you though.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 16 '05 #4

Mike Schilling

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...

Lenard Gunda <fr****@fbi.hu> wrote:
UTF-7 and UTF-8 are text encodings - a file containing text in UTF-8
encoding is still a text file.

Yup I know that. I wanted to mean plain text file, one that can be read
without conversion.

There's *always* conversion involved. The file is binary data, and you
want text data. There's a conversion involved, even if it's ASCII.
It sounds like you really need to know what encoding your file is
*really* in. Have you tried Encoding.Default?

Well, it contains ASCII characters, extended-ascii characters that are (in this case) Finnish language characters.

"Extended-ascii" isn't a well-defined character set (there are many
character sets which are extensions of ASCII) and anything above 127 is
*not* ASCII.
There's probably a code-page number
that would describe it. But ... Encoding.Default solved my problem, so it would seem. Thanks very much!

To find out the code page of Encoding.Default, just look at
Encoding.Default.CodePage.

And note that it's working because your default encoding is the one with the
Finnish characters. If you need it to work on machines where this is not
the case, take Jon's advice: look at Encoding.Default.CodePage, and add
codes that explicitly uses that encoding to read the file.

Nov 16 '05 #5

Lenard Gunda

Hi,

And note that it's working because your default encoding is the one with the Finnish characters. If you need it to work on machines where this is not
the case, take Jon's advice: look at Encoding.Default.CodePage, and add
codes that explicitly uses that encoding to read the file.

Yup, I finally managed to understand how these Encoders work, and found it
how to create one for a particular code page. Could be useful in the future,
but because this product is supposed to run on our server, it will have the
correct settings. But good advice, anyway.

Thanks for the help.

-Lenard

Nov 16 '05 #6

Similar topics

utf8 and ftplib

by: Richard Lewis | last post by:

Hi there, I'm having a problem with unicode files and ftplib (using Python 2.3.5). I've got this code: xml_source = codecs.open("foo.xml", 'w+b', "utf8") #xml_source = file("foo.xml",...

Python

utf8 silly question

by: Catalin Constantin | last post by:

i have the following code: c=chr(169)+" some text" how can i utf8 encode the variable above ? something like in php utf8_encode($var);?! chr(169) is the &copy (c) sign ! 10x for your...

Python

UTF8 to Unicode conversion

by: Spamtrap | last post by:

I only work in Perl occasionaly, and have been searching for a solution for a conversion, and everything I found seems much too complex. All I need to do is take a simple text file and copy...

Perl

utf8 in regexp (perl 5.8.1)

by: Wes Groleau | last post by:

I have a file containing thousands of Spanish words, encoded AFAIK) in UTF-8. I also have a perl script in UTF-8, which says (hope pasting works): #!/usr/bin/perl -w -CSD # # NOTE: The extra...

Perl

Read UTF8 (mixed byte) file & convert to Unicode

by: hunterb | last post by:

I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...

.NET Framework

ascii or binary

by: Pohihihi | last post by:

Hello NG, I am making a small tool which reads files on harddisk and saves many information about files in a db. Now, while reading information from the file I want to figure out what type,...

C# / C Sharp

How to determine if a file is UTF8 encoded?

by: Thomas Podlesak | last post by:

I need a check, if a file is utf8 encoded. I only found the php-functions 'iconv' and 'recode'. But it seems itÂ´s not possible to determine the encoding with them. IsnÂ´t there any similar...

PHP

How to get Python to default to UTF8

by: weheh | last post by:

I'm developing a cgi-bin application that must be unicode sensitive. I'm striving for a UTF8 implementation. I'm running python 2.3 on a development machine (windows xp) and a server (windows xp...

Python

how to convert UTF8 file into ANSI?

by: firepol | last post by:

Hello there, I am dealing with files encoded in UTF8 and I can't find a way to convert them into ANSI. I've already searched in google for this since a while, and I'm not achieving the result I...

C# / C Sharp

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General