473,320 Members | 2,146 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

StreamReader / StreamWriter Encoding

Hi,

please help.

Sounds so simple. We receive textfiles (customer orders) as e-mail
attachment. These textfiles contain a simple structure of orders, like:
custno, itemno, qty, text

Since these textfile are made on different systems, the field "text" causes
some trouble.

Characters like ä, ö, ü are not convertet in each case correctly.

The source code looks like:

- open (streamread, encoding = default, detect encoding = true) textfile
- convert to a new structure
- write (streamwriter) new textfile

What would you suggest? How could we "detect" the encoding of the file in
order to convert the text-field correctly?

Thanks and regards - Jari

Nov 16 '05 #1
4 9349
Hi Jaroslav,

I can recommend to use a byte (characted) histogram to determine
frequency of occuring character codes (from 128 to 255).
If you will compare those values to GERMAN, POLISH (or any other
encoding) "special" codes then you can guess source encoding.

It can be more sophistricated (e.g. dictionary-based) algorithm
to eliminate errors.

HTH
Marcin
Hi,

please help.

Sounds so simple. We receive textfiles (customer orders) as e-mail
attachment. These textfiles contain a simple structure of orders, like:
custno, itemno, qty, text

Since these textfile are made on different systems, the field "text" causes
some trouble.

Characters like ä, ö, ü are not convertet in each case correctly.

The source code looks like:

- open (streamread, encoding = default, detect encoding = true) textfile
- convert to a new structure
- write (streamwriter) new textfile

What would you suggest? How could we "detect" the encoding of the file in
order to convert the text-field correctly?

Thanks and regards - Jari


Nov 16 '05 #2
Hi Marcin,

do you have a link for samples or further description? Sorry, don't know,
how to do that...

Thanks and regards - Jari

"Marcin Grzêbski" <mg*******@taxussi.no.com.spam.pl> schrieb im Newsbeitrag
news:ct**********@atlantis.news.tpi.pl...
Hi Jaroslav,

I can recommend to use a byte (characted) histogram to determine
frequency of occuring character codes (from 128 to 255).
If you will compare those values to GERMAN, POLISH (or any other
encoding) "special" codes then you can guess source encoding.

It can be more sophistricated (e.g. dictionary-based) algorithm
to eliminate errors.

HTH
Marcin
Hi,

please help.

Sounds so simple. We receive textfiles (customer orders) as e-mail
attachment. These textfiles contain a simple structure of orders, like:
custno, itemno, qty, text

Since these textfile are made on different systems, the field "text" causes some trouble.

Characters like ä, ö, ü are not convertet in each case correctly.

The source code looks like:

- open (streamread, encoding = default, detect encoding = true) textfile
- convert to a new structure
- write (streamwriter) new textfile

What would you suggest? How could we "detect" the encoding of the file in order to convert the text-field correctly?

Thanks and regards - Jari


Nov 16 '05 #3
hmmm...
I don't know any links or samples but i'm sure that your problem
occured at this group some time ago.

I can show you a concept of this alghorithm:

int germanEncodingCounter=0;
int polishEncodingCounter=0;
byte[] bytesOfText; // a table with bytes of text file

// i don't know a german char-codes so i used a random numbers
for(int i=0; i<bytesOfText.Length; i++) {
swith( butesOfText[i] ) {
case 170:
germanEncodingCounter++;
break;
case 163:
germanEncodingCounter++;
polishEncodingCounter++; // £
break;
case 175:
polishEncodingCounter++; // ¯
break;
}
}

if( polishEncodingCounter>0
|| germanEncodingCounter>0 ) {
if( germanEncodingCounter>polishEncodingCounter ) {
// it looks like a german encoding
}
else if( polishEncodingCounter>germanEncodingCounter ) {
// it looks like a polish encoding
}
else {
// i'm confused??
}
}
else {
// encoding not found!
}

HTH
Marcin
Hi Marcin,

do you have a link for samples or further description? Sorry, don't know,
how to do that...

Thanks and regards - Jari

"Marcin Grzêbski" <mg*******@taxussi.no.com.spam.pl> schrieb im Newsbeitrag
news:ct**********@atlantis.news.tpi.pl...
Hi Jaroslav,

I can recommend to use a byte (characted) histogram to determine
frequency of occuring character codes (from 128 to 255).
If you will compare those values to GERMAN, POLISH (or any other
encoding) "special" codes then you can guess source encoding.

It can be more sophistricated (e.g. dictionary-based) algorithm
to eliminate errors.

HTH
Marcin

Hi,

please help.

Sounds so simple. We receive textfiles (customer orders) as e-mail
attachment. These textfiles contain a simple structure of orders, like:
custno, itemno, qty, text

Since these textfile are made on different systems, the field "text"
causes
some trouble.

Characters like ä, ö, ü are not convertet in each case correctly.

The source code looks like:

- open (streamread, encoding = default, detect encoding = true) textfile
- convert to a new structure
- write (streamwriter) new textfile

What would you suggest? How could we "detect" the encoding of the file
in
order to convert the text-field correctly?

Thanks and regards - Jari



Nov 16 '05 #4
Hi Marcin,

thanks! I understood what I am to do...

Regards - Jari

"Marcin Grzêbski" <mg*******@taxussi.no.com.spam.pl> schrieb im Newsbeitrag
news:ct**********@atlantis.news.tpi.pl...
hmmm...
I don't know any links or samples but i'm sure that your problem
occured at this group some time ago.

I can show you a concept of this alghorithm:

int germanEncodingCounter=0;
int polishEncodingCounter=0;
byte[] bytesOfText; // a table with bytes of text file

// i don't know a german char-codes so i used a random numbers
for(int i=0; i<bytesOfText.Length; i++) {
swith( butesOfText[i] ) {
case 170:
germanEncodingCounter++;
break;
case 163:
germanEncodingCounter++;
polishEncodingCounter++; // £
break;
case 175:
polishEncodingCounter++; // ¯
break;
}
}

if( polishEncodingCounter>0
|| germanEncodingCounter>0 ) {
if( germanEncodingCounter>polishEncodingCounter ) {
// it looks like a german encoding
}
else if( polishEncodingCounter>germanEncodingCounter ) {
// it looks like a polish encoding
}
else {
// i'm confused??
}
}
else {
// encoding not found!
}

HTH
Marcin
Hi Marcin,

do you have a link for samples or further description? Sorry, don't know, how to do that...

Thanks and regards - Jari

"Marcin Grzêbski" <mg*******@taxussi.no.com.spam.pl> schrieb im Newsbeitrag news:ct**********@atlantis.news.tpi.pl...
Hi Jaroslav,

I can recommend to use a byte (characted) histogram to determine
frequency of occuring character codes (from 128 to 255).
If you will compare those values to GERMAN, POLISH (or any other
encoding) "special" codes then you can guess source encoding.

It can be more sophistricated (e.g. dictionary-based) algorithm
to eliminate errors.

HTH
Marcin
Hi,

please help.

Sounds so simple. We receive textfiles (customer orders) as e-mail
attachment. These textfiles contain a simple structure of orders, like:
custno, itemno, qty, text

Since these textfile are made on different systems, the field "text"


causes
some trouble.

Characters like ä, ö, ü are not convertet in each case correctly.

The source code looks like:

- open (streamread, encoding = default, detect encoding = true) textfile- convert to a new structure
- write (streamwriter) new textfile

What would you suggest? How could we "detect" the encoding of the file


in
order to convert the text-field correctly?

Thanks and regards - Jari



Nov 16 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Jaga | last post by:
Hi, how can I read the same passage in a textfile several times? I'm writing a little textgenerator. It reads lines from a file, replaces the variables, and writes it in an other file. Some...
9
by: ShadowOfTheBeast | last post by:
Hi, I have got a major headache understanding streamReader and streamWriter relationship. I know how to use the streamreader and streamwriter independently. but how do you write out using the...
1
by: R.L. | last post by:
See the code below, var 'content ' is suppose to be "Hello!", not "". Who knows why? Thanks ---------------------------------------- string text = "hello!"; MemoryStream stream = new...
0
by: Ed West | last post by:
Hi, I am trying to read a file, make changes, and write it to a new file. The original file has the copyright character © which is ascii 169 I believe, which is more than 7 bits. I am using...
16
by: vvenk | last post by:
Hello: When I use either one to read a Text file, I get the same result. The length of the string that the file's content has been written into is the same. However, if the file is binary,...
11
by: LucaJonny | last post by:
Hi, I've got a problem using StreamReader in VB.NET. I try to read a txt file that contains extended characters and theese are removed from the line that is being read. I've read a lot of...
2
by: Thelonious Monk | last post by:
I have a problem where some data is being eliminated. The problem is that the data contains signed numeric fields (the low-order byte of a negative number uses the first 4 bits as a sign and...
5
by: Rob | last post by:
Hi, I have a VB.Net application that parses an HTML file. This file was an MS Word document that was saved as web page. My application removes all unnecessary code generated by MS Word and does...
0
by: rajana | last post by:
Dear All, We have Ansi file with german characters (Ä / Ø) , We are using Streamreader to read the contents of the file. But Readline() not able to read the German characters. We tried all...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.