473,569 Members | 3,063 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

StreamReader / StreamWriter Encoding

Hi,

please help.

Sounds so simple. We receive textfiles (customer orders) as e-mail
attachment. These textfiles contain a simple structure of orders, like:
custno, itemno, qty, text

Since these textfile are made on different systems, the field "text" causes
some trouble.

Characters like ä, ö, ü are not convertet in each case correctly.

The source code looks like:

- open (streamread, encoding = default, detect encoding = true) textfile
- convert to a new structure
- write (streamwriter) new textfile

What would you suggest? How could we "detect" the encoding of the file in
order to convert the text-field correctly?

Thanks and regards - Jari

Nov 16 '05 #1
4 9360
Hi Jaroslav,

I can recommend to use a byte (characted) histogram to determine
frequency of occuring character codes (from 128 to 255).
If you will compare those values to GERMAN, POLISH (or any other
encoding) "special" codes then you can guess source encoding.

It can be more sophistricated (e.g. dictionary-based) algorithm
to eliminate errors.

HTH
Marcin
Hi,

please help.

Sounds so simple. We receive textfiles (customer orders) as e-mail
attachment. These textfiles contain a simple structure of orders, like:
custno, itemno, qty, text

Since these textfile are made on different systems, the field "text" causes
some trouble.

Characters like ä, ö, ü are not convertet in each case correctly.

The source code looks like:

- open (streamread, encoding = default, detect encoding = true) textfile
- convert to a new structure
- write (streamwriter) new textfile

What would you suggest? How could we "detect" the encoding of the file in
order to convert the text-field correctly?

Thanks and regards - Jari


Nov 16 '05 #2
Hi Marcin,

do you have a link for samples or further description? Sorry, don't know,
how to do that...

Thanks and regards - Jari

"Marcin Grzêbski" <mg*******@taxu ssi.no.com.spam .pl> schrieb im Newsbeitrag
news:ct******** **@atlantis.new s.tpi.pl...
Hi Jaroslav,

I can recommend to use a byte (characted) histogram to determine
frequency of occuring character codes (from 128 to 255).
If you will compare those values to GERMAN, POLISH (or any other
encoding) "special" codes then you can guess source encoding.

It can be more sophistricated (e.g. dictionary-based) algorithm
to eliminate errors.

HTH
Marcin
Hi,

please help.

Sounds so simple. We receive textfiles (customer orders) as e-mail
attachment. These textfiles contain a simple structure of orders, like:
custno, itemno, qty, text

Since these textfile are made on different systems, the field "text" causes some trouble.

Characters like ä, ö, ü are not convertet in each case correctly.

The source code looks like:

- open (streamread, encoding = default, detect encoding = true) textfile
- convert to a new structure
- write (streamwriter) new textfile

What would you suggest? How could we "detect" the encoding of the file in order to convert the text-field correctly?

Thanks and regards - Jari


Nov 16 '05 #3
hmmm...
I don't know any links or samples but i'm sure that your problem
occured at this group some time ago.

I can show you a concept of this alghorithm:

int germanEncodingC ounter=0;
int polishEncodingC ounter=0;
byte[] bytesOfText; // a table with bytes of text file

// i don't know a german char-codes so i used a random numbers
for(int i=0; i<bytesOfText.L ength; i++) {
swith( butesOfText[i] ) {
case 170:
germanEncodingC ounter++;
break;
case 163:
germanEncodingC ounter++;
polishEncodingC ounter++; // £
break;
case 175:
polishEncodingC ounter++; // ¯
break;
}
}

if( polishEncodingC ounter>0
|| germanEncodingC ounter>0 ) {
if( germanEncodingC ounter>polishEn codingCounter ) {
// it looks like a german encoding
}
else if( polishEncodingC ounter>germanEn codingCounter ) {
// it looks like a polish encoding
}
else {
// i'm confused??
}
}
else {
// encoding not found!
}

HTH
Marcin
Hi Marcin,

do you have a link for samples or further description? Sorry, don't know,
how to do that...

Thanks and regards - Jari

"Marcin Grzêbski" <mg*******@taxu ssi.no.com.spam .pl> schrieb im Newsbeitrag
news:ct******** **@atlantis.new s.tpi.pl...
Hi Jaroslav,

I can recommend to use a byte (characted) histogram to determine
frequency of occuring character codes (from 128 to 255).
If you will compare those values to GERMAN, POLISH (or any other
encoding) "special" codes then you can guess source encoding.

It can be more sophistricated (e.g. dictionary-based) algorithm
to eliminate errors.

HTH
Marcin

Hi,

please help.

Sounds so simple. We receive textfiles (customer orders) as e-mail
attachment . These textfiles contain a simple structure of orders, like:
custno, itemno, qty, text

Since these textfile are made on different systems, the field "text"
causes
some trouble.

Characters like ä, ö, ü are not convertet in each case correctly.

The source code looks like:

- open (streamread, encoding = default, detect encoding = true) textfile
- convert to a new structure
- write (streamwriter) new textfile

What would you suggest? How could we "detect" the encoding of the file
in
order to convert the text-field correctly?

Thanks and regards - Jari



Nov 16 '05 #4
Hi Marcin,

thanks! I understood what I am to do...

Regards - Jari

"Marcin Grzêbski" <mg*******@taxu ssi.no.com.spam .pl> schrieb im Newsbeitrag
news:ct******** **@atlantis.new s.tpi.pl...
hmmm...
I don't know any links or samples but i'm sure that your problem
occured at this group some time ago.

I can show you a concept of this alghorithm:

int germanEncodingC ounter=0;
int polishEncodingC ounter=0;
byte[] bytesOfText; // a table with bytes of text file

// i don't know a german char-codes so i used a random numbers
for(int i=0; i<bytesOfText.L ength; i++) {
swith( butesOfText[i] ) {
case 170:
germanEncodingC ounter++;
break;
case 163:
germanEncodingC ounter++;
polishEncodingC ounter++; // £
break;
case 175:
polishEncodingC ounter++; // ¯
break;
}
}

if( polishEncodingC ounter>0
|| germanEncodingC ounter>0 ) {
if( germanEncodingC ounter>polishEn codingCounter ) {
// it looks like a german encoding
}
else if( polishEncodingC ounter>germanEn codingCounter ) {
// it looks like a polish encoding
}
else {
// i'm confused??
}
}
else {
// encoding not found!
}

HTH
Marcin
Hi Marcin,

do you have a link for samples or further description? Sorry, don't know, how to do that...

Thanks and regards - Jari

"Marcin Grzêbski" <mg*******@taxu ssi.no.com.spam .pl> schrieb im Newsbeitrag news:ct******** **@atlantis.new s.tpi.pl...
Hi Jaroslav,

I can recommend to use a byte (characted) histogram to determine
frequency of occuring character codes (from 128 to 255).
If you will compare those values to GERMAN, POLISH (or any other
encoding) "special" codes then you can guess source encoding.

It can be more sophistricated (e.g. dictionary-based) algorithm
to eliminate errors.

HTH
Marcin
Hi,

please help.

Sounds so simple. We receive textfiles (customer orders) as e-mail
attachment . These textfiles contain a simple structure of orders, like:
custno, itemno, qty, text

Since these textfile are made on different systems, the field "text"


causes
some trouble.

Characters like ä, ö, ü are not convertet in each case correctly.

The source code looks like:

- open (streamread, encoding = default, detect encoding = true) textfile- convert to a new structure
- write (streamwriter) new textfile

What would you suggest? How could we "detect" the encoding of the file


in
order to convert the text-field correctly?

Thanks and regards - Jari



Nov 16 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
3232
by: Jaga | last post by:
Hi, how can I read the same passage in a textfile several times? I'm writing a little textgenerator. It reads lines from a file, replaces the variables, and writes it in an other file. Some lines I have to read several times and write them with other values. What is wrong with this code: using (StreamWriter swr = new...
9
4577
by: ShadowOfTheBeast | last post by:
Hi, I have got a major headache understanding streamReader and streamWriter relationship. I know how to use the streamreader and streamwriter independently. but how do you write out using the streamwriter, what you have read into a streamReader? and also can someone explain how they work in simple terms -- The Matrix Insurrection
1
4664
by: R.L. | last post by:
See the code below, var 'content ' is suppose to be "Hello!", not "". Who knows why? Thanks ---------------------------------------- string text = "hello!"; MemoryStream stream = new MemoryStream(); StreamWriter streamWriter = new StreamWriter(stream, Encoding.ASCII); streamWriter.Write(text);
0
326
by: Ed West | last post by:
Hi, I am trying to read a file, make changes, and write it to a new file. The original file has the copyright character © which is ascii 169 I believe, which is more than 7 bits. I am using typical StreamReader object to read in the file, but it is not getting it correctly. If I make the encoding type Ascii, it turns it into a question...
16
2073
by: vvenk | last post by:
Hello: When I use either one to read a Text file, I get the same result. The length of the string that the file's content has been written into is the same. However, if the file is binary, FileGet gets me the correct content while StreamReader gives me a truncated string. Can somebody advise me why? Should I be using BinaryReader...
11
31692
by: LucaJonny | last post by:
Hi, I've got a problem using StreamReader in VB.NET. I try to read a txt file that contains extended characters and theese are removed from the line that is being read. I've read a lot of articles about ANSI encoding like this http://support.microsoft.com/default.aspx?scid=kb;en-us;889835 but System.Text.Encoding.Default don't work!!
2
3096
by: Thelonious Monk | last post by:
I have a problem where some data is being eliminated. The problem is that the data contains signed numeric fields (the low-order byte of a negative number uses the first 4 bits as a sign and the last 4 bits as the low-order digit. This produces byte values higher than X'7F'. To be more specific these values are hexadecimal X'B0' through...
5
6864
by: Rob | last post by:
Hi, I have a VB.Net application that parses an HTML file. This file was an MS Word document that was saved as web page. My application removes all unnecessary code generated by MS Word and does some custom formatting needed by my client. I use a StreamReader to read in the file...regular expressions to parse and clean up the file...and a...
0
2337
by: rajana | last post by:
Dear All, We have Ansi file with german characters (Ä / Ø) , We are using Streamreader to read the contents of the file. But Readline() not able to read the German characters. We tried all possibilities of calling the streamreader, but nothing worked. Dim sr As StreamReader = New StreamReader(Filename, System.Text.Encoding.Default,...
0
7703
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7619
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
7983
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
1
5514
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5228
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3662
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3651
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1229
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
950
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.