473,395 Members | 1,412 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Junk characters when using StreamReader and StreamWriter

Rob
Hi,
I have a VB.Net application that parses an HTML file. This file was an
MS Word document that was saved as web page.

My application removes all unnecessary code generated by MS Word and
does some custom formatting needed by my client.

I use a StreamReader to read in the file...regular expressions to parse
and clean up the file...and a StreamWriter to write the new file. On
some HTML files that I parse, I get this character "ل" showing up in a
lot of places.
I've used different types of encoding for my StreamReader and
StreamWriter which works sometimes but then my application doesn't work
when I parse a French HTML file.

This is what I'm doing:

Dim filename As String = myfileinfo.OriginalFileName
Dim sr As New StreamReader(filename, System.Text.Encoding.Default)

Dim textstream As String = sr.ReadToEnd()
'close the stream...don't need it anymore
sr.Close()

Dim newtext As String

'send the stream for cleaning
newtext = CleanHTML(textstream)

Dim OutputFileName As String = myfileinfo.NewFolderPathToFile

Dim fs As New FileStream(OutputFileName, FileMode.Create,
FileAccess.Write)

Dim sw As New StreamWriter(fs, System.Text.Encoding.Default)

sw.WriteLine(newtext)
sw.Close()

These are the results:

<p class="MsoToc1" ><span
><a href="#_Toc169072483"><span>99(15)0<span>ل </span>INSTALMENT
PROGRAM</span></a></span></p>

<p class="MsoToc1" ><span
><a href="#_Toc169072484"><span>99(15)1<span>ل </span>OVERVIEW OF THE
INSTALMENT PROGRAM</span></a></span></p>

<p class="MsoNormal" ><span>لل </span>OP:<span>ل
</span>BB<span>لللللللللللل </span>ACCT:<span>ل </span>123456789
YR:<span>ل </span>2004<span>لللللللللللللللللللللل </span>PG:<span>ل
</span>1 of 1<span>لللللللللللللللللل </span>26FEB
2004 USER-ID</p>

I also have different characters representing quotes such as "ô" and "ِ"
(not shown here).

When I do a "Save as Web Page" from MS Word 2000, the characters aren't
there but they show up after runnung through my program.

I'm assuming they represent characters that it doesn't understand but
does anyone know how to reslove this?
Thanks
Rob

*** Sent via Developersdex http://www.developersdex.com ***
Jun 20 '07 #1
5 6847
Don't call CleanHTML function to see if it still happens. If yes, this is
likely an encoding problem (Word could perhaps save the file using an utf-8
encoding). If not it is something in your CleanHTML code...

--
Patrice

"Rob" <ro****@hotmail.coma écrit dans le message de news:
uH**************@TK2MSFTNGP06.phx.gbl...
Hi,
I have a VB.Net application that parses an HTML file. This file was an
MS Word document that was saved as web page.

My application removes all unnecessary code generated by MS Word and
does some custom formatting needed by my client.

I use a StreamReader to read in the file...regular expressions to parse
and clean up the file...and a StreamWriter to write the new file. On
some HTML files that I parse, I get this character "ل" showing up in a
lot of places.
I've used different types of encoding for my StreamReader and
StreamWriter which works sometimes but then my application doesn't work
when I parse a French HTML file.

This is what I'm doing:

Dim filename As String = myfileinfo.OriginalFileName
Dim sr As New StreamReader(filename, System.Text.Encoding.Default)

Dim textstream As String = sr.ReadToEnd()
'close the stream...don't need it anymore
sr.Close()

Dim newtext As String

'send the stream for cleaning
newtext = CleanHTML(textstream)

Dim OutputFileName As String = myfileinfo.NewFolderPathToFile

Dim fs As New FileStream(OutputFileName, FileMode.Create,
FileAccess.Write)

Dim sw As New StreamWriter(fs, System.Text.Encoding.Default)

sw.WriteLine(newtext)
sw.Close()

These are the results:

<p class="MsoToc1" ><span
>><a href="#_Toc169072483"><span>99(15)0<span>ل </span>INSTALMENT
PROGRAM</span></a></span></p>

<p class="MsoToc1" ><span
>><a href="#_Toc169072484"><span>99(15)1<span>ل </span>OVERVIEW OF THE
INSTALMENT PROGRAM</span></a></span></p>

<p class="MsoNormal" ><span>لل </span>OP:<span>ل
</span>BB<span>لللللللللللل </span>ACCT:<span>ل </span>123456789
YR:<span>ل </span>2004<span>لللللللللللللللللللللل </span>PG:<span>ل
</span>1 of 1<span>لللللللللللللللللل </span>26FEB
2004 USER-ID</p>

I also have different characters representing quotes such as "ô" and "ِ"
(not shown here).

When I do a "Save as Web Page" from MS Word 2000, the characters aren't
there but they show up after runnung through my program.

I'm assuming they represent characters that it doesn't understand but
does anyone know how to reslove this?
Thanks
Rob

*** Sent via Developersdex http://www.developersdex.com ***

Jun 20 '07 #2

ل is just a single-byte ASCII character and Word just puts it into
the html file without encoding.

Why would you want to use System.Text.Encoding.Default?

Then you do not know what you are getting. You might get something
to work - until you run the application somewhere else in the world.

If you write the data out without going through the CleanHTML
function, do you then get a file which byte for byte is identical to
the original?

Regards,

Joergen Bech

On Wed, 20 Jun 2007 06:46:38 -0700, Rob <ro****@hotmail.comwrote:
>Hi,
I have a VB.Net application that parses an HTML file. This file was an
MS Word document that was saved as web page.

My application removes all unnecessary code generated by MS Word and
does some custom formatting needed by my client.

I use a StreamReader to read in the file...regular expressions to parse
and clean up the file...and a StreamWriter to write the new file. On
some HTML files that I parse, I get this character "ل" showing up in a
lot of places.
I've used different types of encoding for my StreamReader and
StreamWriter which works sometimes but then my application doesn't work
when I parse a French HTML file.

This is what I'm doing:

Dim filename As String = myfileinfo.OriginalFileName
Dim sr As New StreamReader(filename, System.Text.Encoding.Default)

Dim textstream As String = sr.ReadToEnd()
'close the stream...don't need it anymore
sr.Close()

Dim newtext As String

'send the stream for cleaning
newtext = CleanHTML(textstream)

Dim OutputFileName As String = myfileinfo.NewFolderPathToFile

Dim fs As New FileStream(OutputFileName, FileMode.Create,
FileAccess.Write)

Dim sw As New StreamWriter(fs, System.Text.Encoding.Default)

sw.WriteLine(newtext)
sw.Close()

These are the results:

<p class="MsoToc1" ><span
>><a href="#_Toc169072483"><span>99(15)0<span>ل </span>INSTALMENT
PROGRAM</span></a></span></p>

<p class="MsoToc1" ><span
>><a href="#_Toc169072484"><span>99(15)1<span>ل </span>OVERVIEW OF THE
INSTALMENT PROGRAM</span></a></span></p>

<p class="MsoNormal" ><span>لل </span>OP:<span>ل
</span>BB<span>لللللللللللل </span>ACCT:<span>ل </span>123456789
YR:<span>ل </span>2004<span>لللللللللللللللللللللل </span>PG:<span>ل
</span>1 of 1<span>لللللللللللللللللل </span>26FEB
2004 USER-ID</p>

I also have different characters representing quotes such as "ô" and "ِ"
(not shown here).

When I do a "Save as Web Page" from MS Word 2000, the characters aren't
there but they show up after runnung through my program.

I'm assuming they represent characters that it doesn't understand but
does anyone know how to reslove this?
Thanks
Rob

*** Sent via Developersdex http://www.developersdex.com ***
Jun 20 '07 #3
"Joergen Bech @ post1.tele.dk>" <jbech<NOSPAMNOSPAMschrieb:
ل is just a single-byte ASCII character and Word just puts it into
the html file without encoding.

Why would you want to use System.Text.Encoding.Default?
It will return the system's default Windows ANSI codepage. However, the
encoding used to encode the file must be used to decode the file too.

--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://dotnet.mvps.org/dotnet/faqs/>

Jun 20 '07 #4
On Wed, 20 Jun 2007 22:09:55 +0200, "Herfried K. Wagner [MVP]"
<hi***************@gmx.atwrote:
>"Joergen Bech @ post1.tele.dk>" <jbech<NOSPAMNOSPAMschrieb:
>ل is just a single-byte ASCII character and Word just puts it into
the html file without encoding.

Why would you want to use System.Text.Encoding.Default?

It will return the system's default Windows ANSI codepage. However, the
encoding used to encode the file must be used to decode the file too.
Yes, that is what I meant. If you grab the system code page it
could be anything - not necessarily suitable for reading a particular
html file.

Perhaps the charset information found in any Word-generated
html file might be of more use here?

/Joergen Bech

Jun 20 '07 #5
Rob

Thanks for your input guys, I think I've got it.
I've ran the program without calling the CleanHTML function and it
worked fine...so it must be the encoding when running this function.

The reason I used System.Text.Encoding.Default for the StreamReader and
StreamWriter is because any other encoding wouldn't work with both
english and french documents. When I used UTF8 for both...the french
side would remove french characters altogether so I used Default for the
StreamReader and UTF8 for the StreamWriter and it seems to be working
fine for both english and french documents.

I should mention that this application is only going to be used in house
and only used on english and french documents so I think I should be
fine.

Thanks again
*** Sent via Developersdex http://www.developersdex.com ***
Jun 21 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: RichW | last post by:
We are having some problems with invalid characters in passwords when using aspnet_setreg. Aspnet_setreg seems to generate the registry keys successfully, but then the asp.net application using...
2
by: Joecx | last post by:
Using vb.net, I am using Streamreader to read a text file and searching for a line to delete, the I close the file and open it as a streamwriter so I can put the new file back to disk without the...
7
by: SKG | last post by:
Iam trying to read an xml file from a website and i get junk characters. But when i open the same file in browser everything is fine. here is the snippet of the code WebRequest objRequest =...
2
by: Sacha Korell | last post by:
How would I check for an end of file when parsing a text file using the StreamReader object? I would like to do something like this: '******************************** Dim objStreamReader As...
2
by: not aaron | last post by:
I start out with a string. Which I then encode with my own algorithm changing every characters ascii value depending on a key. I then save it to a binary file. When I generate the initial...
1
by: Stephen | last post by:
Hi, I am reading a file (c:\address.txt) using streamreader and I want to make changes to a line in the file and save it. will there be a locking issue if I read and write to the samefile...
2
by: Thelonious Monk | last post by:
I have a problem where some data is being eliminated. The problem is that the data contains signed numeric fields (the low-order byte of a negative number uses the first 4 bits as a sign and...
1
by: JB | last post by:
Hi All, I'm writing XML in VB.NET 2005 using the System.Xml.XmlWriter class. It's all working fine but I'd like to remove a few NewLine characters. For instance if I use the following code:...
3
by: stumorgan | last post by:
I'm doing some USB communications in C# and am running into a minor annoyance. I'm using the Windows API CreateFile function to get a SafeFileHandle, which I then stuff into a FileStream and from...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.