473,626 Members | 3,265 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Junk characters when using StreamReader and StreamWriter

Rob
Hi,
I have a VB.Net application that parses an HTML file. This file was an
MS Word document that was saved as web page.

My application removes all unnecessary code generated by MS Word and
does some custom formatting needed by my client.

I use a StreamReader to read in the file...regular expressions to parse
and clean up the file...and a StreamWriter to write the new file. On
some HTML files that I parse, I get this character "ل" showing up in a
lot of places.
I've used different types of encoding for my StreamReader and
StreamWriter which works sometimes but then my application doesn't work
when I parse a French HTML file.

This is what I'm doing:

Dim filename As String = myfileinfo.Orig inalFileName
Dim sr As New StreamReader(fi lename, System.Text.Enc oding.Default)

Dim textstream As String = sr.ReadToEnd()
'close the stream...don't need it anymore
sr.Close()

Dim newtext As String

'send the stream for cleaning
newtext = CleanHTML(texts tream)

Dim OutputFileName As String = myfileinfo.NewF olderPathToFile

Dim fs As New FileStream(Outp utFileName, FileMode.Create ,
FileAccess.Writ e)

Dim sw As New StreamWriter(fs , System.Text.Enc oding.Default)

sw.WriteLine(ne wtext)
sw.Close()

These are the results:

<p class="MsoToc1" ><span
><a href="#_Toc1690 72483"><span>99 (15)0<span>ل </span>INSTALMENT
PROGRAM</span></a></span></p>

<p class="MsoToc1" ><span
><a href="#_Toc1690 72484"><span>99 (15)1<span>ل </span>OVERVIEW OF THE
INSTALMENT PROGRAM</span></a></span></p>

<p class="MsoNorma l" ><span>لل </span>OP:<span>ل
</span>BB<span>لل لللللللللل </span>ACCT:<span >ل </span>123456789
YR:<span>ل </span>2004<span> للللللللللللللل للللللل </span>PG:<span>ل
</span>1 of 1<span>لللللللل لللللللللل </span>26FEB
2004 USER-ID</p>

I also have different characters representing quotes such as "ô" and "ِ"
(not shown here).

When I do a "Save as Web Page" from MS Word 2000, the characters aren't
there but they show up after runnung through my program.

I'm assuming they represent characters that it doesn't understand but
does anyone know how to reslove this?
Thanks
Rob

*** Sent via Developersdex http://www.developersdex.com ***
Jun 20 '07 #1
5 6867
Don't call CleanHTML function to see if it still happens. If yes, this is
likely an encoding problem (Word could perhaps save the file using an utf-8
encoding). If not it is something in your CleanHTML code...

--
Patrice

"Rob" <ro****@hotmail .coma écrit dans le message de news:
uH************* *@TK2MSFTNGP06. phx.gbl...
Hi,
I have a VB.Net application that parses an HTML file. This file was an
MS Word document that was saved as web page.

My application removes all unnecessary code generated by MS Word and
does some custom formatting needed by my client.

I use a StreamReader to read in the file...regular expressions to parse
and clean up the file...and a StreamWriter to write the new file. On
some HTML files that I parse, I get this character "ل" showing up in a
lot of places.
I've used different types of encoding for my StreamReader and
StreamWriter which works sometimes but then my application doesn't work
when I parse a French HTML file.

This is what I'm doing:

Dim filename As String = myfileinfo.Orig inalFileName
Dim sr As New StreamReader(fi lename, System.Text.Enc oding.Default)

Dim textstream As String = sr.ReadToEnd()
'close the stream...don't need it anymore
sr.Close()

Dim newtext As String

'send the stream for cleaning
newtext = CleanHTML(texts tream)

Dim OutputFileName As String = myfileinfo.NewF olderPathToFile

Dim fs As New FileStream(Outp utFileName, FileMode.Create ,
FileAccess.Writ e)

Dim sw As New StreamWriter(fs , System.Text.Enc oding.Default)

sw.WriteLine(ne wtext)
sw.Close()

These are the results:

<p class="MsoToc1" ><span
>><a href="#_Toc1690 72483"><span>99 (15)0<span>ل </span>INSTALMENT
PROGRAM</span></a></span></p>

<p class="MsoToc1" ><span
>><a href="#_Toc1690 72484"><span>99 (15)1<span>ل </span>OVERVIEW OF THE
INSTALMENT PROGRAM</span></a></span></p>

<p class="MsoNorma l" ><span>لل </span>OP:<span>ل
</span>BB<span>لل لللللللللل </span>ACCT:<span >ل </span>123456789
YR:<span>ل </span>2004<span> للللللللللللللل للللللل </span>PG:<span>ل
</span>1 of 1<span>لللللللل لللللللللل </span>26FEB
2004 USER-ID</p>

I also have different characters representing quotes such as "ô" and "ِ"
(not shown here).

When I do a "Save as Web Page" from MS Word 2000, the characters aren't
there but they show up after runnung through my program.

I'm assuming they represent characters that it doesn't understand but
does anyone know how to reslove this?
Thanks
Rob

*** Sent via Developersdex http://www.developersdex.com ***

Jun 20 '07 #2

ل is just a single-byte ASCII character and Word just puts it into
the html file without encoding.

Why would you want to use System.Text.Enc oding.Default?

Then you do not know what you are getting. You might get something
to work - until you run the application somewhere else in the world.

If you write the data out without going through the CleanHTML
function, do you then get a file which byte for byte is identical to
the original?

Regards,

Joergen Bech

On Wed, 20 Jun 2007 06:46:38 -0700, Rob <ro****@hotmail .comwrote:
>Hi,
I have a VB.Net application that parses an HTML file. This file was an
MS Word document that was saved as web page.

My application removes all unnecessary code generated by MS Word and
does some custom formatting needed by my client.

I use a StreamReader to read in the file...regular expressions to parse
and clean up the file...and a StreamWriter to write the new file. On
some HTML files that I parse, I get this character "ل" showing up in a
lot of places.
I've used different types of encoding for my StreamReader and
StreamWriter which works sometimes but then my application doesn't work
when I parse a French HTML file.

This is what I'm doing:

Dim filename As String = myfileinfo.Orig inalFileName
Dim sr As New StreamReader(fi lename, System.Text.Enc oding.Default)

Dim textstream As String = sr.ReadToEnd()
'close the stream...don't need it anymore
sr.Close()

Dim newtext As String

'send the stream for cleaning
newtext = CleanHTML(texts tream)

Dim OutputFileName As String = myfileinfo.NewF olderPathToFile

Dim fs As New FileStream(Outp utFileName, FileMode.Create ,
FileAccess.Wri te)

Dim sw As New StreamWriter(fs , System.Text.Enc oding.Default)

sw.WriteLine(n ewtext)
sw.Close()

These are the results:

<p class="MsoToc1" ><span
>><a href="#_Toc1690 72483"><span>99 (15)0<span>ل </span>INSTALMENT
PROGRAM</span></a></span></p>

<p class="MsoToc1" ><span
>><a href="#_Toc1690 72484"><span>99 (15)1<span>ل </span>OVERVIEW OF THE
INSTALMENT PROGRAM</span></a></span></p>

<p class="MsoNorma l" ><span>لل </span>OP:<span>ل
</span>BB<span>لل لللللللللل </span>ACCT:<span >ل </span>123456789
YR:<span>ل </span>2004<span> للللللللللللللل للللللل </span>PG:<span>ل
</span>1 of 1<span>لللللللل لللللللللل </span>26FEB
2004 USER-ID</p>

I also have different characters representing quotes such as "ô" and "ِ"
(not shown here).

When I do a "Save as Web Page" from MS Word 2000, the characters aren't
there but they show up after runnung through my program.

I'm assuming they represent characters that it doesn't understand but
does anyone know how to reslove this?
Thanks
Rob

*** Sent via Developersdex http://www.developersdex.com ***
Jun 20 '07 #3
"Joergen Bech @ post1.tele.dk>" <jbech<NOSPAMNO SPAMschrieb:
ل is just a single-byte ASCII character and Word just puts it into
the html file without encoding.

Why would you want to use System.Text.Enc oding.Default?
It will return the system's default Windows ANSI codepage. However, the
encoding used to encode the file must be used to decode the file too.

--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://dotnet.mvps.org/dotnet/faqs/>

Jun 20 '07 #4
On Wed, 20 Jun 2007 22:09:55 +0200, "Herfried K. Wagner [MVP]"
<hi************ ***@gmx.atwrote :
>"Joergen Bech @ post1.tele.dk>" <jbech<NOSPAMNO SPAMschrieb:
>ل is just a single-byte ASCII character and Word just puts it into
the html file without encoding.

Why would you want to use System.Text.Enc oding.Default?

It will return the system's default Windows ANSI codepage. However, the
encoding used to encode the file must be used to decode the file too.
Yes, that is what I meant. If you grab the system code page it
could be anything - not necessarily suitable for reading a particular
html file.

Perhaps the charset information found in any Word-generated
html file might be of more use here?

/Joergen Bech

Jun 20 '07 #5
Rob

Thanks for your input guys, I think I've got it.
I've ran the program without calling the CleanHTML function and it
worked fine...so it must be the encoding when running this function.

The reason I used System.Text.Enc oding.Default for the StreamReader and
StreamWriter is because any other encoding wouldn't work with both
english and french documents. When I used UTF8 for both...the french
side would remove french characters altogether so I used Default for the
StreamReader and UTF8 for the StreamWriter and it seems to be working
fine for both english and french documents.

I should mention that this application is only going to be used in house
and only used on english and french documents so I think I should be
fine.

Thanks again
*** Sent via Developersdex http://www.developersdex.com ***
Jun 21 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1405
by: RichW | last post by:
We are having some problems with invalid characters in passwords when using aspnet_setreg. Aspnet_setreg seems to generate the registry keys successfully, but then the asp.net application using the username/password encrypted with aspnet_setreg won't start properly because of a problem with the password. For information, we have a utility that generates strong passwords for us. The passwords could contain most of the special...
2
2419
by: Joecx | last post by:
Using vb.net, I am using Streamreader to read a text file and searching for a line to delete, the I close the file and open it as a streamwriter so I can put the new file back to disk without the line that I deleted. Can someone tell me how I can do this without having to close the file first before switching to streamwriter, because closing the file allows someone else to grab the file before I put the new information back. Thanks if any...
7
6670
by: SKG | last post by:
Iam trying to read an xml file from a website and i get junk characters. But when i open the same file in browser everything is fine. here is the snippet of the code WebRequest objRequest = WebRequest.Create("http://www.xvdabc.com/order.xml"); WebResponse objResponse = objRequest.GetResponse(); StreamReader SR = new StreamReader(objResponse.GetResponseStream()); string strContent = oSR.ReadToEnd();
2
15598
by: Sacha Korell | last post by:
How would I check for an end of file when parsing a text file using the StreamReader object? I would like to do something like this: '******************************** Dim objStreamReader As StreamReader Dim strReqLine As String objStreamReader = File.OpenText("data.txt")
2
2130
by: not aaron | last post by:
I start out with a string. Which I then encode with my own algorithm changing every characters ascii value depending on a key. I then save it to a binary file. When I generate the initial string, it shows up fine. When I encode the string it shows up right. When I save the file it appends about 3 lines of random ascii characters (about the length of my original string). When I go back to decode the string, it works right for what...
1
1450
by: Stephen | last post by:
Hi, I am reading a file (c:\address.txt) using streamreader and I want to make changes to a line in the file and save it. will there be a locking issue if I read and write to the samefile simultaneously? Thanks, Stephen
2
3099
by: Thelonious Monk | last post by:
I have a problem where some data is being eliminated. The problem is that the data contains signed numeric fields (the low-order byte of a negative number uses the first 4 bits as a sign and the last 4 bits as the low-order digit. This produces byte values higher than X'7F'. To be more specific these values are hexadecimal X'B0' through X'B9'. It appears the program is eliminating any byte values greater than X'7F', thus shortening...
1
4810
by: JB | last post by:
Hi All, I'm writing XML in VB.NET 2005 using the System.Xml.XmlWriter class. It's all working fine but I'd like to remove a few NewLine characters. For instance if I use the following code: Writer.WriteStartElement("Name") oObject.WriteXml(Writer) 'This object writes it's ow Writer.WriteEndElement("Name")
3
5226
by: stumorgan | last post by:
I'm doing some USB communications in C# and am running into a minor annoyance. I'm using the Windows API CreateFile function to get a SafeFileHandle, which I then stuff into a FileStream and from there into StreamReader and StreamWriter objects. The StreamReader is working beautifully, but the StreamWriter isn't. If I convert the StreamWriter.BaseStream back to a FileStream and use its SafeFileHandle in the Windows API WriteFile...
0
8265
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8196
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
7193
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6125
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5574
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4092
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4197
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2625
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
1511
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.