I have a VB.Net application that parses an HTML file. This file was an
MS Word document that was saved as web page.
My application removes all unnecessary code generated by MS Word and
does some custom formatting needed by my client.
I use a StreamReader to read in the file...regular expressions to parse
and clean up the file...and a StreamWriter to write the new file. On
some HTML files that I parse, I get this character "ل" showing up in a
lot of places.
I've used different types of encoding for my StreamReader and
StreamWriter which works sometimes but then my application doesn't work
when I parse a French HTML file.
This is what I'm doing:
Dim filename As String = myfileinfo.Orig inalFileName
Dim sr As New StreamReader(fi lename, System.Text.Enc oding.Default)
Dim textstream As String = sr.ReadToEnd()
'close the stream...don't need it anymore
sr.Close()
Dim newtext As String
'send the stream for cleaning
newtext = CleanHTML(texts tream)
Dim OutputFileName As String = myfileinfo.NewF olderPathToFile
Dim fs As New FileStream(Outp utFileName, FileMode.Create ,
FileAccess.Writ e)
Dim sw As New StreamWriter(fs , System.Text.Enc oding.Default)
sw.WriteLine(ne wtext)
sw.Close()
These are the results:
<p class="MsoToc1" ><span
><a href="#_Toc1690 72483"><span>99 (15)0<span>ل </span>INSTALMENTPROGRAM</span></a></span></p>
<p class="MsoToc1" ><span
><a href="#_Toc1690 72484"><span>99 (15)1<span>ل </span>OVERVIEW OF THEINSTALMENT PROGRAM</span></a></span></p>
<p class="MsoNorma l" ><span>لل </span>OP:<span>ل
</span>BB<span>لل لللللللللل </span>ACCT:<span >ل </span>123456789
YR:<span>ل </span>2004<span> للللللللللللللل للللللل </span>PG:<span>ل
</span>1 of 1<span>لللللللل لللللللللل </span>26FEB
2004 USER-ID</p>
I also have different characters representing quotes such as "ô" and "ِ"
(not shown here).
When I do a "Save as Web Page" from MS Word 2000, the characters aren't
there but they show up after runnung through my program.
I'm assuming they represent characters that it doesn't understand but
does anyone know how to reslove this?
Thanks
Rob
*** Sent via Developersdex http://www.developersdex.com ***