472,118 Members | 1,491 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,118 software developers and data experts.

Stripping HTML

I am trying to read in an HTML file and strip out the HTML
code so that all I have left is the text of the body.

Does anyone have any suggestions for doing this?
Any HTML stripping routines or objects that perform the
function?

Nov 20 '05 #1
5 2362
Public Function StripHTML(ByVal HTML As String) As String
Dim strContent As String, mString As String
Dim mStartPos As Long, mEndPos As Long
Dim i, j

strContent = HTML.Replace("</P>", vbCrLf)
strContent = strContent.Replace("</p>", vbCrLf)

mStartPos = InStr(strContent, "<")
mEndPos = InStr(strContent, ">")
Do While mStartPos <> 0 And mEndPos <> 0 And mEndPos > mStartPos
mString = Mid(strContent, mStartPos, mEndPos - mStartPos + 1)
strContent = Replace(strContent, mString, "")
mStartPos = InStr(strContent, "<")
mEndPos = InStr(strContent, ">")
Loop
strContent = Replace(strContent, "&nbsp;", " ")
strContent = Replace(strContent, "&amp;", "&")
strContent = Replace(strContent, "&quot;", "'")
strContent = Replace(strContent, "&#", "#")
strContent = Replace(strContent, "&lt;", "<")
strContent = Replace(strContent, "&gt;", ">")
strContent = Replace(strContent, "%20", " ")
strContent = LTrim(Trim(strContent))
Do While Left(strContent, 1) = Chr(13) Or Left(strContent, 1) =
Chr(10)
strContent = Mid(strContent, 2)
Loop
Return strContent.Replace(vbCrLf, "<br>")
End Function

"David Sawyer" <goldwings@_DELETE_msn.com> wrote in message
news:0e****************************@phx.gbl...
I am trying to read in an HTML file and strip out the HTML
code so that all I have left is the text of the body.

Does anyone have any suggestions for doing this?
Any HTML stripping routines or objects that perform the
function?

Nov 20 '05 #2
Cor
David,
Try to use mshtml.htmldocument
That gives a document in "DOM" document style.
If you are used to JavaScript it is easy to get the values then to process
this with the vVB.net language.
(tagname.innertext)
Hopes this helps a little bit,
Cor
Nov 20 '05 #3
Thanks Charles. I will put this in my program and try it
out. I appreciate your help.
-----Original Message-----
Public Function StripHTML(ByVal HTML As String) As String Dim strContent As String, mString As String
Dim mStartPos As Long, mEndPos As Long
Dim i, j

strContent = HTML.Replace("</P>", vbCrLf)
strContent = strContent.Replace("</p>", vbCrLf)

mStartPos = InStr(strContent, "<")
mEndPos = InStr(strContent, ">")
Do While mStartPos <> 0 And mEndPos <> 0 And mEndPos > mStartPos mString = Mid(strContent, mStartPos, mEndPos - mStartPos + 1) strContent = Replace(strContent, mString, "")
mStartPos = InStr(strContent, "<")
mEndPos = InStr(strContent, ">")
Loop
strContent = Replace(strContent, " ", " ")
strContent = Replace(strContent, "&", "&")
strContent = Replace(strContent, """, "'")
strContent = Replace(strContent, "&#", "#")
strContent = Replace(strContent, "<", "<")
strContent = Replace(strContent, ">", ">")
strContent = Replace(strContent, "%20", " ")
strContent = LTrim(Trim(strContent))
Do While Left(strContent, 1) = Chr(13) Or Left (strContent, 1) =Chr(10)
strContent = Mid(strContent, 2)
Loop
Return strContent.Replace(vbCrLf, "<br>")
End Function

"David Sawyer" <goldwings@_DELETE_msn.com> wrote in messagenews:0e****************************@phx.gbl...
I am trying to read in an HTML file and strip out the HTML code so that all I have left is the text of the body.

Does anyone have any suggestions for doing this?
Any HTML stripping routines or objects that perform the
function?

.

Nov 20 '05 #4
Cor
David,
You have to set the reference to Microsoft.mshtml
You do that in the IDE by:Project, add reference, Microsoft.mshtml(You can
do it too by right clicking on references in the soluction explorer.)
You better can not use an import when you have set a reference to this,
because then your IDE becomes terrible slow.
Use it like (as example) document as microsoft.mshtml.document
Success
Cor
Nov 20 '05 #5
Charles,
I tried using your function by calling the function as
stripHTML(tDataToGet)
It get an invalid qualifer error in this line
strContent = HTML.Replace("</P>", vbCrLf)
And it highlights HTML.

Also on the line
Return strContent.Replace(vbCrLf, "<br>")
I get the error
Expected: end of statement

Any thoughts or ideas?
Thanks
-----Original Message-----
Public Function StripHTML(ByVal HTML As String) As String Dim strContent As String, mString As String
Dim mStartPos As Long, mEndPos As Long
Dim i, j

strContent = HTML.Replace("</P>", vbCrLf)
strContent = strContent.Replace("</p>", vbCrLf)

mStartPos = InStr(strContent, "<")
mEndPos = InStr(strContent, ">")
Do While mStartPos <> 0 And mEndPos <> 0 And mEndPos > mStartPos mString = Mid(strContent, mStartPos, mEndPos - mStartPos + 1) strContent = Replace(strContent, mString, "")
mStartPos = InStr(strContent, "<")
mEndPos = InStr(strContent, ">")
Loop
strContent = Replace(strContent, " ", " ")
strContent = Replace(strContent, "&", "&")
strContent = Replace(strContent, """, "'")
strContent = Replace(strContent, "&#", "#")
strContent = Replace(strContent, "<", "<")
strContent = Replace(strContent, ">", ">")
strContent = Replace(strContent, "%20", " ")
strContent = LTrim(Trim(strContent))
Do While Left(strContent, 1) = Chr(13) Or Left (strContent, 1) =Chr(10)
strContent = Mid(strContent, 2)
Loop
Return strContent.Replace(vbCrLf, "<br>")
End Function

"David Sawyer" <goldwings@_DELETE_msn.com> wrote in messagenews:0e****************************@phx.gbl...
I am trying to read in an HTML file and strip out the HTML code so that all I have left is the text of the body.

Does anyone have any suggestions for doing this?
Any HTML stripping routines or objects that perform the
function?

.

Nov 20 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

7 posts views Thread by Margaret MacDonald | last post: by
3 posts views Thread by Steveo | last post: by
2 posts views Thread by Patrick | last post: by
258 posts views Thread by Terry Andersen | last post: by
4 posts views Thread by Lance | last post: by
4 posts views Thread by Spondishy | last post: by
3 posts views Thread by Michel Bouwmans | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.