By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,257 Members | 1,563 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,257 IT Pros & Developers. It's quick & easy.

Download HTML As Plain Text

P: n/a
good day,

I was just wondering how can I download a web page as plain text from a
certain web site. I have tried to use the OpenURL() method from INET control
in my VB.NET app, but it returns elements such as this <BR> within the plain
text. Is there a way to filter them or to simply download the page as plain
text?

any help would be greatly appreciated.
Nov 20 '05 #1
Share this Question
Share on Google+
8 Replies


P: n/a
"Doominato" <Bi******@hotmail.com> wrote in
news:bg*******************@news01.bloor.is.net.cab le.rogers.com:
I was just wondering how can I download a web page as plain text from a
certain web site. I have tried to use the OpenURL() method from INET
control in my VB.NET app, but it returns elements such as this <BR>
within the plain text. Is there a way to filter them or to simply
download the page as plain text?


No. Web pages are not plain text, they are HTML. If you download it it, it
will always come in the format that it is, being HTML.

To have it as plain text you will need to convert it.
--
Chad Z. Hower (a.k.a. Kudzu) - http://www.hower.org/Kudzu/
"Programming is an art form that fights back"

Make your ASP.NET applications run faster
http://www.atozed.com/IntraWeb/
Nov 20 '05 #2

P: n/a
thanks for reply,

I realize that but I should have said that it is an HTML format but it
contains plain text (btw, this is the type of the page that i'm talking
about
http://www.wunderground.com/history/...html?format=1).
If you look at it and it's source you will see that they are pretty much
look the same except that source contains these tags sunch as <BR>, so the
question is how do I remove these tags and convert it to plain text???

thanks
Nov 20 '05 #3

P: n/a
That seems really stupid of weather underground not to actually provide a comma delimited file!but that junk. I'd be finding out if I couldn't find someone in their computer department to create real csv files (I don't know who's idea it was to do it like that)

Meanwhile, If you can get the entire document into a string, you can use a Replace(wholeDoc, "<BR>", vbCrLf), and then output that to a real csv file.

Also you've probalby noticed that there are no line breaks separating the actual data, which makes replacing those <br> with CRLF even more critical!

Good Luck!
--Michael

"Doominato" <Bi******@hotmail.com> wrote in message news:8O*******************@news04.bloor.is.net.cab le.rogers.com...
thanks for reply,

I realize that but I should have said that it is an HTML format but it
contains plain text (btw, this is the type of the page that i'm talking
about
http://www.wunderground.com/history/...html?format=1).
If you look at it and it's source you will see that they are pretty much
look the same except that source contains these tags sunch as <BR>, so the
question is how do I remove these tags and convert it to plain text???

thanks

Nov 20 '05 #4

P: n/a
Hello,

I got an upper-hand on this and was able to clear out all the tags, so now I
got a clean CSV file.

Thank you so much for your help.

Nov 20 '05 #5

P: n/a
Hi Doominato

In addition to the others

In an HTML page you have always the property InnerText and OuterText.

The Innertext is between the tags, the Outertext including the tags.

HTML.outertext is almost forever a complete document including all tags and
whatever, however without the strange enough now more and more preceding
declaration line of a HTML page which is as far as I know unreachable using
the Document Object Model.

I hope this helps?

Cor
Nov 20 '05 #6

P: n/a
* "Doominato" <Bi******@hotmail.com> scripsit:
I was just wondering how can I download a web page as plain text from a
certain web site. I have tried to use the OpenURL() method from INET control
in my VB.NET app, but it returns elements such as this <BR> within the plain
text. Is there a way to filter them or to simply download the page as plain
text?


Nice algorithm, implemented in VB6:

<URL:http://groups.google.com/groups?selm=ebXm3efoCHA.1976%40TK2MSFTNGP10>

--
Herfried K. Wagner [MVP]
<URL:http://dotnet.mvps.org/>
Nov 20 '05 #7

P: n/a
Hi Herfried,

Have a time a look at mshtml, this is very amateuristique in my opinion.

http://msdn.microsoft.com/library/de...LDHTMLAPIs.asp

Cor
Nov 20 '05 #8

P: n/a
* "Cor Ligthert" <no**********@planet.nl> scripsit:
Have a time a look at mshtml, this is very amateuristique in my opinion.


I know that it's possible with MSHTML, but Olaf's algorithm is in VB6
/very/ fast and often it's good enough. I am not sure if it will work
with the "shorttag" option and stuff like that enabled.

--
Herfried K. Wagner [MVP]
<URL:http://dotnet.mvps.org/>
Nov 20 '05 #9

This discussion thread is closed

Replies have been disabled for this discussion.