By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,404 Members | 938 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,404 IT Pros & Developers. It's quick & easy.

Stripping html

P: n/a
I understand that you can strip html out of a txt file so that all the
information is left is the visable information that is needed (e.g.
everything that has < > around is gone). My question is that I have a
table of information that I need to be fed into a program as such. Well
kind of I need the program to read it just as you would on paper and be
able to use that information like it was entered. I am unsure how strip
so much away just to leave me with the information I want and then use
it like I want. Any help?

Jun 12 '06 #1
Share this Question
Share on Google+
6 Replies


P: n/a
Medros (in 11**********************@j55g2000cwa.googlegroups. com)
said:

| I understand that you can strip html out of a txt file so that all
| the information is left is the visable information that is needed
| (e.g. everything that has < > around is gone). My question is that
| I have a table of information that I need to be fed into a program
| as such. Well kind of I need the program to read it just as you
| would on paper and be able to use that information like it was
| entered. I am unsure how strip so much away just to leave me with
| the information I want and then use it like I want. Any help?

Start with a simple program that reads and saves one character at a
time looking for a '<' character. When it finds a '<', it should throw
it (and following characters) away until it finds a '>'. When the
program reaches end-of-file, hopefully it's saved what you want to
keep.

You'll probably discover that you want to add refinements (perhaps to
deal with HTML encodings like &nbsp; and &lt; - but those can wait on
getting the initial version working.

--
Morris Dovey
DeSoto Solar
DeSoto, Iowa USA
http://www.iedu.com/DeSoto
Jun 12 '06 #2

P: n/a
Medros said:
I understand that you can strip html out of a txt file so that all the
information is left is the visable information that is needed (e.g.
everything that has < > around is gone). My question is that I have a
table of information that I need to be fed into a program as such. Well
kind of I need the program to read it just as you would on paper and be
able to use that information like it was entered. I am unsure how strip
so much away just to leave me with the information I want and then use
it like I want. Any help?


If the HTML is well-produced, mostly you can simply read characters one by
one. If you hit a '<' character, discard it, and keep discarding everything
until you hit a '>', which again you can discard.

If you hit a & character, though, you have some work to do. You'll need to
save up characters until you hit a semicolon.

The characters between the & and the ; form a keyword, e.g. &amp; for
ampersand, &lt; for '<', &gt; for '>', &copy; for the copyright symbol, and
so on. You will need to have some kind of lookup in your program for
matching these keywords with their replacements.

If you hit a space character, preserve it, but then discard all remaining
whitespace until the next non-whitespace character.

These simple rules will give you a basic translation into English, but you
have to be a bit cleverer if you want to split text into paragraphs and so
on, by interpreting tags such as <BR>, <P>, <TD> etc -- at which point you
won't be too far away from having your own text-only but otherwise
full-blown HTML renderer.

If the HTML is /not/ well-produced, the above may not be sufficient.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
Jun 12 '06 #3

P: n/a
On Sun, 11 Jun 2006 21:46:03 -0500, "Morris Dovey" <mr*****@iedu.com>
wrote:
Medros (in 11**********************@j55g2000cwa.googlegroups. com)
said:

| I understand that you can strip html out of a txt file so that all
| the information is left is the visable information that is needed
| (e.g. everything that has < > around is gone). My question is that
| I have a table of information that I need to be fed into a program
| as such. Well kind of I need the program to read it just as you
| would on paper and be able to use that information like it was
| entered. I am unsure how strip so much away just to leave me with
| the information I want and then use it like I want. Any help?

Start with a simple program that reads and saves one character at a
time looking for a '<' character. When it finds a '<', it should throw
it (and following characters) away until it finds a '>'. When the
program reaches end-of-file, hopefully it's saved what you want to
keep.

I remember starting with a simple program like that, and finding to my
dismay that between the "script" and "/script" tags the '<' and '>'
characters are used not as tag delimiters but as "greater than" and
"less than" comparison operators. I had to check for those particular
tags and discard everything between them, and not let the presence of
a lone unbalanced '<' in the script cause my logic to miss finding the
"/string" tag.

Bill

Jun 12 '06 #4

P: n/a
Bill Latvin (in 44*****************@news.verizon.net) said:

| On Sun, 11 Jun 2006 21:46:03 -0500, "Morris Dovey"
| <mr*****@iedu.com> wrote:
|
|| Medros (in 11**********************@j55g2000cwa.googlegroups. com)
|| said:
||
||| I understand that you can strip html out of a txt file so that all
||| the information is left is the visable information that is needed
||| (e.g. everything that has < > around is gone). My question is that
||| I have a table of information that I need to be fed into a program
||| as such. Well kind of I need the program to read it just as you
||| would on paper and be able to use that information like it was
||| entered. I am unsure how strip so much away just to leave me with
||| the information I want and then use it like I want. Any help?
||
|| Start with a simple program that reads and saves one character at a
|| time looking for a '<' character. When it finds a '<', it should
|| throw it (and following characters) away until it finds a '>'.
|| When the program reaches end-of-file, hopefully it's saved what
|| you want to keep.
||
| I remember starting with a simple program like that, and finding to
| my dismay that between the "script" and "/script" tags the '<' and
| '>' characters are used not as tag delimiters but as "greater than"
| and "less than" comparison operators. I had to check for those
| particular tags and discard everything between them, and not let
| the presence of a lone unbalanced '<' in the script cause my logic
| to miss finding the "/string" tag.

Welcome to the club. It's because of things like that that I added my
second paragraph:

"You'll probably discover that you want to add refinements (perhaps to
deal with HTML encodings like &nbsp; and &lt; - but those can wait on
getting the initial version working."

The refinements will depend on whether the OP wants a general solution
or just enough to extract data from one particular page. On
re-reading, I'd guess is that <table>, <tr>, and <td> tags may be his
1st refinement - but the question indicated that he'll probably need
to start at the most basic level.

--
Morris Dovey
DeSoto Solar
DeSoto, Iowa USA
http://www.iedu.com/DeSoto
Jun 12 '06 #5

P: n/a
Richard Heathfield wrote:
Medros said:

I understand that you can strip html out of a txt file so that all the
information is left is the visable information that is needed (e.g.
everything that has < > around is gone). My question is that I have a
table of information that I need to be fed into a program as such. Well
kind of I need the program to read it just as you would on paper and be
able to use that information like it was entered. I am unsure how strip
so much away just to leave me with the information I want and then use
it like I want. Any help?

If the HTML is well-produced, mostly you can simply read characters one by
one. If you hit a '<' character, discard it, and keep discarding everything
until you hit a '>', which again you can discard.

If you hit a & character, though, you have some work to do. You'll need to
save up characters until you hit a semicolon.

The characters between the & and the ; form a keyword, e.g. &amp; for
ampersand, &lt; for '<', &gt; for '>', &copy; for the copyright symbol, and
so on. You will need to have some kind of lookup in your program for
matching these keywords with their replacements.

If you hit a space character, preserve it, but then discard all remaining
whitespace until the next non-whitespace character.

These simple rules will give you a basic translation into English, but you
have to be a bit cleverer if you want to split text into paragraphs and so
on, by interpreting tags such as <BR>, <P>, <TD> etc -- at which point you
won't be too far away from having your own text-only but otherwise
full-blown HTML renderer.

If the HTML is /not/ well-produced, the above may not be sufficient.

HTMLtidy (http://tidy.sourceforge.net/) is your friend in this cases.
This little program has prevented much pain and suffering!

--
Ian Collins.
Jun 12 '06 #6

P: n/a
"Medros" <Me********@gmail.com> writes:
I understand that you can strip html out of a txt file so that all the
information is left is the visable information that is needed (e.g.
everything that has < > around is gone). My question is that I have a
table of information that I need to be fed into a program as such. Well
kind of I need the program to read it just as you would on paper and be
able to use that information like it was entered. I am unsure how strip
so much away just to leave me with the information I want and then use
it like I want. Any help?


lynx -dump ?

Asbjørn
Jun 12 '06 #7

This discussion thread is closed

Replies have been disabled for this discussion.