By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,320 Members | 2,224 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,320 IT Pros & Developers. It's quick & easy.

Tool needed: to strip some HTML

P: n/a
G'day

I have some pages written by a bot and much of the code does not
concern the visible content on the site. I'd like to strip all the
codes that do not affect or influence the visible stuff (although I'd
like to keep the nested tables, if possible). Some of this can be
stripped using Search/Replace, but some of it contains codes which
differ from page to page.

How many pages? About 750, totalling 80 megabytes of data, which I'm
hoping to reduce when I "clean" the code.

Do you know of any tool that can do this? A tool that can be set to
strip all codes except HTML 2.0 would, for example, also be useful
except I'll lose the nested tables (which is not a *gigantic*
loss...).

I tried converting everything to TXT but most HTML2TXT programs
deliver very poor results. I did find some code strippers that
attempt to maintain the tables layout (but that is even less
preferred). If the stuff is gonna be in plaintext, then there should
be an intelligent way of dealing with nested tables.

Any advice, people? What tool can you recommend? Preferably for W95x
(but Linux would be fine too as long as it is newbie-friendly),
preferably freeware (or shareware, but I don't intend buying).
Jul 20 '05 #1
Share this Question
Share on Google+
16 Replies


P: n/a
On Wed, 07 Apr 2004 02:34:39 -0700, Voetleuce en f?nsievry wrote:
Any advice, people? What tool can you recommend?


If you only knew Perl...

--

..

Jul 20 '05 #2

P: n/a
On 7 Apr 2004 02:34:39 -0700, Voetleuce en f?nsievry
<ca******@websurfer.co.za> wrote:
G'day

I have some pages written by a bot and much of the code does not
concern the visible content on the site. I'd like to strip all the
codes that do not affect or influence the visible stuff (although I'd
like to keep the nested tables, if possible). Some of this can be
stripped using Search/Replace, but some of it contains codes which
differ from page to page.

How many pages? About 750, totalling 80 megabytes of data, which I'm
hoping to reduce when I "clean" the code.

Do you know of any tool that can do this? A tool that can be set to
strip all codes except HTML 2.0 would, for example, also be useful
except I'll lose the nested tables (which is not a *gigantic*
loss...).


HTML Tidy might help a lot. It can be set to 'clean' the pages, it will
then drop all presentational markup.

http://tidy.sf.net/

--
Rijk van Geijtenbeek

The Web is a procrastination apparatus:
It can absorb as much time as is required to ensure that you
won't get any real work done. - J.Nielsen
Jul 20 '05 #3

P: n/a
"Voetleuce en f?nsievry" <ca******@websurfer.co.za> a écrit dans le
message de news:f0**************************@posting.google.c om
I have some pages written by a bot and much of the code does not
concern the visible content on the site. I'd like to strip all the
codes that do not affect or influence the visible stuff (although I'd
like to keep the nested tables, if possible). Some of this can be
stripped using Search/Replace, but some of it contains codes which
differ from page to page.


Can you give an URL for a sample ? The answer will depend on where the text
to delete is located in your HTML pages...

Jul 20 '05 #4

P: n/a
Voetleuce en f?nsievry <ca******@websurfer.co.za> wrote:
Any advice, people? What tool can you recommend? Preferably for W95x
(but Linux would be fine too as long as it is newbie-friendly),
preferably freeware (or shareware, but I don't intend buying).


Any example?

--
William Park, Open Geometry Consulting, <op**********@yahoo.ca>
Linux solution/training/migration, Thin-client
Jul 20 '05 #5

P: n/a
"Pierre Goiffon" <pg******@nowhere.invalid> wrote in message news:<40***********************@news.free.fr>...
"Voetleuce en f?nsievry" <ca******@websurfer.co.za> a écrit dans le
message de news:f0**************************@posting.google.c om
I have some pages written by a bot and much of the code does not
concern the visible content on the site. I'd like to strip all the
codes that do not affect or influence the visible stuff (although I'd
like to keep the nested tables, if possible). Some of this can be
stripped using Search/Replace, but some of it contains codes which
differ from page to page.

Can you give an URL for a sample ? The answer will depend on where the text
to delete is located in your HTML pages...


True. Here goes:
http://leuce.com/translate/tempfile/22265.html (100 kb)
http://leuce.com/translate/tempfile/22265.zip (30 kb)

These files are Yahoo group message files from a mailing list group we
have, but you see, Yahoo's message search feature is rubbish and we'd
like to make the archive of old messages available for new members so
that they can *search* the old messages and not ask the same questions
over and over again.

The file for download mentioned above is from a guest login, but the
files I have are logged in which means the e-mail addresses show up
(we'll remove these manually later).

Any thing to reduce the fluff would be nice. We're considering
putting the messages on a web site for Google to index (which would be
*excellent*) but our bandwidth bill will kill us at present.

TIA.
Jul 20 '05 #6

P: n/a
"Pierre Goiffon" <pg******@nowhere.invalid> wrote in message news:<40***********************@news.free.fr>...
"Voetleuce en f?nsievry" <ca******@websurfer.co.za> a écrit dans le
message de news:f0**************************@posting.google.c om
I have some pages written by a bot and much of the code does not
concern the visible content on the site. I'd like to strip all the
codes that do not affect or influence the visible stuff (although I'd
like to keep the nested tables, if possible). Some of this can be
stripped using Search/Replace, but some of it contains codes which
differ from page to page.

Can you give an URL for a sample ? The answer will depend on where the text
to delete is located in your HTML pages...


True. Here goes:
http://leuce.com/translate/tempfile/22265.html (100 kb)
http://leuce.com/translate/tempfile/22265.zip (30 kb)

These files are Yahoo group message files from a mailing list group we
have, but you see, Yahoo's message search feature is rubbish and we'd
like to make the archive of old messages available for new members so
that they can *search* the old messages and not ask the same questions
over and over again.

The file for download mentioned above is from a guest login, but the
files I have are logged in which means the e-mail addresses show up
(we'll remove these manually later).

Any thing to reduce the fluff would be nice. We're considering
putting the messages on a web site for Google to index (which would be
*excellent*) but our bandwidth bill will kill us at present.

TIA.
Jul 20 '05 #7

P: n/a
Vigil <me@privacy.net> wrote in message news:<pa**************************@privacy.net>...
On Wed, 07 Apr 2004 02:34:39 -0700, Voetleuce en f?nsievry wrote:
Any advice, people? What tool can you recommend?

If you only knew Perl...


I have a Perl interpreter installed here... :-)
Jul 20 '05 #8

P: n/a
Vigil <me@privacy.net> wrote in message news:<pa**************************@privacy.net>...
On Wed, 07 Apr 2004 02:34:39 -0700, Voetleuce en f?nsievry wrote:
Any advice, people? What tool can you recommend?

If you only knew Perl...


I have a Perl interpreter installed here... :-)
Jul 20 '05 #9

P: n/a
"Pierre Goiffon" <pg******@nowhere.invalid> wrote in message news:<40***********************@news.free.fr>...
Can you give an URL for a sample ? The answer will depend on where the text
to delete is located in your HTML pages...


SORRY! WRONG URL! Here's the correct ones:

http://leuce.com/tempfile/22265.html (100 kb)
http://leuce.com/tempfile/22265.zip (30 kb)
Jul 20 '05 #10

P: n/a
"Pierre Goiffon" <pg******@nowhere.invalid> wrote in message news:<40***********************@news.free.fr>...
Can you give an URL for a sample ? The answer will depend on where the text
to delete is located in your HTML pages...


SORRY! WRONG URL! Here's the correct ones:

http://leuce.com/tempfile/22265.html (100 kb)
http://leuce.com/tempfile/22265.zip (30 kb)
Jul 20 '05 #11

P: n/a
On 7 Apr 2004 02:34:39 -0700, ca******@websurfer.co.za (Voetleuce en
f?nsievry) wrote:
Do you know of any tool that can do this?


Use HTMLTidy to turn it into XHTML, then run XSLT on that.
Jul 20 '05 #12

P: n/a
On 7 Apr 2004 02:34:39 -0700, ca******@websurfer.co.za (Voetleuce en
f?nsievry) wrote:
Do you know of any tool that can do this?


Use HTMLTidy to turn it into XHTML, then run XSLT on that.
Jul 20 '05 #13

P: n/a
"Voetleuce en f?nsievry" <ca******@websurfer.co.za> a écrit dans le
message de news:f0**************************@posting.google.c om
These files are Yahoo group message files from a mailing list group we
have, but you see, Yahoo's message search feature is rubbish and we'd
like to make the archive of old messages available for new members so
that they can *search* the old messages and not ask the same questions
over and over again.


OK, here are some ideas :
- a batch that catches the messages on a POP or IMAP account registered onto
your list. It'll just have to insert the messages into a database (really
easy to do it with a php script)
- make a script that can extract the datas from the Yahoo Groups pages

For that type of needs, I'll surely choose the first solution ! Actually the
yahoo groups html is just not really standard compliant, the inside
structure seems to vary a lot from one page to another, and so it would be
almost impossible to use dom : you must all do it by yourself, using string
manipulation functions like regexp. It could works, but why not choose
simply to catch a structured data, ie the messages sent vie email ?

Jul 20 '05 #14

P: n/a
"Voetleuce en f?nsievry" <ca******@websurfer.co.za> a écrit dans le
message de news:f0**************************@posting.google.c om
These files are Yahoo group message files from a mailing list group we
have, but you see, Yahoo's message search feature is rubbish and we'd
like to make the archive of old messages available for new members so
that they can *search* the old messages and not ask the same questions
over and over again.


OK, here are some ideas :
- a batch that catches the messages on a POP or IMAP account registered onto
your list. It'll just have to insert the messages into a database (really
easy to do it with a php script)
- make a script that can extract the datas from the Yahoo Groups pages

For that type of needs, I'll surely choose the first solution ! Actually the
yahoo groups html is just not really standard compliant, the inside
structure seems to vary a lot from one page to another, and so it would be
almost impossible to use dom : you must all do it by yourself, using string
manipulation functions like regexp. It could works, but why not choose
simply to catch a structured data, ie the messages sent vie email ?

Jul 20 '05 #15

P: n/a
ca******@websurfer.co.za (Voetleuce en f?nsievry) wrote in
news:f0**************************@posting.google.c om:
"Pierre Goiffon" <pg******@nowhere.invalid> wrote in message
news:<40***********************@news.free.fr>...
"Voetleuce en f?nsievry" <ca******@websurfer.co.za> a écrit dans le
message de news:f0**************************@posting.google.c om

> I have some pages written by a bot and much of the code does not
> concern the visible content on the site. I'd like to strip all the
> codes that do not affect or influence the visible stuff (although
> I'd like to keep the nested tables, if possible). Some of this can
> be stripped using Search/Replace, but some of it contains codes
> which differ from page to page.

Can you give an URL for a sample ? The answer will depend on where
the text to delete is located in your HTML pages...


True. Here goes:
http://leuce.com/translate/tempfile/22265.html (100 kb)
http://leuce.com/translate/tempfile/22265.zip (30 kb)

These files are Yahoo group message files from a mailing list group we
have, but you see, Yahoo's message search feature is rubbish and we'd
like to make the archive of old messages available for new members so
that they can *search* the old messages and not ask the same questions
over and over again.


You can use Personal Groupware to archive Yahoo Group messages:
http://www.personalgroupware.com/index.htm
and then export the messages to a file, which they can search.

--
Dave Patton
Canadian Coordinator, Degree Confluence Project
http://www.confluence.org/
My website: http://members.shaw.ca/davepatton/
Jul 20 '05 #16

P: n/a
ca******@websurfer.co.za (Voetleuce en f?nsievry) wrote in
news:f0**************************@posting.google.c om:
"Pierre Goiffon" <pg******@nowhere.invalid> wrote in message
news:<40***********************@news.free.fr>...
"Voetleuce en f?nsievry" <ca******@websurfer.co.za> a écrit dans le
message de news:f0**************************@posting.google.c om

> I have some pages written by a bot and much of the code does not
> concern the visible content on the site. I'd like to strip all the
> codes that do not affect or influence the visible stuff (although
> I'd like to keep the nested tables, if possible). Some of this can
> be stripped using Search/Replace, but some of it contains codes
> which differ from page to page.

Can you give an URL for a sample ? The answer will depend on where
the text to delete is located in your HTML pages...


True. Here goes:
http://leuce.com/translate/tempfile/22265.html (100 kb)
http://leuce.com/translate/tempfile/22265.zip (30 kb)

These files are Yahoo group message files from a mailing list group we
have, but you see, Yahoo's message search feature is rubbish and we'd
like to make the archive of old messages available for new members so
that they can *search* the old messages and not ask the same questions
over and over again.


You can use Personal Groupware to archive Yahoo Group messages:
http://www.personalgroupware.com/index.htm
and then export the messages to a file, which they can search.

--
Dave Patton
Canadian Coordinator, Degree Confluence Project
http://www.confluence.org/
My website: http://members.shaw.ca/davepatton/
Jul 20 '05 #17

This discussion thread is closed

Replies have been disabled for this discussion.