By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,618 Members | 1,543 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,618 IT Pros & Developers. It's quick & easy.

Looking for a tool to make plain text document out of a simple HTML document

P: n/a
Hi,

Hopefully this is not too much offtopic.

I'm working on a FAQ. I want to make two versions of it, plain text and
HTML. I'm looking for a tool that will make a plain text doc out of the
HTML doc. The HTML version doesn't have anything fancy, just internal
links. So the tool must be able to delete internal links and anchors from
the HTML version, but leave external links in simplified form. That is, the
HTML version would say <a href="http://foo/bar.html">Bar</a> and the plain
text version would just say http://foo/bar.html. The tool should also be
able to make mailto links into plain text, that is, change <a
href="mailto:fo*@bar.com">Foo </a> into fo*@bar.com. In fact, I think both
of these changes might be possible with regex search and replace engine. I
have one at my editor, but I don't know how to use it. So I'm either
looking for a regex strings or a ready tool, both will be find.
Jul 20 '05 #1
Share this Question
Share on Google+
14 Replies


P: n/a
Akseli Mäki wrote:

I forgot to say, that the tool should be Dos or Windows one.
Jul 20 '05 #2

P: n/a
Jon
open the file in Word, select all, copy, open Notepad, paste, save :)

Jon

"Akseli Mäki" <ne********@akseli-yok.utu.fi> wrote in message
news:0p********************************@4ax.com...
Akseli Mäki wrote:

I forgot to say, that the tool should be Dos or Windows one.
Jul 20 '05 #3

P: n/a
On Sun, 21 Dec 2003 14:12:45 +0200, Akseli Mäki wrote:
Hi,

Hopefully this is not too much offtopic.

I'm working on a FAQ. I want to make two versions of it, plain text and
HTML. I'm looking for a tool that will make a plain text doc out of the
HTML doc. The HTML version doesn't have anything fancy, just internal
links. So the tool must be able to delete internal links and anchors from
the HTML version, but leave external links in simplified form. That is, the
HTML version would say <a href="http://foo/bar.html">Bar</a> and the plain
text version would just say http://foo/bar.html. The tool should also be
able to make mailto links into plain text, that is, change <a
href="mailto:fo*@bar.com">Foo </a> into fo*@bar.com. In fact, I think both
of these changes might be possible with regex search and replace engine. I
have one at my editor, but I don't know how to use it. So I'm either
looking for a regex strings or a ready tool, both will be find.


While it doesn't *exactly* match your requirements, lynx is a very good
tool for doing this. "lynx --dump http://host/dir/page.ext" will produce
a plain-text output with links replaced with '[1]link text'; at the bottom
of the output is a list of all the links' destination URLs.

It is available for Windows at <http://jim.spath.com/lynx_win32/>.

--
Some say the Wired doesn't have political borders like the real world,
but there are far too many nonsense-spouting anarchists or idiots who
think that pranks are a revolution.

Jul 20 '05 #4

P: n/a
Jon wrote:

Please direct your attention to: http://www.allmyfaqs.com/faq.pl?How_to_post
open the file in Word, select all, copy, open Notepad, paste, save :)


How does that preserve extenal hyperlinks?

--
David Dorward <http://dorward.me.uk/>
Jul 20 '05 #5

P: n/a
Akseli Mäki wrote:

Hopefully this is not too much offtopic.
Perhaps slightly, but the ciwa-tools groups seems to generate little
traffic outside of spam.
I'm working on a FAQ. I want to make two versions of it, plain text
and HTML.
May we ask why?
I'm looking for a tool that will make a plain text doc out of the
HTML doc. The HTML version doesn't have anything fancy, just
internal links. So the tool must be able to delete internal links
and anchors from the HTML version, but leave external links in
simplified form. That is, the HTML version would say <a
href="http://foo/bar.html">Bar</a> and the plain text version would
just say http://foo/bar.html. The tool should also be able to make
mailto links into plain text, that is, change <a
href="mailto:fo*@bar.com">Foo </a> into fo*@bar.com.
And in both cases, you want to *remove* the anchor text, is that right?
In fact, I think both of these changes might be possible with regex
search and replace engine.
That's how I'd probably do it, but it would be a little time consuming
for me, because I'd need several steps to do it. My editor has a
search/replace dialogue box. If I were going to try to do what you're
doing, I'd copy the html files to a new directory, each file with a
new .txt extension. Then I'd run the search/replace.

Search: <a href="{[a-z/]*}">[a-zA-Z]*</a>

Replace: \1

This almost works in my text editor, NoteTab Light. Perhaps it'll
help you get started.
I have one at my editor, but I don't know how to use it.


Google "regex" or "regular expression" -- lots of links to go
through. If you are going to go this route, then I doubt there's any
acceptable substiture to learning regular expressions. But then, if
your editor has them,

--
Brian
follow the directions in my address to email me

Jul 20 '05 #6

P: n/a
In article <fg********************************@4ax.com> in
comp.infosystems.www.authoring.html, Akseli Mäki wrote:
Hi,

Hopefully this is not too much offtopic.

I'm working on a FAQ. I want to make two versions of it, plain text and
HTML. I'm looking for a tool that will make a plain text doc out of the
HTML doc. The HTML version doesn't have anything fancy, just internal
links. So the tool must be able to delete internal links and anchors from
the HTML version, but leave external links in simplified form.


Lynx can almost do what you want, and it has the great virtue that
you can do the job with a batch file rather than navigate menus. The
form (from memory; check with "lynx -help") is
lynx -dump URL_or_file >outputfile
A local file can be done either as file:///c:/zonk/file or without
the leading "file:///".

Lynx will insert bracketed numbers [1], [2], etc in the text after
each link, then put a list at the end, so you have a record of what
each link is. I don't think it makes any distinction among external
and internal links and mailtos, however. You could postprocess the
output to remove internal links from the link list.

http://www.fdisk.com/doslynx/lynxport.htm

--
Stan Brown, Oak Road Systems, Cortland County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/
Jul 20 '05 #7

P: n/a
Jon
Whoops! missed that!!!!

Although this could be sorted with some word VBA script.. not really the
polace for that

Jon
"David Dorward" <do*****@yahoo.com> wrote in message
news:bs*******************@news.demon.co.uk...
Jon wrote:

Please direct your attention to: http://www.allmyfaqs.com/faq.pl?How_to_post
open the file in Word, select all, copy, open Notepad, paste, save :)


How does that preserve extenal hyperlinks?

--
David Dorward <http://dorward.me.uk/>
Jul 20 '05 #8

P: n/a
Akseli Mäki wrote:

Ok thanks for all the suggestins. I already have Lynx so I'll use it.
Jul 20 '05 #9

P: n/a
Brian wrote:
I'm working on a FAQ. I want to make two versions of it, plain text
and HTML.May we ask why?

Well, some people might prefer HMTL file, I don't know yet. I might deside
to drop the idea if no one downloads it. Naturally I would post only the
plaintext version to the NG.
And in both cases, you want to *remove* the anchor text, is that right?

Yes.
Jul 20 '05 #10

P: n/a
On Mon, 22 Dec 2003, Akseli Mäki wrote:
Ok thanks for all the suggestins. I already have Lynx so I'll use it.


Yes, Lynx does that job very well - except for tables. Real tables, I
mean - it can be actually beneficial what it does with
tables-for-layout, but tabular data can become unusable in Lynx.

(If you're in control of the HTML, I have an ancient web page
about how to make HTML tables which also present acceptably on
Lynx, but it's not really been updated since 1998: if you still
want to read it after that low-key introduction, then
http://ppewww.ph.gla.ac.uk/~flavell/www/tablejob.html

The nobreak-space stuffing technique is the one I would recommend now,
if you want to do anything at all.)
Jul 20 '05 #11

P: n/a
Alan J. Flavell wrote:
On Mon, 22 Dec 2003, Akseli Mäki wrote:
Ok thanks for all the suggestins. I already have Lynx so I'll use it.


Yes, Lynx does that job very well - except for tables. Real tables, I
mean - it can be actually beneficial what it does with
tables-for-layout, but tabular data can become unusable in Lynx.


Lynx actually does a pretty good job with tables these days, it guesses if
its a layout table or a real table, so its not 100% though.

To take the example from the URI I snipped, it comes out quite happily as:

Deutsch British USA
Haube Bonnet Hood
Kofferraum Boot Trunk
Benzin Petrol Gas(oline)

(With the <th>s rendered in brown)

(That said, it is rather early here, and I only skimmed your document, so I
could be missing something).
--
David Dorward <http://dorward.me.uk/>
Jul 20 '05 #12

P: n/a
In article <Pi*******************************@ppepc56.ph.gla. ac.uk>, one of infinite monkeys
at the keyboard of "Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
Yes, Lynx does that job very well - except for tables. Real tables, I
mean - it can be actually beneficial what it does with
tables-for-layout, but tabular data can become unusable in Lynx.


Um - recent Lynx does a very nice job on tables.
At least, that's my experience with the Lynx bundled in Slackware 9.

--
Nick Kew

In urgent need of paying work - see http://www.webthing.com/~nick/cv.html
Jul 20 '05 #13

P: n/a
On Mon, 22 Dec 2003, David Dorward wrote:
Lynx actually does a pretty good job with tables these days, it guesses if
its a layout table or a real table, so its not 100% though.
OK, I knew they had been working on it. Care to mention which version
you're using?
(That said, it is rather early here, and I only skimmed your document, so I
could be missing something).


OK, I *did* stress that my page is from about 5 years back. However,
I might remark that there used to be plenty of obsolete versions of
Lynx around. Hmmm, interesting: if I trawl the logs now, the oldest
version of Lynx that seems to be well represented is 2.8.3rel.1,
although I did see just the occasional 2.7.1

A quick hunt around different machines here didn't show up too many
usable versions, but I tried a couple as below.

This one doesn't space the display out usefully: Lynx 2.8.2rel.1

This one does: Lynx 2.8.4rel.1

So it must have got sorted out somewhere in between.

I think the bottom line is that I withdraw my previous posting,
provided that the user has a recent-enough version of Lynx. Thanks.
Jul 20 '05 #14

P: n/a
Alan J. Flavell wrote:
On Mon, 22 Dec 2003, David Dorward wrote:
Lynx actually does a pretty good job with tables these days, it guesses
if its a layout table or a real table, so its not 100% though.
OK, I knew they had been working on it. Care to mention which version
you're using?


david $ lynx --version
Lynx Version 2.8.4rel.1 (17 Jul 2001)
libwww-FM 2.14, SSL-MM 1.4.1, OpenSSL 0.9.7c
Built on linux-gnu Dec 7 2003 10:16:09
I think the bottom line is that I withdraw my previous posting,
provided that the user has a recent-enough version of Lynx. Thanks.


Its always nice to see browsers improve. (Hint to Microsoft :D)

--
David Dorward <http://dorward.me.uk/>
Jul 20 '05 #15

This discussion thread is closed

Replies have been disabled for this discussion.