Connecting Tech Pros Worldwide Help | Site Map

Converting HTML to ASCII

gf gf
Guest
 
Posts: n/a
#1: Jul 18 '05
Hans,

Thanks for the tip. I took a look at Beatiful Soup,
and it looked like it was a framework to parse HTML.
I'm not really interetsed in going through it tag by
tag - just to get it converted to ASCII. How can I do
this with B. Soup?

--Thanks

PS William - thanks for the reference to lynx, but I
need a Python solution - forking and execing for each
file I need to convert is too slow for my application


Hans wrote:
Try Beautiful Soup!
[color=blue]
> 1) Be able to handle badly formed, or illegal, HTML,
> as best as possible.[/color]
From the description:
"It won't choke if you give it ill-formed markup:
it'll just give you access to
a correspondingly ill-formed data structure."
[color=blue]
> Can anyone direct me to something which could help[/color]
me[color=blue]
> for this?[/color]
http://www.crummy.com/software/BeautifulSoup/

Hans Christian



__________________________________
Do you Yahoo!?
Yahoo! Mail - Easier than ever with enhanced search. Learn more.
http://info.mail.yahoo.com/mail_250
Jorgen Grahn
Guest
 
Posts: n/a
#2: Jul 18 '05

re: Converting HTML to ASCII


On Fri, 25 Feb 2005 10:51:47 -0800 (PST), gf gf <unknownsoldier93@yahoo.com> wrote:[color=blue]
> Hans,
>
> Thanks for the tip. I took a look at Beatiful Soup,
> and it looked like it was a framework to parse HTML.[/color]

This is my understanding, too.
[color=blue]
> I'm not really interetsed in going through it tag by
> tag - just to get it converted to ASCII. How can I do
> this with B. Soup?[/color]

You should probably do what some other poster suggested -- download lynx or
some other text-only browser and make your code execute it in -dump mode to
get the text-formatted html. You'll get that working in an hour or so, and
then you can see if you need something more complicated.

/Jorgen

--
// Jorgen Grahn <jgrahn@ Ph'nglui mglw'nafh Cthulhu
\X/ algonet.se> R'lyeh wgah'nagl fhtagn!
Paul Rubin
Guest
 
Posts: n/a
#3: Jul 18 '05

re: Converting HTML to ASCII


Jorgen Grahn <jgrahn-nntq@algonet.se> writes:[color=blue]
> You should probably do what some other poster suggested -- download
> lynx or some other text-only browser and make your code execute it
> in -dump mode to get the text-formatted html. You'll get that
> working in an hour or so, and then you can see if you need something
> more complicated.[/color]

Lynx is pathetically slow for large files. It seems to use a
quadratic algorithm for remembering where the links point, or
something. I wrote a very crude but very fast renderer in C that I
can post if someone wants it, which is what I use for this purpose.
Jorgen Grahn
Guest
 
Posts: n/a
#4: Jul 18 '05

re: Converting HTML to ASCII


On 26 Feb 2005 02:36:31 -0800, Paul Rubin <> wrote:[color=blue]
> Jorgen Grahn <jgrahn-nntq@algonet.se> writes:[color=green]
>> You should probably do what some other poster suggested -- download
>> lynx or some other text-only browser and make your code execute it
>> in -dump mode to get the text-formatted html. You'll get that
>> working in an hour or so, and then you can see if you need something
>> more complicated.[/color]
>
> Lynx is pathetically slow for large files. It seems to use a
> quadratic algorithm for remembering where the links point, or
> something. I wrote a very crude but very fast renderer in C that I
> can post if someone wants it, which is what I use for this purpose.[/color]

That may be so, but it's fast enough for all the people who use it as a
general html->plaintext tool, so it's probably good enough for the OP.

w3m and links are other options. They provide better formatting than lynx,
and at least w3m has the -dump option.

I wouldn't mind if there was a reusable library for rendering HTML to text,
from various languages. I'd also like to see one (CSS-aware) for rendering
to troff or Postscript.

/Jorgen

--
// Jorgen Grahn <jgrahn@ Ph'nglui mglw'nafh Cthulhu
\X/ algonet.se> R'lyeh wgah'nagl fhtagn!
Closed Thread