Divya Rao wrote:
Hi,
I need to parse a HTML file, and extract all the text in it (not the
images, tags). I cannot figure out how to do it. I have the HTML file
saved in my local directory. I need to have the text printed/saved in
my local directory. I would really appreciate any help in this regard.
unix% cat /usr/local/bin/nohtml
#!/usr/bin/perl -w
# Name: nohtml Author:
Jo*******@inwap.com 07-Nov-2001
# Purpose: Extracts just the text portions of a document.
use strict;
use HTML::Parser ();
sub text_handler { # Ordinary text
print @_;
}
my $p = HTML::Parser->new(api_version => 3);
$p->handler( text => \&text_handler, "dtext");
$p->parse_file(shift || "-") || die $!;
1;
unix% cat /usr/local/bin/nh
#!/bin/sh
PATH=$PATH:/usr/local/bin; export PATH
nohtml - | less -s
Usage: while reading e-mail, pipe the message into '|nh'.
-Joe