What I want to do is to input a htm file, then delete all the tables, strange html language symbol, and return a plain text.
What the code below did is:
It deletes all the html tags, like <TABLE>, <\Table>, <..> and etc, but it did not delete anything between <TABLE> and <\TABLE>, and there are some symbols like,  , $#151 and etc.
Please help me modify the file to do the job. Thanks very much!
Expand|Select|Wrap|Line Numbers
- #!/usr/bin/perl -w
- use strict;
- use HTML::Parser;
- # define the subclass
- package IdentityParse;
- use base "HTML::Parser";
- my @processed_html;
- sub text {
- my ($self, $text) = @_;
- # print out the text
- push(@processed_html, $text);
- }
- sub comment {
- }
- sub start {
- my $self = shift;
- $self->{table_seen}++ if $_[0] eq "table" ;
- $self->SUPER::start(@_);
- }
- sub end {
- my $self = shift;
- $self->SUPER::end(@_);
- $self->{table_seen}-- if $_[0] eq "table";
- }
- sub output
- {
- my $self = shift;
- unless ($self->{table_seen}) {
- $self->SUPER::output(@_);
- }
- }
- my $p = new IdentityParse;
- $p->parse_file("H:/Test Data/wmt2004.htm");
- open OUT, '>', "H:/Test Data/text1-processed.txt" or die;
- print OUT @processed_html;
- close OUT;