By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,775 Members | 1,740 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,775 IT Pros & Developers. It's quick & easy.

A question about HTML::Parser.

P: 3
I am sorry to bother again.
What I want to do is to input a htm file, then delete all the tables, strange html language symbol, and return a plain text.
What the code below did is:
It deletes all the html tags, like <TABLE>, <\Table>, <..> and etc, but it did not delete anything between <TABLE> and <\TABLE>, and there are some symbols like, &nbsp, $#151 and etc.
Please help me modify the file to do the job. Thanks very much!
Expand|Select|Wrap|Line Numbers
  1.        #!/usr/bin/perl -w
  2.        use strict;
  3.        use HTML::Parser;
  4.        # define the subclass
  5.        package IdentityParse;
  6.        use base "HTML::Parser";
  7.  
  8.        my @processed_html;
  9.  
  10.        sub text {
  11.           my ($self, $text) = @_;
  12.           # print out the text
  13.           push(@processed_html, $text);
  14.       }
  15.  
  16.       sub comment {
  17.       }
  18.  
  19.       sub start {
  20.           my $self = shift;
  21.           $self->{table_seen}++ if $_[0] eq  "table" ;
  22.           $self->SUPER::start(@_);
  23.  
  24.       }
  25.  
  26.       sub end {
  27.           my $self = shift;
  28.           $self->SUPER::end(@_);
  29.           $self->{table_seen}-- if $_[0] eq "table";
  30.  
  31.       }
  32.       sub output
  33.          {
  34.          my $self = shift;
  35.          unless ($self->{table_seen}) {
  36.                    $self->SUPER::output(@_);
  37.                 }
  38.          }
  39.  
  40.       my $p = new IdentityParse;
  41.       $p->parse_file("H:/Test Data/wmt2004.htm");
  42.       open OUT, '>', "H:/Test Data/text1-processed.txt" or die;
  43.       print OUT @processed_html;
  44.       close OUT;
  45.  
Jun 9 '10 #1
Share this Question
Share on Google+
2 Replies


P: 15
@zhengmath
This will remove everything in tables.

Expand|Select|Wrap|Line Numbers
  1. while (<STDIN>) { $l.=$_; } $_=$l;
  2. $_=$l;
  3. while (/<table/is)
  4.  { s/(.*)<table.*?>.*?<\/table>/$1/igs; }
  5.  
  6. print ;
  7.  
The conversion of HTML::Entities to text is harder as many don't have text equiv. If you want to roll your own mapping many to a single character, just use s///.
Jun 10 '10 #2

Expert
P: 80
IMHO, XPath is better suited for such job.
Jun 11 '10 #3

Post your reply

Sign in to post your reply or Sign up for a free account.