By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,852 Members | 1,410 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,852 IT Pros & Developers. It's quick & easy.

HTML::Parser problem parsing special charcters in HTML file.

P: 1
Hello All,

I am trying to extract text from the HTML using the following code,

Expand|Select|Wrap|Line Numbers
  1. use strict;
  2. use HTML::Parser 3.00 ();
  3.  
  4. my %inside;
  5. my $p;
  6. sub tag
  7. {
  8. }
  9.  
  10. sub text
  11. {
  12.     return if $inside{script} || $inside{style};
  13.     print $_[0];
  14. }
  15.  
  16. open(my $fh, shift) || die;
  17.  
  18. $p = HTML::Parser->new(api_version => 3,
  19.           handlers    => [start => [\&tag, "tagname, '+1'"],
  20.                   end   => [\&tag, "tagname, '-1'"],
  21.                   text  => [\&text, "dtext"],
  22.                  ],
  23.           marked_sections => 1,
  24.     );
  25.  
  26. while(<$fh>)
  27. {
  28.     $p->parse($_) || die "Can't open file: $!\n";;
  29. }
It does so only falters when it finds the special characters like or the likes. I wish to process these either to ignore them or replace them with some sensible ASCII chars. I tried the encode module it does not yield anything useful.

Any help is appreciated.

Regards,
Atul.
Mar 19 '08 #1
Share this Question
Share on Google+
1 Reply


numberwhun
Expert Mod 2.5K+
P: 3,503
Hello All,

I am trying to extract text from the HTML using the following code,

Expand|Select|Wrap|Line Numbers
  1. use strict;
  2. use HTML::Parser 3.00 ();
  3.  
  4. my %inside;
  5. my $p;
  6. sub tag
  7. {
  8. }
  9.  
  10. sub text
  11. {
  12.     return if $inside{script} || $inside{style};
  13.     print $_[0];
  14. }
  15.  
  16. open(my $fh, shift) || die;
  17.  
  18. $p = HTML::Parser->new(api_version => 3,
  19.           handlers    => [start => [\&tag, "tagname, '+1'"],
  20.                   end   => [\&tag, "tagname, '-1'"],
  21.                   text  => [\&text, "dtext"],
  22.                  ],
  23.           marked_sections => 1,
  24.     );
  25.  
  26. while(<$fh>)
  27. {
  28.     $p->parse($_) || die "Can't open file: $!\n";;
  29. }
It does so only falters when it finds the special characters like or the likes. I wish to process these either to ignore them or replace them with some sensible ASCII chars. I tried the encode module it does not yield anything useful.

Any help is appreciated.

Regards,
Atul.

Sorry, I don't know the answer but hopefully one of our experts will be able to assist you.

My appologies for the delay in getting an answer to your question.

Regards,

Jeff
Apr 17 '08 #2

Post your reply

Sign in to post your reply or Sign up for a free account.