Connecting Tech Pros Worldwide Forums | Help | Site Map

help with extracting contents from HTML file

Newbie
 
Join Date: Nov 2006
Posts: 7
#1: Nov 2 '06
Hi,

I am a beginer in perl programming , my task is to extract content with in form tags from a HTML file.I tried doing it using Regular expression but could not get the desired result as the HTML is not properly formatted in a webpage.

can you do it using HTML-parser or HTML-Tree builder?
I found out so many tutorials but i am not able to do it using those tutorials. can some one help regarding this?

Thank you in advance,
Priscilla.

miller's Avatar
Moderator
 
Join Date: Oct 2006
Location: San Francisco, CA
Posts: 830
#2: Nov 2 '06

re: help with extracting contents from HTML file


I would focus your efforts at HTML::Parser. There is an example for extracting the contents of a title tag directly in the pod documentation:

http://search.cpan.org/~gaas/HTML-Parser-3.55/Parser.pm#EXAMPLES

Even more importantly though, that is actually an example that they provide for decoding the form contents of a html page. It can be found in the /eg/ directory of the cpan dist for this module.

http://search.cpan.org/src/GAAS/HTML-Parser-3.55/eg/

You need to be able to do this on your own from this point though. If you have any specific trouble, feel free to ask, but you have plenty of specific resources at your disposal now to be able to solve this problem.
Newbie
 
Join Date: Nov 2006
Posts: 7
#3: Nov 2 '06

re: help with extracting contents from HTML file


thank you for your suggestion
I tried running this example in the link you have given
use HTML::Parser ();

sub start_handler
{
return if shift ne "title";
my $self = shift;
$self->handler(text => sub { print shift }, "dtext");
$self->handler(end => sub { shift->eof if shift eq "title"; },
"tagname,self");
}

my $p = HTML::Parser->new(api_version => 3);
$p->handler( start => \&start_handler, "tagname,self");
$p->parse_file(shift || die) || die $!;
print "\n";


its giving me a message " died at line 15"

Can you please tell me what shift here means?

I am unable to understand where this program is taking the input to get the content of the title elment
miller's Avatar
Moderator
 
Join Date: Oct 2006
Location: San Francisco, CA
Posts: 830
#4: Nov 3 '06

re: help with extracting contents from HTML file


The line that is dying for you is this:
Expand|Select|Wrap|Line Numbers
  1. $p->parse_file(shift || die) || die $!;
  2.  
This is because this example is meant to be run as a script with a html file as a parameter.

IE:
./yourScript.pl yourHtmlFile.html

The shift command in the above code is shifting the first element off the @ARGV array and parsing that file.
miller's Avatar
Moderator
 
Join Date: Oct 2006
Location: San Francisco, CA
Posts: 830
#5: Nov 3 '06

re: help with extracting contents from HTML file


Today is your lucky day. As a learning project I decided to try to get a working version of this code for you. The below script will accept an html file as a parameter, and then parses out the raw text of all forms found within that file. It saves them in the @forms array, which is then printed out at the end of the form.

You'll have to decode how this is done on your own, and of course adapt it to your own purposes since you did not more explicitly state what your end goal was. If you have some quick questions, I might answer them, but I will not be waste me time trying to teach you what this does. I was able to figure it out by simply going through all of the examples that they provided, and of course by reading the documentation. Although, I admit it could definitely use a little more verbose explaining.

http://search.cpan.org/src/GAAS/HTML-Parser-3.55/eg/
http://search.cpan.org/~gaas/HTML-Parser-3.55/Parser.pm

Expand|Select|Wrap|Line Numbers
  1. use HTML::Parser;
  2.  
  3. use strict;
  4.  
  5. my $file = shift || '20061101form.html';
  6.  
  7. my @forms = ();
  8.  
  9. sub start_form {
  10.     my ($tagname, $self, $text) = @_;
  11.  
  12.     return if $tagname ne 'form';
  13.  
  14.     # Setup Handlers
  15.     # - No longer look for start conditions, instead let the
  16.     # default handler pick those up.
  17.     $self->handler(start => undef);
  18.     $self->handler(default => \&save_form, "text");
  19.     $self->handler(end => \&end_form, "tagname,self,text");
  20.  
  21.     # Start New Form
  22.     push @forms, '';
  23.     save_form($text);
  24. }
  25.  
  26. sub save_form {
  27.     # Save all raw text in the current form.
  28.     $forms[-1] .= shift;
  29. }
  30.  
  31. sub end_form {
  32.     my ($tagname, $self, $text) = @_;
  33.  
  34.     save_form($text);
  35.  
  36.     # End Processing, Wait for new Start Form
  37.     if ($tagname eq 'form') {
  38.         $self->handler(start => \&start_form, "tagname,self,text");
  39.         $self->handler(default => undef);
  40.         $self->handler(end => undef);
  41.     }
  42. }
  43.  
  44.  
  45. my $p = HTML::Parser->new(api_version => 3);
  46. $p->handler( start => \&start_form, "tagname,self,text");
  47. $p->parse_file($file) || die $!;
  48.  
  49. # Prints all found forms.
  50. print @forms;
  51.  
  52. 1;
  53.  
  54. __END__
  55.  
Newbie
 
Join Date: Nov 2006
Posts: 7
#6: Nov 6 '06

re: help with extracting contents from HTML file


thanks a lot
Reply