473,654 Members | 3,109 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

help with extracting contents from HTML file

7 New Member
Hi,

I am a beginer in perl programming , my task is to extract content with in form tags from a HTML file.I tried doing it using Regular expression but could not get the desired result as the HTML is not properly formatted in a webpage.

can you do it using HTML-parser or HTML-Tree builder?
I found out so many tutorials but i am not able to do it using those tutorials. can some one help regarding this?

Thank you in advance,
Priscilla.
Nov 2 '06 #1
5 1772
miller
1,089 Recognized Expert Top Contributor
I would focus your efforts at HTML::Parser. There is an example for extracting the contents of a title tag directly in the pod documentation:

http://search.cpan.org/~gaas/HTML-Parser-3.55/Parser.pm#EXAMP LES

Even more importantly though, that is actually an example that they provide for decoding the form contents of a html page. It can be found in the /eg/ directory of the cpan dist for this module.

http://search.cpan.org/src/GAAS/HTML-Parser-3.55/eg/

You need to be able to do this on your own from this point though. If you have any specific trouble, feel free to ask, but you have plenty of specific resources at your disposal now to be able to solve this problem.
Nov 2 '06 #2
priscilla
7 New Member
thank you for your suggestion
I tried running this example in the link you have given
use HTML::Parser ();

sub start_handler
{
return if shift ne "title";
my $self = shift;
$self->handler(text => sub { print shift }, "dtext");
$self->handler(end => sub { shift->eof if shift eq "title"; },
"tagname,self") ;
}

my $p = HTML::Parser->new(api_versio n => 3);
$p->handler( start => \&start_handler , "tagname,self") ;
$p->parse_file(shi ft || die) || die $!;
print "\n";


its giving me a message " died at line 15"

Can you please tell me what shift here means?

I am unable to understand where this program is taking the input to get the content of the title elment
Nov 2 '06 #3
miller
1,089 Recognized Expert Top Contributor
The line that is dying for you is this:
Expand|Select|Wrap|Line Numbers
  1. $p->parse_file(shift || die) || die $!;
  2.  
This is because this example is meant to be run as a script with a html file as a parameter.

IE:
./yourScript.pl yourHtmlFile.ht ml

The shift command in the above code is shifting the first element off the @ARGV array and parsing that file.
Nov 3 '06 #4
miller
1,089 Recognized Expert Top Contributor
Today is your lucky day. As a learning project I decided to try to get a working version of this code for you. The below script will accept an html file as a parameter, and then parses out the raw text of all forms found within that file. It saves them in the @forms array, which is then printed out at the end of the form.

You'll have to decode how this is done on your own, and of course adapt it to your own purposes since you did not more explicitly state what your end goal was. If you have some quick questions, I might answer them, but I will not be waste me time trying to teach you what this does. I was able to figure it out by simply going through all of the examples that they provided, and of course by reading the documentation. Although, I admit it could definitely use a little more verbose explaining.

http://search.cpan.org/src/GAAS/HTML-Parser-3.55/eg/
http://search.cpan.org/~gaas/HTML-Parser-3.55/Parser.pm

Expand|Select|Wrap|Line Numbers
  1. use HTML::Parser;
  2.  
  3. use strict;
  4.  
  5. my $file = shift || '20061101form.html';
  6.  
  7. my @forms = ();
  8.  
  9. sub start_form {
  10.     my ($tagname, $self, $text) = @_;
  11.  
  12.     return if $tagname ne 'form';
  13.  
  14.     # Setup Handlers
  15.     # - No longer look for start conditions, instead let the
  16.     # default handler pick those up.
  17.     $self->handler(start => undef);
  18.     $self->handler(default => \&save_form, "text");
  19.     $self->handler(end => \&end_form, "tagname,self,text");
  20.  
  21.     # Start New Form
  22.     push @forms, '';
  23.     save_form($text);
  24. }
  25.  
  26. sub save_form {
  27.     # Save all raw text in the current form.
  28.     $forms[-1] .= shift;
  29. }
  30.  
  31. sub end_form {
  32.     my ($tagname, $self, $text) = @_;
  33.  
  34.     save_form($text);
  35.  
  36.     # End Processing, Wait for new Start Form
  37.     if ($tagname eq 'form') {
  38.         $self->handler(start => \&start_form, "tagname,self,text");
  39.         $self->handler(default => undef);
  40.         $self->handler(end => undef);
  41.     }
  42. }
  43.  
  44.  
  45. my $p = HTML::Parser->new(api_version => 3);
  46. $p->handler( start => \&start_form, "tagname,self,text");
  47. $p->parse_file($file) || die $!;
  48.  
  49. # Prints all found forms.
  50. print @forms;
  51.  
  52. 1;
  53.  
  54. __END__
  55.  
Nov 3 '06 #5
priscilla
7 New Member
thanks a lot
Nov 6 '06 #6

Sign in to post your reply or Sign up for a free account.

Similar topics

0
1366
by: Gary | last post by:
Hi, Given a regular formed html document, what would be the easiest way to grab 3 parts of the file? 1- everything from beginning of file up to and including the <body> tag 2- everything between the <body> and </body> tags 3- everything from and including the </body> tag to the end of file Storing the contents of the file into an array I think would be best
5
5181
by: Markus Ernst | last post by:
Hello I have a regex problem, spent about 7 hours on this now, but I don't find the answer in the manual and googling, though I think this must have been discussed before. I try to simply extract the title and meta tags of a valid HTML page as an array: function extract_html($filename)
1
6050
by: Will Stuyvesant | last post by:
There seems to be no XML parser that can do validation in the Python Standard Libraries. And I am stuck with Python 2.1.1. until my web master upgrades (I use Python for CGI). I know pyXML has validating parsers, but I can not compile things on the (unix) webserver. And even if I could, the compiler I have access to would be different than what was used to compile python for CGI. I need to write a CGI script that does XML validation...
4
1282
by: Robert Fentress | last post by:
I'm loading an xml data file and then trying to take a particular node and add it, as html, to an element on my page using inner HTML. The xml is like what is below, with the ... representing eliminated detail: <?xml version="1.0" encoding="iso-8859-1"?> <table> <descriptions> <fielddesc id="1"> ...
8
5462
by: baustin75 | last post by:
Posted: Mon Oct 03, 2005 1:41 pm Post subject: cannot mail() in ie only when debugging in php designer 2005 -------------------------------------------------------------------------------- Hello, I have a very simple problem but cannot seem to figure it out. I have a very simple php script that sends a test email to myself. When I debug it in PHP designer, it works with no problems, I get the test email. If
2
3808
by: Chris Millar | last post by:
Can anyone help me on converting this vb asp page to C#, thanks in advance. chris. <!DOCTYPE HTML PUBLIC "-//W3C//Dtd HTML 4.0 transitional//EN"> <% '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' ''
7
1522
by: Une Bévue | last post by:
the purpose : avoid all banners and unusefull contents of an html document the leaves intact the part from start to body and inside the body leave only the part where user has clicked (by mousedown -- mousemove --mouseup)). for example a schematic document as input : <html><title>...<meta<<link to csss, javascript ect> <body...>
6
4444
by: Werner | last post by:
Hi, I try to read (and extract) some "self extracting" zipefiles on a Windows system. The standard module zipefile seems not to be able to handle this. False Is there a wrapper or has some one experience with other libaries to
4
1763
by: Sutharsan Nagasun | last post by:
Hi, I am new to Perl. I need help with file search for the following scenario. Currently as part of the archiving process, we have archived the files under /$rootdir/Archive/yyyy directory where yyyy is year. During the archiving process, for each day, yyyymmdd_trn.lst file and yyyymmdd_trn.tar.gz files are created where lst file will contain the names of all files that have been archived under yyyymmdd_trn.tar.gz file. I am in the...
0
8372
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8285
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8706
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8475
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
6160
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4149
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4293
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2709
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1915
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.