473,465 Members | 1,366 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Parsing title from a web page

43 New Member
Hello,

I have a recuring problem with parsing routing that I have written.

Whenever I try to extract a text from a pair of HTML Tag, sometime, not always, I end end up catching a large part of the file which is bigger then the intended target.

Here is the code:

Expand|Select|Wrap|Line Numbers
  1. $variable_start = "<title>";
  2. $variable_stop= "</title>";
  3.  
  4. $start_string = $variable_start;
  5. $stop_string = $variable_stop;
  6.  
  7. my ($name2) = $line =~ m/$start_string(.+)$stop_string/si;
  8. $name = $name2;
  9.  
This code is embedded in a while loop and do each line one by one and extract the target.

Sometime it work perfectly and other time it does not. There must be better code to extract info from a pair of HTML Tag.

Thanks,

Yves
Apr 2 '11 #1

✓ answered by miller

I still say that XPath is the best choice for this type of problem, but here's the regex code with some additional comments that might help.

Expand|Select|Wrap|Line Numbers
  1.  
  2. use strict;
  3.  
  4. local *DATABASE = *DATA;
  5.  
  6. # Your code begins here;
  7.  
  8. # Before and after strings to use in match
  9. my $pre = "<title>";
  10. my $post = "</title>";
  11.  
  12. # Slurp entire file in case pattern occurs across multiple lines;
  13. my $data = do {local $/; <DATABASE>};
  14.  
  15. # Use .*? so it doesn't use greedy matching.
  16. # Use \Q...\E around boundaries so that special regex characters are escaped.
  17. #    boundaries are meant to be literal strings
  18. # Use 's' modifier so that '.' will match return characters
  19. # USe 'i' modifier to make case insensitive matching
  20. my $text = $data =~ m{\Q$pre\E(.*?)\Q$post\E}si
  21.     ? $1
  22.     : warn "Can't find text";
  23.  
  24. print "$text\n";
  25.  
  26. # End your code
  27.  
  28. __DATA__
  29. <html>
  30. <head>
  31. <title>My Title</title>
  32. </head>
  33. <body>
  34. Hello World
  35. </body>
  36. </html>
  37.  

5 2178
yjulien
43 New Member
I have discover something which might help. When I write it this way, it always work:

Expand|Select|Wrap|Line Numbers
  1. $start_string = "<title>";
  2. $stop_string = "</title>"; 
  3.  
  4. my ($name2) = $line =~ m/$start_string(.+)$stop_string/si; 
  5. $name = $name2; 
  6.  
When I pass on the content from a variable, that when it does not work properly:

Expand|Select|Wrap|Line Numbers
  1. $start_string = $variable_start; 
  2. $stop_string = $variable_stop; 
  3.  
  4. my ($name2) = $line =~ m/$start_string(.+)$stop_string/si; 
  5. $name = $name2; 
  6.  
Where
$variable_start = "<title>";
$variable_stop= "</title>";

Comes from another routine.

Does that help?

Thanks,

Yves
Apr 2 '11 #2
miller
1,089 Recognized Expert Top Contributor
I made the mistake of creating a detailed post that took too long and therefore bytes.com lost it. Very annoying. I'm therefore going to make this one really quick.

Using a regex, be sure to not use greedy matching:

Expand|Select|Wrap|Line Numbers
  1. use strict;
  2.  
  3. my $data = do {local $/; <DATA>};
  4.  
  5. my $title = $data =~ m{<title>(.*?)</title>}s ? $1 : warn "Can't find title";
  6.  
  7. print "$title\n";
  8.  
  9. __DATA__
  10. <html>
  11. <head>
  12. <title>My Title</title>
  13. </head>
  14. <body>
  15. Hello World
  16. </body>
  17. </html>
  18.  
Using HTML::Parser pulled straight from it's example directory.

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2.  
  3. # This program will print out the title of an HTML document.
  4.  
  5. use HTML::Parser ();
  6.  
  7. use strict;
  8.  
  9. my $data = do {local $/; <DATA>};
  10.  
  11. sub title_handler {
  12.     my $self = shift;
  13.     $self->handler(text => sub { print @_ }, "dtext");
  14.     $self->handler(end  => "eof", "self");
  15. }
  16.  
  17. my $p = HTML::Parser->new(
  18.     api_version => 3,
  19.     start_h => [\&title_handler, "self"],
  20.     report_tags => ['title'],
  21. );
  22. $p->parse($data);
  23. $p->eof();
  24.  
  25. __DATA__
  26. <html>
  27. <head>
  28. <title>My Title</title>
  29. </head>
  30. <body>
  31. Hello World
  32. </body>
  33. </html>
  34.  
Using XPath and HTML::TreeBuilder::XPath

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2.  
  3. use HTML::TreeBuilder::XPath;
  4.  
  5. use strict;
  6.  
  7. my $data = do {local $/; <DATA>};
  8.  
  9. my $tree= HTML::TreeBuilder::XPath->new;
  10. $tree->parse( $data );
  11. my $title = $tree->findvalue( '/html/head/title' );
  12.  
  13. print "$title";
  14.  
  15. __DATA__
  16. <html>
  17. <head>
  18. <title>My Title</title>
  19. </head>
  20. <body>
  21. Hello World
  22. </body>
  23. </html>
  24.  
As you can see, each of these methods has advantages and certain specialties. I believe XPath is probably your best for this particular problem, but it's good to be familiar with each.

- Miller
Apr 3 '11 #3
yjulien
43 New Member
Hi Miller,

Great answer as usual :D

I will try the first one now since it seems to me the easiest way to replace the existing code without major change. However I like the other one better in term of neatness:

Expand|Select|Wrap|Line Numbers
  1. my $title = $data =~ m{<title>(.*?)</title>}s ? $1 : warn "Can't find title";
Can it be modified like this?:

Expand|Select|Wrap|Line Numbers
  1. my $title = $data =~ m{$variable_start(.*?)$variable_stop}s ? $1 : warn "Can't find title";
  2.  
As I explain above, I'm pasing the value <title> via a an extenal file. But the title of the page I'm visiting sometime will be found within other tag like <h1> </h1>, and sometime it's not even within tag. It Could be anything...

It's for the project I was telling you about the ohter day.

For my personal understanding. What was the problem with my own code?

Expand|Select|Wrap|Line Numbers
  1.  
  2. my ($name2) = $line =~ m/$start_string(.+)$stop_string/si; 
  3. $name = $name2; 
  4.  
Why was it not always working?

Thanks,

Yves
Apr 3 '11 #4
yjulien
43 New Member
Hello again Miller,

There is someting I must be doing wrong here or don't understand (probably both...).

I'm reading an external file that contain the Web Page and use a loop to read each line. It's like this:

Expand|Select|Wrap|Line Numbers
  1.  while (<DATABASE>) {
  2.   $line = $_;
  3.     chomp $line;
  4.  
  5. my $title = $line =~ m{<title>(.*?)</title>}s ? $1 : 
  6. warn "Can't find title";
  7.  
  8. if ($title ne "") {print "title = $title<br>";}
  9.  
  10.  }                   # end of while
  11.  
The output is always "one" >> "1" and not the actual title or string I'm trying to extract. How can I modify the line to extract the text I'm looking for?
Apr 3 '11 #5
miller
1,089 Recognized Expert Top Contributor
I still say that XPath is the best choice for this type of problem, but here's the regex code with some additional comments that might help.

Expand|Select|Wrap|Line Numbers
  1.  
  2. use strict;
  3.  
  4. local *DATABASE = *DATA;
  5.  
  6. # Your code begins here;
  7.  
  8. # Before and after strings to use in match
  9. my $pre = "<title>";
  10. my $post = "</title>";
  11.  
  12. # Slurp entire file in case pattern occurs across multiple lines;
  13. my $data = do {local $/; <DATABASE>};
  14.  
  15. # Use .*? so it doesn't use greedy matching.
  16. # Use \Q...\E around boundaries so that special regex characters are escaped.
  17. #    boundaries are meant to be literal strings
  18. # Use 's' modifier so that '.' will match return characters
  19. # USe 'i' modifier to make case insensitive matching
  20. my $text = $data =~ m{\Q$pre\E(.*?)\Q$post\E}si
  21.     ? $1
  22.     : warn "Can't find text";
  23.  
  24. print "$text\n";
  25.  
  26. # End your code
  27.  
  28. __DATA__
  29. <html>
  30. <head>
  31. <title>My Title</title>
  32. </head>
  33. <body>
  34. Hello World
  35. </body>
  36. </html>
  37.  
Apr 3 '11 #6

Sign in to post your reply or Sign up for a free account.

Similar topics

1
by: Agathe | last post by:
Bonjour, Je souhaite insérer dans une table MySQL des données provenant d'un fichier texte grâce à un script PHP. Mon fichier porte l'extension "txt" et les données sont séparées par des ";'. ...
9
by: RiGGa | last post by:
Hi, I want to parse a web page in Python and have it write certain values out to a mysql database. I really dont know where to start with parsing the html code ( I can work out the database...
16
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...
6
by: Vincent van Beveren | last post by:
Hey everyone, I try to insert a special character into my HTML using the DOM. I do this by the following piece of code: document.getElementById('space')....
4
by: nephish | last post by:
hey there, i am using beautiful soup to parse a few pages (screen scraping) easy stuff. the issue i am having is with one particular web page that uses a javascript to display some numbers in...
1
by: eric lecocq | last post by:
Salut, mon problème est le suivant: j'ai une DB avec des liens vers des fichiers texte. je voudrais savoir comment je fais pour pouvoir dire à Access qu'un champ de type date peut être vide ?...
4
by: Garry Jones | last post by:
I have recently constructed a website using a lot of php script (self taught). I now wonder if I should have construted the site in a different way. The page contains a header (a.php) and left...
1
by: Randell D. | last post by:
Folks, I consider myself well versed with Apache 1.3 and PHP4 - I found drupal and wanted to try it out - I had problems getting it working with Apache2/PHP5/MySQL5 so I downgraded... Note: I...
8
by: manontheedge | last post by:
I've been trying to parse data on a web page in C, but after a several hours of searching the internet for help with writing this code I'm still lost. Can anyone give me any tips, code, or any...
1
by: Jon86 | last post by:
Hi All, Please help me in developing a windows application for parsing a page form webbrowser control into a tree view control in the windows form, using C# .. Cheers Jon
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.