Parsing title from a web page

43 New Member

Hello,

I have a recuring problem with parsing routing that I have written.

Whenever I try to extract a text from a pair of HTML Tag, sometime, not always, I end end up catching a large part of the file which is bigger then the intended target.

Here is the code:

Expand|Select|Wrap|Line Numbers

 
$variable_start = "<title>";

$variable_stop= "</title>";
 
$start_string = $variable_start;

$stop_string = $variable_stop;
 
my ($name2) = $line =~ m/$start_string(.+)$stop_string/si;

$name = $name2;

This code is embedded in a while loop and do each line one by one and extract the target.

Sometime it work perfectly and other time it does not. There must be better code to extract info from a pair of HTML Tag.

Thanks,

Yves

Apr 2 '11 #1

Subscribe Reply

✓ answered by miller

I still say that XPath is the best choice for this type of problem, but here's the regex code with some additional comments that might help.

Expand|Select|Wrap|Line Numbers

  
use strict;
 
local *DATABASE = *DATA;
 
# Your code begins here;
 
# Before and after strings to use in match

my $pre = "<title>";

my $post = "</title>";
 
# Slurp entire file in case pattern occurs across multiple lines;

my $data = do {local $/; <DATABASE>};
 
# Use .*? so it doesn't use greedy matching.

# Use \Q...\E around boundaries so that special regex characters are escaped.

#    boundaries are meant to be literal strings

# Use 's' modifier so that '.' will match return characters

# USe 'i' modifier to make case insensitive matching

my $text = $data =~ m{\Q$pre\E(.*?)\Q$post\E}si

    ? $1

    : warn "Can't find text";
 
print "$text\n";
 
# End your code
 
__DATA__

<html>

<head>

<title>My Title</title>

</head>

<body>

Hello World

</body>

</html>

2178

yjulien

New Member

I have discover something which might help. When I write it this way, it always work:

Expand|Select|Wrap|Line Numbers

 
$start_string = "<title>";

$stop_string = "</title>"; 
 
my ($name2) = $line =~ m/$start_string(.+)$stop_string/si; 

$name = $name2;

When I pass on the content from a variable, that when it does not work properly:

Expand|Select|Wrap|Line Numbers

 
$start_string = $variable_start; 

$stop_string = $variable_stop; 
 
my ($name2) = $line =~ m/$start_string(.+)$stop_string/si; 

$name = $name2;

Where
$variable_start = "<title>";
$variable_stop= "</title>";

Comes from another routine.

Does that help?

Thanks,

Yves

Apr 2 '11 #2

miller

1,089

Recognized Expert Top Contributor

I made the mistake of creating a detailed post that took too long and therefore bytes.com lost it. Very annoying. I'm therefore going to make this one really quick.

Using a regex, be sure to not use greedy matching:

Expand|Select|Wrap|Line Numbers

 
use strict;
 
my $data = do {local $/; <DATA>};
 
my $title = $data =~ m{<title>(.*?)</title>}s ? $1 : warn "Can't find title";
 
print "$title\n";
 
__DATA__

<html>

<head>

<title>My Title</title>

</head>

<body>

Hello World

</body>

</html>

Using HTML::Parser pulled straight from it's example directory.

Expand|Select|Wrap|Line Numbers

 
#!/usr/bin/perl
 
# This program will print out the title of an HTML document.
 
use HTML::Parser ();
 
use strict;
 
my $data = do {local $/; <DATA>};
 
sub title_handler {

    my $self = shift;

    $self->handler(text => sub { print @_ }, "dtext");

    $self->handler(end  => "eof", "self");

}
 
my $p = HTML::Parser->new(

    api_version => 3,

    start_h => [\&title_handler, "self"],

    report_tags => ['title'],

);

$p->parse($data);

$p->eof();
 
__DATA__

<html>

<head>

<title>My Title</title>

</head>

<body>

Hello World

</body>

</html>

Using XPath and HTML::TreeBuilder::XPath

Expand|Select|Wrap|Line Numbers

 
#!/usr/bin/perl
 
use HTML::TreeBuilder::XPath;
 
use strict;
 
my $data = do {local $/; <DATA>};
 
my $tree= HTML::TreeBuilder::XPath->new;

$tree->parse( $data );

my $title = $tree->findvalue( '/html/head/title' );
 
print "$title";
 
__DATA__

<html>

<head>

<title>My Title</title>

</head>

<body>

Hello World

</body>

</html>

As you can see, each of these methods has advantages and certain specialties. I believe XPath is probably your best for this particular problem, but it's good to be familiar with each.

- Miller

Apr 3 '11 #3

yjulien

New Member

Hi Miller,

Great answer as usual :D

I will try the first one now since it seems to me the easiest way to replace the existing code without major change. However I like the other one better in term of neatness:

Expand|Select|Wrap|Line Numbers

 my $title = $data =~ m{<title>(.*?)</title>}s ? $1 : warn "Can't find title";
 

Can it be modified like this?:

Expand|Select|Wrap|Line Numbers

 
my $title = $data =~ m{$variable_start(.*?)$variable_stop}s ? $1 : warn "Can't find title";

As I explain above, I'm pasing the value <title> via a an extenal file. But the title of the page I'm visiting sometime will be found within other tag like <h1> </h1>, and sometime it's not even within tag. It Could be anything...

It's for the project I was telling you about the ohter day.

For my personal understanding. What was the problem with my own code?

Expand|Select|Wrap|Line Numbers

  
my ($name2) = $line =~ m/$start_string(.+)$stop_string/si; 

$name = $name2;

Why was it not always working?

Thanks,

Yves

Apr 3 '11 #4

yjulien

New Member

Hello again Miller,

There is someting I must be doing wrong here or don't understand (probably both...).

I'm reading an external file that contain the Web Page and use a loop to read each line. It's like this:

Expand|Select|Wrap|Line Numbers

 
 while (<DATABASE>) {

  $line = $_;

    chomp $line;
 
my $title = $line =~ m{<title>(.*?)</title>}s ? $1 : 

warn "Can't find title";
 
if ($title ne "") {print "title = $title<br>";}
 
 }                   # end of while

The output is always "one" >> "1" and not the actual title or string I'm trying to extract. How can I modify the line to extract the text I'm looking for?

Apr 3 '11 #5

miller

1,089

Recognized Expert Top Contributor

I still say that XPath is the best choice for this type of problem, but here's the regex code with some additional comments that might help.

Expand|Select|Wrap|Line Numbers

  
use strict;
 
local *DATABASE = *DATA;
 
# Your code begins here;
 
# Before and after strings to use in match

my $pre = "<title>";

my $post = "</title>";
 
# Slurp entire file in case pattern occurs across multiple lines;

my $data = do {local $/; <DATABASE>};
 
# Use .*? so it doesn't use greedy matching.

# Use \Q...\E around boundaries so that special regex characters are escaped.

#    boundaries are meant to be literal strings

# Use 's' modifier so that '.' will match return characters

# USe 'i' modifier to make case insensitive matching

my $text = $data =~ m{\Q$pre\E(.*?)\Q$post\E}si

    ? $1

    : warn "Can't find text";
 
print "$text\n";
 
# End your code
 
__DATA__

<html>

<head>

<title>My Title</title>

</head>

<body>

Hello World

</body>

</html>

Apr 3 '11 #6

by: Agathe | last post by:

Bonjour, Je souhaite insérer dans une table MySQL des données provenant d'un fichier texte grâce à un script PHP. Mon fichier porte l'extension "txt" et les données sont séparées par des ";'. ...

PHP

Help with parsing web page

by: RiGGa | last post by:

Hi, I want to parse a web page in Python and have it write certain values out to a mysql database. I really dont know where to start with parsing the html code ( I can work out the database...

Python

Help with a Simple Question

by: Terry | last post by:

Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...

Javascript

stange no-escape parsing

by: Vincent van Beveren | last post by:

Hey everyone, I try to insert a special character into my HTML using the DOM. I do this by the following piece of code: document.getElementById('space')....

Javascript

question about urllib and parsing a page

by: nephish | last post by:

hey there, i am using beautiful soup to parse a few pages (screen scraping) easy stuff. the issue i am having is with one particular web page that uses a javascript to display some numbers in...

Python

lier un fichier texte avec des dates

by: eric lecocq | last post by:

Salut, mon problème est le suivant: j'ai une DB avec des liens vers des fichiers texte. je voudrais savoir comment je fais pour pouvoir dire à Access qu'un champ de type date peut être vide ?...

Microsoft Access / VBA

Page Structure

by: Garry Jones | last post by:

I have recently constructed a website using a lot of php script (self taught). I now wonder if I should have construted the site in a different way. The page contains a header (a.php) and left...

PHP

Apache2/PHP4 compile/startup fine but no PHP parsing (blank page for phpinfo)

by: Randell D. | last post by:

Folks, I consider myself well versed with Apache 1.3 and PHP4 - I found drupal and wanted to try it out - I had problems getting it working with Apache2/PHP5/MySQL5 so I downgraded... Note: I...

PHP

parsing internet page using C

by: manontheedge | last post by:

I've been trying to parse data on a web page in C, but after a several hours of searching the internet for help with writing this code I'm still lost. Can anyone give me any tips, code, or any...

C / C++

Help in parsing HTML using C#

by: Jon86 | last post by:

Hi All, Please help me in developing a windows application for parsing a page form webbrowser control into a tree view control in the windows form, using C# .. Cheers Jon

.NET Framework

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

Parsing title from a web page

✓ answered by miller

Similar topics