472,119 Members | 1,595 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,119 software developers and data experts.

LWP questions


I'm returning to Perl and Linux after many years away and while I
know/knew way back when about Perl and Unix I'm new to this world
today.

I'm considering using LWP as the heart of a Web application and have a
number of questions.

It appears to me that the Get method returns ONLY the content of the
single object referenced by the URL. Is this correct? To what
degree, if any, does LWP Get deal with script on the page that may be
involved in building the page content?

In the end, I need to get a page in much the same way a browser does
and then examine it, looking at the text on the page (as it would be
rendered by IE or Mozilla) for a bunch of stuff. I also need to
examine the HTML as it exist in the abstract for the page as actually
displayed for a bunch of stuff. On XP (no flame please, surely Perl
programmers can forgive an attachment to the ugly real world) the IE
object model has two objects InnerText and InnerHTML. InnerText is a
linearized version of the text as displayed on the page AFTER all
scripts have executed. InnerHTML seems to be the HTML that would
exist to create the page AFTER all scripts have executed. It is this
kind of structure that I need. Can LWP help me here? What is the
basic attack? Are there any examples in the Perl world.

Thanks for any help/clues.

R
Jul 19 '05 #1
6 2992
On Tue, 16 Mar 2004 at 18:01 GMT, Richard Bell <rb********@earthlink.net> wrote:
I'm considering using LWP as the heart of a Web application and have a
number of questions.


LWP does not render the page, nor does it execute (client-side)
scripts, nor does it provide you with a DOM. However, you can
get the HTML using LWP and parse that with any of the available
HTML parsers (e.g., HTML-TreeBuilder).
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;
use LWP::Simple;

my $cachefile = 'mirrored.htm';

mirror('http://cpan.org', $cachefile);

my $tree = HTML::TreeBuilder->new_from_file($cachefile);

my $h1 = $tree->look_down('_tag', 'table');
print $h1->as_text if $h1;
Jul 19 '05 #2
Thanks Roel, that was very helpful.

For my application, I need something that will do all such things as
might happen in a real browser that would create user visible content
on the screen. For many of the pages I'll be working with that
includes various client side scripts and includes. While LWP gets
part of the way, it doesn't seem to go as far as this project needs.

As I mentioned, I'm newly returned to Unix/Linux and Perl. Is there
something that might be more appropriate? I've some previous
experience in IE com automation under XP. Can I play the same sort of
game (or hopefully a simpler one) under Linux? What do I use for an
engine? Can I get by with wget (it seems to do a good job of
mirroring)? Will I need to work with Mozilla?

I'd appreciate any advice.

Thanks again.

R

On 17 Mar 2004 00:55:24 GMT, Roel van der Steen <ro*******@st2x.net>
wrote:
On Tue, 16 Mar 2004 at 18:01 GMT, Richard Bell <rb********@earthlink.net> wrote:
I'm considering using LWP as the heart of a Web application and have a
number of questions.


LWP does not render the page, nor does it execute (client-side)
scripts, nor does it provide you with a DOM. However, you can
get the HTML using LWP and parse that with any of the available
HTML parsers (e.g., HTML-TreeBuilder).
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;
use LWP::Simple;

my $cachefile = 'mirrored.htm';

mirror('http://cpan.org', $cachefile);

my $tree = HTML::TreeBuilder->new_from_file($cachefile);

my $h1 = $tree->look_down('_tag', 'table');
print $h1->as_text if $h1;


Jul 19 '05 #3
(Top-posting reordered.)

On Wed, 17 Mar 2004 at 01:50 GMT, Richard Bell <rb********@earthlink.net> wrote:
On 17 Mar 2004 00:55:24 GMT, Roel van der Steen <ro*******@st2x.net>
wrote:
On Tue, 16 Mar 2004 at 18:01 GMT, Richard Bell <rb********@earthlink.net> wrote:
I'm considering using LWP as the heart of a Web application and have a
number of questions.


LWP does not render the page, nor does it execute (client-side)
scripts, nor does it provide you with a DOM.


For many of the pages I'll be working with that
includes various client side scripts and includes.

Maybe HTML:Display is more in the direction you want. Or WWW::Mechanize.
Did you already have a look at http://cpan.org ?
Jul 19 '05 #4
On 17 Mar 2004 03:12:38 GMT, Roel van der Steen <ro*******@st2x.net>
wrote:
(Top-posting reordered.)

On Wed, 17 Mar 2004 at 01:50 GMT, Richard Bell <rb********@earthlink.net> wrote:
On 17 Mar 2004 00:55:24 GMT, Roel van der Steen <ro*******@st2x.net>
wrote:
On Tue, 16 Mar 2004 at 18:01 GMT, Richard Bell <rb********@earthlink.net> wrote:
I'm considering using LWP as the heart of a Web application and have a
number of questions.

LWP does not render the page, nor does it execute (client-side)
scripts, nor does it provide you with a DOM.
For many of the pages I'll be working with that
includes various client side scripts and includes.

Maybe HTML:Display is more in the direction you want. Or WWW::Mechanize.


Thanks, I'll look into HTML:Display and WWW:Mechanize. I picked up
the O'Reilly books and am also checking the web on these packages, but
the learning curve right now is a bit stiff particularly when I'm not
really sure where to look or what to look at. Thanks for your help.
Did you already have a look at http://cpan.org ?


I have checked cpan. Lots of apparently good stuff there, but again
I'm faced with not knowing what is really appropriate for my needs.

I've thought about trying to automate Mozilla and accessing its DOM
object to get at what I want. Do you have any reflections on that
attack?

Thanks again for the new clues.

R

Jul 19 '05 #5
Richard Bell wrote:
Thanks Roel, that was very helpful.

For my application, I need something that will do all such things as
might happen in a real browser that would create user visible content
on the screen. For many of the pages I'll be working with that
includes various client side scripts and includes. While LWP gets
part of the way, it doesn't seem to go as far as this project needs.


When LWP requests a page from a server, it is no different than any
other brower's request, in that the server will process server-side
includes.

If the HTML returned contains JavaScript, it is up to you to provide
a JavaScript interpreter. I've seen many JavaScript functions that
do things like ask the graphic brower it is running in as to the
size (in pixels) of the currently active window so that it can
decide on the layout of the text is will be writing to the
document window. Other JavaScript uses include reading or
modifying the text being displayed in a field of a form. (Think of
<input type="text" name="clock" value="12:45:00 pm">.)

In other words, to handle a full range of client-side scripts,
you will have to re-invent a very large wheel: a complete browser
with graphical display and GUI widgets.

LWP is good at getting the raw HTML from the server. Postprocessing
the HTML on the client side before, during, and after rendering is
an entirely different kettle of fish.

I certainly would not want to emulate the quirks (features, bugs) of
IE 6 vs IE 5 vs Netscape vs Mozilla vs Opera.
-Joe

specific.
Jul 19 '05 #6

No one ever said it would be easy.

I'm now looking into automating Mozilla (let it do the heavy lifting),
possibly from perl, possibly using the Mozilla application
environment. Any ideas where I can get clues/examples/insight into
the issues from the perl side? I've got the O'Reilly book for the app
environment so I'm reasonably armed there.

Richard

On Sat, 20 Mar 2004 22:33:08 GMT, Joe Smith <Jo*******@inwap.com>
wrote:
Richard Bell wrote:
Thanks Roel, that was very helpful.

For my application, I need something that will do all such things as
might happen in a real browser that would create user visible content
on the screen. For many of the pages I'll be working with that
includes various client side scripts and includes. While LWP gets
part of the way, it doesn't seem to go as far as this project needs.


When LWP requests a page from a server, it is no different than any
other brower's request, in that the server will process server-side
includes.

If the HTML returned contains JavaScript, it is up to you to provide
a JavaScript interpreter. I've seen many JavaScript functions that
do things like ask the graphic brower it is running in as to the
size (in pixels) of the currently active window so that it can
decide on the layout of the text is will be writing to the
document window. Other JavaScript uses include reading or
modifying the text being displayed in a field of a form. (Think of
<input type="text" name="clock" value="12:45:00 pm">.)

In other words, to handle a full range of client-side scripts,
you will have to re-invent a very large wheel: a complete browser
with graphical display and GUI widgets.

LWP is good at getting the raw HTML from the server. Postprocessing
the HTML on the client side before, during, and after rendering is
an entirely different kettle of fish.

I certainly would not want to emulate the quirks (features, bugs) of
IE 6 vs IE 5 vs Netscape vs Mozilla vs Opera.
-Joe

specific.


Jul 19 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by softwareengineer2006 | last post: by
reply views Thread by connectrajesh | last post: by
8 posts views Thread by Krypto | last post: by
reply views Thread by ramu | last post: by
1 post views Thread by ramu | last post: by
reply views Thread by ramu | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.