By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
434,921 Members | 1,493 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 434,921 IT Pros & Developers. It's quick & easy.

Help to automatically traverse a login session

P: n/a
The subject may sound a little cryptic, so i'll try my best to explain.
Details are unavailable, as i am under a nondisclosure agreement, but
i'm looking for general principles and tips, not necessarily fixes for
existing code.

There is a website that requires me to log in using a web-form.
Obviously, POST vars are sent and verified and on success i'm given a
Session and/or Cookie. Within this logged-in area, there are links
leading to data query result pages. "Click here for your recent
transactions" kind of thing.

Those results pages are what i want to get to, but through some kind of
script that parses the results that get served out, not by user
interaction. i want to send a request for a link within that logged in
area and have the results served to my script, then parse out specific
data from those results and in turn serve them to a user in my own
page.

i know that sounds shady, but the login is legitimate, the data access
is legitimate, and the credentials are also valid. The problem is, i
can't request a direct database link to the server hosting the actual
data because of this nondisclosure agreement. It would require
divulging the reasons for the need for such access, which my employer
is not willing to reveal at this time.

If there's anyone who can offer ideas or help, and wishes to keep
possible answers off the public board, please email me. i realize this
is a long shot, and i doubt that even if there IS a way, that anyone
would be willing/able. But i gotta try.

Thanks all.
-joe

Aug 10 '06 #1
Share this Question
Share on Google+
6 Replies


P: n/a
joe t. wrote:
The subject may sound a little cryptic, so i'll try my best to explain.
Details are unavailable, as i am under a nondisclosure agreement, but
i'm looking for general principles and tips, not necessarily fixes for
existing code.
<snip long winded explanation>

So you want to copy someone else's data and you've only got an HTTP
interface intended for humans to the remote system.

There's plenty of companies doing this already - no need to be shy.

How simple it is depends on how well their site is written - assuming it is
well written you should be able to parse the pages with the XML parser. How
to get the pages? That's rather up to you - you could use a site ripper
like pavuk or write your own spider, e.g. using snoopy.

HTH

C.
Aug 10 '06 #2

P: n/a
On 10 Aug 2006 14:25:33 -0700, "joe t." <th*******@gmail.comwrote:
>There is a website that requires me to log in using a web-form.
Obviously, POST vars are sent and verified and on success i'm given a
Session and/or Cookie. Within this logged-in area, there are links
leading to data query result pages. "Click here for your recent
transactions" kind of thing.

Those results pages are what i want to get to, but through some kind of
script that parses the results that get served out, not by user
interaction. i want to send a request for a link within that logged in
area and have the results served to my script, then parse out specific
data from those results and in turn serve them to a user in my own
page.

i know that sounds shady, but the login is legitimate, the data access
is legitimate, and the credentials are also valid. The problem is, i
can't request a direct database link to the server hosting the actual
data because of this nondisclosure agreement. It would require
divulging the reasons for the need for such access, which my employer
is not willing to reveal at this time.

If there's anyone who can offer ideas or help, and wishes to keep
possible answers off the public board, please email me. i realize this
is a long shot, and i doubt that even if there IS a way, that anyone
would be willing/able. But i gotta try.
Whilst this sort of situation is never the best way of doing things, sometimes
it's the only way. If you really do have to go down this route then there is a
particularly nice Perl module called WWW::Mechanize.

Obviously it's not PHP, but you can call Perl from PHP.

http://search.cpan.org/search?query=...anize&mode=all

Whilst you're in Perl, then it also has various HTML parsing modules, the most
obvious being HTML::Parser, which can deal with HTML even if it's of dubious
quality.

http://search.cpan.org/~gaas/HTML-Parser-3.55/Parser.pm

So combined you can have a Perl script that does all the hard stuff and then
returns its results in an easily machine-readable form to PHP.

--
Andy Hassall :: an**@andyh.co.uk :: http://www.andyh.co.uk
http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool
Aug 10 '06 #3

P: n/a
In article <11*********************@74g2000cwt.googlegroups.c om>,
joe t. <th*******@gmail.comwrote:
>There is a website that requires me to log in using a web-form.
...
Those results pages are what i want to get to, but through some kind of
script that parses the results that get served out, not by user
interaction.
I once did this to gather a huge amount of historical data from a
horse-racing web site. I had to write the application in Java. It
would log in with my userID and password, submit queries to forms,
save the HTML result pages sent back, then parse the tabular data in
those pages into comma-delimited text data.

It was a much bigger project than I anticipated. I suspect there
are some macro automation tools out there that will let you do it
more easily.

-Alex
Aug 11 '06 #4

P: n/a

axlq wrote:
In article <11*********************@74g2000cwt.googlegroups.c om>,
joe t. <th*******@gmail.comwrote:
There is a website that requires me to log in using a web-form.
...
Those results pages are what i want to get to, but through some kind of
script that parses the results that get served out, not by user
interaction.

I once did this to gather a huge amount of historical data from a
horse-racing web site. I had to write the application in Java. It
would log in with my userID and password, submit queries to forms,
save the HTML result pages sent back, then parse the tabular data in
those pages into comma-delimited text data.

It was a much bigger project than I anticipated. I suspect there
are some macro automation tools out there that will let you do it
more easily.

-Alex

Thanks all of you for the suggestions. i will investigate these options
and try to report back on success.
-joe

Aug 11 '06 #5

P: n/a
joe t. wrote:
<snip>
There is a website that requires me to log in using a web-form.
Obviously, POST vars are sent and verified and on success i'm given a
Session and/or Cookie. Within this logged-in area, there are links
leading to data query result pages. "Click here for your recent
transactions" kind of thing.

Those results pages are what i want to get to, but through some kind of
script that parses the results that get served out, not by user
interaction. i want to send a request for a link within that logged in
area and have the results served to my script, then parse out specific
data from those results and in turn serve them to a user in my own
page.
<snip>

Such "web scraping" can be done with cURL <http://in.php.net/curl>
(need to set cookie support). Not all sites would allow web scraping
and will try to block automation with "CAPTCHA" (google it). Some sites
will even use Ajax based rendering which will then make the cURL
process a big tough (though I heard that cURL can work with Mozilla
JavaScript engine). In that case, it will be better to go for Delphi or
VB 6 as we can use WebBrowser component and can automate clicks, etc
with DOM object.

--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/

Aug 13 '06 #6

P: n/a
joe t. wrote:
<snip>
There is a website that requires me to log in using a web-form.
Obviously, POST vars are sent and verified and on success i'm given a
Session and/or Cookie. Within this logged-in area, there are links
leading to data query result pages. "Click here for your recent
transactions" kind of thing.

Those results pages are what i want to get to, but through some kind of
script that parses the results that get served out, not by user
interaction. i want to send a request for a link within that logged in
area and have the results served to my script, then parse out specific
data from those results and in turn serve them to a user in my own
page.
<snip>

Such "web scraping" can be done with cURL <http://in.php.net/curl>
(need to set cookie support). Not all sites would allow web scraping
and will try to block automation with "CAPTCHA" (google it). Some sites
will even use Ajax based rendering which will then make the cURL
process a big tough (though I heard that cURL can work with Mozilla
JavaScript engine). In that case, it will be better to go for Delphi or
VB 6 as we can use WebBrowser component and can automate clicks, etc
with DOM object.

--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/

Aug 13 '06 #7

This discussion thread is closed

Replies have been disabled for this discussion.