How can I get my https file fetch working?

Hi y'all!
I'm new at perl, and I'm trying to automate a file fetch.

I have this url (in this example called 'https://GetMyFile'), which, when I paste it into a browser, gives me the pop-up "File Download" - Do you want to open or save this file?.. And clicking 'save' gives me the file I want.

I would like to achieve the same result automatically, without having to paste the url into a browser and click 'save' a specify where to save my file.

So, here's my first attempt:
---------------------------------------------------------------------

Expand|Select|Wrap|Line Numbers

 use strict;

use WWW::Mechanize;

use LWP::Debug qw(+);
 
my $ua = new LWP::UserAgent;

$ua->proxy([qw( https http )], "myProxyAddress");
 
my $url = "https://GetMyFile";
 
my $mech = WWW::Mechanize->new();
 
print "Fetching $url";

   $mech->get( $url, ':content_file' => 'C:\Tmp\myFile.zip' );

   die "Ooops, that didn't work: ", $mech->response->status_line unless $mech->success;

--------------------------------------------------------------------
The thing is, I don't get the "oops" printout, instead myFile.zip is downloaded to the correct location, but this file is corrupted. It seems it doesn't get downloaded entirely since when I download it manually it's much bigger.

Here are some debug printouts I get

Expand|Select|Wrap|Line Numbers

 LWP::UserAgent::new: ()

LWP::UserAgent::proxy: ARRAY(someHexNumber) myProxyAddress

LWP::UserAgent::proxy: https myProxyAddress

LWP::UserAgent::proxy: http myProxyAddress

LWP::UserAgent::new: ()

LWP::UserAgent::request: ()

HTTP::Cookies::add_cookie_header: Checking GetMyFile for cookies

LWP::UserAgent::send_request: GET https://GetMyFile

LWP::UserAgent::_need_proxy: Not proxied

LWP::Protocol::http::request: ()

LWP::Protocol::collect: read 336 bytes

LWP::UserAgent::request: Simple response: Found

LWP::UserAgent::request: ()

HTTP::Cookies::add_cookie_header: Checking GetMyFile for cookies

LWP::UserAgent::send_request: GET https://GetMyFile

LWP::UserAgent::_need_proxy: Not proxied

LWP::Protocol::http::request: ()

LWP::Protocol::collect: read 439 bytes

LWP::Protocol::collect: read 176 bytes

LWP::UserAgent::request: Simple response: Found
 
... Then these printouts are repeated
 
...
 
LWP::UserAgent::_need_proxy: Not proxied

LWP::Protocol::http::request: ()

LWP::Protocol::collect: read 869 bytes

LWP::Protocol::collect: read 4096 bytes

LWP::Protocol::collect: read 4096 bytes

LWP::Protocol::collect: read 2395 bytes

LWP::UserAgent::request: Simple response: OK

Fetching https://GetMyFile

Any help or suggestions as to why I don't get the entire file (?) would be greatly appreciated!

Cheers

Feb 4 '09 #1

Subscribe Post Reply

10722

KevinADC

4,059

Expert 2GB

Looks like it should work. Don't know what the problem is.

Feb 4 '09 #2

numberwhun

3,509

Expert Mod 2GB

I agree with Kevin. Right off the bat, it looks like it might work, but I haven't gone through it thoroughly. What I can say is that you want to look at the book "Spidering Hacks". Specifically, this part here:

http://books.google.com/books?id=4M2...PXiAg#PPA60,M1

That will help you with a fetch using the Mechanize module.

Regards,

Jeff

Feb 5 '09 #3

MimiMi

Hi! Thank you so much guys, for giving me feedback quickly!
I now know though, that the problem is related to credentials...
When, in the script, I change
$mech->get( $url, ':content_file' => 'C:\Tmp\myFile.zip' );
to
$mech->get( $url, ':content_file' => 'C:\Tmp\myFile.html' );
I can see that the downloaded file is indeed a webpage; and that is, a login page..

I don't really know how to solve this though. I will have to investigate further.
There is some autologin asp session involved when fetching files from where I want to fetch them. Probably the browser handles a lot of that "behind the scenes", and I don't really know exactly what's going on, which, of course I must, in order to get my script to work.. These enterprise networks.. *sigh* :)...

Feb 5 '09 #4

numberwhun

3,509

Expert Mod 2GB

@MimiMi
Check out the module documentation on CPAN for WWW::Mechanize. I am pretty positive that it provides options for logging in to such pages, you just have to code for it.

I don't know if it will help any, but here is a script I wrote a while ago that logs into a website (you had to log in before you could see the list of files) and then downloads everything that was there:

Expand|Select|Wrap|Line Numbers

 
#!/usr/bin/perl
 
use strict;

use warnings;

use File::Basename;

use WWW::Mechanize;

use MIME::Base64;
 
$|++;
 
my $username = "username";

my $password = "password";

my $url = "http://www.site.com/page.asp";

my $realm;

my $tempfile = "temp.txt";
 
my $agent = WWW::Mechanize->new();

my @args = (

    Authorization => "Basic " . MIME::Base64::encode( $username . ':' . $password )

);
 
$agent->credentials( $url, $realm, $username, $password );
 
$agent->get( $url, @args)

Obviously, site name, username and password have all been changed to protect the innocent and the above values for each should be replaced with whatever you are using.

Regards,

Jeff

Feb 5 '09 #5

KevinADC

4,059

Expert 2GB

Look into Win32::IE::Mechanize which can handle a lot more things than WWW::Mechanize can

Feb 5 '09 #6

MimiMi

Hello again!
I appreciate all your efforts to help me out here!

I've been working on other things, but now it's time to get back to this. (I still haven't got it working).

Here's my current status:

The myFile.html I get from

Expand|Select|Wrap|Line Numbers

 $mech->get( $url, ':content_file' => 'C:\Tmp\myFile.html' );

(see previous posts if I'm unclear)

has JavaScript on it.. Here are some parts of the html-file (including the JavaScript):

Expand|Select|Wrap|Line Numbers

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML>

    <head>

        <title>TheCompany Portal Login</title>

        <link type="text/css" rel="stylesheet" href="styles.css">    

        <META HTTP-EQUIV="Pragma" CONTENT="no-cache">

        <META HTTP-EQUIV="Expires" CONTENT="-1">

        <meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type">
 
            <SCRIPT LANGUAGE="JavaScript">

                function resetCredFields()

                {

                  document.Login.PASSWORD.value = "";

                }
 
                function submitForm()

                {         

                     document.Login.submit();

                }
 
                function cancelLogin()

                {

                        window.history.go(-1);

                }
 
                if (top.frames.length > 1)

                {

                    top.location.href = document.location;

                }
 
                function checkEnter(event)

                {

                    var code = 0;

                    NS4 = (document.layers) ? true : false;

                    if (NS4)

                    code = event.which;

                    else

                    code = event.keyCode;

                    if (code==13)

                    document.Login.submit();

                }
 
            </SCRIPT>
 
    </head>
 
<BODY topmargin="0" leftmargin="0" marginwidth="0" marginheight="0">
 
<table height="95%" width="100%" border="0" cellspacing="0" cellpadding="0">

<tr>

... And so on and so forth..

I don't know anything about how WWW::Mechanize could work with JavaScript.. is that even possible? How then can I provide the JavaScript with the right credentials?

Cheers

Feb 27 '09 #7

MimiMi

Sorry sorry.. I don't need to waste your time by asking silly questions such as whether WWW::Mechanize works with JavaScript, that wasn't hard to find out for myself. The answer is NO. Unfortunately.

Have to figure out how to solve this then.. some other way.. :/

Cheers

Feb 27 '09 #8

KevinADC

4,059

Expert 2GB

I guess you missed my previosu post:

Look into Win32::IE::Mechanize which can handle a lot more things than WWW::Mechanize can

Feb 27 '09 #9

MimiMi

Hi!
KevinADC: Yes that's right I missed looking into Win32::IE::Mechanize, sorry for that!
Now I've started looking into that though, and it seems to be filling my needs somewhat better, feels like I'm almost there, but still I don't get how I can get my files downloaded without manually having to provide any user input whatsoever.

As of now I get an IE browser starting up, and I get to the download file prompt, but I don't want to manually have to click
"save" and provide location etc.. Plus, I don't want IE to show at all.. Is that possible?

This script is to be run at a server, so I want everything to be "invisible"..

Here's my current script:

Expand|Select|Wrap|Line Numbers

 
use warnings;

use Win32::IE::Mechanize;
 
my $ie = Win32::IE::Mechanize->new( visible => 1 );
 
my $username = "user";

my $password = "pwd";
 
my $url = "http://weblink.To.TheFile";

my $realm;
 
 $ie->credentials( 'myHostname:myPort', $realm, $username, $password );
 
print "Fetching $url";

   $ie->get( $url, ':content_file' => 'C:\Temp\result\result.zip');

   die "Ooops, this didn't work: ", $ie->response->status_line unless $ie->success;

Mar 9 '09 #10

KevinADC

4,059

Expert 2GB

Sorry but I don't know the answer or have any suggestions for your last questions. All I can suggest is to carefully read the modules documentation and see if there is anything that can help you solve those parts of your question.

Mar 9 '09 #11

Icecrack

174

Expert 100+

That is not a Perl issue that is a browser issue, you have too look into your browser settings or use the first version of google chrome as they started download when a file was clicked on. (this was updated in newer versions as it is a security risk, this is why they have a save option).

Mar 9 '09 #12

AsiaWired

Why not use wget?

No need for a big perl script--assuming you're running *nix.

man wget

You can use it to imitate a browser, including login information and site cookies, while downloading files or webpages.

Example:

Expand|Select|Wrap|Line Numbers

  wget -m -c --convert-links --user="Mister Man" --password=PreTTyPlease --load-cookies cookie.txt  --user-agent="Mozilla/4.0 (compatible; MSIE 7.0;  Windows NT 5.2)" http://www.your-special-site.com/get-that-archive.zip
 

You can automate the process via a cronjob on your *nix server to get those files from the remote location.

Of course, there are some security issues with putting your password into a shell command, and anyone with access to your crontab will be able to see it in plain text...but there are some other options if you are needing more security.

Apr 17 '10 #13

Similar topics

reading https file

by: Wim Roffil | last post by:

Hi, I have a screencraper application that read webpages with fopen() and then processes them. Now I found a https page that does not require a password. However, I cannot read it with...

PHP

write to file not working

by: rxl124 | last post by:

someone please please help w/ this one. As I been working on this on and off and it just does not want to work. 1 #!/usr/bin/perl -w 2 3 $file = "/home/user1/dothis"; 4 open(FILE, ">$file");...

Perl

HTTPS File Downloader

by: Drew | last post by:

I have been looking all over the web for an example of how to accomplish this. I am trying to download a comma seperated file from a https server. I can't establish the connection - the error reads...

C# / C Sharp

CSv file not working for one column

by: archana | last post by:

Hi all I am having oen csv file having two columns and some rows. In my first row and second column i have some string say 'aaa'. but for all other rows second column contains some integer...

C# / C Sharp

aspx file not working

by: DrNoose | last post by:

Hi! I may have made this post several weeks back, but not sure. I have a project due Thurs. The problem is when I copy the file the the local inetpub/wwwroot folder and try to run the aspx...

ASP.NET

file_get_contents() or file() not working with URL on same server

by: mmarlow | last post by:

Hi all, I'm trying to get the contents of a file from a URL using file_get_contents() or file(). It doesn't work, it eventually just times out. It used to display a 'failed to open file' message...

PHP

https Redirection not working in Vista IE7

by: kmdshuaib | last post by:

Hi, I have created a web application, and i need the total web site to be in https. So i have written the below code in the Page Init of the Master Page to automatically redirect the site from...

.NET Framework

external javascript file not working on server

by: kkshansid | last post by:

external javascript file not working on server <script type="text/javascript" src="../include/common.js"></script> while internal wrking fine <script language="JavaScript"> . . </script> i...

Javascript

XP manifest file not working in Windows 7

by: George Wade | last post by:

As many of you know, if you put an VB6.EXE.Manifest file in the same directory as the VB6.exe file then VB6 IDE will run with the XP look for many of the controls. I got a new computer with...

Visual Basic 4 / 5 / 6

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General