I am trying to write a web scraper and am having trouble accessing pages that require authentication. I am attempting to utilise the mechanize library, but am having difficulties. The site I am trying to login is http://www.princetonre view.com/Login3.aspx?uid badge=
user: bugmenot2008@ya hoo.com
pass: letmeinalready
Previously I did something similar to another site: schoolfinder.co m. Here is my code for that: -
import cookielib
-
import urllib
-
import urllib2
-
-
cj = cookielib.CookieJar()
-
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
-
resp = opener.open('http://schoolfinder.com') # save a cookie
-
-
theurl = 'http://schoolfinder.com/login/login.asp' # an example url that sets a cookie, try different urls here and see the cookie collection you can make !
-
body={'usr':'greenman','pwd':'greenman'}
-
txdata = urllib.urlencode(body) # if we were making a POST type request, we could encode a dictionary of values here - using urllib.urlencode
-
txheaders = {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} # fake a user agent, some websites (like google) don't like automated exploration
-
-
-
try:
-
req = urllib2.Request(theurl, txdata, txheaders) # create a request object
-
handle = opener.open(req) # and open it to return a handle on the url
-
HTMLSource = handle.read()
-
f = file('test.html', 'w')
-
f.write(HTMLSource)
-
f.close()
-
-
except IOError, e:
-
print 'We failed to open "%s".' % theurl
-
if hasattr(e, 'code'):
-
print 'We failed with error code - %s.' % e.code
-
elif hasattr(e, 'reason'):
-
print "The error object has the following 'reason' attribute :", e.reason
-
print "This usually means the server doesn't exist, is down, or we don't have an internet connection."
-
sys.exit()
-
-
else:
-
print 'Here are the headers of the page :'
-
print handle.info() # handle.read() returns the page, handle.geturl() returns the true url of the page fetched (in case urlopen has followed any redirects, which it sometimes does)
-
This method does not work on the Princeton Review site however. Interestingly I cannot even get mechanize to access the schoolfinder.co m site. Here is the code I am using: -
#!/usr/bin/env python
-
# -*- coding: UTF-8 -*-
-
import mechanize
-
-
theurl = 'http://www.princetonreview.com/Login3.aspx?uidbadge='
-
mech = mechanize.Browser()
-
mech.open(theurl)
-
-
mech.select_form(nr=0)
-
mech["ctl00$MasterMainBodyContent$txtUsername"] = "bugmenot2008@yahoo.com"
-
mech["ctl00$MasterMainBodyContent$txtPassword"] = "letmeinalready"
-
results = mech.submit().read()
-
-
f = file('test.html', 'w')
-
f.write(results) # write to a test file
-
f.close()
-
This code is so short and I just cannot figure out what I am doing wrong. What is incorrect about this? Thank you in advance.
0 3499 Sign in to post your reply or Sign up for a free account.
Similar topics |
by: Larry Asher |
last post by:
Hi all. I'm a bit of a novice in this arena so please forgive if this
question reflects that. I am trying to grab the html from a website and
display it within another webpage (once I get this to work I am going to
manipulate the html in other ways - this isn't the end purpose of this
effort). To do this I am trying to open another window containing the
source html from a URL and then capture the html from that window. I
can open the...
|
by: bruce |
last post by:
hi...
update to an ongoing issue i've been having regarding html/Browser and
selecting forms.
i've created a basic test app, and created a stripped down page of html. the
html has a single form.
i get the following error:
fname = main <<<< the app can find the frame from the XPath...
|
by: barrybevel |
last post by:
Hi,
I have a very small simple program below which does the following:
1) post a username & password to a website - THIS WORKS
2) follow a link - THIS WORKS
3) update values of 2 fields and post the form - ERROR!
This works fine using firefox even with javascript turned off.
But when using Perl (v5.8.8 on FC5) I get a page back stating an error has occured:
"We're sorry, an error has occurred. Please review the error below
There has...
|
by: comeshopcheap |
last post by:
Hi
I am using this script to access doba.com (I need to download some
files) but I keep on being sent back to the login page not the user
home page. Any help. I think I may need to use a post method and
opener is using a get method
Thanks
import mechanize
|
by: numberwhun |
last post by:
I am having an issue with understanding something in the WWW::Mechanize module. I have a website which I want to download a whole plethora of pdf files from. It is a site that I have paid to access and it is perfectly legal for me to download them, but there are FAR too many files to download by hand so I want to automate the process. The problem is, is that the site has a login page (see http://stampalbums.com/worldwide_list.asp).
I was...
| |
by: Silgd1 |
last post by:
Hi all....
I'm using pyscripter 1.7.2, on a Win XP Prof 2002 - service pack 2 machine to script a website. I have no problem logging into the site, loading and an xml file, and retrieving the confirmation transaction code xml file, but when I go to the reports page and try to grab a report, I run into a problem. The "Get Report" button code within the web page is the following:
<input type="button" name="change" value="Get Report"...
|
by: sureshbup |
last post by:
Hi,
i am new to perl...
i tried this module mechanize. this is the script
#!/usr/bin/perl
# Include the WWW::Mechanize module
use WWW::Mechanize;
|
by: Rex |
last post by:
Hello,
I am working on an academic research project where I need to log in to
a website (www.lexis.com) over HTTPS and execute a bunch of queries to
gather a data set. I just discovered the mechanize module, which seems
great because it's a high-level tool. However, I can't find any decent
documentation for mechanize apart from the docstrings, which are
pretty thin. So I just followed some other examples I found online, to
produce the...
|
by: tedpottel |
last post by:
Hi,
I can read the home page using the mechanize lib. Is there a way to
load in web pages using filename.html instad of servername/
filename.html. Lots of time the links just have the file name. I'm
trying to read in the links name and then vsit those pages.
here is the sample code I am ussing.
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |