I am trying to write a web scraper and am having trouble accessing pages that require authentication. I am attempting to utilise the mechanize library, but am having difficulties. The site I am trying to login is http://www.princetonreview.com/Login3.aspx?uidbadge=
user: bugmenot2008@yahoo.com
pass: letmeinalready
Previously I did something similar to another site: schoolfinder.com. Here is my code for that:
-
import cookielib
-
import urllib
-
import urllib2
-
-
cj = cookielib.CookieJar()
-
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
-
resp = opener.open('http://schoolfinder.com') # save a cookie
-
-
theurl = 'http://schoolfinder.com/login/login.asp' # an example url that sets a cookie, try different urls here and see the cookie collection you can make !
-
body={'usr':'greenman','pwd':'greenman'}
-
txdata = urllib.urlencode(body) # if we were making a POST type request, we could encode a dictionary of values here - using urllib.urlencode
-
txheaders = {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} # fake a user agent, some websites (like google) don't like automated exploration
-
-
-
try:
-
req = urllib2.Request(theurl, txdata, txheaders) # create a request object
-
handle = opener.open(req) # and open it to return a handle on the url
-
HTMLSource = handle.read()
-
f = file('test.html', 'w')
-
f.write(HTMLSource)
-
f.close()
-
-
except IOError, e:
-
print 'We failed to open "%s".' % theurl
-
if hasattr(e, 'code'):
-
print 'We failed with error code - %s.' % e.code
-
elif hasattr(e, 'reason'):
-
print "The error object has the following 'reason' attribute :", e.reason
-
print "This usually means the server doesn't exist, is down, or we don't have an internet connection."
-
sys.exit()
-
-
else:
-
print 'Here are the headers of the page :'
-
print handle.info() # handle.read() returns the page, handle.geturl() returns the true url of the page fetched (in case urlopen has followed any redirects, which it sometimes does)
-
This method does not work on the Princeton Review site however. Interestingly I cannot even get mechanize to access the schoolfinder.com site. Here is the code I am using:
-
#!/usr/bin/env python
-
# -*- coding: UTF-8 -*-
-
import mechanize
-
-
theurl = 'http://www.princetonreview.com/Login3.aspx?uidbadge='
-
mech = mechanize.Browser()
-
mech.open(theurl)
-
-
mech.select_form(nr=0)
-
mech["ctl00$MasterMainBodyContent$txtUsername"] = "bugmenot2008@yahoo.com"
-
mech["ctl00$MasterMainBodyContent$txtPassword"] = "letmeinalready"
-
results = mech.submit().read()
-
-
f = file('test.html', 'w')
-
f.write(results) # write to a test file
-
f.close()
-
This code is so short and I just cannot figure out what I am doing wrong. What is incorrect about this? Thank you in advance.