Help | Site Map
Connecting Tech Pros Worldwide
Reply
 
LinkBack Thread Tools
  #1  
Old September 5th, 2008, 05:29 AM
Newbie
 
Join Date: Feb 2008
Posts: 6
Default Using mechanize to do website authentication

I am trying to write a web scraper and am having trouble accessing pages that require authentication. I am attempting to utilise the mechanize library, but am having difficulties. The site I am trying to login is http://www.princetonreview.com/Login3.aspx?uidbadge=

user: bugmenot2008@yahoo.com
pass: letmeinalready

Previously I did something similar to another site: schoolfinder.com. Here is my code for that:

Expand|Select|Wrap|Line Numbers
  1. import cookielib
  2. import urllib
  3. import urllib2
  4.  
  5. cj = cookielib.CookieJar()
  6. opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
  7. resp = opener.open('http://schoolfinder.com') # save a cookie
  8.  
  9. theurl = 'http://schoolfinder.com/login/login.asp' # an example url that sets a cookie, try different urls here and see the cookie collection you can make !
  10. body={'usr':'greenman','pwd':'greenman'}
  11. txdata = urllib.urlencode(body) # if we were making a POST type request, we could encode a dictionary of values here - using urllib.urlencode
  12. txheaders =  {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} # fake a user agent, some websites (like google) don't like automated exploration
  13.  
  14.  
  15. try:
  16.     req = urllib2.Request(theurl, txdata, txheaders) # create a request object
  17.     handle = opener.open(req) # and open it to return a handle on the url
  18.     HTMLSource = handle.read()
  19.     f = file('test.html', 'w')
  20.     f.write(HTMLSource)
  21.     f.close()
  22.  
  23. except IOError, e:
  24.     print 'We failed to open "%s".' % theurl
  25.     if hasattr(e, 'code'):
  26.         print 'We failed with error code - %s.' % e.code
  27.     elif hasattr(e, 'reason'):
  28.         print "The error object has the following 'reason' attribute :", e.reason
  29.         print "This usually means the server doesn't exist, is down, or we don't have an internet connection."
  30.         sys.exit()
  31.  
  32. else:
  33.     print 'Here are the headers of the page :'
  34.     print handle.info() # handle.read() returns the page, handle.geturl() returns the true url of the page fetched (in case urlopen has followed any redirects, which it sometimes does)
  35.  
This method does not work on the Princeton Review site however. Interestingly I cannot even get mechanize to access the schoolfinder.com site. Here is the code I am using:

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/env python
  2. # -*- coding: UTF-8 -*-
  3. import mechanize
  4.  
  5. theurl = 'http://www.princetonreview.com/Login3.aspx?uidbadge='
  6. mech = mechanize.Browser()
  7. mech.open(theurl)
  8.  
  9. mech.select_form(nr=0)
  10. mech["ctl00$MasterMainBodyContent$txtUsername"] = "bugmenot2008@yahoo.com"
  11. mech["ctl00$MasterMainBodyContent$txtPassword"] = "letmeinalready"
  12. results = mech.submit().read()
  13.  
  14. f = file('test.html', 'w')
  15. f.write(results) # write to a test file
  16. f.close()
  17.  
This code is so short and I just cannot figure out what I am doing wrong. What is incorrect about this? Thank you in advance.
Reply
Reply

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles