467,081 Members | 1,230 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,081 developers. It's quick & easy.

Help with cookies/authentication

Hi I am trying to pull some data from a Web site: http://schoolfinder.com

The issue is that I want to use the advanced search feature which requires logging into the Web site. I have a username and password, however I want to connect programmatically from Python. I have done data capture from the Web before so the only new thing here to me is the authentication stuff. I need cookies as this page describes: http://schoolfinder.com/login/login.asp

I already know how to enter POST/GET data to a request, but how do I deal with cookies/authentication? I have read a few articles without success:

urllib2:
http://www.voidspace.org.uk/python/a...lib2.shtml#id6

urllib2 Cookbook:
http://personalpages.tds.net/~kent37/kk/00010.html

basic authentication:
http://www.voidspace.org.uk/python/a...ion.shtml#id19

cookielib:
http://www.voidspace.org.uk/python/a...ookielib.shtml

Is there some other resource I am missing? Is it possible that someone could setup a basic script that would allow me to connect to schoolfinder.com with my username and password? My username is "greenman", password is "greenman". All I need to know is how to access pages as if I logged in by Web browser.

Thank you very much.
Aug 10 '08 #1
  • viewed: 6745
Share:
3 Replies
Try this code will give you all cookies will be registered in a file
from the schoolfinder.com

Expand|Select|Wrap|Line Numbers
  1. #!/usr/local/bin/python
  2.  
  3.  
  4.  
  5.  
  6.  
  7. COOKIEFILE = 'cookies.lwp'          # the path and filename that you want to use to save your cookies in
  8.  
  9. import os.path
  10.  
  11. import sys
  12.  
  13.  
  14.  
  15. cj = None
  16.  
  17. ClientCookie = None
  18.  
  19. cookielib = None
  20.  
  21.  
  22.  
  23. try:                                    # Let's see if cookielib is available
  24.  
  25.     import cookielib            
  26.  
  27. except ImportError:
  28.  
  29.     pass
  30.  
  31. else:
  32.  
  33.     import urllib2    
  34.  
  35.     urlopen = urllib2.urlopen
  36.  
  37.     cj = cookielib.LWPCookieJar()       # This is a subclass of FileCookieJar that has useful load and save methods
  38.  
  39.     Request = urllib2.Request
  40.  
  41.  
  42.  
  43. if not cookielib:                   # If importing cookielib fails let's try ClientCookie
  44.  
  45.     try:                                            
  46.  
  47.         import ClientCookie 
  48.  
  49.     except ImportError:
  50.  
  51.         import urllib2
  52.  
  53.         urlopen = urllib2.urlopen
  54.  
  55.         Request = urllib2.Request
  56.  
  57.     else:
  58.  
  59.         urlopen = ClientCookie.urlopen
  60.  
  61.         cj = ClientCookie.LWPCookieJar()
  62.  
  63.         Request = ClientCookie.Request
  64.  
  65.  
  66.  
  67. ####################################################
  68.  
  69. # We've now imported the relevant library - whichever library is being used urlopen is bound to the right function for retrieving URLs
  70.  
  71. # Request is bound to the right function for creating Request objects
  72.  
  73. # Let's load the cookies, if they exist. 
  74.  
  75.  
  76.  
  77. if cj != None:                                  # now we have to install our CookieJar so that it is used as the default CookieProcessor in the default opener handler
  78.  
  79.     if os.path.isfile(COOKIEFILE):
  80.  
  81.         cj.load(COOKIEFILE)
  82.  
  83.     if cookielib:
  84.  
  85.         opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
  86.  
  87.         urllib2.install_opener(opener)
  88.  
  89.     else:
  90.  
  91.         opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj))
  92.  
  93.         ClientCookie.install_opener(opener)
  94.  
  95.  
  96.  
  97. # If one of the cookie libraries is available, any call to urlopen will handle cookies using the CookieJar instance we've created
  98.  
  99. # (Note that if we are using ClientCookie we haven't explicitly imported urllib2)
  100.  
  101. # as an example :
  102.  
  103.  
  104.  
  105. theurl = 'http://schoolfinder.com/login/login.asp'         # an example url that sets a cookie, try different urls here and see the cookie collection you can make !
  106. body={'usr':'greenman','pwd':'greenman'}
  107.  
  108. from urllib import urlencode
  109.  
  110.  
  111. txdata = urlencode(body)                                                                           # if we were making a POST type request, we could encode a dictionary of values here - using urllib.urlencode
  112.  
  113. txheaders =  {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}          # fake a user agent, some websites (like google) don't like automated exploration
  114.  
  115.  
  116.  
  117. try:
  118.  
  119.     req = Request(theurl, txdata, txheaders)            # create a request object
  120.  
  121.     handle = urlopen(req)                               # and open it to return a handle on the url
  122.  
  123. except IOError, e:
  124.  
  125.     print 'We failed to open "%s".' % theurl
  126.  
  127.     if hasattr(e, 'code'):
  128.  
  129.         print 'We failed with error code - %s.' % e.code
  130.  
  131.     elif hasattr(e, 'reason'):
  132.  
  133.         print "The error object has the following 'reason' attribute :", e.reason
  134.  
  135.         print "This usually means the server doesn't exist, is down, or we don't have an internet connection."
  136.  
  137.         sys.exit()
  138.  
  139.  
  140.  
  141. else:
  142.  
  143.     print 'Here are the headers of the page :'
  144.  
  145.     print handle.info()                             # handle.read() returns the page, handle.geturl() returns the true url of the page fetched (in case urlopen has followed any redirects, which it sometimes does)
  146.  
  147.  
  148.  
  149. print
  150.  
  151. if cj == None:
  152.  
  153.     print "We don't have a cookie library available - sorry."
  154.  
  155.     print "I can't show you any cookies."
  156.  
  157. else:
  158.  
  159.     print 'These are the cookies we have received so far :'
  160.  
  161.     for index, cookie in enumerate(cj):
  162.  
  163.         print index, '  :  ', cookie        
  164.  
  165.     cj.save(COOKIEFILE)                     # save the cookies again
  166.  
  167.  
Aug 10 '08 #2
Thanks for the help. Your code by itself did not work, but it pushed me in the right direction. Here is what worked for me and let me see the protected pages:

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/env python
  2. # -*- coding: UTF-8 -*-
  3.  
  4. import cookielib
  5. import urllib
  6. import urllib2
  7.  
  8. cj = cookielib.CookieJar()
  9. opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
  10. resp = opener.open('http://schoolfinder.com') # save a cookie
  11.  
  12. theurl = 'http://schoolfinder.com/login/login.asp' # an example url that sets a cookie, try different urls here and see the cookie collection you can make !
  13. body={'usr':'greenman','pwd':'greenman'}
  14. txdata = urllib.urlencode(body) # if we were making a POST type request, we could encode a dictionary of values here - using urllib.urlencode
  15. txheaders =  {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} # fake a user agent, some websites (like google) don't like automated exploration
  16.  
  17.  
  18. try:
  19.     req = urllib2.Request(theurl, txdata, txheaders) # create a request object
  20.     handle = opener.open(req) # and open it to return a handle on the url
  21.     HTMLSource = handle.read()
  22.     f = file('test.html', 'w')
  23.     f.write(HTMLSource)
  24.     f.close()
  25.  
  26. except IOError, e:
  27.     print 'We failed to open "%s".' % theurl
  28.     if hasattr(e, 'code'):
  29.         print 'We failed with error code - %s.' % e.code
  30.     elif hasattr(e, 'reason'):
  31.         print "The error object has the following 'reason' attribute :", e.reason
  32.         print "This usually means the server doesn't exist, is down, or we don't have an internet connection."
  33.         sys.exit()
  34.  
  35. else:
  36.     print 'Here are the headers of the page :'
  37.     print handle.info() # handle.read() returns the page, handle.geturl() returns the true url of the page fetched (in case urlopen has followed any redirects, which it sometimes does)
Aug 30 '08 #3
Your script works for me, but the one below for another site does not. The test.html file is not my logged in file like it is when I run your script.

The only lines of code I changed are;
resp = opener.open('http://www.amm.com/')
theurl = 'http://www.amm.com/login.asp'
body={'username':'AMMT54590570','password':'AMMT32 564288'}

What am I doing wrong?

-----------------------------------
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/env python
  2. # -*- coding: UTF-8 -*-
  3.  
  4. import cookielib
  5. import urllib
  6. import urllib2
  7.  
  8. cj = cookielib.CookieJar()
  9. opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
  10. resp = opener.open('http://www.amm.com/login.asp') # save a cookie
  11.  
  12. theurl = 'http://www.amm.com/login.asp'
  13. # an example url that sets a cookie, try different urls here and see the cookie collection you can make !
  14. body={'username':'AMMT54590570','password':'AMMT32564288'}
  15. txdata = urllib.urlencode(body)
  16. # if we were making a POST type request, we could encode a dictionary of values here - using urllib.urlencode
  17. txheaders =  {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
  18. # fake a user agent, some websites (like google) don't like automated exploration
  19.  
  20.  
  21. try:
  22.     req = urllib2.Request(theurl, txdata, txheaders) # create a request object
  23.     handle = opener.open(req) # and open it to return a handle on the url
  24.     HTMLSource = handle.read()
  25.     f = file('test.html', 'w')
  26.     f.write(HTMLSource)
  27.     f.close()
  28.  
  29. except IOError, e:
  30.     print 'We failed to open "%s".' % theurl
  31.     if hasattr(e, 'code'):
  32.         print 'We failed with error code - %s.' % e.code
  33.     elif hasattr(e, 'reason'):
  34.         print "The error object has the following 'reason' attribute :", e.reason
  35.         print "This usually means the server doesn't exist, is down, or we don't have an internet connection."
  36.         sys.exit()
  37.  
  38. else:
  39.     print 'Here are the headers of the page :'
  40.     print handle.info() # handle.read() returns the page, handle.geturl() returns the true url of the page fetched (in case urlopen has followed any redirects, which it sometimes does)
Oct 29 '08 #4

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

10 posts views Thread by Brian Conway | last post: by
3 posts views Thread by Kris van der Mast | last post: by
3 posts views Thread by Joey Powell | last post: by
3 posts views Thread by Calvin KD | last post: by
2 posts views Thread by pv_kannan@yahoo.com | last post: by
2 posts views Thread by Nicola Farina | last post: by
1 post views Thread by studio60podcast@gmail.com | last post: by
reply views Thread by Calvin KD | last post: by
5 posts views Thread by archana | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.