I have a test script below which I use to fetch urls into strings,
either over https or http. When over https, I use m2crypto.urllib and
when over http I use the standard urllib. Whenever, I import sockets
and setdefaulttimeout, however, using m2crypto.urllib tends to cause a
http.BadStatusLine to be raised, even if the timeout is set to be very
large. All of the documents in the test script can be accessed
publicly.
Any ideas? Is there a better/easier way to get https docs in python?
Thanks,
JDH
import urllib, socket
from cStringIO import StringIO
from M2Crypto import Rand, SSL, m2urllib
#comment out this line and the script generally works, but without it
#my zope process, which is using this code, hangs.
socket.setdefaulttimeout(200)
def url_to_string(source):
"""
get url as string, for https and http
"""
if source.startswith('https:'):
sh = StringIO()
url = m2urllib.FancyURLopener()
url.addheader('Connection', 'close')
u = url.open(source)
while 1:
data = u.read()
if not data: break
sh.write(data)
return sh.getvalue()
else:
return urllib.urlopen(source).read()
if __name__=='__main__':
s1 = url_to_string('https://crcdocs.bsd.uchicago.edu/crcdocs/Files/informatics.doc')
s2 = url_to_string('http://yahoo.com')
s3 = url_to_string('https://crcdocs.bsd.uchicago.edu/crcdocs/Files/facepage.doc')
print len(s1), len(s2), len(s3)