471,356 Members | 1,703 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,356 software developers and data experts.

mysteries of urllib/urllib2

I'm trying to use urllib2 to download a page (I'd rather use urllib,
but I need to change the User-Agent header to look like a browser or
G**gle won't send it to me, the big meanies). The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but falls at
the final hurdle when run as a cgi script - can anyone suggest
anything I may have overlooked?

request = urllib2.Request(some_URL)
request.add_header('User-Agent', 'some_plausible_string')
opener = urllib2.build_opener()
data = opener.open(request).read()

Jul 3 '07 #1
5 1648
On Jul 3, 9:43 am, Adrian Smith <adrian_p_sm...@yahoo.comwrote:
The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but falls at
the final hurdle when run as a cgi script - can anyone suggest
anything I may have overlooked?

request = urllib2.Request(some_URL)
request.add_header('User-Agent', 'some_plausible_string')
opener = urllib2.build_opener()
data = opener.open(request).read()
Most likely the account that cgi script is running as does not have
permissions to access the net. Check the traceback to be sure. Put
this at the top of your cgi script:

import cgitb; cgitb.enable()

--Ben

Jul 3 '07 #2
On Jul 3, 11:25 pm, Ben Cartwright <benc.nos...@gmail.comwrote:
On Jul 3, 9:43 am, Adrian Smith <adrian_p_sm...@yahoo.comwrote:
The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but
falls at the final hurdle when run as a cgi script - can
anyone suggest anything I may have overlooked?
request = urllib2.Request(some_URL)
request.add_header('User-Agent', 'some_plausible_string')
opener = urllib2.build_opener()
data = opener.open(request).read()

Most likely the account that cgi script is running as does not
have permissions to access the net. Check the traceback to be
sure. Put this at the top of your cgi script:

import cgitb; cgitb.enable()
Well, it worked with urllib (resulting in a G**gle 403 your-client-
does-not-have-permission-to-get-urlX page), so I think it must have
some access. Apparently there's a way to change the user-agent string
by subclassing urllib's URLopener class, but that's beyond my comfort
zone at present.

Jul 3 '07 #3
On Jul 3, 11:14 am, Adrian Smith <adrian_p_sm...@yahoo.comwrote:
The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but
falls at the final hurdle when run as a cgi script
Put this at the top of your cgi script:
import cgitb; cgitb.enable()
Did you even try this? Asking for Python help without posting the
traceback is like phoning your mechanic and saying, "My car is making
a generic rattling noise, can you tell me what the problem is without
looking under the hood?"
Apparently there's a way to change the user-agent string
by subclassing urllib's URLopener class, but that's beyond my comfort
zone at present.
Untested:

import urllib
url = 'http://groups.google.com/group/Google-AJAX-Search-API/
browse_thread/thread/a0eb87ad13b11762'
opener = urllib.FancyURLopener()
opener.addheaders = [('User-Agent', 'Fauxzilla 4.0')]
data = opener.open(url).read()

Hope that helps,
--Ben

Jul 3 '07 #4
Adrian Smith wrote:
I'm trying to use urllib2 to download a page (I'd rather use urllib,
but I need to change the User-Agent header to look like a browser or
G**gle won't send it to me, the big meanies). The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but falls at
the final hurdle when run as a cgi script - can anyone suggest
anything I may have overlooked?

request = urllib2.Request(some_URL)
request.add_header('User-Agent', 'some_plausible_string')
opener = urllib2.build_opener()
data = opener.open(request).read()
I doubt that's the problem here, but don't use a USER-AGENT string
that ends in "m" without a preceding "m" when the USER-AGENT
string is the last element of the header. Coyote Point load balancers
will drop the packet.

(Coyote Point uses regular expressions to parse HTTP headers, and
I think somebody wrote "\m" where they meant "\n".)

John Nagle
Jul 3 '07 #5
On Jul 4, 12:42 am, Ben Cartwright <benc.nos...@gmail.comwrote:
On Jul 3, 11:14 am, Adrian Smith <adrian_p_sm...@yahoo.comwrote:
The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but
falls at the final hurdle when run as a cgi script
Put this at the top of your cgi script:
import cgitb; cgitb.enable()

Did you even try this? Asking for Python help without posting the
traceback is like phoning your mechanic and saying, "My car is
making a generic rattling noise, can you tell me what the problem
is without looking under the hood?"
Sorry, I thought as the cgi did appear to have web access it wasn't
applicable, and it's amazing what some mechanics can infer from engine
noise. cgitb certainly does send back an impressive amount of
information, I'll be sure to use it in future.
Apparently there's a way to change the user-agent string
by subclassing urllib's URLopener class, but that's beyond my
comfort zone at present.

Untested:

import urllib
url = 'http://groups.google.com/group/Google-AJAX-Search-API/
browse_thread/thread/a0eb87ad13b11762'
opener = urllib.FancyURLopener()
opener.addheaders = [('User-Agent', 'Fauxzilla 4.0')]
data = opener.open(url).read()
That works a treat, thanks!

Jul 3 '07 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by Gary Feldman | last post: by
1 post views Thread by Chris Lyon | last post: by
3 posts views Thread by Volker M. | last post: by
reply views Thread by Pieter Edelman | last post: by
11 posts views Thread by Johnny Lee | last post: by
reply views Thread by Ali.Sabil | last post: by
5 posts views Thread by John Nagle | last post: by
6 posts views Thread by O.R.Senthil Kumaran | last post: by
reply views Thread by XIAOLAOHU | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.