472,959 Members | 1,667 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,959 software developers and data experts.

mysteries of urllib/urllib2

I'm trying to use urllib2 to download a page (I'd rather use urllib,
but I need to change the User-Agent header to look like a browser or
G**gle won't send it to me, the big meanies). The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but falls at
the final hurdle when run as a cgi script - can anyone suggest
anything I may have overlooked?

request = urllib2.Request(some_URL)
request.add_header('User-Agent', 'some_plausible_string')
opener = urllib2.build_opener()
data = opener.open(request).read()

Jul 3 '07 #1
5 1705
On Jul 3, 9:43 am, Adrian Smith <adrian_p_sm...@yahoo.comwrote:
The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but falls at
the final hurdle when run as a cgi script - can anyone suggest
anything I may have overlooked?

request = urllib2.Request(some_URL)
request.add_header('User-Agent', 'some_plausible_string')
opener = urllib2.build_opener()
data = opener.open(request).read()
Most likely the account that cgi script is running as does not have
permissions to access the net. Check the traceback to be sure. Put
this at the top of your cgi script:

import cgitb; cgitb.enable()

--Ben

Jul 3 '07 #2
On Jul 3, 11:25 pm, Ben Cartwright <benc.nos...@gmail.comwrote:
On Jul 3, 9:43 am, Adrian Smith <adrian_p_sm...@yahoo.comwrote:
The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but
falls at the final hurdle when run as a cgi script - can
anyone suggest anything I may have overlooked?
request = urllib2.Request(some_URL)
request.add_header('User-Agent', 'some_plausible_string')
opener = urllib2.build_opener()
data = opener.open(request).read()

Most likely the account that cgi script is running as does not
have permissions to access the net. Check the traceback to be
sure. Put this at the top of your cgi script:

import cgitb; cgitb.enable()
Well, it worked with urllib (resulting in a G**gle 403 your-client-
does-not-have-permission-to-get-urlX page), so I think it must have
some access. Apparently there's a way to change the user-agent string
by subclassing urllib's URLopener class, but that's beyond my comfort
zone at present.

Jul 3 '07 #3
On Jul 3, 11:14 am, Adrian Smith <adrian_p_sm...@yahoo.comwrote:
The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but
falls at the final hurdle when run as a cgi script
Put this at the top of your cgi script:
import cgitb; cgitb.enable()
Did you even try this? Asking for Python help without posting the
traceback is like phoning your mechanic and saying, "My car is making
a generic rattling noise, can you tell me what the problem is without
looking under the hood?"
Apparently there's a way to change the user-agent string
by subclassing urllib's URLopener class, but that's beyond my comfort
zone at present.
Untested:

import urllib
url = 'http://groups.google.com/group/Google-AJAX-Search-API/
browse_thread/thread/a0eb87ad13b11762'
opener = urllib.FancyURLopener()
opener.addheaders = [('User-Agent', 'Fauxzilla 4.0')]
data = opener.open(url).read()

Hope that helps,
--Ben

Jul 3 '07 #4
Adrian Smith wrote:
I'm trying to use urllib2 to download a page (I'd rather use urllib,
but I need to change the User-Agent header to look like a browser or
G**gle won't send it to me, the big meanies). The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but falls at
the final hurdle when run as a cgi script - can anyone suggest
anything I may have overlooked?

request = urllib2.Request(some_URL)
request.add_header('User-Agent', 'some_plausible_string')
opener = urllib2.build_opener()
data = opener.open(request).read()
I doubt that's the problem here, but don't use a USER-AGENT string
that ends in "m" without a preceding "m" when the USER-AGENT
string is the last element of the header. Coyote Point load balancers
will drop the packet.

(Coyote Point uses regular expressions to parse HTTP headers, and
I think somebody wrote "\m" where they meant "\n".)

John Nagle
Jul 3 '07 #5
On Jul 4, 12:42 am, Ben Cartwright <benc.nos...@gmail.comwrote:
On Jul 3, 11:14 am, Adrian Smith <adrian_p_sm...@yahoo.comwrote:
The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but
falls at the final hurdle when run as a cgi script
Put this at the top of your cgi script:
import cgitb; cgitb.enable()

Did you even try this? Asking for Python help without posting the
traceback is like phoning your mechanic and saying, "My car is
making a generic rattling noise, can you tell me what the problem
is without looking under the hood?"
Sorry, I thought as the cgi did appear to have web access it wasn't
applicable, and it's amazing what some mechanics can infer from engine
noise. cgitb certainly does send back an impressive amount of
information, I'll be sure to use it in future.
Apparently there's a way to change the user-agent string
by subclassing urllib's URLopener class, but that's beyond my
comfort zone at present.

Untested:

import urllib
url = 'http://groups.google.com/group/Google-AJAX-Search-API/
browse_thread/thread/a0eb87ad13b11762'
opener = urllib.FancyURLopener()
opener.addheaders = [('User-Agent', 'Fauxzilla 4.0')]
data = opener.open(url).read()
That works a treat, thanks!

Jul 3 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Gary Feldman | last post by:
I think I've found a deficiency in the design of urllib related to https. In order to complete an https connection, it appears that URLOpener and hence FancyURLOpener require the key and cert...
1
by: Chris Lyon | last post by:
Could somebody please explain the difference between these two modules and explain why they are both required, and if there will ever be a unification of them?
3
by: Volker M. | last post by:
Hey, I want to open a list of URLs with Pythons urllib and the fuction open(URL) automatically. It is important that the program open ONLY normal http-sites and no https-sites with...
0
by: Pieter Edelman | last post by:
Hi all, I'm trying to submit some data using a POST request to a HTTP server with BASIC authentication with python, but I can't get it to work. Since it's driving me completely nuts, so here's...
11
by: Johnny Lee | last post by:
Hi, I was using urllib to grab urls from web. here is the work flow of my program: 1. Get base url and max number of urls from user 2. Call filter to validate the base url 3. Read the source...
0
by: Ali.Sabil | last post by:
hello all, I just maybe hit a bug in both urllib and urllib2, actually urllib doesn't support proxy authentication, and if you setup the http_proxy env var to...
5
by: John Nagle | last post by:
I thought I had all the timeout problems with urllib worked around, but no. socket.setdefaulttimeout is useful, but not always effective. I'm setting that to 15 seconds. If the host end won't...
6
by: O.R.Senthil Kumaran | last post by:
Hi, There is an Open Tracker item against urllib2 library python.org/sf/735515 which states that. urllib / urllib2 should cache the results of 301 (permanent) redirections. This shouldn't break...
0
by: johnpollard | last post by:
For some reason this script isn't working and I dont know what it is. I believe the problem lies in the following lines of code since the script works with a different website and username/password...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
2
by: DJRhino | last post by:
Was curious if anyone else was having this same issue or not.... I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 4 Oct 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
by: Aliciasmith | last post by:
In an age dominated by smartphones, having a mobile app for your business is no longer an option; it's a necessity. Whether you're a startup or an established enterprise, finding the right mobile app...
0
tracyyun
by: tracyyun | last post by:
Hello everyone, I have a question and would like some advice on network connectivity. I have one computer connected to my router via WiFi, but I have two other computers that I want to be able to...
4
NeoPa
by: NeoPa | last post by:
Hello everyone. I find myself stuck trying to find the VBA way to get Access to create a PDF of the currently-selected (and open) object (Form or Report). I know it can be done by selecting :...
3
NeoPa
by: NeoPa | last post by:
Introduction For this article I'll be using a very simple database which has Form (clsForm) & Report (clsReport) classes that simply handle making the calling Form invisible until the Form, or all...
1
by: Teri B | last post by:
Hi, I have created a sub-form Roles. In my course form the user selects the roles assigned to the course. 0ne-to-many. One course many roles. Then I created a report based on the Course form and...
0
isladogs
by: isladogs | last post by:
The next online meeting of the Access Europe User Group will be on Wednesday 6 Dec 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, Mike...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.