473,386 Members | 1,674 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

mysteries of urllib/urllib2

I'm trying to use urllib2 to download a page (I'd rather use urllib,
but I need to change the User-Agent header to look like a browser or
G**gle won't send it to me, the big meanies). The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but falls at
the final hurdle when run as a cgi script - can anyone suggest
anything I may have overlooked?

request = urllib2.Request(some_URL)
request.add_header('User-Agent', 'some_plausible_string')
opener = urllib2.build_opener()
data = opener.open(request).read()

Jul 3 '07 #1
5 1723
On Jul 3, 9:43 am, Adrian Smith <adrian_p_sm...@yahoo.comwrote:
The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but falls at
the final hurdle when run as a cgi script - can anyone suggest
anything I may have overlooked?

request = urllib2.Request(some_URL)
request.add_header('User-Agent', 'some_plausible_string')
opener = urllib2.build_opener()
data = opener.open(request).read()
Most likely the account that cgi script is running as does not have
permissions to access the net. Check the traceback to be sure. Put
this at the top of your cgi script:

import cgitb; cgitb.enable()

--Ben

Jul 3 '07 #2
On Jul 3, 11:25 pm, Ben Cartwright <benc.nos...@gmail.comwrote:
On Jul 3, 9:43 am, Adrian Smith <adrian_p_sm...@yahoo.comwrote:
The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but
falls at the final hurdle when run as a cgi script - can
anyone suggest anything I may have overlooked?
request = urllib2.Request(some_URL)
request.add_header('User-Agent', 'some_plausible_string')
opener = urllib2.build_opener()
data = opener.open(request).read()

Most likely the account that cgi script is running as does not
have permissions to access the net. Check the traceback to be
sure. Put this at the top of your cgi script:

import cgitb; cgitb.enable()
Well, it worked with urllib (resulting in a G**gle 403 your-client-
does-not-have-permission-to-get-urlX page), so I think it must have
some access. Apparently there's a way to change the user-agent string
by subclassing urllib's URLopener class, but that's beyond my comfort
zone at present.

Jul 3 '07 #3
On Jul 3, 11:14 am, Adrian Smith <adrian_p_sm...@yahoo.comwrote:
The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but
falls at the final hurdle when run as a cgi script
Put this at the top of your cgi script:
import cgitb; cgitb.enable()
Did you even try this? Asking for Python help without posting the
traceback is like phoning your mechanic and saying, "My car is making
a generic rattling noise, can you tell me what the problem is without
looking under the hood?"
Apparently there's a way to change the user-agent string
by subclassing urllib's URLopener class, but that's beyond my comfort
zone at present.
Untested:

import urllib
url = 'http://groups.google.com/group/Google-AJAX-Search-API/
browse_thread/thread/a0eb87ad13b11762'
opener = urllib.FancyURLopener()
opener.addheaders = [('User-Agent', 'Fauxzilla 4.0')]
data = opener.open(url).read()

Hope that helps,
--Ben

Jul 3 '07 #4
Adrian Smith wrote:
I'm trying to use urllib2 to download a page (I'd rather use urllib,
but I need to change the User-Agent header to look like a browser or
G**gle won't send it to me, the big meanies). The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but falls at
the final hurdle when run as a cgi script - can anyone suggest
anything I may have overlooked?

request = urllib2.Request(some_URL)
request.add_header('User-Agent', 'some_plausible_string')
opener = urllib2.build_opener()
data = opener.open(request).read()
I doubt that's the problem here, but don't use a USER-AGENT string
that ends in "m" without a preceding "m" when the USER-AGENT
string is the last element of the header. Coyote Point load balancers
will drop the packet.

(Coyote Point uses regular expressions to parse HTTP headers, and
I think somebody wrote "\m" where they meant "\n".)

John Nagle
Jul 3 '07 #5
On Jul 4, 12:42 am, Ben Cartwright <benc.nos...@gmail.comwrote:
On Jul 3, 11:14 am, Adrian Smith <adrian_p_sm...@yahoo.comwrote:
The following (pinched
from Dive Into Python) seems to work perfectly in Idle, but
falls at the final hurdle when run as a cgi script
Put this at the top of your cgi script:
import cgitb; cgitb.enable()

Did you even try this? Asking for Python help without posting the
traceback is like phoning your mechanic and saying, "My car is
making a generic rattling noise, can you tell me what the problem
is without looking under the hood?"
Sorry, I thought as the cgi did appear to have web access it wasn't
applicable, and it's amazing what some mechanics can infer from engine
noise. cgitb certainly does send back an impressive amount of
information, I'll be sure to use it in future.
Apparently there's a way to change the user-agent string
by subclassing urllib's URLopener class, but that's beyond my
comfort zone at present.

Untested:

import urllib
url = 'http://groups.google.com/group/Google-AJAX-Search-API/
browse_thread/thread/a0eb87ad13b11762'
opener = urllib.FancyURLopener()
opener.addheaders = [('User-Agent', 'Fauxzilla 4.0')]
data = opener.open(url).read()
That works a treat, thanks!

Jul 3 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Gary Feldman | last post by:
I think I've found a deficiency in the design of urllib related to https. In order to complete an https connection, it appears that URLOpener and hence FancyURLOpener require the key and cert...
1
by: Chris Lyon | last post by:
Could somebody please explain the difference between these two modules and explain why they are both required, and if there will ever be a unification of them?
3
by: Volker M. | last post by:
Hey, I want to open a list of URLs with Pythons urllib and the fuction open(URL) automatically. It is important that the program open ONLY normal http-sites and no https-sites with...
0
by: Pieter Edelman | last post by:
Hi all, I'm trying to submit some data using a POST request to a HTTP server with BASIC authentication with python, but I can't get it to work. Since it's driving me completely nuts, so here's...
11
by: Johnny Lee | last post by:
Hi, I was using urllib to grab urls from web. here is the work flow of my program: 1. Get base url and max number of urls from user 2. Call filter to validate the base url 3. Read the source...
0
by: Ali.Sabil | last post by:
hello all, I just maybe hit a bug in both urllib and urllib2, actually urllib doesn't support proxy authentication, and if you setup the http_proxy env var to...
5
by: John Nagle | last post by:
I thought I had all the timeout problems with urllib worked around, but no. socket.setdefaulttimeout is useful, but not always effective. I'm setting that to 15 seconds. If the host end won't...
6
by: O.R.Senthil Kumaran | last post by:
Hi, There is an Open Tracker item against urllib2 library python.org/sf/735515 which states that. urllib / urllib2 should cache the results of 301 (permanent) redirections. This shouldn't break...
0
by: johnpollard | last post by:
For some reason this script isn't working and I dont know what it is. I believe the problem lies in the following lines of code since the script works with a different website and username/password...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.