473,231 Members | 1,576 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,231 software developers and data experts.

urllib to cache 301 redirections?

Hi,
There is an Open Tracker item against urllib2 library python.org/sf/735515
which states that.
urllib / urllib2 should cache the results of 301 (permanent) redirections.
This shouldn't break anything, since it's just an internal optimisation
from one point of view -- but it's also what the RFC (2616, section 10.3.2, first para) says
SHOULD happen.

I am trying to understand, what does it mean.
Should the original url be avaiable to the user upon request as urllib
automatically calls the redirect_request and provides the redirected url only?

I am not completely getting what "cache - redirection" implies and what should
be done with the urllib2 module. Any pointers?

Thanks,
--
O.R.Senthil Kumaran
http://uthcode.sarovar.org
Jul 6 '07 #1
6 2041
"O.R.Senthil Kumaran" <or*******@users.sourceforge.netwrites:
Hi,
There is an Open Tracker item against urllib2 library python.org/sf/735515
which states that.
urllib / urllib2 should cache the results of 301 (permanent) redirections.
This shouldn't break anything, since it's just an internal optimisation
from one point of view -- but it's also what the RFC (2616, section 10.3.2, first para) says
SHOULD happen.

I am trying to understand, what does it mean.
Should the original url be avaiable to the user upon request as urllib
automatically calls the redirect_request and provides the redirected url only?
urllib2, you mean.

Regardless of this bug, Request.get_full_url() should be (and is)
whatever URL the request instance was originally constructed with.

I am not completely getting what "cache - redirection" implies and what should
be done with the urllib2 module. Any pointers?
When a 301 redirect occurs after a request for URL U, via
urllib2.urlopen(U), urllib2 should remember the result of that
redirection, viz a second URL, V. Then, when another
urllib2.urlopen(U) takes place, urllib2 should send an HTTP request
for V, not U. urllib2 does not currently do this. (Obviously the
cache -- that is, the dictionary or whatever that stores the mapping
from URLs U to V -- should not be maintained by function urlopen
itself. Perhaps it should live on the redirect handler.)

302 redirections are temporary and are handled correctly in this
respect already by urllib2.
John
Jul 6 '07 #2
Thank you for the reply, Mr. John and I apologize for a very late response
from my end.

* John J. Lee <jj*@pobox.com[2007-07-06 18:53:09]:
"O.R.Senthil Kumaran" <or*******@users.sourceforge.netwrites:
Hi,
There is an Open Tracker item against urllib2 library python.org/sf/735515
I am not completely getting what "cache - redirection" implies and what should
be done with the urllib2 module. Any pointers?

When a 301 redirect occurs after a request for URL U, via
urllib2.urlopen(U), urllib2 should remember the result of that
redirection, viz a second URL, V. Then, when another
urllib2.urlopen(U) takes place, urllib2 should send an HTTP request
for V, not U. urllib2 does not currently do this. (Obviously the
cache -- that is, the dictionary or whatever that stores the mapping
from URLs U to V -- should not be maintained by function urlopen
itself. Perhaps it should live on the redirect handler.)
I spent a little time thinking about a solution and figured out that the
following changes to HTTPRedirectHandler, might be helpful in implementing
this.

Class HTTPRedirectHandler(BaseHandler):
# ... omitted ...
# Initialize a dictionary to hold cache.

def __init__(self):
self.cache = {}
# Handles 301 errors separately in a different function which maintains a
# maintains cache.

def http_error_301(self, req, fp, code, msg, headers):

if req in self.cache:
# Look for loop, if a particular url appears in both key and value
# then there is loop and return HTTPError
if len(set(self.cache.keys()) & set(self.cache.values())) 0:
raise HTTPError(req.get_full_url(), code, self.inf_msg + msg +
headers, fp)
return self.cache[req]

self.cache[req] = self.http_error_302(req,fp,code,msg, headers)
return self.cache[req]
John, let me know your comments on this approach.
I have not tested this code in real scenario yet with a 301 redirect.
If its okay, I shall test it and submit a patch for the tracker item.

Thanks,
Senthil

--
O.R.Senthil Kumaran
http://uthcode.sarovar.org
Jul 16 '07 #3
On Tue, 17 Jul 2007, O.R.Senthil Kumaran wrote:
[...]
I spent a little time thinking about a solution and figured out that the
following changes to HTTPRedirectHandler, might be helpful in implementing
this.
[...]

Did you post it on the Python SF patch tracker?

If not, please do, and point us at it. I'll comment there.
John

Jul 16 '07 #4
O.R.Senthil Kumaran wrote:
Thank you for the reply, Mr. John and I apologize for a very late response
from my end.

* John J. Lee <jj*@pobox.com[2007-07-06 18:53:09]:

>>"O.R.Senthil Kumaran" <or*******@users.sourceforge.netwrites:

>>>Hi,
There is an Open Tracker item against urllib2 library python.org/sf/735515
>>>I am not completely getting what "cache - redirection" implies and what should
be done with the urllib2 module. Any pointers?

When a 301 redirect occurs after a request for URL U, via
urllib2.urlopen(U), urllib2 should remember the result of that
redirection, viz a second URL, V. Then, when another
urllib2.urlopen(U) takes place, urllib2 should send an HTTP request
for V, not U. urllib2 does not currently do this. (Obviously the
cache -- that is, the dictionary or whatever that stores the mapping
from URLs U to V -- should not be maintained by function urlopen
itself. Perhaps it should live on the redirect handler.)


I spent a little time thinking about a solution and figured out that the
following changes to HTTPRedirectHandler, might be helpful in implementing
this.

Class HTTPRedirectHandler(BaseHandler):
# ... omitted ...
# Initialize a dictionary to hold cache.

def __init__(self):
self.cache = {}
# Handles 301 errors separately in a different function which maintains a
# maintains cache.

def http_error_301(self, req, fp, code, msg, headers):

if req in self.cache:
# Look for loop, if a particular url appears in both key and value
# then there is loop and return HTTPError
if len(set(self.cache.keys()) & set(self.cache.values())) 0:
raise HTTPError(req.get_full_url(), code, self.inf_msg + msg +
headers, fp)
return self.cache[req]

self.cache[req] = self.http_error_302(req,fp,code,msg, headers)
return self.cache[req]
John, let me know your comments on this approach.
I have not tested this code in real scenario yet with a 301 redirect.
If its okay, I shall test it and submit a patch for the tracker item.
That assumes you're reusing the same object to reopen another URL.

Is this thread-safe?

That's also an inefficient way to test for an empty dictionary.

John Nagle
Jul 16 '07 #5
* John J Lee <jj*@pobox.com[2007-07-16 20:17:40]:
I spent a little time thinking about a solution and figured out that the
following changes to HTTPRedirectHandler, might be helpful in implementing
this.

Did you post it on the Python SF patch tracker?

If not, please do, and point us at it. I'll comment there.
Posted: http://www.python.org/sf/1755841
Thanks,
--
O.R.Senthil Kumaran
http://uthcode.sarovar.org
Jul 18 '07 #6
* John Nagle <na***@animats.com[2007-07-16 12:34:00]:
That assumes you're reusing the same object to reopen another URL.

Is this thread-safe?
I don't know. I looked into few other cache requests (cache ftp) and saw how it was
implemented. I am not getting as how this wont be thread-safe.
>
That's also an inefficient way to test for an empty dictionary.
How should it be done, otherwise? I am looking for alternative methods as
well.

--
O.R.Senthil Kumaran
http://uthcode.sarovar.org
Jul 18 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Pieter Edelman | last post by:
Hi all, I'm trying to submit some data using a POST request to a HTTP server with BASIC authentication with python, but I can't get it to work. Since it's driving me completely nuts, so here's...
1
by: Timothy Wu | last post by:
Hi, I'm trying to fill the form on page http://www.cbs.dtu.dk/services/TMHMM/ using urllib. There are two peculiarities. First of all, I am filling in incorrect key/value pairs in the...
1
by: Timothy Smith | last post by:
ok what i am seeing is impossible. i DELETED the file from my webserver, uploaded the new one. when my app logs in it checks the file, if it's changed it downloads it. the impossible part, is that...
4
by: william | last post by:
I've got a strange problem on windows (not very familiar with that OS). I can ping a host, but cannot get it via urllib (see here under). I can even telnet the host on port 80. Thus network...
6
by: justsee | last post by:
Hi, I'm using Python 2.3 on Windows for the first time, and am doing something wrong in using urllib to retrieve images from urls embedded in a csv file. If I explicitly specify a url and image...
1
by: AndrewJ | last post by:
I've got code: f= urllib.urlopen("http://www.stuff/nb5.php") ; This connects to a page that changes in real time. Works ok, and retrieves the data the first time. But then any subsequent...
0
by: =?Utf-8?B?UnVzc2VsbCBQb29sZXk=?= | last post by:
I have a .NET 2.0 Web Service that is using a strongly named .NET 1.1 dll with a version of 1.0.1.0. If we change the version of the 1.1 dll that the web service is using to 1.0.2.0 we get an...
5
by: supercooper | last post by:
I am downloading images using the script below. Sometimes it will go for 10 mins, sometimes 2 hours before timing out with the following error: Traceback (most recent call last): File...
5
by: chrispoliquin | last post by:
Hi, I have a small Python script to fetch some pages from the internet. There are a lot of pages and I am looping through them and then downloading the page using urlretrieve() in the urllib...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
0
Git
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.