468,539 Members | 1,628 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,539 developers. It's quick & easy.

urllib2 rate limiting

Hello list,

I want to limit the download speed when using urllib2. In particular,
having several parallel downloads, I want to make sure that their total
speed doesn't exceed a maximum value.

I can't find a simple way to achieve this. After researching a can try
some things but I'm stuck on the details:

1) Can I overload some method in _socket.py to achieve this, and perhaps
make this generic enough to work even with other libraries than urllib2?

2) There is the urllib.urlretrieve() function which accepts a reporthook
parameter. Perhaps I can have reporthook to increment a global counter and
sleep as necessary when a threshold is reached.
However there is not something similar in urllib2. Isn't urllib2 supposed
to be a superset of urllib in functionality? Why there is no reporthook
parameter in any of urllib2's functions?
Moreover, even the existing way reporthook can be used doesn't seem so
right: reporthook(blocknum, bs, size) is always called with bs=8K even
for the last block, and sometimes (blocknum*bs size) is possible, if the
server sends wrong Content-Lentgth HTTP headers.

3) Perhaps I can use filehandle.read(1024) and manually read as many
chunks of data as I need. However I think this would generally be
inefficient and I'm not sure how it would work because
of internal buffering of urllib2.

So how do you think I can achieve rate limiting in urllib2?
Thanks in advance,
Dimitris

P.S. And something simpler: How can I disallow urllib2 to follow
redirections to foreign hosts?
Jan 10 '08 #1
4 6599
Dimitrios Apostolou <ji***@gmx.netwrites:
P.S. And something simpler: How can I disallow urllib2 to follow
redirections to foreign hosts?
You need to subclass `urllib2.HTTPRedirectHandler`, override
`http_error_301` and `http_error_302` methods and throw
`urllib2.HTTPError` exception.

http://diveintopython.org/http_web_s...redirects.html

HTH,
Rob
Jan 10 '08 #2
On Thu, 10 Jan 2008, Rob Wolfe wrote:
Dimitrios Apostolou <ji***@gmx.netwrites:
>P.S. And something simpler: How can I disallow urllib2 to follow
redirections to foreign hosts?

You need to subclass `urllib2.HTTPRedirectHandler`, override
`http_error_301` and `http_error_302` methods and throw
`urllib2.HTTPError` exception.
Thanks! I think for my case it's better to override redirect_request
method, and return a Request only in case the redirection goes to the
same site. Just another question, because I can't find in the docs the
meaning of (req, fp, code, msg, hdrs) parameters. To read the URL I get
redirected to (the 'Location:' HTTP header?), should I check the hdrs
parameter or there is a better way?
Thanks,
Dimitris

>
http://diveintopython.org/http_web_s...redirects.html

HTH,
Rob
--
http://mail.python.org/mailman/listinfo/python-list
Jan 10 '08 #3
Dimitrios Apostolou <ji***@gmx.netwrote:
I want to limit the download speed when using urllib2. In particular,
having several parallel downloads, I want to make sure that their total
speed doesn't exceed a maximum value.

I can't find a simple way to achieve this. After researching a can try
some things but I'm stuck on the details:

1) Can I overload some method in _socket.py to achieve this, and perhaps
make this generic enough to work even with other libraries than urllib2?

2) There is the urllib.urlretrieve() function which accepts a reporthook
parameter.
Here is an implementation based on that idea. I've used urllib rather
than urllib2 as that is what I'm familiar with.

------------------------------------------------------------
#!/usr/bin/python

"""
Fetch a url rate limited

Syntax: rate URL local_file_name
"""

import os
import sys
import urllib
from time import time, sleep

class RateLimit(object):
"""Rate limit a url fetch"""
def __init__(self, rate_limit):
"""rate limit in kBytes / second"""
self.rate_limit = rate_limit
self.start = time()
def __call__(self, block_count, block_size, total_size):
total_kb = total_size / 1024
downloaded_kb = (block_count * block_size) / 1024
elapsed_time = time() - self.start
if elapsed_time != 0:
rate = downloaded_kb / elapsed_time
print "%d kb of %d kb downloaded %f.1 kBytes/s\n" % (downloaded_kb ,total_kb, rate),
expected_time = downloaded_kb / self.rate_limit
sleep_time = expected_time - elapsed_time
print "Sleep for", sleep_time
if sleep_time 0:
sleep(sleep_time)

def main():
"""Fetch the contents of urls"""
if len(sys.argv) != 4:
print 'Syntax: %s "rate in kBytes/s" URL "local output path"' % sys.argv[0]
raise SystemExit(1)
rate_limit, url, out_path = sys.argv[1:]
rate_limit = float(rate_limit)
print "Fetching %r to %r with rate limit %.1f" % (url, out_path, rate_limit)
urllib.urlretrieve(url, out_path, reporthook=RateLimit(rate_limit))

if __name__ == "__main__": main()
------------------------------------------------------------

Use it like this

$ ./rate-limited-fetch.py 16 http://some/url/or/other z
Fetching 'http://some/url/or/other' to 'z' with rate limit 16.0
0 kb of 10118 kb downloaded 0.000000.1 kBytes/s
Sleep for -0.0477550029755
8 kb of 10118 kb downloaded 142.073242.1 kBytes/s
Sleep for 0.443691015244
16 kb of 10118 kb downloaded 32.130966.1 kBytes/s
Sleep for 0.502038002014
24 kb of 10118 kb downloaded 23.952789.1 kBytes/s
Sleep for 0.498028993607
32 kb of 10118 kb downloaded 21.304672.1 kBytes/s
Sleep for 0.497982025146
40 kb of 10118 kb downloaded 19.979510.1 kBytes/s
Sleep for 0.497948884964
48 kb of 10118 kb downloaded 19.184721.1 kBytes/s
Sleep for 0.498008966446
....
1416 kb of 10118 kb downloaded 16.090774.1 kBytes/s
Sleep for 0.499262094498
1424 kb of 10118 kb downloaded 16.090267.1 kBytes/s
Sleep for 0.499293088913
1432 kb of 10118 kb downloaded 16.089760.1 kBytes/s
Sleep for 0.499292135239
1440 kb of 10118 kb downloaded 16.089254.1 kBytes/s
Sleep for 0.499267101288
....
--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick
Jan 11 '08 #4
On Fri, 11 Jan 2008, Nick Craig-Wood wrote:
Here is an implementation based on that idea. I've used urllib rather
than urllib2 as that is what I'm familiar with.
Thanks! Really nice implementation. However I'm stuck with urllib2 because
of its extra functionality so I'll try to implement something similar
using handle.read(1024) to read in small chunks.

It really seems weird that urllib2 is missing reporthook functionality!
Thank you,
Dimitris

Jan 12 '08 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by O. Koch | last post: by
4 posts views Thread by bmiras | last post: by
1 post views Thread by Matthew Wilson | last post: by
2 posts views Thread by John F Dutcher | last post: by
5 posts views Thread by Pascal | last post: by
4 posts views Thread by Monty | last post: by
1 post views Thread by Ray Slakinski | last post: by
3 posts views Thread by Peter Silva | last post: by
1 post views Thread by Alessandro Fachin | last post: by
reply views Thread by NPC403 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.