urllib2 rate limiting

Dimitrios Apostolou

Hello list,

I want to limit the download speed when using urllib2. In particular,
having several parallel downloads, I want to make sure that their total
speed doesn't exceed a maximum value.

I can't find a simple way to achieve this. After researching a can try
some things but I'm stuck on the details:

1) Can I overload some method in _socket.py to achieve this, and perhaps
make this generic enough to work even with other libraries than urllib2?

2) There is the urllib.urlretrieve() function which accepts a reporthook
parameter. Perhaps I can have reporthook to increment a global counter and
sleep as necessary when a threshold is reached.
However there is not something similar in urllib2. Isn't urllib2 supposed
to be a superset of urllib in functionality? Why there is no reporthook
parameter in any of urllib2's functions?
Moreover, even the existing way reporthook can be used doesn't seem so
right: reporthook(blocknum, bs, size) is always called with bs=8K even
for the last block, and sometimes (blocknum*bs size) is possible, if the
server sends wrong Content-Lentgth HTTP headers.

3) Perhaps I can use filehandle.read(1024) and manually read as many
chunks of data as I need. However I think this would generally be
inefficient and I'm not sure how it would work because
of internal buffering of urllib2.

So how do you think I can achieve rate limiting in urllib2?
Thanks in advance,
Dimitris

P.S. And something simpler: How can I disallow urllib2 to follow
redirections to foreign hosts?

Jan 10 '08 #1

Subscribe Reply

7184

Rob Wolfe

Dimitrios Apostolou <ji***@gmx.netwrites:

P.S. And something simpler: How can I disallow urllib2 to follow
redirections to foreign hosts?

You need to subclass `urllib2.HTTPRedirectHandler`, override
`http_error_301` and `http_error_302` methods and throw
`urllib2.HTTPError` exception.

http://diveintopython.org/http_web_s...redirects.html

HTH,
Rob

Jan 10 '08 #2

Dimitrios Apostolou

On Thu, 10 Jan 2008, Rob Wolfe wrote:

Dimitrios Apostolou <ji***@gmx.netwrites:

>P.S. And something simpler: How can I disallow urllib2 to follow
redirections to foreign hosts?

You need to subclass `urllib2.HTTPRedirectHandler`, override
`http_error_301` and `http_error_302` methods and throw
`urllib2.HTTPError` exception.

Thanks! I think for my case it's better to override redirect_request
method, and return a Request only in case the redirection goes to the
same site. Just another question, because I can't find in the docs the
meaning of (req, fp, code, msg, hdrs) parameters. To read the URL I get
redirected to (the 'Location:' HTTP header?), should I check the hdrs
parameter or there is a better way?
Thanks,
Dimitris

>
http://diveintopython.org/http_web_s...redirects.html

HTH,
Rob
--
http://mail.python.org/mailman/listinfo/python-list

Jan 10 '08 #3

Nick Craig-Wood

Dimitrios Apostolou <ji***@gmx.netwrote:

I want to limit the download speed when using urllib2. In particular,
having several parallel downloads, I want to make sure that their total
speed doesn't exceed a maximum value.

I can't find a simple way to achieve this. After researching a can try
some things but I'm stuck on the details:

1) Can I overload some method in _socket.py to achieve this, and perhaps
make this generic enough to work even with other libraries than urllib2?

2) There is the urllib.urlretrieve() function which accepts a reporthook
parameter.

Here is an implementation based on that idea. I've used urllib rather
than urllib2 as that is what I'm familiar with.

------------------------------------------------------------
#!/usr/bin/python

"""
Fetch a url rate limited

Syntax: rate URL local_file_name
"""

import os
import sys
import urllib
from time import time, sleep

class RateLimit(object):
"""Rate limit a url fetch"""
def __init__(self, rate_limit):
"""rate limit in kBytes / second"""
self.rate_limit = rate_limit
self.start = time()
def __call__(self, block_count, block_size, total_size):
total_kb = total_size / 1024
downloaded_kb = (block_count * block_size) / 1024
elapsed_time = time() - self.start
if elapsed_time != 0:
rate = downloaded_kb / elapsed_time
print "%d kb of %d kb downloaded %f.1 kBytes/s\n" % (downloaded_kb ,total_kb, rate),
expected_time = downloaded_kb / self.rate_limit
sleep_time = expected_time - elapsed_time
print "Sleep for", sleep_time
if sleep_time 0:
sleep(sleep_time)

def main():
"""Fetch the contents of urls"""
if len(sys.argv) != 4:
print 'Syntax: %s "rate in kBytes/s" URL "local output path"' % sys.argv[0]
raise SystemExit(1)
rate_limit, url, out_path = sys.argv[1:]
rate_limit = float(rate_limit)
print "Fetching %r to %r with rate limit %.1f" % (url, out_path, rate_limit)
urllib.urlretrieve(url, out_path, reporthook=RateLimit(rate_limit))

if __name__ == "__main__": main()
------------------------------------------------------------

Use it like this

$ ./rate-limited-fetch.py 16 http://some/url/or/other z
Fetching 'http://some/url/or/other' to 'z' with rate limit 16.0
0 kb of 10118 kb downloaded 0.000000.1 kBytes/s
Sleep for -0.0477550029755
8 kb of 10118 kb downloaded 142.073242.1 kBytes/s
Sleep for 0.443691015244
16 kb of 10118 kb downloaded 32.130966.1 kBytes/s
Sleep for 0.502038002014
24 kb of 10118 kb downloaded 23.952789.1 kBytes/s
Sleep for 0.498028993607
32 kb of 10118 kb downloaded 21.304672.1 kBytes/s
Sleep for 0.497982025146
40 kb of 10118 kb downloaded 19.979510.1 kBytes/s
Sleep for 0.497948884964
48 kb of 10118 kb downloaded 19.184721.1 kBytes/s
Sleep for 0.498008966446
....
1416 kb of 10118 kb downloaded 16.090774.1 kBytes/s
Sleep for 0.499262094498
1424 kb of 10118 kb downloaded 16.090267.1 kBytes/s
Sleep for 0.499293088913
1432 kb of 10118 kb downloaded 16.089760.1 kBytes/s
Sleep for 0.499292135239
1440 kb of 10118 kb downloaded 16.089254.1 kBytes/s
Sleep for 0.499267101288
....
--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick

Jan 11 '08 #4

Dimitrios Apostolou

On Fri, 11 Jan 2008, Nick Craig-Wood wrote:

Here is an implementation based on that idea. I've used urllib rather
than urllib2 as that is what I'm familiar with.

Thanks! Really nice implementation. However I'm stuck with urllib2 because
of its extra functionality so I'll try to implement something similar
using handle.read(1024) to read in small chunks.

It really seems weird that urllib2 is missing reporthook functionality!
Thank you,
Dimitris

Jan 12 '08 #5

Similar topics

21082

FTP with urllib2 behind a proxy

by: O. Koch | last post by:

Until now, i know that ftplib doesn't support proxies and that i have to use urllib2. But i don't know how to use the urllib2 correct. I found some examples, but i don't understand them. Is...

Python

4441

problem using urllib2: \n

by: bmiras | last post by:

I've got a problem using urllib2 to get a web page. I'm going through a proxy using user/password authentification and i'm trying to get a page asking for a HTTP authentification. And I'm using...

Python

3942

urllib2 http authorization question

by: Matthew Wilson | last post by:

I am writing a script to check on my router's external IP address. My ISP refreshes my IP very often and I use dyndns for the hostname for my computer. My Netgear mr814 router has a webserver that...

Python

6073

urllib2.urlopen(req) error........

by: John F Dutcher | last post by:

Can anyone comment on why the code shown in the Python error is in some way incorrect...or is there a problem with Python on my hoster's site ?? The highlites don't seem to show here...but line...

Python

7378

OWA (Outlook Web Access) with urllib2

by: Pascal | last post by:

Hello, I want to acces my OWA (Outlook Web Acces - http Exchange interface) server with urllib2 but, when I try, I've always a 401 http error. Can someone help me (and us)? Thanks. ...

Python

1626

urllib's functionality with urllib2

by: Monty | last post by:

Hello, Sorry for this maybe stupid newbie question but I didn't find any answer in all my readings about python: With urllib, using urlretrieve, it's possible to get the number of blocks...

Python

3352

urllib2 Opener and Proxy/Authentication issues

by: Ray Slakinski | last post by:

Hello, I have defined a function to set an opener for urllib2, this opener defines any proxy and http authentication that is required. If the proxy has authencation itself and requests an...

Python

2507

transfer rate limiting in socket.py

by: Peter Silva | last post by:

Hi folks, I have a need in a network data distribution application to send out data to folks who want it using the protocol of their choice. I´d like it to support a variety of protocols and I...

Python

5734

urllib2 request htaccess page through proxy

by: Alessandro Fachin | last post by:

I write this simply code that should give me the access to private page with htaccess using a proxy, i don't known because it's wrong... import urllib,urllib2 #input url...

Python

7257

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

7157

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7535

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

7098

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

5682

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

5084

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

4745

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3232

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

3221

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET