By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,082 Members | 2,135 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,082 IT Pros & Developers. It's quick & easy.

Making HTTP requests using Twisted

P: n/a
I'm hoping to write a program that will read any number of urls from
stdin (1 per line), download them, and process them. So far my script
(below) works well for small numbers of urls. However, it does not
scale to more than 200 urls or so, because it issues HTTP requests for
all of the urls simultaneously, and terminates after 25 seconds.
Ideally, I'd like this script to download at most 50 pages in parallel,
and to time out if and only if any HTTP request is not answered in 3
seconds. What changes do I need to make?

Is Twisted the best library for me to be using? I do like Twisted, but
it seems more suited to batch mode operations. Is there some way that I
could continue registering url requests while the reactor is running?
Is there a way to specify a time out per page request, rather than for
a batch of pages requests?

Thanks!

#-------------------------------------------------

from twisted.internet import reactor
from twisted.web import client
import re, urllib, sys, time

def extract(html):
#do some processing on html, writing to stdout

def printError(failure):
print >sys.stderr, "Error:", failure.getErrorMessage( )

def stopReactor():
print "Now stopping reactor..."
reactor.stop()

for url in sys.stdin:
url = url.rstrip()
client.getPage(url).addCallback(extract).addErrbac k(printError)

reactor.callLater(25, stopReactor)
reactor.run()

Jul 11 '06 #1
Share this Question
Share on Google+
4 Replies


P: n/a
rzimerman wrote:
I'm hoping to write a program that will read any number of urls from
stdin (1 per line), download them, and process them. So far my script
(below) works well for small numbers of urls. However, it does not
scale to more than 200 urls or so, because it issues HTTP requests for
all of the urls simultaneously, and terminates after 25 seconds.
Ideally, I'd like this script to download at most 50 pages in parallel,
and to time out if and only if any HTTP request is not answered in 3
seconds. What changes do I need to make?

Is Twisted the best library for me to be using? I do like Twisted, but
it seems more suited to batch mode operations. Is there some way that I
could continue registering url requests while the reactor is running?
Is there a way to specify a time out per page request, rather than for
a batch of pages requests?
Have a look at pyCurl. (http://pycurl.sourceforge.net)

Regards
Sreeram

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEs2vIrgn0plK5qqURAmahAJ4oPAJ4AtPNvRFxs99IFN HuViyCiQCgmT8a
GYqpz82zvsin4QrXGXW0WDI=
=rz4Q
-----END PGP SIGNATURE-----

Jul 11 '06 #2

P: n/a
"rzimerman" wrote:
Is Twisted the best library for me to be using? I do like Twisted, but
it seems more suited to batch mode operations. Is there some way that I
could continue registering url requests while the reactor is running?
Is there a way to specify a time out per page request, rather than for
a batch of pages requests?
there are probably ways to solve this with Twisted, but in case you want a
simpler alternative, you could use Python's standard asyncore module and
the stuff described here:

http://effbot.org/zone/effnews.htm

especially

http://effbot.org/zone/effnews-1.htm...g-the-rss-data
http://effbot.org/zone/effnews-3.htm#managing-downloads

</F>

Jul 11 '06 #3

P: n/a
rzimerman ha scritto:
I'm hoping to write a program that will read any number of urls from
stdin (1 per line), download them, and process them. So far my script
(below) works well for small numbers of urls. However, it does not
scale to more than 200 urls or so, because it issues HTTP requests for
all of the urls simultaneously, and terminates after 25 seconds.
Ideally, I'd like this script to download at most 50 pages in parallel,
and to time out if and only if any HTTP request is not answered in 3
seconds. What changes do I need to make?
Take a look at
http://svn.twistedmatrix.com/cvs/tru...rkup&rev=15456

And read
http://twistedmatrix.com/documents/c...ntFactory.html

You can pass a timeout to the constructor.

To download at most 50 pages in parallel you can use a download queue.

Here is a quick example, ABSOLUTELY NOT TESTED:

class DownloadQueue(object):
SIZE = 50

def init(self):
self.requests = [] # queued requests
self.deferreds = [] # waiting requests

def addRequest(self, url, timeout):
if len(self.deferreds) >= sels.SIZE:
# wait for completion of all previous requests
DeferredList(self.deferreds
).addCallback(self._callback)
self.deferreds = []

# queue the request
deferred = Deferred()
self.requests.append((url, timeout, deferred))

return deferred
else:
# execute the request now
deferred = getPage(url, timeout=timeout)
self.deferreds.append(deferred)

return deferred

def _callback(self):
if len(self.requests) self.SIZE:
queue = self.requests[:self.SIZE]
self.requests = self.requests[self.SIZE:]
else:
queue = self.requests[:]
self.requests = []

# execute the requests
for (url, timeout, deferredHelper) in queue:
deferred = getPage(url, timeout=timeout)
self.deferreds.append(deferred)

deferred.chainDeferred(deferredHelper)


Regards Manlio Perillo
Jul 11 '06 #4

P: n/a
Manlio Perillo ha scritto:
[...]
Here is a quick example, ABSOLUTELY NOT TESTED:

class DownloadQueue(object):
SIZE = 50

def init(self):
self.requests = [] # queued requests
self.deferreds = [] # waiting requests

def addRequest(self, url, timeout):
if len(self.deferreds) >= sels.SIZE:
# wait for completion of all previous requests
DeferredList(self.deferreds
).addCallback(self._callback)
self.deferreds = []
The deferreds list should be cleared in the _callback method, not here.
Please note that probably there are other bugs.
Regards Manlio Perillo
Jul 11 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.