473,728 Members | 1,726 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Making HTTP requests using Twisted

I'm hoping to write a program that will read any number of urls from
stdin (1 per line), download them, and process them. So far my script
(below) works well for small numbers of urls. However, it does not
scale to more than 200 urls or so, because it issues HTTP requests for
all of the urls simultaneously, and terminates after 25 seconds.
Ideally, I'd like this script to download at most 50 pages in parallel,
and to time out if and only if any HTTP request is not answered in 3
seconds. What changes do I need to make?

Is Twisted the best library for me to be using? I do like Twisted, but
it seems more suited to batch mode operations. Is there some way that I
could continue registering url requests while the reactor is running?
Is there a way to specify a time out per page request, rather than for
a batch of pages requests?

Thanks!

#-------------------------------------------------

from twisted.interne t import reactor
from twisted.web import client
import re, urllib, sys, time

def extract(html):
#do some processing on html, writing to stdout

def printError(fail ure):
print >sys.stderr, "Error:", failure.getErro rMessage( )

def stopReactor():
print "Now stopping reactor..."
reactor.stop()

for url in sys.stdin:
url = url.rstrip()
client.getPage( url).addCallbac k(extract).addE rrback(printErr or)

reactor.callLat er(25, stopReactor)
reactor.run()

Jul 11 '06 #1
4 4599
rzimerman wrote:
I'm hoping to write a program that will read any number of urls from
stdin (1 per line), download them, and process them. So far my script
(below) works well for small numbers of urls. However, it does not
scale to more than 200 urls or so, because it issues HTTP requests for
all of the urls simultaneously, and terminates after 25 seconds.
Ideally, I'd like this script to download at most 50 pages in parallel,
and to time out if and only if any HTTP request is not answered in 3
seconds. What changes do I need to make?

Is Twisted the best library for me to be using? I do like Twisted, but
it seems more suited to batch mode operations. Is there some way that I
could continue registering url requests while the reactor is running?
Is there a way to specify a time out per page request, rather than for
a batch of pages requests?
Have a look at pyCurl. (http://pycurl.sourceforge.net)

Regards
Sreeram

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEs2vIrgn 0plK5qqURAmahAJ 4oPAJ4AtPNvRFxs 99IFNHuViyCiQCg mT8a
GYqpz82zvsin4Qr XGXW0WDI=
=rz4Q
-----END PGP SIGNATURE-----

Jul 11 '06 #2
"rzimerman" wrote:
Is Twisted the best library for me to be using? I do like Twisted, but
it seems more suited to batch mode operations. Is there some way that I
could continue registering url requests while the reactor is running?
Is there a way to specify a time out per page request, rather than for
a batch of pages requests?
there are probably ways to solve this with Twisted, but in case you want a
simpler alternative, you could use Python's standard asyncore module and
the stuff described here:

http://effbot.org/zone/effnews.htm

especially

http://effbot.org/zone/effnews-1.htm...g-the-rss-data
http://effbot.org/zone/effnews-3.htm#managing-downloads

</F>

Jul 11 '06 #3
rzimerman ha scritto:
I'm hoping to write a program that will read any number of urls from
stdin (1 per line), download them, and process them. So far my script
(below) works well for small numbers of urls. However, it does not
scale to more than 200 urls or so, because it issues HTTP requests for
all of the urls simultaneously, and terminates after 25 seconds.
Ideally, I'd like this script to download at most 50 pages in parallel,
and to time out if and only if any HTTP request is not answered in 3
seconds. What changes do I need to make?
Take a look at
http://svn.twistedmatrix.com/cvs/tru...rkup&rev=15456

And read
http://twistedmatrix.com/documents/c...ntFactory.html

You can pass a timeout to the constructor.

To download at most 50 pages in parallel you can use a download queue.

Here is a quick example, ABSOLUTELY NOT TESTED:

class DownloadQueue(o bject):
SIZE = 50

def init(self):
self.requests = [] # queued requests
self.deferreds = [] # waiting requests

def addRequest(self , url, timeout):
if len(self.deferr eds) >= sels.SIZE:
# wait for completion of all previous requests
DeferredList(se lf.deferreds
).addCallback(s elf._callback)
self.deferreds = []

# queue the request
deferred = Deferred()
self.requests.a ppend((url, timeout, deferred))

return deferred
else:
# execute the request now
deferred = getPage(url, timeout=timeout )
self.deferreds. append(deferred )

return deferred

def _callback(self) :
if len(self.reques ts) self.SIZE:
queue = self.requests[:self.SIZE]
self.requests = self.requests[self.SIZE:]
else:
queue = self.requests[:]
self.requests = []

# execute the requests
for (url, timeout, deferredHelper) in queue:
deferred = getPage(url, timeout=timeout )
self.deferreds. append(deferred )

deferred.chainD eferred(deferre dHelper)


Regards Manlio Perillo
Jul 11 '06 #4
Manlio Perillo ha scritto:
[...]
Here is a quick example, ABSOLUTELY NOT TESTED:

class DownloadQueue(o bject):
SIZE = 50

def init(self):
self.requests = [] # queued requests
self.deferreds = [] # waiting requests

def addRequest(self , url, timeout):
if len(self.deferr eds) >= sels.SIZE:
# wait for completion of all previous requests
DeferredList(se lf.deferreds
).addCallback(s elf._callback)
self.deferreds = []
The deferreds list should be cleared in the _callback method, not here.
Please note that probably there are other bugs.
Regards Manlio Perillo
Jul 11 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
7950
by: Mark Carter | last post by:
I'm trying to create a mail server in Twisted. I either get SMTPSenderRefused or SMTPException: SMTP AUTH extension not supported by server. What do I need to do to get it to work?
1
1281
by: Dominique de Waleffe | last post by:
I have a need to have a simple multithreaded sort of proxy wich listens to a given port and for each client session, sends the requests to a single server (mono thread and listening on one single port) that sits behind and returns the answers to the appropriate client. I confess that I have spent very little time on finding a solution to this yet. I have a feeling that this should be a piec of cake for python possibly with twisted...
4
17724
by: Fuzzyman | last post by:
In a nutshell - the question I'm asking is, how do I make a socket conenction go via a proxy server ? All our internet traffic has to go through a proxy-server at location 'dav-serv:8080' and I need to make a socket connection through it. The reason (with code example) is as follows : I am hacking "Tiny HTTP Proxy" by SUZUKI Hisao to make an http proxy that modifies URLs. I haven't got very far - having started from zero knowledge of...
2
2762
by: Daniel Bickett | last post by:
Hello, I am writing an application using two event-driven libraries: wxPython, and twisted. The first problem I encountered in the program is the confliction between the two all-consuming methods of the two libraries: app.MainLoop, and reactor.run. Additionally, the fact that wxPython was to receive requests from the twisted framework as well as the end user seemed to be simply asking for trouble. My initial solution was, naturally,...
1
5896
by: qwejohn | last post by:
Hello, I had posted this question in the twisted mailing list but did not got a solution ; I hope that the python Gurus of this forum can help me a bit. I am trying the exmaple in the python docs - http://twistedmatrix.com/users/warner/doc-latest/web/howto/using-twistedweb.xhtml Configuring and Using the Twisted.Web Server - Twisted Web Development.
2
1354
by: Christopher Benson-Manica | last post by:
Is there a more crossbrowser-friendly means to make HTTP requests using script than the XML methods? I need to do this in IE 5.0, and it doesn't seem to have the ActiveX control that can make XML HTTP requests. I hope the answer will not be to roll my own HTTP protocol handler :) -- Christopher Benson-Manica | I *should* know what I'm talking about - if I ataru(at)cyberspace.org | don't, I need to know. Flames welcome.
0
1921
by: jgarber | last post by:
Hello, I just upgraded MySQLdb to the 1.2.0 version provided by Redhat Enterprise Linux ES4. At that point I began to get segfaults when importing twisted after MySQLdb, but not before. -- RedHat Enterprise Linux ES 4 (fully updated) Python 2.3.4 mysql-python (MySQLdb) version 1.2.0
14
2548
by: Mike C# | last post by:
Hi all, Is it possible to make 10 POST requests from ASP.NET asynchronously? I have been working on this problem for a few days now, and I seem to keep running up against IIS limitations. Basically here's the process as it works now (synchronously): Person visits mywebsite.com and fills out a form mywebsite.com POSTs one request to providerwebsite.com mywebsite.com receives response
4
4761
by: swq22 | last post by:
Which python module is capable of pipelining http requests? (I know httplib can send mulitple requests per tcp connection, but in a strictly serial way. )
0
8894
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8753
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9265
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9188
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9122
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8113
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
4522
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
3230
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2643
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.