On Wed, Mar 31, 2004 at 07:33:45PM +0200, Markus Franz wrote:
Hi.
How can I grab websites with a command-line python script? I want to start
the script like this:
./script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com
The script should load these 3 websites (or more if specified) in parallel
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.
How can I do this?
You could use Twisted <http://twistedmatrix.com>:
from twisted.internet import reactor
from twisted.web.client import getPage
import sys
def gotPage(page):
print seperator
print page
def failed(failure):
print seperator + ' FAILED'
failure.printTraceback()
def decrement(ignored):
global count
count -= 1
if count == 0:
reactor.stop()
seperator = sys.argv[1]
urlList = sys.argv[2:]
count = len(urlList)
for url in urlList:
getPage(url, timeout=15).addCallbacks(gotPage, failed).addBoth(decrement)
reactor.run()
It will grab the sites in parallel, printing them in the order they arrive,
and doesn't use multiple processes, or multiple threads :)
-Andrew.