By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,547 Members | 1,435 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,547 IT Pros & Developers. It's quick & easy.

Fetching websites with Python

P: n/a
Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

../script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?

Thanks.

Yours sincerely

Markus Franz
Jul 18 '05 #1
Share this Question
Share on Google+
6 Replies


P: n/a
Markus Franz wrote:
Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

./script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?

Thanks.

Yours sincerely

Markus Franz


Markus,
I think there's a timeout in urllib; not sure.
import urllib
import sys
#--------------------------------------
if __name__ == "__main__":
if len(sys.argv) < 3:
print 'arg error'
sys.exit(1)
sep = sys.argv[1]
for url in sys.argv[2:]:
try:
f = urllib.urlopen(url)
lines = f.readlines()
f.close()
for line in lines:
print line[:-1]
except:
print url,'get error'
print sep

Jul 18 '05 #2

P: n/a
Markus Franz <mf@orase.com> wrote:
Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

./script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
In parallel? Hmm... play around with
lynx -dump http://... > a1 &
lynx -dump http://... > a2 &
lynx -dump http://... > a3 &
sleep 15
kill %1 %2 %3
for i in a1 a2 a3; do
cat $i
echo ---xxx---
done
rm a1 a2 a3

In serial, the code becomes
for i in http://... http://... http://... ; do
lynx -connect_timeout=15 -dump $i
echo ---xxx---
done
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?

Thanks.

Yours sincerely

Markus Franz


--
William Park, Open Geometry Consulting, <op**********@yahoo.ca>
Linux solution for data processing and document management.
Jul 18 '05 #3

P: n/a
wes weston <ww*****@att.net> wrote:
Markus,
I think there's a timeout in urllib; not sure.


No there isn't, bit of a shame that. There is in httplib.
Jul 18 '05 #4

P: n/a
> How can I do this?

Perhaps something like this:

import urllib2, thread, time, sys

thread_count = len(sys.argv) - 1
pages = []
lock = thread.allocate_lock()

def timer():
global lock
time.sleep(15)
lock.release()

def get_page(url):
global thread_count, pages, lock
try: pages.append(urllib2.urlopen(url).read())
except: pass
thread_count -= 1
if thread_count == 0:
lock.release()

lock.acquire()
thread.start_new_thread(timer, ())
for url in sys.argv[1:]:
thread.start_new_thread(get_page, (url,))
lock.acquire()
print '\n---xxx---\n'.join(pages)

Please have a nice day.

Regards,
Technoumena
Jul 18 '05 #5

P: n/a
f29
> > Markus,
I think there's a timeout in urllib; not sure.


No there isn't, bit of a shame that. There is in httplib.


Sure there is, use urllib or urllib2 as usual, but also import socket
module and call "socket.setdefaulttimeout(secs)" before requesting any
pages with urlopen.

f29
Jul 18 '05 #6

P: n/a
On Wed, Mar 31, 2004 at 07:33:45PM +0200, Markus Franz wrote:
Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

./script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?


You could use Twisted <http://twistedmatrix.com>:

from twisted.internet import reactor
from twisted.web.client import getPage
import sys

def gotPage(page):
print seperator
print page

def failed(failure):
print seperator + ' FAILED'
failure.printTraceback()

def decrement(ignored):
global count
count -= 1
if count == 0:
reactor.stop()

seperator = sys.argv[1]
urlList = sys.argv[2:]
count = len(urlList)
for url in urlList:
getPage(url, timeout=15).addCallbacks(gotPage, failed).addBoth(decrement)

reactor.run()

It will grab the sites in parallel, printing them in the order they arrive,
and doesn't use multiple processes, or multiple threads :)

-Andrew.
Jul 18 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.