469,579 Members | 1,076 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,579 developers. It's quick & easy.

Fetching websites with Python

Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

../script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?

Thanks.

Yours sincerely

Markus Franz
Jul 18 '05 #1
6 1616
Markus Franz wrote:
Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

./script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?

Thanks.

Yours sincerely

Markus Franz


Markus,
I think there's a timeout in urllib; not sure.
import urllib
import sys
#--------------------------------------
if __name__ == "__main__":
if len(sys.argv) < 3:
print 'arg error'
sys.exit(1)
sep = sys.argv[1]
for url in sys.argv[2:]:
try:
f = urllib.urlopen(url)
lines = f.readlines()
f.close()
for line in lines:
print line[:-1]
except:
print url,'get error'
print sep

Jul 18 '05 #2
Markus Franz <mf@orase.com> wrote:
Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

./script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
In parallel? Hmm... play around with
lynx -dump http://... > a1 &
lynx -dump http://... > a2 &
lynx -dump http://... > a3 &
sleep 15
kill %1 %2 %3
for i in a1 a2 a3; do
cat $i
echo ---xxx---
done
rm a1 a2 a3

In serial, the code becomes
for i in http://... http://... http://... ; do
lynx -connect_timeout=15 -dump $i
echo ---xxx---
done
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?

Thanks.

Yours sincerely

Markus Franz


--
William Park, Open Geometry Consulting, <op**********@yahoo.ca>
Linux solution for data processing and document management.
Jul 18 '05 #3
wes weston <ww*****@att.net> wrote:
Markus,
I think there's a timeout in urllib; not sure.


No there isn't, bit of a shame that. There is in httplib.
Jul 18 '05 #4
> How can I do this?

Perhaps something like this:

import urllib2, thread, time, sys

thread_count = len(sys.argv) - 1
pages = []
lock = thread.allocate_lock()

def timer():
global lock
time.sleep(15)
lock.release()

def get_page(url):
global thread_count, pages, lock
try: pages.append(urllib2.urlopen(url).read())
except: pass
thread_count -= 1
if thread_count == 0:
lock.release()

lock.acquire()
thread.start_new_thread(timer, ())
for url in sys.argv[1:]:
thread.start_new_thread(get_page, (url,))
lock.acquire()
print '\n---xxx---\n'.join(pages)

Please have a nice day.

Regards,
Technoumena
Jul 18 '05 #5
f29
> > Markus,
I think there's a timeout in urllib; not sure.


No there isn't, bit of a shame that. There is in httplib.


Sure there is, use urllib or urllib2 as usual, but also import socket
module and call "socket.setdefaulttimeout(secs)" before requesting any
pages with urlopen.

f29
Jul 18 '05 #6
On Wed, Mar 31, 2004 at 07:33:45PM +0200, Markus Franz wrote:
Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

./script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?


You could use Twisted <http://twistedmatrix.com>:

from twisted.internet import reactor
from twisted.web.client import getPage
import sys

def gotPage(page):
print seperator
print page

def failed(failure):
print seperator + ' FAILED'
failure.printTraceback()

def decrement(ignored):
global count
count -= 1
if count == 0:
reactor.stop()

seperator = sys.argv[1]
urlList = sys.argv[2:]
count = len(urlList)
for url in urlList:
getPage(url, timeout=15).addCallbacks(gotPage, failed).addBoth(decrement)

reactor.run()

It will grab the sites in parallel, printing them in the order they arrive,
and doesn't use multiple processes, or multiple threads :)

-Andrew.
Jul 18 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by Markus Franz | last post: by
4 posts views Thread by Stephen Boulet | last post: by
1 post views Thread by Sebastian Kress | last post: by
reply views Thread by Shujun Huang | last post: by
9 posts views Thread by Chris Pearl | last post: by
22 posts views Thread by Sandman | last post: by
1 post views Thread by Ivan Ven Osdel | last post: by
reply views Thread by suresh191 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.