473,394 Members | 1,956 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,394 software developers and data experts.

Fetching websites with Python

Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

../script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?

Thanks.

Yours sincerely

Markus Franz
Jul 18 '05 #1
6 1738
Markus Franz wrote:
Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

./script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?

Thanks.

Yours sincerely

Markus Franz


Markus,
I think there's a timeout in urllib; not sure.
import urllib
import sys
#--------------------------------------
if __name__ == "__main__":
if len(sys.argv) < 3:
print 'arg error'
sys.exit(1)
sep = sys.argv[1]
for url in sys.argv[2:]:
try:
f = urllib.urlopen(url)
lines = f.readlines()
f.close()
for line in lines:
print line[:-1]
except:
print url,'get error'
print sep

Jul 18 '05 #2
Markus Franz <mf@orase.com> wrote:
Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

./script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
In parallel? Hmm... play around with
lynx -dump http://... > a1 &
lynx -dump http://... > a2 &
lynx -dump http://... > a3 &
sleep 15
kill %1 %2 %3
for i in a1 a2 a3; do
cat $i
echo ---xxx---
done
rm a1 a2 a3

In serial, the code becomes
for i in http://... http://... http://... ; do
lynx -connect_timeout=15 -dump $i
echo ---xxx---
done
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?

Thanks.

Yours sincerely

Markus Franz


--
William Park, Open Geometry Consulting, <op**********@yahoo.ca>
Linux solution for data processing and document management.
Jul 18 '05 #3
wes weston <ww*****@att.net> wrote:
Markus,
I think there's a timeout in urllib; not sure.


No there isn't, bit of a shame that. There is in httplib.
Jul 18 '05 #4
> How can I do this?

Perhaps something like this:

import urllib2, thread, time, sys

thread_count = len(sys.argv) - 1
pages = []
lock = thread.allocate_lock()

def timer():
global lock
time.sleep(15)
lock.release()

def get_page(url):
global thread_count, pages, lock
try: pages.append(urllib2.urlopen(url).read())
except: pass
thread_count -= 1
if thread_count == 0:
lock.release()

lock.acquire()
thread.start_new_thread(timer, ())
for url in sys.argv[1:]:
thread.start_new_thread(get_page, (url,))
lock.acquire()
print '\n---xxx---\n'.join(pages)

Please have a nice day.

Regards,
Technoumena
Jul 18 '05 #5
f29
> > Markus,
I think there's a timeout in urllib; not sure.


No there isn't, bit of a shame that. There is in httplib.


Sure there is, use urllib or urllib2 as usual, but also import socket
module and call "socket.setdefaulttimeout(secs)" before requesting any
pages with urlopen.

f29
Jul 18 '05 #6
On Wed, Mar 31, 2004 at 07:33:45PM +0200, Markus Franz wrote:
Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

./script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?


You could use Twisted <http://twistedmatrix.com>:

from twisted.internet import reactor
from twisted.web.client import getPage
import sys

def gotPage(page):
print seperator
print page

def failed(failure):
print seperator + ' FAILED'
failure.printTraceback()

def decrement(ignored):
global count
count -= 1
if count == 0:
reactor.stop()

seperator = sys.argv[1]
urlList = sys.argv[2:]
count = len(urlList)
for url in urlList:
getPage(url, timeout=15).addCallbacks(gotPage, failed).addBoth(decrement)

reactor.run()

It will grab the sites in parallel, printing them in the order they arrive,
and doesn't use multiple processes, or multiple threads :)

-Andrew.
Jul 18 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Markus Franz | last post by:
Hi. I have a difficult problem: An array contains several different URLs. I want to load these websites in parallel by using a HTTP-Request. How can I do this in PHP? Up to now I did this...
4
by: Stephen Boulet | last post by:
I'm trying to parse a url to set my hardware & system clock (linux). Perhaps the best way to do this would be to use the urllib2 module to convert a site to text, but since I haven't found that...
1
by: Sebastian Kress | last post by:
Hi, I'm terribly sorry for this very easy question, but I really would like to know that :). I've been programming in Python quite a while now and mostly coped quite ok. Now my Webhoster sent...
0
by: Shujun Huang | last post by:
Hi, I am working on converting Informix database to Postgre. I have one question for fetching records using PostgreSQL. The record I am fetching is a variable size text string. Before fetching...
9
by: Chris Pearl | last post by:
Are there Python tools to help webmasters manage static websites? I'm talking about regenerating an entire static website - all the HTML files in their appropriate directories and...
22
by: Sandman | last post by:
So, I have this content management system I've developed myself. The system has a solid community part where members can register and then participate in forums, write weblogs and a ton of other...
8
by: nazeers | last post by:
Hi All, I am new to XML and I need some help from you all. we have a requirement like... we want to fetch the base URL that is present in the XML file , and getting it displayed in the...
5
by: Bhavesh | last post by:
Hello genious people, I m trying to insert a LARGE text from Multiline Textbox into my table of sqlserver2000. I m using vs-2005. Please note that I dont want to store blob data From FILE...
1
by: Ivan Ven Osdel | last post by:
>----- Original Message ----- Here's a function I wrote for checking remote or local file existence. It works for me but admittedly I haven't tested many cases with it. Also its currently...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.