Fetching websites with Python

Markus Franz

Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

../script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?

Thanks.

Yours sincerely

Markus Franz

Jul 18 '05 #1

Subscribe Post Reply

1738

wes weston

Markus Franz wrote:

Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

./script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?

Thanks.

Yours sincerely

Markus Franz

Markus,
I think there's a timeout in urllib; not sure.
import urllib
import sys
#--------------------------------------
if __name__ == "__main__":
if len(sys.argv) < 3:
print 'arg error'
sys.exit(1)
sep = sys.argv[1]
for url in sys.argv[2:]:
try:
f = urllib.urlopen(url)
lines = f.readlines()
f.close()
for line in lines:
print line[:-1]
except:
print url,'get error'
print sep

Jul 18 '05 #2

William Park

Markus Franz <mf@orase.com> wrote:

Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

./script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
In parallel? Hmm... play around with
lynx -dump http://... > a1 &
lynx -dump http://... > a2 &
lynx -dump http://... > a3 &
sleep 15
kill %1 %2 %3
for i in a1 a2 a3; do
cat $i
echo ---xxx---
done
rm a1 a2 a3

In serial, the code becomes
for i in http://... http://... http://... ; do
lynx -connect_timeout=15 -dump $i
echo ---xxx---
done
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?

Thanks.

Yours sincerely

Markus Franz

--
William Park, Open Geometry Consulting, <op**********@yahoo.ca>
Linux solution for data processing and document management.

Jul 18 '05 #3

simo

wes weston <ww*****@att.net> wrote:

Markus,
I think there's a timeout in urllib; not sure.

No there isn't, bit of a shame that. There is in httplib.

Jul 18 '05 #4

Technoumena

> How can I do this?

Perhaps something like this:

import urllib2, thread, time, sys

thread_count = len(sys.argv) - 1
pages = []
lock = thread.allocate_lock()

def timer():
global lock
time.sleep(15)
lock.release()

def get_page(url):
global thread_count, pages, lock
try: pages.append(urllib2.urlopen(url).read())
except: pass
thread_count -= 1
if thread_count == 0:
lock.release()

lock.acquire()
thread.start_new_thread(timer, ())
for url in sys.argv[1:]:
thread.start_new_thread(get_page, (url,))
lock.acquire()
print '\n---xxx---\n'.join(pages)

Please have a nice day.

Regards,
Technoumena

Jul 18 '05 #5

f29

> > Markus,

I think there's a timeout in urllib; not sure.

No there isn't, bit of a shame that. There is in httplib.

Sure there is, use urllib or urllib2 as usual, but also import socket
module and call "socket.setdefaulttimeout(secs)" before requesting any
pages with urlopen.

f29

Jul 18 '05 #6

Andrew Bennetts

On Wed, Mar 31, 2004 at 07:33:45PM +0200, Markus Franz wrote:

Hi.

How can I grab websites with a command-line python script? I want to start
the script like this:

./script.py ---xxx--- http://www.address1.com http://www.address2.com
http://www.address3.com

The script should load these 3 websites (or more if specified) in parallel
(may be processes? threads?) and show their contents seperated by ---xxx---.
The whole output should be print on the command-line. Each website should
only have 15 seconds to return the contents (maximum) in order to avoid a
never-ending script.

How can I do this?

You could use Twisted <http://twistedmatrix.com>:

from twisted.internet import reactor
from twisted.web.client import getPage
import sys

def gotPage(page):
print seperator
print page

def failed(failure):
print seperator + ' FAILED'
failure.printTraceback()

def decrement(ignored):
global count
count -= 1
if count == 0:
reactor.stop()

seperator = sys.argv[1]
urlList = sys.argv[2:]
count = len(urlList)
for url in urlList:
getPage(url, timeout=15).addCallbacks(gotPage, failed).addBoth(decrement)

reactor.run()

It will grab the sites in parallel, printing them in the order they arrive,
and doesn't use multiple processes, or multiple threads :)

-Andrew.

Jul 18 '05 #7

Similar topics

Loading websites in parallel

by: Markus Franz | last post by:

Hi. I have a difficult problem: An array contains several different URLs. I want to load these websites in parallel by using a HTTP-Request. How can I do this in PHP? Up to now I did this...

PHP

Trouble with script fetching site

by: Stephen Boulet | last post by:

I'm trying to parse a url to set my hardware & system clock (linux). Perhaps the best way to do this would be to use the urllib2 module to convert a site to text, but since I haven't found that...

Python

python on websites

by: Sebastian Kress | last post by:

Hi, I'm terribly sorry for this very easy question, but I really would like to know that :). I've been programming in Python quite a while now and mostly coped quite ok. Now my Webhoster sent...

Python

Variable record fetching

by: Shujun Huang | last post by:

Hi, I am working on converting Informix database to Postgre. I have one question for fetching records using PostgreSQL. The record I am fetching is a variable size text string. Before fetching...

PostgreSQL Database

Python tools for managing static websites?

by: Chris Pearl | last post by:

Are there Python tools to help webmasters manage static websites? I'm talking about regenerating an entire static website - all the HTML files in their appropriate directories and...

Python

Advice about fetching user information

by: Sandman | last post by:

So, I have this content management system I've developed myself. The system has a solid community part where members can register and then participate in forums, write weblogs and a ton of other...

PHP

Fetching the base URL from XML file

by: nazeers | last post by:

Hi All, I am new to XML and I need some help from you all. we have a requirement like... we want to fetch the base URL that is present in the XML file , and getting it displayed in the...

Javascript

Fetching & Inserting Data into a column of TEXT Data type in SQL server 2000 Using ASP.NET

by: Bhavesh | last post by:

Hello genious people, I m trying to insert a LARGE text from Multiline Textbox into my table of sqlserver2000. I m using vs-2005. Please note that I dont want to store blob data From FILE...

ASP.NET

Re: Agnostic fetching

by: Ivan Ven Osdel | last post by:

>----- Original Message ----- Here's a function I wrote for checking remote or local file existence. It works for me but admittedly I haven't tested many cases with it. Also its currently...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server