Web Spider - Python

Thomas Lindgaard

Hello

I'm a newcomer to the world of Python trying to write a web spider. I
downloaded the skeleton from

http://starship.python.net/crew/aahz...dPoolSpider.py

Some of the source shown below.

A couple of questions:

1) Why use the

if __name__ == '__main__':

construct?

2) In Retrievepool.__ init__ the Retriever.__ini t__ is called with
self.inputQueue and self.outputQueu e as arguments. Does this mean that
each Retriever thread has a reference to Retrievepool.in putQueue and
Retrievepool.ou tputQueue (ie. there is only one input and output queue and
the threads all share, pushing and popping whenever they want (which is
safe due to the synchronized nature of Queue)?

3) How many threads will be running? Spider.run initializes the
Retrievepool and this will consist of MAX_THREADS threads, so once the
crawler is running there will be the main thread (caught in the while loop
in Spider.run) and MAX_THREADS Retriever threads running, right?

Hmm... I think that's about it for now.

---------------------------------------------------------------------

MAX_THREADS = 3

....

class Retriever(threa ding.Thread):
def __init__(self, inputQueue, outputQueue):
threading.Threa d.__init__(self )
self.inputQueue = inputQueue
self.outputQueu e = outputQueue

def run(self):
while 1:
self.URL = self.inputQueue .get()
self.getPage()
self.outputQueu e.put(self.getL inks())

...
class RetrievePool:
def __init__(self, numThreads):
self.retrievePo ol = []
self.inputQueue = Queue.Queue()
self.outputQueu e = Queue.Queue()
for i in range(numThread s):
retriever = Retriever(self. inputQueue, self.outputQueu e)
retriever.start ()
self.retrievePo ol.append(retri ever)

...
class Spider:
def __init__(self, startURL, maxThreads):
self.URLs = []
self.queue = [startURL]
self.URLdict = {startURL: 1}
self.include = startURL
self.numPagesQu eued = 0
self.retriever = RetrievePool(ma xThreads)

def run(self):
self.startPages ()
while self.numPagesQu eued > 0:
self.queueLinks ()
self.startPages ()
self.retriever. shutdown()
self.URLs = self.URLdict.ke ys()
self.URLs.sort( )

...
if __name__ == '__main__':
startURL = sys.argv[1]
spider = Spider(startURL , MAX_THREADS)
spider.run()
print
for URL in spider.URLs:
print URL
--
Regards
/Thomas

Jul 18 '05 #1

Subscribe Reply

3222

Peter Hansen

Thomas Lindgaard wrote:

A couple of questions:

1) Why use the
if __name__ == '__main__':
construct?
Answered indirectly in this FAQ:
http://www.python.org/doc/faq/progra...nt-module-name
2) In Retrievepool.__ init__ the Retriever.__ini t__ is called with
self.inputQueue and self.outputQueu e as arguments. Does this mean that
each Retriever thread has a reference to Retrievepool.in putQueue and
Retrievepool.ou tputQueue
Yes, and that's sort of the whole point of the thing.
3) How many threads will be running? Spider.run initializes the
Retrievepool and this will consist of MAX_THREADS threads, so once the
crawler is running there will be the main thread (caught in the while loop
in Spider.run) and MAX_THREADS Retriever threads running, right?

Yep. Good analysis. :-) You could inject this somewhere to
check:

print len(threading.e numerate()), 'threads exist'

-Peter

Jul 18 '05 #2

Thomas Lindgaard

On Tue, 06 Jul 2004 11:19:01 -0400, Peter Hansen wrote:

Answered indirectly in this FAQ:
http://www.python.org/doc/faq/progra...nt-module-name
Let me just see if I understood this correctly...

The reason for using the construct is to have to "modes" for the script:
One for running the script by itself (ie. run main()) and one for when it
is included from somewhere else (ie. main() should not be run unless
called from the surrounding code).

2) In Retrievepool.__ init__ the Retriever.__ini t__ is called with
self.inputQueue and self.outputQueu e as arguments. Does this mean that
each Retriever thread has a reference to Retrievepool.in putQueue and
Retrievepool.ou tputQueue

Yes, and that's sort of the whole point of the thing.

Okidoki :)

3) How many threads will be running? Spider.run initializes the
Retrievepool and this will consist of MAX_THREADS threads, so once the
crawler is running there will be the main thread (caught in the while
loop in Spider.run) and MAX_THREADS Retriever threads running, right?

Yep. Good analysis. :-) You could inject this somewhere to check:

Thanks - sometimes it actually helps to read code you want to elaborate on
closely :)
print len(threading.e numerate()), 'threads exist'

Can a thread die spontaneously if for instance an exception is thrown?

--
Mvh.
/Thomas

Jul 18 '05 #3

Peter Hansen

Thomas Lindgaard wrote:

On Tue, 06 Jul 2004 11:19:01 -0400, Peter Hansen wrote:
Answered indirectly in this FAQ:
http://www.python.org/doc/faq/progra...nt-module-name
Let me just see if I understood this correctly...

The reason for using the construct is to have to "modes" for the script:
One for running the script by itself (ie. run main()) and one for when it
is included from somewhere else (ie. main() should not be run unless
called from the surrounding code).

Yep.
Can a thread die spontaneously if for instance an exception is thrown?

The interactive prompt is your friend for such questions in Python.
Good to get in the habit of being able to check such stuff out
easily:

c:\>python
Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on
win32
Type "help", "copyright" , "credits" or "license" for more information.

import time, threading
class Test(threading. Thread): .... def run(self):
.... while 1:
.... time.sleep(5)
.... 1/0
.... a = Test()
threading.enume rate() [<_MainThread(Ma inThread, started)>] a.start()
threading.enume rate() [<Test(Thread-2, started)>, <_MainThread(Ma inThread, started)>]
# wait a few seconds here Exception in thread Thread-2:
Traceback (most recent call last):
File "c:\a\python23\ lib\threading.p y", line 436, in __bootstrap
self.run()
File "<stdin>", line 5, in run
ZeroDivisionErr or: integer division or modulo by zero
threading.enume rate()

[<_MainThread(Ma inThread, started)>]

Tada! The answer is yes. :-)

-Peter

Jul 18 '05 #4

Similar topics

7999

php to spider a website

by: Kyle Mizell | last post by:

I am looking for a script that I can use to spider a website, and then pull the images... I know how to do it for a single page, but, I would like to be able to do this for the entire site. Any suggestions? Thanks, Kyle Mizell http://www.pimpinonline.com

PHP

2377

Google groups email spider,Auction software, Directory PPC search engine software, email spiders - 1

by: Auction software | last post by:

Free download full version , all products http://netauction8.url4life.com/ Groupawy --------------- Google Groups Email spider. The first email spider for google groups. Millions of valid and active emails in one easy location to collect. Spiderawy

PHP

2433

web spider and password protected pages

by: jdonnell | last post by:

I've been writing a simple web spider for fun, and I've run into a problem I can't figure out. The spider hangs (waits for username and pass) when I hit a page that requires .htaccess authentication. self.f = urllib.urlopen('http://blogbloc.com/~jay/test/') #nothing below here gets executed print self.f.info() .... It hangs as soon as I call urllib.urlopen(). I was going to try to read

Python

2017

Google groups email spider,Auction software, Directory PPC search engine software, email spiders - 5

by: Auction software | last post by:

Free download full version , all products from Mewsoft dot com http://netauction8.url4life.com/ Groupawy --------------- Google Groups Email spider. The first email spider for google groups. Millions of valid and active emails in one easy location to collect. Spiderawy

Microsoft Access / VBA

2074

Announcing New dtSearch® .NET Spider API; Terabyte Engine for Linux; OpenOffice Support

by: dtsearch | last post by:

New release expands-through a .NET Spider API, to Linux, and to OpenOffice-dtSearch's ability to index over a terabyte of text in a single index, with indexed search time typically less than a second BETHESDA, MD (January 10, 2006) dtSearch Corp., a leading supplier of enterprise and developer text retrieval software, announces Version 7.2 of its product line for instantly searching terabytes of documents across a desktop, network,...

.NET Framework

1931

how google spider access my web site?

by: baroque Chou | last post by:

anyone know how google spiders access web site, how dose they manage to get the href information? do they have special access right or something? any help is appreciated

ASP.NET

2389

Big Bertha Thing spider

by: Tony Lance | last post by:

Big Bertha Thing spider Cosmic Ray Series Possible Real World System Constructs http://web.onetel.com/~tonylance/spider.html Access page JPG 11K Image Astrophysics net ring Access site Newsgroup Reviews including uk.rec.cycling Drawing of a clockwork spider wheel and hairpin.

C / C++

2345

developing web spider

by: abeen | last post by:

Hello, I would want to know which could be the best programming language for developing web spider. More information about the spider, much better,, thanks http://www.imavista.com

Python

3428

how to spider web page with button and hyperlink

by: =?Utf-8?B?Q2hhcnRz?= | last post by:

I have been writing C# programs to spider yellow page to get list of restaurant name, address to the database. When I encounter button or hyperlink, I donâ€™t know how to use the program to click the button or hyperlink. Does anyone have this type of sample code in either C#, vb.net? Thanks, Charts

ASP.NET

8015

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8430

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8094

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8305

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6770

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5966

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

3977

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2448

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

1553

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP