473,606 Members | 2,171 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Web Spider

Hello

I'm a newcomer to the world of Python trying to write a web spider. I
downloaded the skeleton from

http://starship.python.net/crew/aahz...dPoolSpider.py

Some of the source shown below.

A couple of questions:

1) Why use the

if __name__ == '__main__':

construct?

2) In Retrievepool.__ init__ the Retriever.__ini t__ is called with
self.inputQueue and self.outputQueu e as arguments. Does this mean that
each Retriever thread has a reference to Retrievepool.in putQueue and
Retrievepool.ou tputQueue (ie. there is only one input and output queue and
the threads all share, pushing and popping whenever they want (which is
safe due to the synchronized nature of Queue)?

3) How many threads will be running? Spider.run initializes the
Retrievepool and this will consist of MAX_THREADS threads, so once the
crawler is running there will be the main thread (caught in the while loop
in Spider.run) and MAX_THREADS Retriever threads running, right?

Hmm... I think that's about it for now.

---------------------------------------------------------------------

MAX_THREADS = 3

....

class Retriever(threa ding.Thread):
def __init__(self, inputQueue, outputQueue):
threading.Threa d.__init__(self )
self.inputQueue = inputQueue
self.outputQueu e = outputQueue

def run(self):
while 1:
self.URL = self.inputQueue .get()
self.getPage()
self.outputQueu e.put(self.getL inks())

...
class RetrievePool:
def __init__(self, numThreads):
self.retrievePo ol = []
self.inputQueue = Queue.Queue()
self.outputQueu e = Queue.Queue()
for i in range(numThread s):
retriever = Retriever(self. inputQueue, self.outputQueu e)
retriever.start ()
self.retrievePo ol.append(retri ever)

...
class Spider:
def __init__(self, startURL, maxThreads):
self.URLs = []
self.queue = [startURL]
self.URLdict = {startURL: 1}
self.include = startURL
self.numPagesQu eued = 0
self.retriever = RetrievePool(ma xThreads)

def run(self):
self.startPages ()
while self.numPagesQu eued > 0:
self.queueLinks ()
self.startPages ()
self.retriever. shutdown()
self.URLs = self.URLdict.ke ys()
self.URLs.sort( )

...
if __name__ == '__main__':
startURL = sys.argv[1]
spider = Spider(startURL , MAX_THREADS)
spider.run()
print
for URL in spider.URLs:
print URL
--
Regards
/Thomas

Jul 18 '05 #1
3 3222
Thomas Lindgaard wrote:
A couple of questions:

1) Why use the
if __name__ == '__main__':
construct?
Answered indirectly in this FAQ:
http://www.python.org/doc/faq/progra...nt-module-name
2) In Retrievepool.__ init__ the Retriever.__ini t__ is called with
self.inputQueue and self.outputQueu e as arguments. Does this mean that
each Retriever thread has a reference to Retrievepool.in putQueue and
Retrievepool.ou tputQueue
Yes, and that's sort of the whole point of the thing.
3) How many threads will be running? Spider.run initializes the
Retrievepool and this will consist of MAX_THREADS threads, so once the
crawler is running there will be the main thread (caught in the while loop
in Spider.run) and MAX_THREADS Retriever threads running, right?


Yep. Good analysis. :-) You could inject this somewhere to
check:

print len(threading.e numerate()), 'threads exist'

-Peter
Jul 18 '05 #2
On Tue, 06 Jul 2004 11:19:01 -0400, Peter Hansen wrote:
Answered indirectly in this FAQ:
http://www.python.org/doc/faq/progra...nt-module-name
Let me just see if I understood this correctly...

The reason for using the construct is to have to "modes" for the script:
One for running the script by itself (ie. run main()) and one for when it
is included from somewhere else (ie. main() should not be run unless
called from the surrounding code).
2) In Retrievepool.__ init__ the Retriever.__ini t__ is called with
self.inputQueue and self.outputQueu e as arguments. Does this mean that
each Retriever thread has a reference to Retrievepool.in putQueue and
Retrievepool.ou tputQueue


Yes, and that's sort of the whole point of the thing.


Okidoki :)
3) How many threads will be running? Spider.run initializes the
Retrievepool and this will consist of MAX_THREADS threads, so once the
crawler is running there will be the main thread (caught in the while
loop in Spider.run) and MAX_THREADS Retriever threads running, right?


Yep. Good analysis. :-) You could inject this somewhere to check:


Thanks - sometimes it actually helps to read code you want to elaborate on
closely :)
print len(threading.e numerate()), 'threads exist'


Can a thread die spontaneously if for instance an exception is thrown?

--
Mvh.
/Thomas

Jul 18 '05 #3
Thomas Lindgaard wrote:
On Tue, 06 Jul 2004 11:19:01 -0400, Peter Hansen wrote:
Answered indirectly in this FAQ:
http://www.python.org/doc/faq/progra...nt-module-name
Let me just see if I understood this correctly...

The reason for using the construct is to have to "modes" for the script:
One for running the script by itself (ie. run main()) and one for when it
is included from somewhere else (ie. main() should not be run unless
called from the surrounding code).


Yep.
Can a thread die spontaneously if for instance an exception is thrown?


The interactive prompt is your friend for such questions in Python.
Good to get in the habit of being able to check such stuff out
easily:

c:\>python
Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on
win32
Type "help", "copyright" , "credits" or "license" for more information.
import time, threading
class Test(threading. Thread): .... def run(self):
.... while 1:
.... time.sleep(5)
.... 1/0
.... a = Test()
threading.enume rate() [<_MainThread(Ma inThread, started)>] a.start()
threading.enume rate() [<Test(Thread-2, started)>, <_MainThread(Ma inThread, started)>]
# wait a few seconds here Exception in thread Thread-2:
Traceback (most recent call last):
File "c:\a\python23\ lib\threading.p y", line 436, in __bootstrap
self.run()
File "<stdin>", line 5, in run
ZeroDivisionErr or: integer division or modulo by zero
threading.enume rate()

[<_MainThread(Ma inThread, started)>]

Tada! The answer is yes. :-)

-Peter
Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
7999
by: Kyle Mizell | last post by:
I am looking for a script that I can use to spider a website, and then pull the images... I know how to do it for a single page, but, I would like to be able to do this for the entire site. Any suggestions? Thanks, Kyle Mizell http://www.pimpinonline.com
0
2377
by: Auction software | last post by:
Free download full version , all products http://netauction8.url4life.com/ Groupawy --------------- Google Groups Email spider. The first email spider for google groups. Millions of valid and active emails in one easy location to collect. Spiderawy
5
2433
by: jdonnell | last post by:
I've been writing a simple web spider for fun, and I've run into a problem I can't figure out. The spider hangs (waits for username and pass) when I hit a page that requires .htaccess authentication. self.f = urllib.urlopen('http://blogbloc.com/~jay/test/') #nothing below here gets executed print self.f.info() .... It hangs as soon as I call urllib.urlopen(). I was going to try to read
0
2017
by: Auction software | last post by:
Free download full version , all products from Mewsoft dot com http://netauction8.url4life.com/ Groupawy --------------- Google Groups Email spider. The first email spider for google groups. Millions of valid and active emails in one easy location to collect. Spiderawy
0
2074
by: dtsearch | last post by:
New release expands-through a .NET Spider API, to Linux, and to OpenOffice-dtSearch's ability to index over a terabyte of text in a single index, with indexed search time typically less than a second BETHESDA, MD (January 10, 2006) dtSearch Corp., a leading supplier of enterprise and developer text retrieval software, announces Version 7.2 of its product line for instantly searching terabytes of documents across a desktop, network,...
7
1931
by: baroque Chou | last post by:
anyone know how google spiders access web site, how dose they manage to get the href information? do they have special access right or something? any help is appreciated
3
2389
by: Tony Lance | last post by:
Big Bertha Thing spider Cosmic Ray Series Possible Real World System Constructs http://web.onetel.com/~tonylance/spider.html Access page JPG 11K Image Astrophysics net ring Access site Newsgroup Reviews including uk.rec.cycling Drawing of a clockwork spider wheel and hairpin.
2
2345
by: abeen | last post by:
Hello, I would want to know which could be the best programming language for developing web spider. More information about the spider, much better,, thanks http://www.imavista.com
2
3428
by: =?Utf-8?B?Q2hhcnRz?= | last post by:
I have been writing C# programs to spider yellow page to get list of restaurant name, address to the database. When I encounter button or hyperlink, I don’t know how to use the program to click the button or hyperlink. Does anyone have this type of sample code in either C#, vb.net? Thanks, Charts
0
8015
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8430
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8094
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8305
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6770
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
5966
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
3977
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2448
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1553
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.