472,777 Members | 2,853 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,777 software developers and data experts.

Web Spider

Hello

I'm a newcomer to the world of Python trying to write a web spider. I
downloaded the skeleton from

http://starship.python.net/crew/aahz...dPoolSpider.py

Some of the source shown below.

A couple of questions:

1) Why use the

if __name__ == '__main__':

construct?

2) In Retrievepool.__init__ the Retriever.__init__ is called with
self.inputQueue and self.outputQueue as arguments. Does this mean that
each Retriever thread has a reference to Retrievepool.inputQueue and
Retrievepool.outputQueue (ie. there is only one input and output queue and
the threads all share, pushing and popping whenever they want (which is
safe due to the synchronized nature of Queue)?

3) How many threads will be running? Spider.run initializes the
Retrievepool and this will consist of MAX_THREADS threads, so once the
crawler is running there will be the main thread (caught in the while loop
in Spider.run) and MAX_THREADS Retriever threads running, right?

Hmm... I think that's about it for now.

---------------------------------------------------------------------

MAX_THREADS = 3

....

class Retriever(threading.Thread):
def __init__(self, inputQueue, outputQueue):
threading.Thread.__init__(self)
self.inputQueue = inputQueue
self.outputQueue = outputQueue

def run(self):
while 1:
self.URL = self.inputQueue.get()
self.getPage()
self.outputQueue.put(self.getLinks())

...
class RetrievePool:
def __init__(self, numThreads):
self.retrievePool = []
self.inputQueue = Queue.Queue()
self.outputQueue = Queue.Queue()
for i in range(numThreads):
retriever = Retriever(self.inputQueue, self.outputQueue)
retriever.start()
self.retrievePool.append(retriever)

...
class Spider:
def __init__(self, startURL, maxThreads):
self.URLs = []
self.queue = [startURL]
self.URLdict = {startURL: 1}
self.include = startURL
self.numPagesQueued = 0
self.retriever = RetrievePool(maxThreads)

def run(self):
self.startPages()
while self.numPagesQueued > 0:
self.queueLinks()
self.startPages()
self.retriever.shutdown()
self.URLs = self.URLdict.keys()
self.URLs.sort()

...
if __name__ == '__main__':
startURL = sys.argv[1]
spider = Spider(startURL, MAX_THREADS)
spider.run()
print
for URL in spider.URLs:
print URL
--
Regards
/Thomas

Jul 18 '05 #1
3 3179
Thomas Lindgaard wrote:
A couple of questions:

1) Why use the
if __name__ == '__main__':
construct?
Answered indirectly in this FAQ:
http://www.python.org/doc/faq/progra...nt-module-name
2) In Retrievepool.__init__ the Retriever.__init__ is called with
self.inputQueue and self.outputQueue as arguments. Does this mean that
each Retriever thread has a reference to Retrievepool.inputQueue and
Retrievepool.outputQueue
Yes, and that's sort of the whole point of the thing.
3) How many threads will be running? Spider.run initializes the
Retrievepool and this will consist of MAX_THREADS threads, so once the
crawler is running there will be the main thread (caught in the while loop
in Spider.run) and MAX_THREADS Retriever threads running, right?


Yep. Good analysis. :-) You could inject this somewhere to
check:

print len(threading.enumerate()), 'threads exist'

-Peter
Jul 18 '05 #2
On Tue, 06 Jul 2004 11:19:01 -0400, Peter Hansen wrote:
Answered indirectly in this FAQ:
http://www.python.org/doc/faq/progra...nt-module-name
Let me just see if I understood this correctly...

The reason for using the construct is to have to "modes" for the script:
One for running the script by itself (ie. run main()) and one for when it
is included from somewhere else (ie. main() should not be run unless
called from the surrounding code).
2) In Retrievepool.__init__ the Retriever.__init__ is called with
self.inputQueue and self.outputQueue as arguments. Does this mean that
each Retriever thread has a reference to Retrievepool.inputQueue and
Retrievepool.outputQueue


Yes, and that's sort of the whole point of the thing.


Okidoki :)
3) How many threads will be running? Spider.run initializes the
Retrievepool and this will consist of MAX_THREADS threads, so once the
crawler is running there will be the main thread (caught in the while
loop in Spider.run) and MAX_THREADS Retriever threads running, right?


Yep. Good analysis. :-) You could inject this somewhere to check:


Thanks - sometimes it actually helps to read code you want to elaborate on
closely :)
print len(threading.enumerate()), 'threads exist'


Can a thread die spontaneously if for instance an exception is thrown?

--
Mvh.
/Thomas

Jul 18 '05 #3
Thomas Lindgaard wrote:
On Tue, 06 Jul 2004 11:19:01 -0400, Peter Hansen wrote:
Answered indirectly in this FAQ:
http://www.python.org/doc/faq/progra...nt-module-name
Let me just see if I understood this correctly...

The reason for using the construct is to have to "modes" for the script:
One for running the script by itself (ie. run main()) and one for when it
is included from somewhere else (ie. main() should not be run unless
called from the surrounding code).


Yep.
Can a thread die spontaneously if for instance an exception is thrown?


The interactive prompt is your friend for such questions in Python.
Good to get in the habit of being able to check such stuff out
easily:

c:\>python
Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
import time, threading
class Test(threading.Thread): .... def run(self):
.... while 1:
.... time.sleep(5)
.... 1/0
.... a = Test()
threading.enumerate() [<_MainThread(MainThread, started)>] a.start()
threading.enumerate() [<Test(Thread-2, started)>, <_MainThread(MainThread, started)>]
# wait a few seconds here Exception in thread Thread-2:
Traceback (most recent call last):
File "c:\a\python23\lib\threading.py", line 436, in __bootstrap
self.run()
File "<stdin>", line 5, in run
ZeroDivisionError: integer division or modulo by zero
threading.enumerate()

[<_MainThread(MainThread, started)>]

Tada! The answer is yes. :-)

-Peter
Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Kyle Mizell | last post by:
I am looking for a script that I can use to spider a website, and then pull the images... I know how to do it for a single page, but, I would like to be able to do this for the entire site. Any...
0
by: Auction software | last post by:
Free download full version , all products http://netauction8.url4life.com/ Groupawy --------------- Google Groups Email spider. The first email spider for google groups. Millions of valid...
5
by: jdonnell | last post by:
I've been writing a simple web spider for fun, and I've run into a problem I can't figure out. The spider hangs (waits for username and pass) when I hit a page that requires .htaccess...
0
by: Auction software | last post by:
Free download full version , all products from Mewsoft dot com http://netauction8.url4life.com/ Groupawy --------------- Google Groups Email spider. The first email spider for google groups....
0
by: dtsearch | last post by:
New release expands-through a .NET Spider API, to Linux, and to OpenOffice-dtSearch's ability to index over a terabyte of text in a single index, with indexed search time typically less than a...
7
by: baroque Chou | last post by:
anyone know how google spiders access web site, how dose they manage to get the href information? do they have special access right or something? any help is appreciated
3
by: Tony Lance | last post by:
Big Bertha Thing spider Cosmic Ray Series Possible Real World System Constructs http://web.onetel.com/~tonylance/spider.html Access page JPG 11K Image Astrophysics net ring Access site...
2
by: abeen | last post by:
Hello, I would want to know which could be the best programming language for developing web spider. More information about the spider, much better,, thanks http://www.imavista.com
2
by: =?Utf-8?B?Q2hhcnRz?= | last post by:
I have been writing C# programs to spider yellow page to get list of restaurant name, address to the database. When I encounter button or hyperlink, I don’t know how to use the program to click...
0
by: Rina0 | last post by:
Cybersecurity engineering is a specialized field that focuses on the design, development, and implementation of systems, processes, and technologies that protect against cyber threats and...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 2 August 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
linyimin
by: linyimin | last post by:
Spring Startup Analyzer generates an interactive Spring application startup report that lets you understand what contributes to the application startup time and helps to optimize it. Support for...
0
by: kcodez | last post by:
As a H5 game development enthusiast, I recently wrote a very interesting little game - Toy Claw ((http://claw.kjeek.com/))。Here I will summarize and share the development experience here, and hope it...
0
by: Taofi | last post by:
I try to insert a new record but the error message says the number of query names and destination fields are not the same This are my field names ID, Budgeted, Actual, Status and Differences ...
5
by: DJRhino | last post by:
Private Sub CboDrawingID_BeforeUpdate(Cancel As Integer) If = 310029923 Or 310030138 Or 310030152 Or 310030346 Or 310030348 Or _ 310030356 Or 310030359 Or 310030362 Or...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
0
by: lllomh | last post by:
How does React native implement an English player?
2
by: DJRhino | last post by:
Was curious if anyone else was having this same issue or not.... I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.