467,102 Members | 1,141 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,102 developers. It's quick & easy.

Web Crawling/Threading and Things That Go Bump in the Night

Hello all

I am trying to write a reliable web-crawler. I tried to write my own
using recursion and found I quickly hit the "too many sockets" open
problem. So I looked for a threaded version that I could easily extend.

The simplest/most reliable I found was called Spider.py (see attached).

At this stage I want a spider that I can point at a site, let it do
it's thing, and reliable get a callback of sorts... including the html
(for me to parse), the url of the page in question (so I can log it)
and the urls-found-on-that-page (so I can strip out any ones I really
don't want and add them to the "seen-list".
Now, this is my question.

The code above ALMOST works fine. The crawler crawls, I get the data I
need BUT... every now and again the code just pauses, I hit control-C
and it reports an error as if it has hit an exception and then carries
on!!! I like the fact that my spider_usage.py file has the minimum
amount of spider stuff in it... really just a main() and handle()
handler.

How does this happen... is a thread being killed and then a new one is
made or what? I suspect it may have something to do with sockets timing
out, but I have no idea...

By the way on small sites (100s of pages) it never gets to the stall,
it's on larger sites such as Amazon that it "fails"

This is my other question

It would be great to know, when the code is stalled, if it is doing
anything... is there any way to even print a full stop to screen?

This is my last question

Given python's suitability for this sort of thing (isn't google written
in it?) I can't believe that that there isn't a kick ass crawler
already out there...

regards

tom

http://www.theotherblog.com/Articles...rawler-spider/

Aug 4 '06 #1
  • viewed: 1481
Share:
1 Reply
Rem, what OS are you trying this on? Windows XP SP2 has a limit of
around 40 tcp connections per second...

Remarkable wrote:
Hello all

I am trying to write a reliable web-crawler. I tried to write my own
using recursion and found I quickly hit the "too many sockets" open
problem. So I looked for a threaded version that I could easily extend.

The simplest/most reliable I found was called Spider.py (see attached).

At this stage I want a spider that I can point at a site, let it do
it's thing, and reliable get a callback of sorts... including the html
(for me to parse), the url of the page in question (so I can log it)
and the urls-found-on-that-page (so I can strip out any ones I really
don't want and add them to the "seen-list".
Now, this is my question.

The code above ALMOST works fine. The crawler crawls, I get the data I
need BUT... every now and again the code just pauses, I hit control-C
and it reports an error as if it has hit an exception and then carries
on!!! I like the fact that my spider_usage.py file has the minimum
amount of spider stuff in it... really just a main() and handle()
handler.

How does this happen... is a thread being killed and then a new one is
made or what? I suspect it may have something to do with sockets timing
out, but I have no idea...

By the way on small sites (100s of pages) it never gets to the stall,
it's on larger sites such as Amazon that it "fails"

This is my other question

It would be great to know, when the code is stalled, if it is doing
anything... is there any way to even print a full stop to screen?

This is my last question

Given python's suitability for this sort of thing (isn't google written
in it?) I can't believe that that there isn't a kick ass crawler
already out there...

regards

tom

http://www.theotherblog.com/Articles...rawler-spider/
Aug 4 '06 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by c duden | last post: by
1 post views Thread by Benjamin Lefevre | last post: by
13 posts views Thread by John | last post: by
2 posts views Thread by Daniel | last post: by
126 posts views Thread by Dann Corbit | last post: by
9 posts views Thread by brendan_gallagher_2001@yahoo.co.uk | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.