472,328 Members | 1,157 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,328 software developers and data experts.

Web Crawling/Threading and Things That Go Bump in the Night

Hello all

I am trying to write a reliable web-crawler. I tried to write my own
using recursion and found I quickly hit the "too many sockets" open
problem. So I looked for a threaded version that I could easily extend.

The simplest/most reliable I found was called Spider.py (see attached).

At this stage I want a spider that I can point at a site, let it do
it's thing, and reliable get a callback of sorts... including the html
(for me to parse), the url of the page in question (so I can log it)
and the urls-found-on-that-page (so I can strip out any ones I really
don't want and add them to the "seen-list".
Now, this is my question.

The code above ALMOST works fine. The crawler crawls, I get the data I
need BUT... every now and again the code just pauses, I hit control-C
and it reports an error as if it has hit an exception and then carries
on!!! I like the fact that my spider_usage.py file has the minimum
amount of spider stuff in it... really just a main() and handle()
handler.

How does this happen... is a thread being killed and then a new one is
made or what? I suspect it may have something to do with sockets timing
out, but I have no idea...

By the way on small sites (100s of pages) it never gets to the stall,
it's on larger sites such as Amazon that it "fails"

This is my other question

It would be great to know, when the code is stalled, if it is doing
anything... is there any way to even print a full stop to screen?

This is my last question

Given python's suitability for this sort of thing (isn't google written
in it?) I can't believe that that there isn't a kick ass crawler
already out there...

regards

tom

http://www.theotherblog.com/Articles...rawler-spider/

Aug 4 '06 #1
1 1596
Rem, what OS are you trying this on? Windows XP SP2 has a limit of
around 40 tcp connections per second...

Remarkable wrote:
Hello all

I am trying to write a reliable web-crawler. I tried to write my own
using recursion and found I quickly hit the "too many sockets" open
problem. So I looked for a threaded version that I could easily extend.

The simplest/most reliable I found was called Spider.py (see attached).

At this stage I want a spider that I can point at a site, let it do
it's thing, and reliable get a callback of sorts... including the html
(for me to parse), the url of the page in question (so I can log it)
and the urls-found-on-that-page (so I can strip out any ones I really
don't want and add them to the "seen-list".
Now, this is my question.

The code above ALMOST works fine. The crawler crawls, I get the data I
need BUT... every now and again the code just pauses, I hit control-C
and it reports an error as if it has hit an exception and then carries
on!!! I like the fact that my spider_usage.py file has the minimum
amount of spider stuff in it... really just a main() and handle()
handler.

How does this happen... is a thread being killed and then a new one is
made or what? I suspect it may have something to do with sockets timing
out, but I have no idea...

By the way on small sites (100s of pages) it never gets to the stall,
it's on larger sites such as Amazon that it "fails"

This is my other question

It would be great to know, when the code is stalled, if it is doing
anything... is there any way to even print a full stop to screen?

This is my last question

Given python's suitability for this sort of thing (isn't google written
in it?) I can't believe that that there isn't a kick ass crawler
already out there...

regards

tom

http://www.theotherblog.com/Articles...rawler-spider/
Aug 4 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: c duden | last post by:
I am able to add an exisiting office addin project to a blank solution in VS.NET 2003. When I attempt to add it's setup project I get the following...
77
by: Jon Skeet [C# MVP] | last post by:
Please excuse the cross-post - I'm pretty sure I've had interest in the article on all the groups this is posted to. I've finally managed to...
1
by: Benjamin Lefevre | last post by:
I am currently developping a web crawler, mainly crawling mobile page (wml, mobile xhtml) but not only (also html/xml/...), and I ask myself which...
13
by: John | last post by:
I've got some reasonably complex business logic in my C# code, in a class called by a ASP.NET page. This takes around 3-4 seconds to execute. It's...
25
by: MuZZy | last post by:
Hi, I'm currently rewriting some functionality which was using multithredaing for retrieving datasets from database and updating a grid control....
2
by: Daniel | last post by:
I have a class similar to this: class MyThread(threading.Thread): def __init__(self): self.terminated = False def run(self): while not...
126
by: Dann Corbit | last post by:
Rather than create a new way of doing things: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2497.html why not just pick up ACE into the...
9
by: brendan_gallagher_2001 | last post by:
Hi I am seeing some strange behaviour on a windows (vb.net 1.1) service. Basically, what I see happening is that when the Timer1_Elapsed event...
0
by: tammygombez | last post by:
Hey fellow JavaFX developers, I'm currently working on a project that involves using a ComboBox in JavaFX, and I've run into a bit of an issue....
0
by: concettolabs | last post by:
In today's business world, businesses are increasingly turning to PowerApps to develop custom business applications. PowerApps is a powerful tool...
0
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and...
0
by: CD Tom | last post by:
This only shows up in access runtime. When a user select a report from my report menu when they close the report they get a menu I've called Add-ins...
0
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge...
0
jalbright99669
by: jalbright99669 | last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was...
0
by: Matthew3360 | last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function. Here is my code. ...
1
by: Matthew3360 | last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.