473,385 Members | 1,320 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Web Crawling/Threading and Things That Go Bump in the Night

Hello all

I am trying to write a reliable web-crawler. I tried to write my own
using recursion and found I quickly hit the "too many sockets" open
problem. So I looked for a threaded version that I could easily extend.

The simplest/most reliable I found was called Spider.py (see attached).

At this stage I want a spider that I can point at a site, let it do
it's thing, and reliable get a callback of sorts... including the html
(for me to parse), the url of the page in question (so I can log it)
and the urls-found-on-that-page (so I can strip out any ones I really
don't want and add them to the "seen-list".
Now, this is my question.

The code above ALMOST works fine. The crawler crawls, I get the data I
need BUT... every now and again the code just pauses, I hit control-C
and it reports an error as if it has hit an exception and then carries
on!!! I like the fact that my spider_usage.py file has the minimum
amount of spider stuff in it... really just a main() and handle()
handler.

How does this happen... is a thread being killed and then a new one is
made or what? I suspect it may have something to do with sockets timing
out, but I have no idea...

By the way on small sites (100s of pages) it never gets to the stall,
it's on larger sites such as Amazon that it "fails"

This is my other question

It would be great to know, when the code is stalled, if it is doing
anything... is there any way to even print a full stop to screen?

This is my last question

Given python's suitability for this sort of thing (isn't google written
in it?) I can't believe that that there isn't a kick ass crawler
already out there...

regards

tom

http://www.theotherblog.com/Articles...rawler-spider/

Aug 4 '06 #1
1 1625
Rem, what OS are you trying this on? Windows XP SP2 has a limit of
around 40 tcp connections per second...

Remarkable wrote:
Hello all

I am trying to write a reliable web-crawler. I tried to write my own
using recursion and found I quickly hit the "too many sockets" open
problem. So I looked for a threaded version that I could easily extend.

The simplest/most reliable I found was called Spider.py (see attached).

At this stage I want a spider that I can point at a site, let it do
it's thing, and reliable get a callback of sorts... including the html
(for me to parse), the url of the page in question (so I can log it)
and the urls-found-on-that-page (so I can strip out any ones I really
don't want and add them to the "seen-list".
Now, this is my question.

The code above ALMOST works fine. The crawler crawls, I get the data I
need BUT... every now and again the code just pauses, I hit control-C
and it reports an error as if it has hit an exception and then carries
on!!! I like the fact that my spider_usage.py file has the minimum
amount of spider stuff in it... really just a main() and handle()
handler.

How does this happen... is a thread being killed and then a new one is
made or what? I suspect it may have something to do with sockets timing
out, but I have no idea...

By the way on small sites (100s of pages) it never gets to the stall,
it's on larger sites such as Amazon that it "fails"

This is my other question

It would be great to know, when the code is stalled, if it is doing
anything... is there any way to even print a full stop to screen?

This is my last question

Given python's suitability for this sort of thing (isn't google written
in it?) I can't believe that that there isn't a kick ass crawler
already out there...

regards

tom

http://www.theotherblog.com/Articles...rawler-spider/
Aug 4 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: c duden | last post by:
I am able to add an exisiting office addin project to a blank solution in VS.NET 2003. When I attempt to add it's setup project I get the following error in VS.NET "Cannot change threading mode...
77
by: Jon Skeet [C# MVP] | last post by:
Please excuse the cross-post - I'm pretty sure I've had interest in the article on all the groups this is posted to. I've finally managed to finish my article on multi-threading - at least for...
1
by: Benjamin Lefevre | last post by:
I am currently developping a web crawler, mainly crawling mobile page (wml, mobile xhtml) but not only (also html/xml/...), and I ask myself which speed I can reach. This crawler is developped in...
13
by: John | last post by:
I've got some reasonably complex business logic in my C# code, in a class called by a ASP.NET page. This takes around 3-4 seconds to execute. It's not dependent on SQL calls or anything like that....
25
by: MuZZy | last post by:
Hi, I'm currently rewriting some functionality which was using multithredaing for retrieving datasets from database and updating a grid control. I found that the grids (Infragistics UltraGrid,...
2
by: Daniel | last post by:
I have a class similar to this: class MyThread(threading.Thread): def __init__(self): self.terminated = False def run(self): while not self.terminated:
126
by: Dann Corbit | last post by:
Rather than create a new way of doing things: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2497.html why not just pick up ACE into the existing standard:...
9
by: brendan_gallagher_2001 | last post by:
Hi I am seeing some strange behaviour on a windows (vb.net 1.1) service. Basically, what I see happening is that when the Timer1_Elapsed event fires, it attempts to execute Timer1.Stop() but...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.