468,512 Members | 1,628 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,512 developers. It's quick & easy.

web crawler in python or C?

Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?Python though has fast development cycle but my concern is
speed also.I want to strke a balance between development speed and
crawler speed.Since Python is an interpreted language it is rather
slow.The crawler which will be working on huge set of pages should be
as fast as possible.One possible implementation would be implementing
partly in C and partly in Python so that i can have best of both
worlds.But i don't know to approach about it.Can anyone guide me on
what part should i implement in C and what should be in Python?

Feb 16 '06 #1
1 3130
abhinav wrote:
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?Python though has fast development cycle but my concern is
speed also. I want to strke a balance between development speed and
crawler speed.
Web crawling is an inherently network limited activity. The way to
speed up crawling is through parallel downloading. The language
performance is not going to have a relevant effect. Python does not
support multithreading, but it does support weak coroutines. (Of
course, C does not support any kind of multithreading, except by
platform specific extensions -- but these extensions are widespread.)

For the problem of parsing and handling data structures for this
activity, however, Python is *FAR* superior to C in terms of
development speed.
[...] Since Python is an interpreted language it is rather
slow.The crawler which will be working on huge set of pages should be
as fast as possible.One possible implementation would be implementing
partly in C and partly in Python so that i can have best of both
worlds. But i don't know to approach about it.Can anyone guide me on
what part should i implement in C and what should be in Python?


Actually, I have, in fact, done it this way myself in the past (before
Python had weak coroutines.) The way I did it is I wrote a
command-line tool for pulling down a collection of URLs from a control
file in C (the URLs would be downloaded in a multithreaded manner),
then I drove this tool from a Python program. Asymptotically, this
pegs my download bandwidth for the majority of the runtime, thus making
it basically within striking distance of theoretically optimal.

The problem is that you've picked completely the wrong newsgroup to ask
this question. Unfortunately, there is not clue to this fact from the
name of this newsgroup. This is actually a newsgroup that discusses
only the ANSI/ISO C standard as it exists, and none of platform
specific extensions (including sockets, and multithreading). Nor is
the discussion of the development of real applications considered
on-topic in this newsgroup. Neither is performance considered on topic
-- by the standard, apparently you can't know even the *relative* speed
of anything in C. comp.programming would probaby have been a better
place to post this.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Feb 16 '06 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by Gomez | last post: by
1 post views Thread by Benjamin Lefevre | last post: by
1 post views Thread by Steve Ocsic | last post: by
13 posts views Thread by abhinav | last post: by
3 posts views Thread by mh121 | last post: by
12 posts views Thread by disappearedng | last post: by
4 posts views Thread by sonich | last post: by
1 post views Thread by fmendoza | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.