By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
431,900 Members | 1,078 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 431,900 IT Pros & Developers. It's quick & easy.

web crawler in python or C?

P: n/a
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?Python though has fast development cycle but my concern is
speed also.I want to strke a balance between development speed and
crawler speed.Since Python is an interpreted language it is rather
slow.The crawler which will be working on huge set of pages should be
as fast as possible.One possible implementation would be implementing
partly in C and partly in Python so that i can have best of both
worlds.But i don't know to approach about it.Can anyone guide me on
what part should i implement in C and what should be in Python?

Feb 16 '06 #1
Share this Question
Share on Google+
13 Replies


P: n/a
"abhinav" <ab***********@gmail.com> writes:
The crawler which will be working on huge set of pages should be
as fast as possible.


What kind of network connection do you have, that's fast enough
that even a fairly cpu-inefficient crawler won't saturate it?
Feb 16 '06 #2

P: n/a
It is DSL broadband 128kbps.But thats not the point.What i am saying is
that would python be fine for implementing fast crawler algorithms or
should i use C.Handling huge data,multithreading,file
handling,heuristics for ranking,and maintaining huge data
structures.What should be the language so as not to compromise that
much on speed.What is the performance of python based crawlers vs C
based crawlers.Should I use both the languages(partly C and python).How
should i decide what part to be implemented in C and what should be
done in python?
Please guide me.Thanks.

Feb 16 '06 #3

P: n/a

abhinav wrote:
It is DSL broadband 128kbps.But thats not the point.What i am saying is
that would python be fine for implementing fast crawler algorithms or
should i use C.
But a web crawler is going to be *mainly* I/O bound - so language
efficiency won't be the main issue. There are several web crawler
implemented in Python.
Handling huge data,multithreading,file
handling,heuristics for ranking,and maintaining huge data
structures.What should be the language so as not to compromise that
much on speed.What is the performance of python based crawlers vs C
based crawlers.Should I use both the languages(partly C and python).How
If your data processing requirements are fairly heavy you will
*probably* get a speed advantage coding them in C and accessing them
from Python.

The usdual advice (which seems to be applicable to you), is to
prototype in Python (which will be much more fun than in C) then test.

Profile to find your real bottlenecks (if the Python one isn't fast
enough - which it may be), and move your bottlenecks to C.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
should i decide what part to be implemented in C and what should be
done in python?
Please guide me.Thanks.


Feb 16 '06 #4

P: n/a
"abhinav" <ab***********@gmail.com> writes:
It is DSL broadband 128kbps.But thats not the point.
But it is the point.
What i am saying is that would python be fine for implementing fast
crawler algorithms or should i use C.Handling huge
data,multithreading,file handling,heuristics for ranking,and
maintaining huge data structures.What should be the language so as
not to compromise that much on speed.What is the performance of
python based crawlers vs C based crawlers.Should I use both the
languages(partly C and python).How should i decide what part to be
implemented in C and what should be done in python? Please guide
me.Thanks.


I think if you don't know how to answer these questions for yourself,
you're not ready to take on projects of that complexity. My advice
is start in Python since development will be much easier. If and when
you start hitting performance problems, you'll have to examine many
combinations of tactics for dealing with them, and switching languages
is just one such tactic.
Feb 16 '06 #5

P: n/a

Paul Rubin wrote:
"abhinav" <ab***********@gmail.com> writes:

maintaining huge data structures.What should be the language so as
not to compromise that much on speed.What is the performance of
python based crawlers vs C based crawlers.Should I use both the
languages(partly C and python).How should i decide what part to be
implemented in C and what should be done in python? Please guide
me.Thanks.


I think if you don't know how to answer these questions for yourself,
you're not ready to take on projects of that complexity. My advice
is start in Python since development will be much easier. If and when
you start hitting performance problems, you'll have to examine many
combinations of tactics for dealing with them, and switching languages
is just one such tactic.


There's another potential bottleneck, parsing HTML and extracting the
text you want, especially when you hit pages that don't meet HTML 4 or
XHTML spec.
http://sig.levillage.org/?p=599

Paul's advice is very sound, given what little info you've provided.

http://trific.ath.cx/resources/python/optimization/
(and look at psyco, pyrex, boost, Swig, Ctypes for bridging C and
python, you have a lot of options. Also look at Harvestman, mechanize,
other existing libs.

Feb 16 '06 #6

P: n/a

abhinav wrote:
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement


Oh, and there's some really good books out there, besides the Orilly
Spidering Hacks. Springer Verlag has a couple books on "Text Mining"
and at least a couple books with "web intelligence" in the title.
Expensive but worth it.

Feb 16 '06 #7

P: n/a
On 15 Feb 2006 21:56:52 -0800, abhinav <ab***********@gmail.com> wrote:
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?


Why does this keep coming up on here as of late? If you search the
archives, you can find numerous posts about spiders. One interesting
fact is that google itself starting with their spiders in python.
http://www-db.stanford.edu/~backrub/google.html I'm _sure_ it'll work
for you.

--
Andrew Gwozdziewycz <ap****@gmail.com>
http://ihadagreatview.org
http://plasticandroid.org
Feb 16 '06 #8

P: n/a
On Wed, 15 Feb 2006 21:56:52 -0800, abhinav wrote:
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?Python though has fast development cycle but my concern is
speed also.I want to strke a balance between development speed and
crawler speed.Since Python is an interpreted language it is rather
slow.
Python is no more interpreted than Java. Like Java, it is compiled to
byte-code. Unlike Java, it doesn't take three weeks to start the runtime
environment. (Okay, maybe it just *seems* like three weeks.)

The nice clean distinctions between "compiled" and "interpreted" languages
haven't existed in most serious programming languages for a decade or
more. In these days of tokenizers and byte-code compilers and processors
emulating other processors, the difference is more of degree than kind.

It is true that standard Python doesn't compile to platform dependent
machine code, but that is rarely an issue since the bottleneck for most
applications is I/O or human interaction, not language speed. And for
those cases where it is a problem, there are solutions, like Psycho.

After all, it is almost never true that your code must run as fast as
physically possible. That's called "over-engineering". It just needs to
run as fast as needed, that's all. And that's a much simpler problem to
solve cheaply.
The crawler which will be working on huge set of pages should be
as fast as possible.
Web crawler performance is almost certainly going to be I/O bound. Sounds
to me like you are guilty of trying to optimize your code before even
writing a single line of code. What you call "huge" may not be huge to
your computer. Have you tried? The great thing about Python is you can
write a prototype in maybe a tenth the time it would take you to do the
same thing in C. Instead of trying to guess what the performance
bottlenecks will be, you can write your code and profile it and find the
bottlenecks with accuracy.

One possible implementation would be implementing
partly in C and partly in Python so that i can have best of both
worlds.
Sure you can do that, if you need to.
But i don't know to approach about it.Can anyone guide me on
what part should i implement in C and what should be in Python?


Yes. Write it all in Python. Test it, debug it, get it working.

Once it is working, and not before, rigorously profile it. You may find it
is fast enough.

If it is not fast enough, find the bottlenecks. Replace them with better
algorithms. We had an example on comp.lang.python just a day or two ago
where a function which was taking hours to complete was re-written with a
better algorithm which took only seconds. And still in Python.

If it is still too slow after using better algorithms, or if there are no
better algorithms, then and only then re-write those bottlenecks in C for
speed.

--
Steven.

Feb 16 '06 #9

P: n/a
abhinav wrote:
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?Python though has fast development cycle but my concern is
speed also.I want to strke a balance between development speed and
crawler speed.Since Python is an interpreted language it is rather
slow.The crawler which will be working on huge set of pages should be
as fast as possible.One possible implementation would be implementing
partly in C and partly in Python so that i can have best of both
worlds.But i don't know to approach about it.Can anyone guide me on
what part should i implement in C and what should be in Python?

Get real. Any web crawler is bound to spend huge amounts of its time
waiting for data to come in over network pipes. Or do you have plans for
massive parallelism previously unheard of in the Python world?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Feb 17 '06 #10

P: n/a
This is following the pattern of your previous post on language choice
wrt. writing a mail server. It is very common for beginers to over
emphasize performance requirements, size of the executable etc. More is
always good. Right? Yes! But at what cost?

The rule of thumb for all your Python Vs C questions is ...
1.) Choose Python by default.
2.) If your program is slow, it's your algorithm that you need to check
first. Python strictly speaking will be slow because of its dynamism.
However, most of whatever is performance critical in Python is already
implemented in C. And the speed difference of well written Python
programs with properly chosen extensions and algorithms is not far off.
3.) Remember that you can always drop back to C where ever you need to
without throwing all of your code. And even if you had to, Python is
very valuable as a prototyping tool since it is very agile. You would
have figured out what you needed to do by then, that rewriting it in C
will only take a fraction of the time compared to if it was written in
C directly.

Don't even start with asking the question, "is it fast enough?" till
you have already written it in Python and it turns out that it is not
running fast enough despite correctness of your code. If it does, you
can fix it relatively easily. It is easy to write bad code in C and
poorly written C code performance is lower than well written Python
code performance.

Remember Donald Knuth's quote.
"Premature optimization is the root of all evil in programming".

C is a language intended to be used when you NEED tight control over
memory allocation. It has few advantages in other scenarios. Don't
abuse it by choosing it by default.

Feb 17 '06 #11

P: n/a
Ravi Teja <we*********@gmail.com> wrote:
...
The rule of thumb for all your Python Vs C questions is ...
1.) Choose Python by default.
+1 QOTW!-)

2.) If your program is slow, it's your algorithm that you need to check
Seriously: yes, and (often even more importantly) data structure.

However, often most important tip, particularly for large-scale systems,
is to consider your program's _architecture_ (algorithms are about
details of computation, architecture is about partitioning systems into
components, locating their deployment, and so forth). At a generic and
lowish level: are you for example creating a lot of threads each for a
small amount of work? Then consider reusing threads from a "worker
threads" pool. Or maybe you could avoid threads and use event-driven
programming; or, at the other extreme, have multiple processes
communicating by TCP/IP so you can scale up your system to tens or
hundreds of processors -- in the latter case, partitioning your system
appropriately to minimize inter process communication may be the
bottleneck. Consider UDP, when you can afford missing a packet once in a
while -- sometimes it may let you reduce overheads compared to TCP
connections.

Database connections, and less importantly database cursors, are well
worth reusing. What are you "caching", and what instead is getting
recomputed over and over? It's possible to undercache (needless
repeated computation) but also to overcache (tying up memory and causing
paging). Are you making lots of system calls that you might be able to
avoid? Each system call has a context-switching cost, after all...

Any or all of these hints may be irrelevant to a specific category of
applications, but then, so can the hint about algorithms be. One cool
thing about Python is that it makes it easy and fast for you to try out
different approaches (particularly to architecture, but to algorithms as
well), even drastically different ones, when simple reasoning about the
issues leaves you undecided and you need to settle them empirically.

Remember Donald Knuth's quote.
"Premature optimization is the root of all evil in programming".


I believe Knuth himself said he was quoting Tony Hoare, and indeed
referred to this as "Hoare's dictum".
Alex
Feb 17 '06 #12

P: n/a
abhinav wrote:
I want to strke a balance between development speed and crawler speed.


"The best performance improvement is the transition from the
nonworking state to the working state." - J. Osterhout

Try to get there are soon as possible. You can figure out what
that means. ;^)

When you do all your programming in Python, most of the code that
is relevant for speed *is* written in C already. If performance
is slow, measure! Use the profiler to see if you are spending a
lot of time in Python code. If that is your problem, take a close
look at your algorithms and perhaps your data structures and see
what you can improve with Python. In the long run, going from from
e.g. O(n^2) to O(n log n) might mean much more than going from
Python to C. A poor algorithm in machine code still sucks when you
have to handle enough data. Changing your code to improve on
algorithms and structure is a lot easier in Python than in C.

If you've done all these things, still have performance problems,
and have identified a bottle neck in your Python code, it might
be time to get that piece rewritten in C. The easiest and least
intrusive way to do that might be with pyrex. You might also want
to try Psyco before you do this.

Even if you end up writing a whole program in C, it's not unlikely
that you will get to your goal faster if your first version is
written in Python.

Good luck!

P.S. Why someone would want to write yet another web crawler is
a puzzle to me. Surely there are plenty of good ideas that haven't
been properly implemented yet! It's probably very difficult to
beat Google on their home turf now, but I'd really like to see
a good tool to manage all that information I got from the net,
or through mail or wrote myself. I don't think they wrote that
yet--although I'm sure they are trying.
Feb 20 '06 #13

P: n/a
I think something that may be even more important to consider than just
the pure speed of your program, would be ease of design as well as the
overall stability of your code.

My opinion would be that writing in Python would have many benefits
over the speed gains of using C. For instance, you crawler will have to
handle all types of input from all over the web. Who can say what types
of malformed or poorly writen data it will come across. I think it
would be easier to create a system to handle this type of data in
Python than in C.

I don't want to pigeon-hole your project, but if it is for any use
other than a commercial product, I would say speed would be a concern
lower on the list than accurracy or time to develop. As others have
pointed out, if you hit many performance barriers chances are the
problem is the algorithm and not Python itself.

I wish you luck and hope you will experiment in Python first. If your
crawler is still not up to par, at the very least you might come up
with some ideas for how Python could be improved.

Feb 20 '06 #14

This discussion thread is closed

Replies have been disabled for this discussion.