473,668 Members | 2,415 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

web crawler in python or C?

Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?Python though has fast development cycle but my concern is
speed also.I want to strke a balance between development speed and
crawler speed.Since Python is an interpreted language it is rather
slow.The crawler which will be working on huge set of pages should be
as fast as possible.One possible implementation would be implementing
partly in C and partly in Python so that i can have best of both
worlds.But i don't know to approach about it.Can anyone guide me on
what part should i implement in C and what should be in Python?

Feb 16 '06 #1
13 5890
"abhinav" <ab***********@ gmail.com> writes:
The crawler which will be working on huge set of pages should be
as fast as possible.


What kind of network connection do you have, that's fast enough
that even a fairly cpu-inefficient crawler won't saturate it?
Feb 16 '06 #2
It is DSL broadband 128kbps.But thats not the point.What i am saying is
that would python be fine for implementing fast crawler algorithms or
should i use C.Handling huge data,multithrea ding,file
handling,heuris tics for ranking,and maintaining huge data
structures.What should be the language so as not to compromise that
much on speed.What is the performance of python based crawlers vs C
based crawlers.Should I use both the languages(partl y C and python).How
should i decide what part to be implemented in C and what should be
done in python?
Please guide me.Thanks.

Feb 16 '06 #3

abhinav wrote:
It is DSL broadband 128kbps.But thats not the point.What i am saying is
that would python be fine for implementing fast crawler algorithms or
should i use C.
But a web crawler is going to be *mainly* I/O bound - so language
efficiency won't be the main issue. There are several web crawler
implemented in Python.
Handling huge data,multithrea ding,file
handling,heuris tics for ranking,and maintaining huge data
structures.What should be the language so as not to compromise that
much on speed.What is the performance of python based crawlers vs C
based crawlers.Should I use both the languages(partl y C and python).How
If your data processing requirements are fairly heavy you will
*probably* get a speed advantage coding them in C and accessing them
from Python.

The usdual advice (which seems to be applicable to you), is to
prototype in Python (which will be much more fun than in C) then test.

Profile to find your real bottlenecks (if the Python one isn't fast
enough - which it may be), and move your bottlenecks to C.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
should i decide what part to be implemented in C and what should be
done in python?
Please guide me.Thanks.


Feb 16 '06 #4
"abhinav" <ab***********@ gmail.com> writes:
It is DSL broadband 128kbps.But thats not the point.
But it is the point.
What i am saying is that would python be fine for implementing fast
crawler algorithms or should i use C.Handling huge
data,multithrea ding,file handling,heuris tics for ranking,and
maintaining huge data structures.What should be the language so as
not to compromise that much on speed.What is the performance of
python based crawlers vs C based crawlers.Should I use both the
languages(partl y C and python).How should i decide what part to be
implemented in C and what should be done in python? Please guide
me.Thanks.


I think if you don't know how to answer these questions for yourself,
you're not ready to take on projects of that complexity. My advice
is start in Python since development will be much easier. If and when
you start hitting performance problems, you'll have to examine many
combinations of tactics for dealing with them, and switching languages
is just one such tactic.
Feb 16 '06 #5

Paul Rubin wrote:
"abhinav" <ab***********@ gmail.com> writes:

maintaining huge data structures.What should be the language so as
not to compromise that much on speed.What is the performance of
python based crawlers vs C based crawlers.Should I use both the
languages(partl y C and python).How should i decide what part to be
implemented in C and what should be done in python? Please guide
me.Thanks.


I think if you don't know how to answer these questions for yourself,
you're not ready to take on projects of that complexity. My advice
is start in Python since development will be much easier. If and when
you start hitting performance problems, you'll have to examine many
combinations of tactics for dealing with them, and switching languages
is just one such tactic.


There's another potential bottleneck, parsing HTML and extracting the
text you want, especially when you hit pages that don't meet HTML 4 or
XHTML spec.
http://sig.levillage.org/?p=599

Paul's advice is very sound, given what little info you've provided.

http://trific.ath.cx/resources/python/optimization/
(and look at psyco, pyrex, boost, Swig, Ctypes for bridging C and
python, you have a lot of options. Also look at Harvestman, mechanize,
other existing libs.

Feb 16 '06 #6

abhinav wrote:
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement


Oh, and there's some really good books out there, besides the Orilly
Spidering Hacks. Springer Verlag has a couple books on "Text Mining"
and at least a couple books with "web intelligence" in the title.
Expensive but worth it.

Feb 16 '06 #7
On 15 Feb 2006 21:56:52 -0800, abhinav <ab***********@ gmail.com> wrote:
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?


Why does this keep coming up on here as of late? If you search the
archives, you can find numerous posts about spiders. One interesting
fact is that google itself starting with their spiders in python.
http://www-db.stanford.edu/~backrub/google.html I'm _sure_ it'll work
for you.

--
Andrew Gwozdziewycz <ap****@gmail.c om>
http://ihadagreatview.org
http://plasticandroid.org
Feb 16 '06 #8
On Wed, 15 Feb 2006 21:56:52 -0800, abhinav wrote:
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?Python though has fast development cycle but my concern is
speed also.I want to strke a balance between development speed and
crawler speed.Since Python is an interpreted language it is rather
slow.
Python is no more interpreted than Java. Like Java, it is compiled to
byte-code. Unlike Java, it doesn't take three weeks to start the runtime
environment. (Okay, maybe it just *seems* like three weeks.)

The nice clean distinctions between "compiled" and "interprete d" languages
haven't existed in most serious programming languages for a decade or
more. In these days of tokenizers and byte-code compilers and processors
emulating other processors, the difference is more of degree than kind.

It is true that standard Python doesn't compile to platform dependent
machine code, but that is rarely an issue since the bottleneck for most
applications is I/O or human interaction, not language speed. And for
those cases where it is a problem, there are solutions, like Psycho.

After all, it is almost never true that your code must run as fast as
physically possible. That's called "over-engineering". It just needs to
run as fast as needed, that's all. And that's a much simpler problem to
solve cheaply.
The crawler which will be working on huge set of pages should be
as fast as possible.
Web crawler performance is almost certainly going to be I/O bound. Sounds
to me like you are guilty of trying to optimize your code before even
writing a single line of code. What you call "huge" may not be huge to
your computer. Have you tried? The great thing about Python is you can
write a prototype in maybe a tenth the time it would take you to do the
same thing in C. Instead of trying to guess what the performance
bottlenecks will be, you can write your code and profile it and find the
bottlenecks with accuracy.

One possible implementation would be implementing
partly in C and partly in Python so that i can have best of both
worlds.
Sure you can do that, if you need to.
But i don't know to approach about it.Can anyone guide me on
what part should i implement in C and what should be in Python?


Yes. Write it all in Python. Test it, debug it, get it working.

Once it is working, and not before, rigorously profile it. You may find it
is fast enough.

If it is not fast enough, find the bottlenecks. Replace them with better
algorithms. We had an example on comp.lang.pytho n just a day or two ago
where a function which was taking hours to complete was re-written with a
better algorithm which took only seconds. And still in Python.

If it is still too slow after using better algorithms, or if there are no
better algorithms, then and only then re-write those bottlenecks in C for
speed.

--
Steven.

Feb 16 '06 #9
abhinav wrote:
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?Python though has fast development cycle but my concern is
speed also.I want to strke a balance between development speed and
crawler speed.Since Python is an interpreted language it is rather
slow.The crawler which will be working on huge set of pages should be
as fast as possible.One possible implementation would be implementing
partly in C and partly in Python so that i can have best of both
worlds.But i don't know to approach about it.Can anyone guide me on
what part should i implement in C and what should be in Python?

Get real. Any web crawler is bound to spend huge amounts of its time
waiting for data to come in over network pipes. Or do you have plans for
massive parallelism previously unheard of in the Python world?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Feb 17 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
1097
by: Benjamin Lefevre | last post by:
I am currently developping a web crawler, mainly crawling mobile page (wml, mobile xhtml) but not only (also html/xml/...), and I ask myself which speed I can reach. This crawler is developped in C# using multithreading and HttpWebRequest. Actually my crawler is able to download and crawl pages at the speed of around 5 pages per second. It's running on a development machine with 512Mb Ram and a shared ADSL-connection (2Mbits). Is it...
1
2501
by: Steve Ocsic | last post by:
Hi, I've coded a basic crawler where by you enter the URL and it will then crawl the said URL. What I would like to do now is to take it one step further and do the following: 1. pick up the url's I would like to crawl from a database and pass them to the crawler. Once the crawler has crawled the website I would then like to put a flag against it so that the url is not processed for a certain period of time.
3
5718
by: Bill | last post by:
Has anyone used/tested Request.Browser.Crawler ? Is it reliable, or are there false positives/negatives? Thanks!
1
3347
by: abhinav | last post by:
Hi guys.I have to implement a topical crawler as a part of my project.What language should i implement C or Python?Python though has fast development cycle but my concern is speed also.I want to strke a balance between development speed and crawler speed.Since Python is an interpreted language it is rather slow.The crawler which will be working on huge set of pages should be as fast as possible.One possible implementation would be...
3
4638
rhitam30111985
by: rhitam30111985 | last post by:
hi all,,, i am testing a web crawler on a site passsed as a command line argument.. it works fine until it finds a server which is down or some other error ... here is my code: #! /usr/bin/python import urllib import re import sys def crawl(urllist,done):
3
3962
by: mh121 | last post by:
I am trying to write a web crawler (for academic research purposes) that grabs the number of links different websites/domain names have from other websites, as listed on Google (for example, to get the number of websites linking to YouTube, you could type into Google 'Link:YouTube.com' and get 11,100). I have a list of websites in a spreadsheet and would like to be able to output the number of links for each website in the sheet. When I run...
12
4290
by: disappearedng | last post by:
Hi all, I am currently planning to write my own web crawler. I know Python but not Perl, and I am interested in knowing which of these two are a better choice given the following scenario: 1) I/O issues: my biggest constraint in terms of resource will be bandwidth throttle neck. 2) Efficiency issues: The crawlers have to be fast, robust and as "memory efficient" as possible. I am running all of my crawlers on cheap pcs with about 500...
0
2243
by: kishorealla | last post by:
Hello I need to create a web bot/crawler/spider that would go into different web sites and collect data for us and store in a database. The crawler needs to 'READ' the options on a website (either from drop-downs, radio-buttons or check-boxesand) to create some input itself OR use some generic pre-defined words (that we provide it with). For example, a webpage might be structure with a text field and some drop-downs. Typically, if the user...
4
4082
by: sonich | last post by:
I need simple web crawler, I found Ruya, but it's seems not currently maintained. Does anybody know good web crawler on python or with python interface?
0
8371
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8790
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8572
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8652
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7391
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6206
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4202
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
2782
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
1779
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.