Web Crawler - Python or Perl?

disappearedng

Hi all,
I am currently planning to write my own web crawler. I know Python but
not Perl, and I am interested in knowing which of these two are a
better choice given the following scenario:

1) I/O issues: my biggest constraint in terms of resource will be
bandwidth throttle neck.
2) Efficiency issues: The crawlers have to be fast, robust and as
"memory efficient" as possible. I am running all of my crawlers on
cheap pcs with about 500 mb RAM and P3 to P4 processors
3) Compatibility issues: Most of these crawlers will run on Unix
(FreeBSD), so there should exist a pretty good compiler that can
optimize my code these under the environments.

What are your opinions?

Jun 27 '08 #1

Subscribe Post Reply

4272

subeen

On Jun 9, 11:48 pm, disappeare...@gmail.com wrote:

Hi all,
I am currently planning to write my own web crawler. I know Python but
not Perl, and I am interested in knowing which of these two are a
better choice given the following scenario:

1) I/O issues: my biggest constraint in terms of resource will be
bandwidth throttle neck.
2) Efficiency issues: The crawlers have to be fast, robust and as
"memory efficient" as possible. I am running all of my crawlers on
cheap pcs with about 500 mb RAM and P3 to P4 processors
3) Compatibility issues: Most of these crawlers will run on Unix
(FreeBSD), so there should exist a pretty good compiler that can
optimize my code these under the environments.

What are your opinions?

It really doesn't matter whether you use Perl or Python for writing
web crawlers. I have used both for writing crawlers. The scenarios you
mentioned (I/O issues, Efficiency, Compatibility) don't differ two
much for these two languages. Both the languages have fast I/O. You
can use urllib2 module and/or beautiful soup for developing crawler in
Python. For Perl you can use Mechanize or LWP modules. Both languages
have good support for regular expressions. Perl is slightly faster I
have heard, though I don't find the difference myself. Both are
compatible with *nix. For writing a good crawler, language is not
important, it's the technology which is important.

regards,
Subeen.
http://love-python.blogspot.com/

Jun 27 '08 #2

Stefan Behnel

di***********@gmail.com wrote:

1) I/O issues: my biggest constraint in terms of resource will be
bandwidth throttle neck.
2) Efficiency issues: The crawlers have to be fast, robust and as
"memory efficient" as possible. I am running all of my crawlers on
cheap pcs with about 500 mb RAM and P3 to P4 processors
3) Compatibility issues: Most of these crawlers will run on Unix
(FreeBSD), so there should exist a pretty good compiler that can
optimize my code these under the environments.

You should rethink your requirements. You expect to be I/O bound, so why do
you require a good "compiler"? Especially when asking about two interpreted
languages...

Consider using lxml (with Python), it has pretty much everything you need for
a web crawler, supports threaded parsing directly from HTTP URLs, and it's
plenty fast and pretty memory efficient.

http://codespeak.net/lxml/

Stefan

Jun 27 '08 #3

Stefan Behnel

subeen wrote:

can use urllib2 module and/or beautiful soup for developing crawler

Not if you care about a) speed and/or b) memory efficiency.

http://blog.ianbicking.org/2008/03/3...r-performance/

Stefan

Jun 27 '08 #4

subeen

On Jun 10, 12:15 am, Stefan Behnel <stefan...@behnel.dewrote:

subeen wrote:
can use urllib2 module and/or beautiful soup for developing crawler

Not if you care about a) speed and/or b) memory efficiency.

http://blog.ianbicking.org/2008/03/3...r-performance/

Stefan

ya, beautiful soup is slower. so it's better to use urllib2 for
fetching data and regular expressions for parsing data.
regards,
Subeen.
http://love-python.blogspot.com/

Jun 27 '08 #5

Ray Cote

At 11:21 AM -0700 6/9/08, subeen wrote:

>On Jun 10, 12:15 am, Stefan Behnel <stefan...@behnel.dewrote:
> subeen wrote:
> can use urllib2 module and/or beautiful soup for developing crawler

Not if you care about a) speed and/or b) memory efficiency.

http://blog.ianbicking.org/2008/03/3...r-performance/

Stefan

ya, beautiful soup is slower. so it's better to use urllib2 for
fetching data and regular expressions for parsing data.
regards,
Subeen.
http://love-python.blogspot.com/
--
http://mail.python.org/mailman/listinfo/python-list

Beautiful Soup is a bit slower, but it will actually parse some of
the bizarre HTML you'll download off the web. We've written a couple
of crawlers to run over specific clients sites (I note, we did _not_
create the content on these sites).

Expect to find html code that looks like this:

<ul>
<li>
<form>
</li>
</form>
</ul>
[from a real example, and yes, it did indeed render in IE.]

I don't know if some of the quicker parsers discussed require
well-formed HTML since I've not used them. You may want to consider
using one of the quicker HTML parsers and, when they throw a fit on
the downloaded HTML, drop back to Beautiful Soup -- which usually
gets _something_ useful off the page.

--Ray

--

Raymond Cote
Appropriate Solutions, Inc.
PO Box 458 ~ Peterborough, NH 03458-0458
Phone: 603.924.6079 ~ Fax: 603.924.8668
rgacote(at)AppropriateSolutions.com
www.AppropriateSolutions.com

Jun 27 '08 #6

Sebastian \lunar\ Wiesner

subeen <ta************@gmail.comat Montag 09 Juni 2008 20:21:

On Jun 10, 12:15 am, Stefan Behnel <stefan...@behnel.dewrote:
>subeen wrote:
can use urllib2 module and/or beautiful soup for developing crawler

Not if you care about a) speed and/or b) memory efficiency.

http://blog.ianbicking.org/2008/03/3...r-performance/

Stefan

ya, beautiful soup is slower. so it's better to use urllib2 for
fetching data and regular expressions for parsing data.

BeautifulSoup is implemented on regular expressions. I doubt, that you can
achieve a great performance gain by using plain regular expressions, and
even if, this gain is certainly not worth the effort. Parsing markup with
regular expressions is hard, and the result will most likely not be as fast
and as memory-efficient as lxml.html.

I personally am absolutely happy with lxml.html. It's fast, memory
efficient, yet powerful and easy to use.

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

Jun 27 '08 #7

Nick Craig-Wood

di***********@gmail.com <di***********@gmail.comwrote:

I am currently planning to write my own web crawler. I know Python but
not Perl, and I am interested in knowing which of these two are a
better choice given the following scenario:

1) I/O issues: my biggest constraint in terms of resource will be
bandwidth throttle neck.
2) Efficiency issues: The crawlers have to be fast, robust and as
"memory efficient" as possible. I am running all of my crawlers on
cheap pcs with about 500 mb RAM and P3 to P4 processors
3) Compatibility issues: Most of these crawlers will run on Unix
(FreeBSD), so there should exist a pretty good compiler that can
optimize my code these under the environments.

What are your opinions?

Use python with twisted.

With a friend I wrote a crawler. Our first attempt was standard
python. Our second attempt was with twisted. Twisted absolutely blew
the socks off our first attempt - mainly because you can fetch 100s or
1000s of pages simultaneously, without threads.

Python with twisted will satisfy 1-3. You'll have to get your head
around its asynchronous nature, but once you do you'll be writing a
killer crawler ;-)

As for Perl - once upon a time I would have done this with perl, but I
wouldn't go back now!

--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick

Jun 27 '08 #8

Stefan Behnel

Ray Cote wrote:

Beautiful Soup is a bit slower, but it will actually parse some of the
bizarre HTML you'll download off the web.

[...]

I don't know if some of the quicker parsers discussed require
well-formed HTML since I've not used them. You may want to consider
using one of the quicker HTML parsers and, when they throw a fit on the
downloaded HTML, drop back to Beautiful Soup -- which usually gets
_something_ useful off the page.

So does lxml.html. And if you still feel like needing BS once in a while,
there's lxml.html.soupparser.

http://codespeak.net/lxml/elementsoup.html

Stefan

Jun 27 '08 #9

disappearedng

As to why as opposed to what, I am attempting to build a search engine
right now that plans to crawl not just html but other things too.

I am open to learning, and I don't want to learn anything that doesn't
really contribute to building my search engine for the moment. Hence I
want to see whether learning PERL will be helpful to the later parts
of my search engine.

Victor

Jun 27 '08 #10

Stefan Behnel

di***********@gmail.com wrote:

As to why as opposed to what, I am attempting to build a search engine
right now that plans to crawl not just html but other things too.

I am open to learning, and I don't want to learn anything that doesn't
really contribute to building my search engine for the moment. Hence I
want to see whether learning PERL will be helpful to the later parts
of my search engine.

I honestly don't think there's anything useful in Perl that you can't do in
Python. There's tons of ugly ways to write unreadable code, though, so if you
prefer that, that's something that's harder to do in Python.

Stefan

Jun 27 '08 #11

Chuck Rhode

On Mon, 09 Jun 2008 10:48:03 -0700, disappearedng wrote:

I know Python but not Perl, and I am interested in knowing which of
these two are a better choice.

I'm partial to *Python*, but, the last time I looked, *urllib2* didn't
provide a time-out mechanism that worked under all circumstances. My
client-side scripts would usually hang when the server quit
responding, which happened a lot.

You can get around this by starting an *html* retrieval in its own
thread, giving it a deadline, and killing it if it doesn't finish
gracefully.

A quicker and considerably grittier solution is to supply timeout
parms to the *curl* command through the shell. Execute the command
and retrieve its output through the *subprocess* module.

--
... Chuck Rhode, Sheboygan, WI, USA
... 1979 Honda Goldwing GL1000 (Geraldine)
... Weather: http://LacusVeris.com/WX
... 64Â° â€” Wind SE 5 mph â€” Sky partly cloudy.

Jun 27 '08 #12

subeen

On Jun 13, 1:26 am, Chuck Rhode <CRh...@LacusVeris.comwrote:

On Mon, 09 Jun 2008 10:48:03 -0700, disappearedng wrote:
I knowPythonbut notPerl, and I am interested in knowing which of
these two are a better choice.

I'm partial to *Python*, but, the last time I looked, *urllib2* didn't
provide a time-out mechanism that worked under all circumstances. My
client-side scripts would usually hang when the server quit
responding, which happened a lot.

You can avoid the problem using the following code:
import socket

timeout = 300 # seconds
socket.setdefaulttimeout(timeout)

regards,
Subeen.
http://love-python.blogspot.com/

Jun 27 '08 #13

Similar topics

Python or PHP?

by: Lad | last post by:

Is anyone capable of providing Python advantages over PHP if there are any? Cheers, L.

Python

Link crawler

by: Tim Johansson | last post by:

Is it an idea to start making a link crawler in C++? I know that Perl fits the purpose very well, but I'd rather do it in C++ as an excersise.

C / C++

Using Request.Browser.Crawler - is it reliable?

by: Bill | last post by:

Has anyone used/tested Request.Browser.Crawler ? Is it reliable, or are there false positives/negatives? Thanks!

ASP.NET

web crawler in python or C?

by: abhinav | last post by:

Hi guys.I have to implement a topical crawler as a part of my project.What language should i implement C or Python?Python though has fast development cycle but my concern is speed also.I want to...

Python

Web crawler

by: Oscarian | last post by:

Does anyone have any tips/tutorials on how to write a web crawler using C? On that note, is C a "good" language to use for a web crawler? -O

C / C++

web crawler error: connection timed out

by: rhitam30111985 | last post by:

hi all,,, i am testing a web crawler on a site passsed as a command line argument.. it works fine until it finds a server which is down or some other error ... here is my code: #!...

Python

Is my web crawler being blocked?

by: mh121 | last post by:

I am trying to write a web crawler (for academic research purposes) that grabs the number of links different websites/domain names have from other websites, as listed on Google (for example, to get...

Python

Web crawler on python

by: sonich | last post by:

I need simple web crawler, I found Ruya, but it's seems not currently maintained. Does anybody know good web crawler on python or with python interface?

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing