473,386 Members | 1,630 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

C# Crawler and performance (speed of crawling)

I am currently developping a web crawler, mainly crawling mobile page (wml,
mobile xhtml) but not only (also html/xml/...), and I ask myself which speed
I can reach.
This crawler is developped in C# using multithreading and HttpWebRequest.
Actually my crawler is able to download and crawl pages at the speed of
around 5 pages per second. It's running on a development machine with 512Mb
Ram and a shared ADSL-connection (2Mbits). Is it ridiculous ? Which speed
may I expect if I improve my code (how ?) ?
I would be very interested to have feedback from some people having already
worked on such stuff.

/Benjamin

N.B.: sorry for my poor english (I am french ;)).
Nov 22 '05 #1
1 4988
Hi Benjamin,

I'm working on a crawler too, and I'd be interesting in swapping notes. It's
hard to know just how much you should be getting from your crawler.

The current log file indicates that my app is capable of processing up to
about 8 pages per second, although it's often around the 2-5 mark, and
varies wildly depending on many factors. This processing includes converting
pages to XHTML, analysing them, filtering out unwanted pages or areas of the
page, page duplication checking, and finally writing the pages to the
database (which is full text indexed). We're scanning 4000+ sites and taking
12,000 new pages a day. The app can cope with double this amount of
throughput. However, this is still quite low cause it has to pause to make
time for other tasks other than crawling.

I'm not expert, but improving the code is down to your performance factors,
which is down to both code and hardware. I'm spinning 20 threads, using
in-memory queues etc.

The database is our biggest problem at the moment, since it can't cope with
simaltaneous indexing and searching. I've spent hours inside both query
analyser and following traces to get it tuned, but pinpointing bottlenecks
is like peeling an onion - you have to methodically pick away at it cause
there's so many possible problem areas.

We're also considering scaling up to having read-only databases that are for
querying only, so that we can index around the clock. Also, using
technologies such as Lucene for text searching etc may yeild performance
increases.

If you want to swap notes, email me at t0bin_<at>_t0binharris_<dot>_c0m.
Replace 0 an o.

Hope this helps

Tobin

"Benjamin Lefevre" <fa**@nospam.com> wrote in message
news:42***********************@news.wanadoo.fr...
I am currently developping a web crawler, mainly crawling mobile page (wml,
mobile xhtml) but not only (also html/xml/...), and I ask myself which
speed I can reach.
This crawler is developped in C# using multithreading and HttpWebRequest.
Actually my crawler is able to download and crawl pages at the speed of
around 5 pages per second. It's running on a development machine with
512Mb Ram and a shared ADSL-connection (2Mbits). Is it ridiculous ? Which
speed may I expect if I improve my code (how ?) ?
I would be very interested to have feedback from some people having
already worked on such stuff.

/Benjamin

N.B.: sorry for my poor english (I am french ;)).

Nov 22 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Gomez | last post by:
Hi, Is there a way to know if a session on my web server is from an actual user or an automated crawler. please advise. G
1
by: Benjamin Lefevre | last post by:
I am currently developping a web crawler, mainly crawling mobile page (wml, mobile xhtml) but not only (also html/xml/...), and I ask myself which speed I can reach. This crawler is developped in...
4
by: wriggs | last post by:
Hi, Any suggestions on the following as I've kind of run out of ideas. I have 2 servers which are the same spec ie box, processor etc. The only difference I can tell is that the production box...
8
by: BlueBall | last post by:
I am writing some kind of network testing tool and I have wrote the following code in ASP.NET with C# int size= 10048576; // around 10 MB data string buffer = ""; for (int j=1; j<=1024;...
3
by: Bill | last post by:
Has anyone used/tested Request.Browser.Crawler ? Is it reliable, or are there false positives/negatives? Thanks!
1
by: abhinav | last post by:
Hi guys.I have to implement a topical crawler as a part of my project.What language should i implement C or Python?Python though has fast development cycle but my concern is speed also.I want to...
4
by: StevePBurgess | last post by:
Hi. I have a book affiliate website. Whenever a visitor clicks on one of the books, a script adds one to a field in a mysql database and then takes the visitor to the shopping basket on the book...
4
by: formido | last post by:
I'm crawling a site and storing links to be crawled in a table. I have multiple threads running at the same time. What I'd like is for a thread to be able to get the primary key of the next...
3
rhitam30111985
by: rhitam30111985 | last post by:
hi all,,, i am testing a web crawler on a site passsed as a command line argument.. it works fine until it finds a server which is down or some other error ... here is my code: #!...
12
by: disappearedng | last post by:
Hi all, I am currently planning to write my own web crawler. I know Python but not Perl, and I am interested in knowing which of these two are a better choice given the following scenario: 1)...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.