By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,549 Members | 1,717 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,549 IT Pros & Developers. It's quick & easy.

C# Crawler and performance (speed of crawling)

P: n/a
I am currently developping a web crawler, mainly crawling mobile page (wml,
mobile xhtml) but not only (also html/xml/...), and I ask myself which speed
I can reach.
This crawler is developped in C# using multithreading and HttpWebRequest.
Actually my crawler is able to download and crawl pages at the speed of
around 5 pages per second. It's running on a development machine with 512Mb
Ram and a shared ADSL-connection (2Mbits). Is it ridiculous ? Which speed
may I expect if I improve my code (how ?) ?
I would be very interested to have feedback from some people having already
worked on such stuff.

/Benjamin

N.B.: sorry for my poor english (I am french ;)).
Nov 22 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
Hi Benjamin,

I'm working on a crawler too, and I'd be interesting in swapping notes. It's
hard to know just how much you should be getting from your crawler.

The current log file indicates that my app is capable of processing up to
about 8 pages per second, although it's often around the 2-5 mark, and
varies wildly depending on many factors. This processing includes converting
pages to XHTML, analysing them, filtering out unwanted pages or areas of the
page, page duplication checking, and finally writing the pages to the
database (which is full text indexed). We're scanning 4000+ sites and taking
12,000 new pages a day. The app can cope with double this amount of
throughput. However, this is still quite low cause it has to pause to make
time for other tasks other than crawling.

I'm not expert, but improving the code is down to your performance factors,
which is down to both code and hardware. I'm spinning 20 threads, using
in-memory queues etc.

The database is our biggest problem at the moment, since it can't cope with
simaltaneous indexing and searching. I've spent hours inside both query
analyser and following traces to get it tuned, but pinpointing bottlenecks
is like peeling an onion - you have to methodically pick away at it cause
there's so many possible problem areas.

We're also considering scaling up to having read-only databases that are for
querying only, so that we can index around the clock. Also, using
technologies such as Lucene for text searching etc may yeild performance
increases.

If you want to swap notes, email me at t0bin_<at>_t0binharris_<dot>_c0m.
Replace 0 an o.

Hope this helps

Tobin

"Benjamin Lefevre" <fa**@nospam.com> wrote in message
news:42***********************@news.wanadoo.fr...
I am currently developping a web crawler, mainly crawling mobile page (wml,
mobile xhtml) but not only (also html/xml/...), and I ask myself which
speed I can reach.
This crawler is developped in C# using multithreading and HttpWebRequest.
Actually my crawler is able to download and crawl pages at the speed of
around 5 pages per second. It's running on a development machine with
512Mb Ram and a shared ADSL-connection (2Mbits). Is it ridiculous ? Which
speed may I expect if I improve my code (how ?) ?
I would be very interested to have feedback from some people having
already worked on such stuff.

/Benjamin

N.B.: sorry for my poor english (I am french ;)).

Nov 22 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.