By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,244 Members | 1,339 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,244 IT Pros & Developers. It's quick & easy.

Use of Xenu Link Sleuth on very large sites?

P: n/a
Does anyone have any experience running Xenu Link Sleuth:
http://home.snafu.de/tilman/xenulink.html
version 1.2e on very large sites?

I'm having problems running it against our site, in that
on my PC it will, for extended periods of time, consume
100% of the CPU cycles(usually with no internet activity).

I've been in touch with the program's author, but this
problem doesn't happen on his system.

Because of the 100% CPU utilisation, the program is of
course much slower than it should be, and I just terminated
the program after trying it again.
In 4 hours, it had checked 96,146 of 112,138 links, but
I know there are many more to check.

--
Dave Patton
Canadian Coordinator, the Degree Confluence Project
http://www.confluence.org dpatton at confluence dot org
My website: http://members.shaw.ca/davepatton/
Vancouver/Whistler - host of the 2010 Winter Olympics
Jul 20 '05 #1
Share this Question
Share on Google+
14 Replies


P: n/a
In article <Xn******************************@24.71.223.159> , one of infinite monkeys
at the keyboard of Dave Patton <dp*****@remove-for-nospam.confluence.org> wrote:
In 4 hours, it had checked 96,146 of 112,138 links, but
I know there are many more to check.


96000 links in 4 hours is nearly 7 links per second, and if it's
consuming CPU without internet activity that's all hits to your
own server! That seems to imply an extremely ill-behaved robot.
Of course if it's only your server that's getting hit then it's
your business, but your question suggests it is a problem.

You'd probably be better off using a well-behaved robot like Site Valet.
It'll take longer (by definition) even if you configure a short
revisit-site time, but the system load is negligible even when it's
running at many[1] hits per second (all to different servers so as
not to expose any one server to rapid-fire, of course).

[1] I can't tell you an upper limit to "many", it's bandwidth-limited.

--
Nick Kew
Jul 20 '05 #2

P: n/a
Nick Kew wrote:
That seems to imply an extremely ill-behaved robot.
Of course if it's only your server that's getting hit then it's
your business, but your question suggests it is a problem.

You'd probably be better off using a well-behaved robot like Site Valet.


You're jumping to an inappropriate conclusion, there is no indication
that Xenu is anything less that a well behaved robot. The OP's
description indicates that it is a problem local to the OP as the
author's system does not exhibit this behaviour.

I've been using Xenu for years and it's not only well behaved, it's also
lightning fast and free (is Site Valet free?).

--
Spartanicus
Jul 20 '05 #3

P: n/a
in post <news:Xn******************************@24.71.223.1 59>
Dave Patton said:
Does anyone have any experience running Xenu Link Sleuth:
http://home.snafu.de/tilman/xenulink.html
version 1.2e on very large sites?

I'm having problems running it against our site, in that
on my PC it will, for extended periods of time, consume
100% of the CPU cycles(usually with no internet activity).


i cant remember the version (it was years ago) but i used to check a
site locally made up of 15k+ pages with about 300k internal links (no
external). everything else had to be shut down and it took about 7 hours
to complete. it often appeared there was no activity and that the
computer had crashed. it was on a spare computer so it didn't really
matter.

--
brucie
19/December/2003 07:01:08 pm kilo
Jul 20 '05 #4

P: n/a
Dave Patton wrote:
Does anyone have any experience running Xenu Link Sleuth:
http://home.snafu.de/tilman/xenulink.html
version 1.2e on very large sites?

I'm having problems running it against our site, in that
on my PC it will, for extended periods of time, consume
100% of the CPU cycles(usually with no internet activity).


I used to have a problem like that with an earlier version (1.2c IIRC, I
use 1.2d currently), it opened and parsed multimedia files (local drive
scan). I brought this to the author's attention and it was fixed in the
subsequently released 1.2d version.

--
Spartanicus
Jul 20 '05 #5

P: n/a
Spartanicus <me@privacy.net> writes:
Nick Kew wrote:
That seems to imply an extremely ill-behaved robot.
Of course if it's only your server that's getting hit then it's
your business, but your question suggests it is a problem.

You'd probably be better off using a well-behaved robot like Site Valet.


You're jumping to an inappropriate conclusion, there is no indication
that Xenu is anything less that a well behaved robot. The OP's


I've seen a Xenu (a recent one, at that) hit a server with hundreds of
consecutive page requests, with no delay between each hit. It's not
particularly well-behaved, though at least it didn't parallelise.

--
Chris
Jul 20 '05 #6

P: n/a
ni**@fenris.webthing.com (Nick Kew) wrote in
news:c1***********@jarl.webthing.com:
In article <Xn******************************@24.71.223.159> , one of
infinite monkeys
at the keyboard of Dave Patton
<dp*****@remove-for-nospam.confluence.org> wrote:
In 4 hours, it had checked 96,146 of 112,138 links, but
I know there are many more to check.


96000 links in 4 hours is nearly 7 links per second, and if it's
consuming CPU without internet activity that's all hits to your
own server!


Maybe I didn't explain things properly. Xenu is running on my PC.
When there is no internet activity, it isn't using the internet to
check our website. Your comment would seem to indicate you thought
Xenu was running on the same platform as the webserver.

--
Dave Patton
Canadian Coordinator, the Degree Confluence Project
http://www.confluence.org dpatton at confluence dot org
My website: http://members.shaw.ca/davepatton/
Vancouver/Whistler - host of the 2010 Winter Olympics
Jul 20 '05 #7

P: n/a
On Fri, 19 Dec 2003 19:17:06 +1000, brucie <sh**@bruciesusenetshit.info>
wrote in <br************@ID-117621.news.uni-berlin.de>:
in post <news:Xn******************************@24.71.223.1 59>
Dave Patton said:
Does anyone have any experience running Xenu Link Sleuth:
http://home.snafu.de/tilman/xenulink.html
version 1.2e on very large sites?

I'm having problems running it against our site, in that
on my PC it will, for extended periods of time, consume
100% of the CPU cycles(usually with no internet activity).


i cant remember the version (it was years ago) but i used to check a
site locally made up of 15k+ pages with about 300k internal links (no
external). everything else had to be shut down and it took about 7 hours
to complete. it often appeared there was no activity and that the
computer had crashed. it was on a spare computer so it didn't really
matter.


I made a speed improvement in 2001 (starting with 1.2a) by adding a hash
table, so that new links no longer needed to be looked up sequentially
in the URL table.

Tilman
Jul 20 '05 #8

P: n/a
Spartanicus <me@privacy.net> wrote in
news:5i********************************@news.spart anicus.utvinternet.ie:
Dave Patton wrote:
Does anyone have any experience running Xenu Link Sleuth:
http://home.snafu.de/tilman/xenulink.html
version 1.2e on very large sites?

I'm having problems running it against our site, in that
on my PC it will, for extended periods of time, consume
100% of the CPU cycles(usually with no internet activity).


I used to have a problem like that with an earlier version (1.2c IIRC, I
use 1.2d currently), it opened and parsed multimedia files (local drive
scan). I brought this to the author's attention and it was fixed in the
subsequently released 1.2d version.


I've been in touch with the author, Tilman, who has been quite helpfull.
He has made some suggestions, and also said that he can confirm
the behaviour I mention.
I'm going to try his suggestions, and I've also made some suggestions
for some enhancements to Xenu Link Sleuth, although whether they are
good ideas, or get implemented, is up to Tilman :-)

--
Dave Patton
Canadian Coordinator, the Degree Confluence Project
http://www.confluence.org dpatton at confluence dot org
My website: http://members.shaw.ca/davepatton/
Vancouver/Whistler - host of the 2010 Winter Olympics
Jul 20 '05 #9

P: n/a
On Fri, 19 Dec 2003 02:44:34 GMT, Dave Patton
<dp*****@remove-for-nospam.confluence.org> wrote in
<Xn******************************@24.71.223.159> :
Does anyone have any experience running Xenu Link Sleuth:
http://home.snafu.de/tilman/xenulink.html
version 1.2e on very large sites?

I'm having problems running it against our site, in that
on my PC it will, for extended periods of time, consume
100% of the CPU cycles(usually with no internet activity).

I've been in touch with the program's author, but this
problem doesn't happen on his system.

Because of the 100% CPU utilisation, the program is of
course much slower than it should be, and I just terminated
the program after trying it again.
In 4 hours, it had checked 96,146 of 112,138 links, but
I know there are many more to check.


(To others, since I already told Dave about this)

I was able to find out in the meantime why Xenu would go "100%" with no
internet activity for some time: your site has several pages which have
several 1000 links each. An example is this page:
http://www.confluence.org/showworld....w=true&scale=1
Xenu then needs 1-2 minutes to process all these links, i.e. look them
up, find out if they are new or not, and add them to the correct
location. In the meantime, background threads terminate normally but no
new threads are created.

The solution would be to exclude certain pages that have only
"automatic" links, i.e. links calculated by the server.

I may add something to the FAQ... another problem with big websites are
people who insist on making a site map. This takes forever, especially
if the website has a forum. While I haven't investigated this fully, it
seems that only the first report option works fast enough.

Tilman
Jul 20 '05 #10

P: n/a
Chris Morris <c.********@durham.ac.uk> wrote in
news:87************@dinopsis.dur.ac.uk:
Spartanicus <me@privacy.net> writes:
Nick Kew wrote:
>That seems to imply an extremely ill-behaved robot.
>Of course if it's only your server that's getting hit then it's
>your business, but your question suggests it is a problem.
>
>You'd probably be better off using a well-behaved robot like Site
>Valet.


You're jumping to an inappropriate conclusion, there is no indication
that Xenu is anything less that a well behaved robot. The OP's


I've seen a Xenu (a recent one, at that) hit a server with hundreds of
consecutive page requests, with no delay between each hit. It's not
particularly well-behaved, though at least it didn't parallelise.


Xenu uses "preemptive multithreading", with up to 100 threads:
http://home.snafu.de/tilman/xenulink.html
"It means that the link checking software retrieves several web
pages at the same time; the competition uses the same technique."
Can you explain why you consider it to be "not particularly
well-behaved", and why you say it "didn't parallelise"?
Thanks

--
Dave Patton
Canadian Coordinator, the Degree Confluence Project
http://www.confluence.org dpatton at confluence dot org
My website: http://members.shaw.ca/davepatton/
Vancouver/Whistler - host of the 2010 Winter Olympics
Jul 20 '05 #11

P: n/a
In article <23********************************@news.spartanic us.utvinternet.ie>, one of infinite monkeys
at the keyboard of Spartanicus <me@privacy.net> wrote:

You're jumping to an inappropriate conclusion, there is no indication
that Xenu is anything less that a well behaved robot.
A well-behaved robot doesn't submit a server to rapid-fire hits.
In some circumstances, like a busy webserver, a rapid-fire robot
becomes a DOS attack. The "general" convention is no more than
one hit per minute on any one server, although that can of course
be increased within reason on a private network.
I've been using Xenu for years and it's not only well behaved,it's also
lightning fast and free (is Site Valet free?).


Not for a website on the scale of the OP's, though the limited online
service is of course free. But if I allowed a rapid-fire service to
spider beyond a few tens of links at any one site, I'd expect to get
firewalled off from many sites pretty quickly.

--
Nick Kew
Jul 20 '05 #12

P: n/a
Dave Patton <dp*****@remove-for-nospam.confluence.org> writes:
Chris Morris <c.********@durham.ac.uk> wrote in
I've seen a Xenu (a recent one, at that) hit a server with hundreds of
consecutive page requests, with no delay between each hit. It's not
particularly well-behaved, though at least it didn't parallelise.


"It means that the link checking software retrieves several web
pages at the same time; the competition uses the same technique."
Can you explain why you consider it to be "not particularly
well-behaved", and why you say it "didn't parallelise"?


"not particularly well-behaved" - it left no time between retrievals.
This put a very high load on the server, since it got into some fairly
deep pages that take a while to generate but are _usually_ asked for
infrequently.

"didn't parallelise" - it had no more than one open request to _this
particular server_ (and only one domain on that server, as it happens)
at any one time, which was fortunate or it would have caused even more
severe problems.

--
Chris
Jul 20 '05 #13

P: n/a
On 19 Dec 2003 16:09:49 +0000, Chris Morris <c.********@durham.ac.uk>
wrote in <87************@dinopsis.dur.ac.uk>:
Spartanicus <me@privacy.net> writes:
Nick Kew wrote:
>That seems to imply an extremely ill-behaved robot.
>Of course if it's only your server that's getting hit then it's
>your business, but your question suggests it is a problem.
>
>You'd probably be better off using a well-behaved robot like Site Valet.


You're jumping to an inappropriate conclusion, there is no indication
that Xenu is anything less that a well behaved robot. The OP's


I've seen a Xenu (a recent one, at that) hit a server with hundreds of
consecutive page requests, with no delay between each hit. It's not
particularly well-behaved, though at least it didn't parallelise.


Normally it should send parallel requests... on the client side, it
starts background threads. I was able to run 100 parallel threads on my
machine to check Dave's site. The funny thing is that I noticed I got
more performance by reducing the number of threads to 10.

Normally, Xenu tries to "randomize" the load somewhat. When people are
checking both internal and external URls of a site, Xenu will chose URLs
at random from the current list of URLs to check, in an attempt to
reduce the load per server. Plus, many URLs are checked with HEAD
instead of GET.

While Xenu might be used for a simple DoS attack, it is easy to counter
this by excluding user-agents that start with the word "Xenu" on a
website. I suspect that some websites do so, e.g. google.

Tilman
Jul 20 '05 #14

P: n/a
Tilman Hausherr <ti****@berlin.snafu.de> writes:
On 19 Dec 2003 16:09:49 +0000, Chris Morris <c.********@durham.ac.uk>
I've seen a Xenu (a recent one, at that) hit a server with hundreds of
consecutive page requests, with no delay between each hit. It's not
Normally, Xenu tries to "randomize" the load somewhat. When people are
checking both internal and external URls of a site, Xenu will chose URLs


As far as I can tell from the server logs this one was set to single
domain, on a site with very few external URLs.
at random from the current list of URLs to check, in an attempt to
reduce the load per server. Plus, many URLs are checked with HEAD
instead of GET.
Unfortunately that doesn't help much if it hits a complex server-side
script, which needs to do a lot of processing just to work out the
HEAD section. And it does do GET requests for anything on the site
being checked.
While Xenu might be used for a simple DoS attack, it is easy to counter
this by excluding user-agents that start with the word "Xenu" on a
website. I suspect that some websites do so, e.g. google.


I almost did that on this occasion. Given that it's the recommended
link checking tool here, though, it would block far too many
legitimate applications for it to be anything other than a temporary
measure.

Then again, the server survived without actual intervention being
necessary, so I'm not too concerned about Xenu-using DoS attacks.

--
Chris
Jul 20 '05 #15

This discussion thread is closed

Replies have been disabled for this discussion.