473,394 Members | 1,866 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,394 software developers and data experts.

Use of Xenu Link Sleuth on very large sites?

Does anyone have any experience running Xenu Link Sleuth:
http://home.snafu.de/tilman/xenulink.html
version 1.2e on very large sites?

I'm having problems running it against our site, in that
on my PC it will, for extended periods of time, consume
100% of the CPU cycles(usually with no internet activity).

I've been in touch with the program's author, but this
problem doesn't happen on his system.

Because of the 100% CPU utilisation, the program is of
course much slower than it should be, and I just terminated
the program after trying it again.
In 4 hours, it had checked 96,146 of 112,138 links, but
I know there are many more to check.

--
Dave Patton
Canadian Coordinator, the Degree Confluence Project
http://www.confluence.org dpatton at confluence dot org
My website: http://members.shaw.ca/davepatton/
Vancouver/Whistler - host of the 2010 Winter Olympics
Jul 20 '05 #1
14 3391
In article <Xn******************************@24.71.223.159> , one of infinite monkeys
at the keyboard of Dave Patton <dp*****@remove-for-nospam.confluence.org> wrote:
In 4 hours, it had checked 96,146 of 112,138 links, but
I know there are many more to check.


96000 links in 4 hours is nearly 7 links per second, and if it's
consuming CPU without internet activity that's all hits to your
own server! That seems to imply an extremely ill-behaved robot.
Of course if it's only your server that's getting hit then it's
your business, but your question suggests it is a problem.

You'd probably be better off using a well-behaved robot like Site Valet.
It'll take longer (by definition) even if you configure a short
revisit-site time, but the system load is negligible even when it's
running at many[1] hits per second (all to different servers so as
not to expose any one server to rapid-fire, of course).

[1] I can't tell you an upper limit to "many", it's bandwidth-limited.

--
Nick Kew
Jul 20 '05 #2
Nick Kew wrote:
That seems to imply an extremely ill-behaved robot.
Of course if it's only your server that's getting hit then it's
your business, but your question suggests it is a problem.

You'd probably be better off using a well-behaved robot like Site Valet.


You're jumping to an inappropriate conclusion, there is no indication
that Xenu is anything less that a well behaved robot. The OP's
description indicates that it is a problem local to the OP as the
author's system does not exhibit this behaviour.

I've been using Xenu for years and it's not only well behaved, it's also
lightning fast and free (is Site Valet free?).

--
Spartanicus
Jul 20 '05 #3
in post <news:Xn******************************@24.71.223.1 59>
Dave Patton said:
Does anyone have any experience running Xenu Link Sleuth:
http://home.snafu.de/tilman/xenulink.html
version 1.2e on very large sites?

I'm having problems running it against our site, in that
on my PC it will, for extended periods of time, consume
100% of the CPU cycles(usually with no internet activity).


i cant remember the version (it was years ago) but i used to check a
site locally made up of 15k+ pages with about 300k internal links (no
external). everything else had to be shut down and it took about 7 hours
to complete. it often appeared there was no activity and that the
computer had crashed. it was on a spare computer so it didn't really
matter.

--
brucie
19/December/2003 07:01:08 pm kilo
Jul 20 '05 #4
Dave Patton wrote:
Does anyone have any experience running Xenu Link Sleuth:
http://home.snafu.de/tilman/xenulink.html
version 1.2e on very large sites?

I'm having problems running it against our site, in that
on my PC it will, for extended periods of time, consume
100% of the CPU cycles(usually with no internet activity).


I used to have a problem like that with an earlier version (1.2c IIRC, I
use 1.2d currently), it opened and parsed multimedia files (local drive
scan). I brought this to the author's attention and it was fixed in the
subsequently released 1.2d version.

--
Spartanicus
Jul 20 '05 #5
Spartanicus <me@privacy.net> writes:
Nick Kew wrote:
That seems to imply an extremely ill-behaved robot.
Of course if it's only your server that's getting hit then it's
your business, but your question suggests it is a problem.

You'd probably be better off using a well-behaved robot like Site Valet.


You're jumping to an inappropriate conclusion, there is no indication
that Xenu is anything less that a well behaved robot. The OP's


I've seen a Xenu (a recent one, at that) hit a server with hundreds of
consecutive page requests, with no delay between each hit. It's not
particularly well-behaved, though at least it didn't parallelise.

--
Chris
Jul 20 '05 #6
ni**@fenris.webthing.com (Nick Kew) wrote in
news:c1***********@jarl.webthing.com:
In article <Xn******************************@24.71.223.159> , one of
infinite monkeys
at the keyboard of Dave Patton
<dp*****@remove-for-nospam.confluence.org> wrote:
In 4 hours, it had checked 96,146 of 112,138 links, but
I know there are many more to check.


96000 links in 4 hours is nearly 7 links per second, and if it's
consuming CPU without internet activity that's all hits to your
own server!


Maybe I didn't explain things properly. Xenu is running on my PC.
When there is no internet activity, it isn't using the internet to
check our website. Your comment would seem to indicate you thought
Xenu was running on the same platform as the webserver.

--
Dave Patton
Canadian Coordinator, the Degree Confluence Project
http://www.confluence.org dpatton at confluence dot org
My website: http://members.shaw.ca/davepatton/
Vancouver/Whistler - host of the 2010 Winter Olympics
Jul 20 '05 #7
On Fri, 19 Dec 2003 19:17:06 +1000, brucie <sh**@bruciesusenetshit.info>
wrote in <br************@ID-117621.news.uni-berlin.de>:
in post <news:Xn******************************@24.71.223.1 59>
Dave Patton said:
Does anyone have any experience running Xenu Link Sleuth:
http://home.snafu.de/tilman/xenulink.html
version 1.2e on very large sites?

I'm having problems running it against our site, in that
on my PC it will, for extended periods of time, consume
100% of the CPU cycles(usually with no internet activity).


i cant remember the version (it was years ago) but i used to check a
site locally made up of 15k+ pages with about 300k internal links (no
external). everything else had to be shut down and it took about 7 hours
to complete. it often appeared there was no activity and that the
computer had crashed. it was on a spare computer so it didn't really
matter.


I made a speed improvement in 2001 (starting with 1.2a) by adding a hash
table, so that new links no longer needed to be looked up sequentially
in the URL table.

Tilman
Jul 20 '05 #8
Spartanicus <me@privacy.net> wrote in
news:5i********************************@news.spart anicus.utvinternet.ie:
Dave Patton wrote:
Does anyone have any experience running Xenu Link Sleuth:
http://home.snafu.de/tilman/xenulink.html
version 1.2e on very large sites?

I'm having problems running it against our site, in that
on my PC it will, for extended periods of time, consume
100% of the CPU cycles(usually with no internet activity).


I used to have a problem like that with an earlier version (1.2c IIRC, I
use 1.2d currently), it opened and parsed multimedia files (local drive
scan). I brought this to the author's attention and it was fixed in the
subsequently released 1.2d version.


I've been in touch with the author, Tilman, who has been quite helpfull.
He has made some suggestions, and also said that he can confirm
the behaviour I mention.
I'm going to try his suggestions, and I've also made some suggestions
for some enhancements to Xenu Link Sleuth, although whether they are
good ideas, or get implemented, is up to Tilman :-)

--
Dave Patton
Canadian Coordinator, the Degree Confluence Project
http://www.confluence.org dpatton at confluence dot org
My website: http://members.shaw.ca/davepatton/
Vancouver/Whistler - host of the 2010 Winter Olympics
Jul 20 '05 #9
On Fri, 19 Dec 2003 02:44:34 GMT, Dave Patton
<dp*****@remove-for-nospam.confluence.org> wrote in
<Xn******************************@24.71.223.159> :
Does anyone have any experience running Xenu Link Sleuth:
http://home.snafu.de/tilman/xenulink.html
version 1.2e on very large sites?

I'm having problems running it against our site, in that
on my PC it will, for extended periods of time, consume
100% of the CPU cycles(usually with no internet activity).

I've been in touch with the program's author, but this
problem doesn't happen on his system.

Because of the 100% CPU utilisation, the program is of
course much slower than it should be, and I just terminated
the program after trying it again.
In 4 hours, it had checked 96,146 of 112,138 links, but
I know there are many more to check.


(To others, since I already told Dave about this)

I was able to find out in the meantime why Xenu would go "100%" with no
internet activity for some time: your site has several pages which have
several 1000 links each. An example is this page:
http://www.confluence.org/showworld....w=true&scale=1
Xenu then needs 1-2 minutes to process all these links, i.e. look them
up, find out if they are new or not, and add them to the correct
location. In the meantime, background threads terminate normally but no
new threads are created.

The solution would be to exclude certain pages that have only
"automatic" links, i.e. links calculated by the server.

I may add something to the FAQ... another problem with big websites are
people who insist on making a site map. This takes forever, especially
if the website has a forum. While I haven't investigated this fully, it
seems that only the first report option works fast enough.

Tilman
Jul 20 '05 #10
Chris Morris <c.********@durham.ac.uk> wrote in
news:87************@dinopsis.dur.ac.uk:
Spartanicus <me@privacy.net> writes:
Nick Kew wrote:
>That seems to imply an extremely ill-behaved robot.
>Of course if it's only your server that's getting hit then it's
>your business, but your question suggests it is a problem.
>
>You'd probably be better off using a well-behaved robot like Site
>Valet.


You're jumping to an inappropriate conclusion, there is no indication
that Xenu is anything less that a well behaved robot. The OP's


I've seen a Xenu (a recent one, at that) hit a server with hundreds of
consecutive page requests, with no delay between each hit. It's not
particularly well-behaved, though at least it didn't parallelise.


Xenu uses "preemptive multithreading", with up to 100 threads:
http://home.snafu.de/tilman/xenulink.html
"It means that the link checking software retrieves several web
pages at the same time; the competition uses the same technique."
Can you explain why you consider it to be "not particularly
well-behaved", and why you say it "didn't parallelise"?
Thanks

--
Dave Patton
Canadian Coordinator, the Degree Confluence Project
http://www.confluence.org dpatton at confluence dot org
My website: http://members.shaw.ca/davepatton/
Vancouver/Whistler - host of the 2010 Winter Olympics
Jul 20 '05 #11
In article <23********************************@news.spartanic us.utvinternet.ie>, one of infinite monkeys
at the keyboard of Spartanicus <me@privacy.net> wrote:

You're jumping to an inappropriate conclusion, there is no indication
that Xenu is anything less that a well behaved robot.
A well-behaved robot doesn't submit a server to rapid-fire hits.
In some circumstances, like a busy webserver, a rapid-fire robot
becomes a DOS attack. The "general" convention is no more than
one hit per minute on any one server, although that can of course
be increased within reason on a private network.
I've been using Xenu for years and it's not only well behaved,it's also
lightning fast and free (is Site Valet free?).


Not for a website on the scale of the OP's, though the limited online
service is of course free. But if I allowed a rapid-fire service to
spider beyond a few tens of links at any one site, I'd expect to get
firewalled off from many sites pretty quickly.

--
Nick Kew
Jul 20 '05 #12
Dave Patton <dp*****@remove-for-nospam.confluence.org> writes:
Chris Morris <c.********@durham.ac.uk> wrote in
I've seen a Xenu (a recent one, at that) hit a server with hundreds of
consecutive page requests, with no delay between each hit. It's not
particularly well-behaved, though at least it didn't parallelise.


"It means that the link checking software retrieves several web
pages at the same time; the competition uses the same technique."
Can you explain why you consider it to be "not particularly
well-behaved", and why you say it "didn't parallelise"?


"not particularly well-behaved" - it left no time between retrievals.
This put a very high load on the server, since it got into some fairly
deep pages that take a while to generate but are _usually_ asked for
infrequently.

"didn't parallelise" - it had no more than one open request to _this
particular server_ (and only one domain on that server, as it happens)
at any one time, which was fortunate or it would have caused even more
severe problems.

--
Chris
Jul 20 '05 #13
On 19 Dec 2003 16:09:49 +0000, Chris Morris <c.********@durham.ac.uk>
wrote in <87************@dinopsis.dur.ac.uk>:
Spartanicus <me@privacy.net> writes:
Nick Kew wrote:
>That seems to imply an extremely ill-behaved robot.
>Of course if it's only your server that's getting hit then it's
>your business, but your question suggests it is a problem.
>
>You'd probably be better off using a well-behaved robot like Site Valet.


You're jumping to an inappropriate conclusion, there is no indication
that Xenu is anything less that a well behaved robot. The OP's


I've seen a Xenu (a recent one, at that) hit a server with hundreds of
consecutive page requests, with no delay between each hit. It's not
particularly well-behaved, though at least it didn't parallelise.


Normally it should send parallel requests... on the client side, it
starts background threads. I was able to run 100 parallel threads on my
machine to check Dave's site. The funny thing is that I noticed I got
more performance by reducing the number of threads to 10.

Normally, Xenu tries to "randomize" the load somewhat. When people are
checking both internal and external URls of a site, Xenu will chose URLs
at random from the current list of URLs to check, in an attempt to
reduce the load per server. Plus, many URLs are checked with HEAD
instead of GET.

While Xenu might be used for a simple DoS attack, it is easy to counter
this by excluding user-agents that start with the word "Xenu" on a
website. I suspect that some websites do so, e.g. google.

Tilman
Jul 20 '05 #14
Tilman Hausherr <ti****@berlin.snafu.de> writes:
On 19 Dec 2003 16:09:49 +0000, Chris Morris <c.********@durham.ac.uk>
I've seen a Xenu (a recent one, at that) hit a server with hundreds of
consecutive page requests, with no delay between each hit. It's not
Normally, Xenu tries to "randomize" the load somewhat. When people are
checking both internal and external URls of a site, Xenu will chose URLs


As far as I can tell from the server logs this one was set to single
domain, on a site with very few external URLs.
at random from the current list of URLs to check, in an attempt to
reduce the load per server. Plus, many URLs are checked with HEAD
instead of GET.
Unfortunately that doesn't help much if it hits a complex server-side
script, which needs to do a lot of processing just to work out the
HEAD section. And it does do GET requests for anything on the site
being checked.
While Xenu might be used for a simple DoS attack, it is easy to counter
this by excluding user-agents that start with the word "Xenu" on a
website. I suspect that some websites do so, e.g. google.


I almost did that on this occasion. Given that it's the recommended
link checking tool here, though, it would block far too many
legitimate applications for it to be anything other than a temporary
measure.

Then again, the server survived without actual intervention being
necessary, so I'm not too concerned about Xenu-using DoS attacks.

--
Chris
Jul 20 '05 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: news frontiernet.net | last post by:
I have key entered and tried to run example 4-6 from Dany Goodmans DYNAMIC HTML book, version one that is on pages 94-96. This is part of my effort to learn JavaScript. I checked each byte and...
22
by: Jonathan Snook | last post by:
I've been contemplating what the recommended usage of a "top of page" link should be? Should there only ever be one at the bottom of the page? Should they be sprinkled at various points on the...
55
by: Jonas Smithson | last post by:
I've seen a few attractive multi-column sites whose geometry is based on pure CSS-P, but they're what you might call "code afficionado" sites, where the subject matter of the site is "coding...
26
by: Harrie | last post by:
Hi, After Brian mentioned the use for <link rel=..> for navigational purposes in another thread, I've been looking into it and found that HTML 3.2 has two other recognized link types than HTML...
14
by: Steve McLellan | last post by:
Hi, Sorry to repost, but this is becoming aggravating, and causing me a lot of wasted time. I've got a reasonably large mixed C++ project, and after a number of builds (but not a constant...
1
by: kalpanaali | last post by:
Link and Banner Exchange ..pls help I am building a site....very very little traffic I am totally ingnorant of Internet marketing... I have heard of different types of link/...
38
by: ted | last post by:
I have an old link that was widely distributed. I would now like to put a link on that old page that will go to a new page without displaying anything.
13
by: trpost | last post by:
I am looking for a way to send data from one page to another as POST data without using forms or cURL. I have a php script that is passing a list of cases from on page to another when a link is...
22
by: Jesse Burns | last post by:
I'm about to start working on my first large scale site (in my opinion) that will hopefully have 1000+ users a day. ok, this isn't on the google/facebook scale, but it's going to be have more hits...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.