473,320 Members | 1,957 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Warning: robots.txt unreliable in Apache servers


Subject: Warning: robots.txt unreliable in Apache servers
From: Philip Ronan <in*****@invalid.invalid>
Newsgroups: alt.internet.search-engines
Message-ID: <BF89BF33.39FDF%in*****@invalid.invalid>
Date: Sat, 29 Oct 2005 23:07:46 GMT

Hi,

I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.

Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation" pages. They could put this rule in their
robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(

You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.

If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.

Phil

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 30 '05
56 3551
On Mon, 31 Oct 2005, Big Bill wrote:
I am still hoping that one of the .htaccess experts will come up
with a way to make all multiple-slash requests 301 redirect to
their single-slash versions.


Trivial. Do it yourself.


Umm, no, I think we'll hand the floor over to you at this point.


Use the rewrite engine to compare the URI to the pattern "(.*)//(.*)", using
apache's syntax. How to do this in other webservers is your problem.

Since the URI is the resource pathname, it won't contain the double slash after
the protocol and colon. I'm not here to spoon feed you.
Nov 7 '05 #51
On Mon, 31 Oct 2005, Philip Ronan wrote:
"Guy Macon" wrote:
One would think that if such a trivial fix existed that someone
in the last 40+ posts would have posted it, thus solving the
problem...
Guy, if you've seen my solution at <http://tinyurl.com/89bmv> and you
haven't got access to PHP, you could try a recursive solution using
.htaccess by itself:

RewriteEngine On
RewriteCond %{REQUEST_URI} //+
RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L]


The begin and end of string markers, ^ and $, should not be necessary. You
want to rewrite it if it occurs anywhere, so position markers are superfluous.
I haven't tested this, but -- in theory -- if the server detects a cluster
of forward slashes in a request URI, it will redirect the client to a URI
containing a single slash in its place. If a request contains more than one
cluster of forward slashes, then the client will be redirected more than
once, but it should eventually get to the right place.


You see, trivial - ONE rule.
Nov 7 '05 #52
On Mon, 07 Nov 2005 08:44:45 GMT, "D. Stussy"
<at******@bde-arc.ampr.org> wrote:
On Mon, 31 Oct 2005, Big Bill wrote:
>> I am still hoping that one of the .htaccess experts will come up
>> with a way to make all multiple-slash requests 301 redirect to
>> their single-slash versions.
>
>Trivial. Do it yourself.


Umm, no, I think we'll hand the floor over to you at this point.


Use the rewrite engine to compare the URI to the pattern "(.*)//(.*)", using
apache's syntax. How to do this in other webservers is your problem.

Since the URI is the resource pathname, it won't contain the double slash after
the protocol and colon. I'm not here to spoon feed you.


Me you have to with stuff like that.

BB
--
www.kruse.co.uk/ se*@kruse.demon.co.uk
The buffalo have gone
Nov 7 '05 #53
On Mon, 07 Nov 2005 09:38:30 +0100, D. Stussy <at******@bde-arc.ampr.org>
wrote:
Quoting RFC1738 (BNF description of url): (...) So it seems you are wrong - multiple slashes in URLs are valid.
However, this is usually further restricted by the filesystem naming
conventions and that's where it's not proper.


Give reference. cd //user//local////www/data works under Linux
and FreeBSD, cd winnt\\cache works under Windows. They are not
restricted by filesystem so they are proper.

Best,
Borek
--
http://www.chembuddy.com - chemical calculators for labs and education
BATE - program for pH calculations
CASC - Concentration and Solution Calculator
pH lectures - guide to hand pH calculation with examples
Nov 7 '05 #54
On Mon, 07 Nov 2005 09:44:45 +0100, D. Stussy <at******@bde-arc.ampr.org>
wrote:
Since the URI is the resource pathname, it won't contain the double
slash after the protocol and colon.


As I told - give reference. So far it is only PbBA (i.e. Proof by Bold
Assertion)

Best,
Borek
--
http://www.chembuddy.com - chemical calculators for labs and education
BATE - program for pH calculations
CASC - Concentration and Solution Calculator
pH lectures - guide to hand pH calculation with examples
Nov 7 '05 #55

Borek wrote:
D. Stussy wrote:
Quoting RFC1738 (BNF description of url):(...) So it seems you are wrong - multiple slashes in URLs are valid.
However, this is usually further restricted by the filesystem naming
conventions and that's where it's not proper.


Give reference. cd //user//local////www/data works under Linux
and FreeBSD, cd winnt\\cache works under Windows. They are not
restricted by filesystem so they are proper.

Since the URI is the resource pathname, it won't contain the double
slash after the protocol and colon.


As I told - give reference. So far it is only PbBA (i.e. Proof by Bold
Assertion)


Gosh, it sure got quiet all of a sudden... :)

Nov 9 '05 #56
Philip Ronan wrote:
"Dave0x01" wrote:

Philip Ronan wrote:
OK, so as long as the robots.txt documentation includes a note saying that
you have to patch your server software to get reliable results, then we'll
all be fine.


I wouldn't consider patching of the Apache source code either necessary
or desirable in this situation.

I was being sarcastic. (You're American, right?)


Yeah, I could tell. And I *wasn't* being sarcastic. What about my
comment do you think implies otherwise?
Does the URL in question appear in the index as
<http://###.####.###//contact/>, or as <http://###.####.###/contact/>?
My assumption is the latter.

Then what the hell do you think this thread is all about??


[snip]

One could obviously be concerned about any number of things resulting
from the behavior described.

Nov 23 '05 #57

This thread has been closed and replies have been disabled. Please start a new discussion.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.