473,796 Members | 2,751 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Warning: robots.txt unreliable in Apache servers


Subject: Warning: robots.txt unreliable in Apache servers
From: Philip Ronan <in*****@invali d.invalid>
Newsgroups: alt.internet.se arch-engines
Message-ID: <BF89BF33.39FDF %in*****@invali d.invalid>
Date: Sat, 29 Oct 2005 23:07:46 GMT

Hi,

I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.

Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation " pages. They could put this rule in their
robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(

You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.

If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.

Phil

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 30 '05
56 3637
On Mon, 31 Oct 2005, Philip Ronan wrote:
"D. Stussy" wrote:
On Sun, 30 Oct 2005, Philip Ronan wrote:
if even the legitimate spiders can be misdirected then the whole point of
having a robots.txt file goes out the window.


No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED
robots into our web sites. Those robots which behave and respect the
robots.txt file will NEVER fall into these traps.


You seem to have misunderstood the problem. These robots CAN and DO access
pages prohibited by robots.txt files due to the way servers process
consecutive slashes in request URIs.


No - not in a PROPERLY set up system they won't. If one is trapping for a "//"
(any number of slashes greater than one), then the robots (or anyone/anything
else) will ever get there.

One can also state that if they do, they are not properly behaved bots.
To the original poster:


Yes, that's me.
Just because you haven't planned for the malicious to happen shows us that you
are closed minded. Your reliance ONLY on robots.txt shows this also. Open up
your thinking.


You also seem to have misunderstood the whole point of this thread. I'm not
asking for help here. I'm just warning people about the unreliability of
robots.txt as a means of excluding your pages from search engines.
You should also probably use the "robots" meta-tag on each HTML page.


That was the first thing I did when I noticed there was a problem.
Have you considered using the rewrite engine to trap for "//" in the URI (the
part of the URL after the protocol and domain name is removed)?


You haven't been paying attention, have you?
<http://groups.google.com/group/alt.i...g/9a0f7baad24c
74dc?hl=en&>


I'm NOT reading this message from that group. If all the messages in the
thread weren't also crossposted to the group I'm reading this from -
comp.infosystem s.www.authoring.html, TFB. Deal with it.
Oct 31 '05 #21
On Mon, 31 Oct 2005, Guy Macon wrote:
Philip Ronan wrote:

"D. Stussy" wrote:
On Sun, 30 Oct 2005, Philip Ronan wrote:

if even the legitimate spiders can be misdirected then the whole point of
having a robots.txt file goes out the window.

No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED
robots into our web sites. Those robots which behave and respect the
robots.txt file will NEVER fall into these traps.
You seem to have misunderstood the problem. These robots CAN and DO access
pages prohibited by robots.txt files due to the way servers process
consecutive slashes in request URIs.


Which means that the operators of the bad robots can put up a few
multiple-slash links so as to lure good robots into those honeypots
and traps, thus discouraging their use. The good news is that all
of the good robots that I know of obey metas as well as robots.txt,
so they can only do that to honypot owners who are under the same
the mistaken belief that Mr. Stussy expressed above - that "Those
robots which behave and respect the robots.txt file will never fall
into these traps."


That's not a mistaken belief. Technically, a double (or more) slash, other
than following a colon when separating a protocol from a domain name and not
counting the query string, is NOT a valid URL construct. Robots which accept
them are misbehaved.
I am still hoping that one of the .htaccess experts will come up
with a way to make all multiple-slash requests 301 redirect to
their single-slash versions.


Trivial. Do it yourself.
Oct 31 '05 #22
"D. Stussy" wrote:
On Mon, 31 Oct 2005, Philip Ronan wrote:

You seem to have misunderstood the problem. These robots CAN and DO access
pages prohibited by robots.txt files due to the way servers process
consecutive slashes in request URIs.
No - not in a PROPERLY set up system they won't.


If by "properly set up" you are referring to a system that has been fixed to
redirect or deny requests for URIs containing consecutive slashes, then
that's correct. In fact that's what I've been suggesting all along in this
thread.
If one is trapping for a "//" (any number of slashes greater than one),
then the robots (or anyone/anything else) will ever get there.


I don't understand what you mean. If you think the addition of a rule
"Disallow: //" will completely fix the problem then you're mistaken. I've
already explained why.
You haven't been paying attention, have you?
<http://groups.google.com/group/alt.i...g/9a0f7baad24c
74dc?hl=en&>


I'm NOT reading this message from that group. If all the messages in the
thread weren't also crossposted to the group I'm reading this from -
comp.infosystem s.www.authoring.html, TFB. Deal with it.


Maybe there's something wrong with your newsreader then.
<http://groups.google.com/group/comp.....html/msg/9a0f
7baad24c74dc>

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/
Oct 31 '05 #23
Tim
Philip Ronan:
the robots.txt protocol is ineffective on (probably) most servers because
it can be circumvented without your knowledge by a third party.


It always has been, anyway. For numerous reasons. Your multiple slash
example is just one of them. Some robots will ignore them altogether,
others will deliberately look at what you tell them to ignore.

Likewise with Google's advice:
Because this page falls outside your robots.txt file, you may want to
use meta tags to remove this page from our index.


In either case, such restrictions only help reduce the load on your server
from well meaning robots. If you want to truly restrict access, you need
to use some form of authentication.

There was moves to suggest the robots exclusion ought to let you specify
what you allow and disallow. For some cases it'd be easier to exclude
everything by default, only allowing what you want through. Though I
don't think that ever took off.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please destroy some files yourself.

Oct 31 '05 #24
On Mon, 31 Oct 2005 11:58:36 +0100, D. Stussy <at******@bde-arc.ampr.org>
wrote:
That's not a mistaken belief. Technically, a double (or more) slash,
other than following a colon when separating a protocol from a domain
name and not counting the query string, is NOT a valid URL construct.
Robots which accept them are misbehaved.


Quoting RFC1738 (BNF description of url):

; HTTP

httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

* means 0 or infinite repetitions, thus hsegment can be empty.
If hsegment can be empty, hpath may contain multiple not separated slashes.
So it seems you are wrong - multiple slashes in URLs are valid.

Best,
Borek
--
http://www.chembuddy.com - chemical calculators for labs and education
BATE - program for pH calculations
CASC - Concentration and Solution Calculator
pH lectures - guide to hand pH calculation with examples
Oct 31 '05 #25
"Tim" wrote:
Philip Ronan:
the robots.txt protocol is ineffective on (probably) most servers because
it can be circumvented without your knowledge by a third party.


It always has been, anyway. For numerous reasons. Your multiple slash
example is just one of them. Some robots will ignore them altogether,
others will deliberately look at what you tell them to ignore.


What you're saying is that it's pointless putting absolute faith in
robots.txt files because they are ignored by some robots. I'm not disputing
that. What I'm saying is that even genuine well-behaved robots like
Googlebot can be made to crawl content prohibited by robots.txt files.

So for example, if you're using a honeypot to block badly behaved robots
from your website automatically, then I can *remove your site from Google*
and probably other search engines simply by publishing a link to your
honeypot directory with an extra slash inserted somewhere. That's why this
issue is important.

I hope you understand now.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 31 '05 #26
On Mon, 31 Oct 2005 10:58:36 GMT, "D. Stussy"
<at******@bde-arc.ampr.org> wrote:
On Mon, 31 Oct 2005, Guy Macon wrote:
Philip Ronan wrote:
>
>"D. Stussy" wrote:
>
>> On Sun, 30 Oct 2005, Philip Ronan wrote:
>>
>>> if even the legitimate spiders can be misdirected then the whole point of
>>> having a robots.txt file goes out the window.
>>
>> No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED
>> robots into our web sites. Those robots which behave and respect the
>> robots.txt file will NEVER fall into these traps.
>
>You seem to have misunderstood the problem. These robots CAN and DO access
>pages prohibited by robots.txt files due to the way servers process
>consecutive slashes in request URIs.


Which means that the operators of the bad robots can put up a few
multiple-slash links so as to lure good robots into those honeypots
and traps, thus discouraging their use. The good news is that all
of the good robots that I know of obey metas as well as robots.txt,
so they can only do that to honypot owners who are under the same
the mistaken belief that Mr. Stussy expressed above - that "Those
robots which behave and respect the robots.txt file will never fall
into these traps."


That's not a mistaken belief. Technically, a double (or more) slash, other
than following a colon when separating a protocol from a domain name and not
counting the query string, is NOT a valid URL construct. Robots which accept
them are misbehaved.
I am still hoping that one of the .htaccess experts will come up
with a way to make all multiple-slash requests 301 redirect to
their single-slash versions.


Trivial. Do it yourself.


Umm, no, I think we'll hand the floor over to you at this point.

BB
--
www.kruse.co.uk/ se*@kruse.demon .co.uk
Elvis does my SEO
Oct 31 '05 #27
On Mon, 31 Oct 2005 12:48:47 +0100, Borek
<bo***@parts.bp p.to.com.remove .pl> wrote:
That's not a mistaken belief. Technically, a double (or more) slash,
other than following a colon when separating a protocol from a domain
name and not counting the query string, is NOT a valid URL construct.
Robots which accept them are misbehaved.
Quoting RFC1738 (BNF description of url):

; HTTP

httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

* means 0 or infinite repetitions, thus hsegment can be empty.


Small correction - 0 _to_ infinite repetitions, not OR infinite
repetitions. But it doesn't change final conclusion.
If hsegment can be empty, hpath may contain multiple not separated
slashes.
So it seems you are wrong - multiple slashes in URLs are valid.


Best,
Borek
--
http://www.chembuddy.com
http://www.chembuddy.com/?left=BATE&...ion_equilibria
http://www.chembuddy.com/?left=CASC&...ion_calculator

Oct 31 '05 #28
Philip Ronan wrote in message news:BF******** ************@in valid.invalid.. .
"D. Stussy" wrote:
On Mon, 31 Oct 2005, Philip Ronan wrote: [...]
You haven't been paying attention, have you?
<http://groups.google.com/group/alt.i...g/9a0f7baad24c
74dc?hl=en&>


I'm NOT reading this message from that group. If all the messages in the
thread weren't also crossposted to the group I'm reading this from -
comp.infosystem s.www.authoring.html, TFB. Deal with it.


Maybe there's something wrong with your newsreader then.
<http://groups.google.com/group/comp.....html/msg/9a0f
7baad24c74dc>


I don't know what is worse than telling someone
"there's something wrong with newsreader"
and at the same time posting b0rken links.
Oct 31 '05 #29
Philip Ronan wrote:
If by "properly set up" you are referring to a system that has been fixed to
redirect or deny requests for URIs containing consecutive slashes, then
that's correct. In fact that's what I've been suggesting all along in this
thread.


Feel free to set up your server like that. Apache provides a range of
mechanisms for doing so, which you can read all about at apache.org.
It only applies default rules (map to the filesystem) if you haven't
asked it to do otherwise.

--
Nick Kew
Oct 31 '05 #30

This thread has been closed and replies have been disabled. Please start a new discussion.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.