473,785 Members | 2,167 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Warning: robots.txt unreliable in Apache servers


Subject: Warning: robots.txt unreliable in Apache servers
From: Philip Ronan <in*****@invali d.invalid>
Newsgroups: alt.internet.se arch-engines
Message-ID: <BF89BF33.39FDF %in*****@invali d.invalid>
Date: Sat, 29 Oct 2005 23:07:46 GMT

Hi,

I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.

Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation " pages. They could put this rule in their
robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(

You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.

If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.

Phil

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 30 '05 #1
56 3634
Anonymous, quoting Philip Ronan wrote:
I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple
consecutive forward slashes in a request URI as a single forward slash. So
for example, <http://apache.org/foundation/> and
<http://apache.org//////foundation/> both resolve to the same page.
I could not find anything about the semantics of empty path segments in http
URLs, but this behaviour seems to be common practice. What about IIS or
other webservers?
Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation " pages. They could put this rule in
their robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(


I would tend to blame googlebot (and any other effected robot). Unless a
different behaviour ('...foo//bar...' and '...foo/bar...' resolve to
different resource on the server) is common practice, the robot should
normalize such pathes (removing empty segments) before matching it against
the rules from the robots.txt file.

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/
Oct 30 '05 #2
Anonymous wrote:
For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.


Yep. If you apply filesystem semantics to that, you have a whopping
great security hole. Of course you could just return "bad request",
but that just transfers the risk leaving server admins to shoot
their own feet.

There was a story in TheRegister a couple of weeks ago about someone
who got a criminal conviction (for attempted unauthorized access)
after he requested a url like that and it triggered an intrusion
detection alarm.

If you have links to things like "////" and dumb robots, put the
paths in your robots.txt. Don't forget that robots.txt is only
advisory and is commonly ignored by evil and/or broken robots.

--
Nick Kew
Oct 30 '05 #3
"Nick Kew" wrote:
If you have links to things like "////" and dumb robots, put the
paths in your robots.txt. Don't forget that robots.txt is only
advisory and is commonly ignored by evil and/or broken robots.


But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution. I realize not all
robots bother with the robots.txt protocol, but if even the legitimate
spiders can be misdirected then the whole point of having a robots.txt file
goes out the window.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 30 '05 #4
Philip Ronan wrote:

[please don't crosspost without warning. Or with inadequate context]
"Nick Kew" wrote:

If you have links to things like "////" and dumb robots, put the
paths in your robots.txt. Don't forget that robots.txt is only
advisory and is commonly ignored by evil and/or broken robots.

But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution.


Who said anything about that? What's impractical about "Disallow //" ?

--
Nick Kew
Oct 30 '05 #5
Sun, 30 Oct 2005 09:34:36 +0000 from Nick Kew
<ni**@asgard.we bthing.com>:
If you have links to things like "////" and dumb robots, put the
paths in your robots.txt. Don't forget that robots.txt is only
advisory and is commonly ignored by evil and/or broken robots.


Wouldn't it be more effective to have any URL containing http://.*//
return a 403 Forbidden or a 404 Not Found? This could be done in
..htaccess or perhaps httpd.conf. I may be having a failure of
imagination, but I can't think of any legitimate reason for such a
link.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you
Oct 30 '05 #6
"Nick Kew" wrote:
[please don't crosspost without warning. Or with inadequate context]
My original post was copied over to ciwah, so now there are two threads with
the same subject. I'm trying to tie them together, mkay?
Philip Ronan wrote:

But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution.
Who said anything about that?


You did, in your earlier post:
If you have links to things like
"////" and dumb robots, put the paths in your robots.txt.
What's impractical about "Disallow //" ?


It's a partial solution. If you're trying to protect content at deeper
levels in the hierarchy, you will also need:

Disallow: /path//to/file
Disallow: /path/to//file
Disallow: /path//to//file
Disallow: /path///to/file
etc..

As I said, robots.txt is inadequate for this purpose because it doesn't
support pattern matching.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/
Oct 30 '05 #7
In comp.infosystem s.www.authoring.html, "Stan Brown" wrote:
Wouldn't it be more effective to have any URL containing http://.*//
return a 403 Forbidden or a 404 Not Found? This could be done in
.htaccess or perhaps httpd.conf. I may be having a failure of
imagination, but I can't think of any legitimate reason for such a
link.


That would also be effective, but maybe it's better to do something useful
with the URL if you can.

Most servers will redirect to a URL with a trailing slash when the name of a
directory is requested. Why not treat multiple slashes in a similar way?

Besides, it might help in terms of page rank.

[[Crossposted to alt.internet.se arch-engines, with apologies to Nick]]

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 30 '05 #8
On Sun, 30 Oct 2005 11:15:03 +0000, Philip Ronan
<in*****@invali d.invalid> wrote:
"Nick Kew" wrote:
If you have links to things like "////" and dumb robots, put the
paths in your robots.txt. Don't forget that robots.txt is only
advisory and is commonly ignored by evil and/or broken robots.


But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution. I realize not all
robots bother with the robots.txt protocol, but if even the legitimate
spiders can be misdirected then the whole point of having a robots.txt file
goes out the window.


A simple solution would be to add the robots meta tag to all pages you
don't want indexing as a backup for when someone links with //. Kind
of defeats the whole point of using a robots.txt file, but what else
can you do?

David
--
Free Search Engine Optimization Tutorial
http://www.seo-gold.com/tutorial/
Oct 30 '05 #9
Anonymous, quoting Philip Ronan wrote:


Subject: Warning: robots.txt unreliable in Apache servers
From: Philip Ronan <in*****@invali d.invalid>
Newsgroups: alt.internet.se arch-engines
Message-ID: <BF89BF33.39FDF %in*****@invali d.invalid>
Date: Sat, 29 Oct 2005 23:07:46 GMT

Hi,

I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.

Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation " pages. They could put this rule in their
robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(

You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.

If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.


I thought that parsing and processing a robots.txt file is the
responsibility of the bot and not the Web server. All the Web
server has to do is deliver the robots.txt file to the bot.

If that is true, the problem lies within Google and not Apache.

--

David E. Ross
<URL:http://www.rossde.com/>

I use Mozilla as my Web browser because I want a browser that
complies with Web standards. See <URL:http://www.mozilla.org/>.
Oct 30 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.