Subject: Warning: robots.txt unreliable in Apache servers
From: Philip Ronan <in*****@invali d.invalid>
Newsgroups: alt.internet.se arch-engines
Message-ID: <BF89BF33.39FDF %in*****@invali d.invalid>
Date: Sat, 29 Oct 2005 23:07:46 GMT
Hi,
I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.
For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.
Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation " pages. They could put this rule in their
robots.txt file:
Disallow: /foundation/
But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.
As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(
You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.
If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.
Phil
--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/