Subject: Warning: robots.txt unreliable in Apache servers
From: Philip Ronan <in*****@invalid.invalid>
Newsgroups: alt.internet.search-engines
Message-ID: <BF89BF33.39FDF%in*****@invalid.invalid>
Date: Sat, 29 Oct 2005 23:07:46 GMT
Hi,
I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.
For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.
Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation" pages. They could put this rule in their
robots.txt file:
Disallow: /foundation/
But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.
As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(
You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.
If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.
Phil
--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/