473,806 Members | 2,284 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Warning: robots.txt unreliable in Apache servers


Subject: Warning: robots.txt unreliable in Apache servers
From: Philip Ronan <in*****@invali d.invalid>
Newsgroups: alt.internet.se arch-engines
Message-ID: <BF89BF33.39FDF %in*****@invali d.invalid>
Date: Sat, 29 Oct 2005 23:07:46 GMT

Hi,

I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.

Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation " pages. They could put this rule in their
robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(

You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.

If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.

Phil

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 30 '05
56 3642
Borek wrote:
On Sun, 30 Oct 2005 21:45:32 +0100, Dave0x1 <as*@example.co m> wrote:

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?

<snip>
All of these generated 404 in last few weeks on my site.

No additional slashes inside of the url, although several times
they were added at the end.

& vs &amp; and wrong capitalization (bate, casc instead of BATE, CASC)
are most prominent sources of errors. But it seems every error is possible
:)


Sorry, I should've been more clear. I wanted to know whether anyone
could point to an actual URL (e.g., a search query) demonstrating that
URLs with multiple adjacent forward slashes are actually being indexed
by any of the major search engines. I haven't seen one.

However, I don't think that the original poster was concerned with
whether these multiple slashed URLs appear in the index as such, so it's
probably not terribly important.
Dave
Nov 2 '05 #41
Guy Macon wrote:
Dave0x1 wrote:

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results.

If there exists a way for someone else on the Internet to override
your spidering decisions as defined in robots.txt, there will be
those who use that ability to inconvenience, harass or hurt others.


A robots.txt file doesn't make any decisions about which parts of a site
are indexed; it merely offers suggestions.

Dave
Nov 2 '05 #42
Philip Ronan wrote:
"Dave0x1" wrote:

I don't understand why this is a big deal. The issue can be addressed
by numerous methods, including patching of the Apache web server source
code.

OK, so as long as the robots.txt documentation includes a note saying that
you have to patch your server software to get reliable results, then we'll
all be fine.


I wouldn't consider patching of the Apache source code either necessary
or desirable in this situation.
It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?

Which bit didn't I explain properly? I'm not going to post a link for you to
check, but here's the response I got from Google on the issue:

Thank you for your note. We apologize for our delayed response.
We understand you're concerned about the inclusion of
http://###.####.###//contact/ in our index.


Does the URL in question appear in the index as
<http://###.####.###//contact/>, or as <http://###.####.###/contact/>?
My assumption is the latter.

Dave

Nov 2 '05 #43
On Wed, 02 Nov 2005 17:45:05 -0500, Dave0x01 <as*@example.co m> wrote:
Guy Macon wrote:
Dave0x1 wrote:

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results.

If there exists a way for someone else on the Internet to override
your spidering decisions as defined in robots.txt, there will be
those who use that ability to inconvenience, harass or hurt others.


A robots.txt file doesn't make any decisions about which parts of a site
are indexed; it merely offers suggestions.

Dave


Which is a good way of putting it.

BB
--
www.kruse.co.uk/ se*@kruse.demon .co.uk
Elvis does my SEO
Nov 3 '05 #44

Dave0x01 wrote:

Guy Macon wrote:
Dave0x1 wrote:
It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results.


If there exists a way for someone else on the Internet to override
your spidering decisions as defined in robots.txt, there will be
those who use that ability to inconvenience, harass or hurt others.


A robots.txt file doesn't make any decisions about which parts of a site
are indexed; it merely offers suggestions.


A robots.txt file most certainly does decide which parts of a site
are indexed - by good robots. It offers suggestions that every good
robot obeys. The effect we are discussing someone else on the Internet
to override your good-robot spidering decisions as defined in robots.txt.
Nov 3 '05 #45
"Dave0x01" wrote:
Philip Ronan wrote:
"Dave0x1" wrote:
I don't understand why this is a big deal. The issue can be addressed
by numerous methods, including patching of the Apache web server source
code.
OK, so as long as the robots.txt documentation includes a note saying that
you have to patch your server software to get reliable results, then we'll
all be fine.


I wouldn't consider patching of the Apache source code either necessary
or desirable in this situation.


I was being sarcastic. (You're American, right?)
Does the URL in question appear in the index as
<http://###.####.###//contact/>, or as <http://###.####.###/contact/>?
My assumption is the latter.


Then what the hell do you think this thread is all about??

For all you doubting Thomases out there:

Exhibit A: http://freespace.virgin.net/phil.ron...bad-google.png

Exhibit B: http://www.japanesetranslator.co.uk/robots.txt
(Last-Modified: Tue, 01 Mar 2005 08:45:29 GMT)

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Nov 3 '05 #46
Philip Ronan <in*****@invali d.invalid> writes:
"Nick Kew" wrote:
[please don't crosspost without warning. Or with inadequate context]


My original post was copied over to ciwah, so now there are two threads with
the same subject. I'm trying to tie them together, mkay?
Philip Ronan wrote:

But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution.


Who said anything about that?


You did, in your earlier post:
If you have links to things like
"////" and dumb robots, put the paths in your robots.txt.
What's impractical about "Disallow //" ?


It's a partial solution. If you're trying to protect content at deeper
levels in the hierarchy, you will also need:

Disallow: /path//to/file
Disallow: /path/to//file
Disallow: /path//to//file
Disallow: /path///to/file
etc..


You do not want that kind of specificity in your robots.txt file.

Hostile robots will use robots.txt as a menu of "protected" pages to
crawl.

Here's what you want:

Disallow: /unlisted
Disallow: //

Then keep your unlisted contact info under /unlisted/contact/
or better, under /unlisted/86ghb3qx/

This is what I do for ourdoings.com family photo sites that aren't
intended for public view. Additionally I use the meta tags google
recommends. There's still the possibility of someone creating an
external link to the site, but having "unlisted" in the URL advises
people that although they can share it they shouldn't. If someone
creates such a link anyway, good search engines won't follow it.

Nov 3 '05 #47
"Bruce Lewis" wrote:
You do not want that kind of specificity in your robots.txt file.

Hostile robots will use robots.txt as a menu of "protected" pages to
crawl.

Here's what you want:

Disallow: /unlisted
Disallow: //


Yeah, that'll work. I wasn't actually *recommending* putting every
conceivable combination of slashes into the robots.txt file, I was just
trying to point out that "Disallow: //" on its own is inadequate.

As long as you're aware of the problem and doing something about it, then
that's fine.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Nov 3 '05 #48
On Mon, 31 Oct 2005, Philip Ronan wrote:
"D. Stussy" wrote:
On Mon, 31 Oct 2005, Philip Ronan wrote:

You seem to have misunderstood the problem. These robots CAN and DO access
pages prohibited by robots.txt files due to the way servers process
consecutive slashes in request URIs.


No - not in a PROPERLY set up system they won't.


If by "properly set up" you are referring to a system that has been fixed to
redirect or deny requests for URIs containing consecutive slashes, then
that's correct. In fact that's what I've been suggesting all along in this
thread.
If one is trapping for a "//" (any number of slashes greater than one),
then the robots (or anyone/anything else) will ever get there.


I don't understand what you mean. If you think the addition of a rule
"Disallow: //" will completely fix the problem then you're mistaken. I've
already explained why.


Nowhere did I say that trapping "//" would be a robots.txt rule. It should be
a rewrite engine rule (at least for apache).
You haven't been paying attention, have you?
<http://groups.google.com/group/alt.i...g/9a0f7baad24c
74dc?hl=en&>


I'm NOT reading this message from that group. If all the messages in the
thread weren't also crossposted to the group I'm reading this from -
comp.infosystem s.www.authoring.html, TFB. Deal with it.


Maybe there's something wrong with your newsreader then.
<http://groups.google.com/group/comp.....html/msg/9a0f
7baad24c74dc>


Like you expect me to read EVERY message. Also, remember that NNTP is a flood
technology - and not all servers get flooded at the same time. Nothing need be
wrong here for one to miss a message that hasn't yet arrived.
Nov 7 '05 #49
On Mon, 31 Oct 2005, Borek wrote:
On Mon, 31 Oct 2005 11:58:36 +0100, D. Stussy <at******@bde-arc.ampr.org>
wrote:
That's not a mistaken belief. Technically, a double (or more) slash, other
than following a colon when separating a protocol from a domainname and not
counting the query string, is NOT a valid URL construct. Robots which accept
them are misbehaved.


Quoting RFC1738 (BNF description of url):

; HTTP

httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

* means 0 or infinite repetitions, thus hsegment can be empty.
If hsegment can be empty, hpath may contain multiple not separated slashes.
So it seems you are wrong - multiple slashes in URLs are valid.


However, this is usually further restricted by the filesystem naming
conventions and that's where it's not proper.
Nov 7 '05 #50

This thread has been closed and replies have been disabled. Please start a new discussion.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.