Warning: robots.txt unreliable in Apache servers

Anonymous, quoting Philip Ronan

Subject: Warning: robots.txt unreliable in Apache servers
From: Philip Ronan <in*****@invalid.invalid>
Newsgroups: alt.internet.search-engines
Message-ID: <BF89BF33.39FDF%in*****@invalid.invalid>
Date: Sat, 29 Oct 2005 23:07:46 GMT

Hi,

I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.

Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation" pages. They could put this rule in their
robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(

You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.

If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.

Phil

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 30 '05 #1

Subscribe Post Reply

3555

Benjamin Niemann

Anonymous, quoting Philip Ronan wrote:

I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple
consecutive forward slashes in a request URI as a single forward slash. So
for example, <http://apache.org/foundation/> and
<http://apache.org//////foundation/> both resolve to the same page.
I could not find anything about the semantics of empty path segments in http
URLs, but this behaviour seems to be common practice. What about IIS or
other webservers?
Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation" pages. They could put this rule in
their robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(

I would tend to blame googlebot (and any other effected robot). Unless a
different behaviour ('...foo//bar...' and '...foo/bar...' resolve to
different resource on the server) is common practice, the robot should
normalize such pathes (removing empty segments) before matching it against
the rules from the robots.txt file.

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/

Oct 30 '05 #2

Nick Kew

Anonymous wrote:

For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.

Yep. If you apply filesystem semantics to that, you have a whopping
great security hole. Of course you could just return "bad request",
but that just transfers the risk leaving server admins to shoot
their own feet.

There was a story in TheRegister a couple of weeks ago about someone
who got a criminal conviction (for attempted unauthorized access)
after he requested a url like that and it triggered an intrusion
detection alarm.

If you have links to things like "////" and dumb robots, put the
paths in your robots.txt. Don't forget that robots.txt is only
advisory and is commonly ignored by evil and/or broken robots.

--
Nick Kew

Oct 30 '05 #3

Philip Ronan

"Nick Kew" wrote:

If you have links to things like "////" and dumb robots, put the
paths in your robots.txt. Don't forget that robots.txt is only
advisory and is commonly ignored by evil and/or broken robots.

But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution. I realize not all
robots bother with the robots.txt protocol, but if even the legitimate
spiders can be misdirected then the whole point of having a robots.txt file
goes out the window.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 30 '05 #4

Nick Kew

Philip Ronan wrote:

[please don't crosspost without warning. Or with inadequate context]

"Nick Kew" wrote:

If you have links to things like "////" and dumb robots, put the
paths in your robots.txt. Don't forget that robots.txt is only
advisory and is commonly ignored by evil and/or broken robots.

But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution.

Who said anything about that? What's impractical about "Disallow //" ?

--
Nick Kew

Oct 30 '05 #5

Stan Brown

Sun, 30 Oct 2005 09:34:36 +0000 from Nick Kew
<ni**@asgard.webthing.com>:

If you have links to things like "////" and dumb robots, put the
paths in your robots.txt. Don't forget that robots.txt is only
advisory and is commonly ignored by evil and/or broken robots.

Wouldn't it be more effective to have any URL containing http://.*//
return a 403 Forbidden or a 404 Not Found? This could be done in
..htaccess or perhaps httpd.conf. I may be having a failure of
imagination, but I can't think of any legitimate reason for such a
link.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You:
http://diveintomark.org/archives/200..._wont_help_you

Oct 30 '05 #6

Philip Ronan

"Nick Kew" wrote:

[please don't crosspost without warning. Or with inadequate context]
My original post was copied over to ciwah, so now there are two threads with
the same subject. I'm trying to tie them together, mkay?
Philip Ronan wrote:

But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution.
Who said anything about that?

You did, in your earlier post:
If you have links to things like
"////" and dumb robots, put the paths in your robots.txt.
What's impractical about "Disallow //" ?

It's a partial solution. If you're trying to protect content at deeper
levels in the hierarchy, you will also need:

Disallow: /path//to/file
Disallow: /path/to//file
Disallow: /path//to//file
Disallow: /path///to/file
etc..

As I said, robots.txt is inadequate for this purpose because it doesn't
support pattern matching.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 30 '05 #7

Philip Ronan

In comp.infosystems.www.authoring.html, "Stan Brown" wrote:

Wouldn't it be more effective to have any URL containing http://.*//
return a 403 Forbidden or a 404 Not Found? This could be done in
.htaccess or perhaps httpd.conf. I may be having a failure of
imagination, but I can't think of any legitimate reason for such a
link.

That would also be effective, but maybe it's better to do something useful
with the URL if you can.

Most servers will redirect to a URL with a trailing slash when the name of a
directory is requested. Why not treat multiple slashes in a similar way?

Besides, it might help in terms of page rank.

[[Crossposted to alt.internet.search-engines, with apologies to Nick]]

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 30 '05 #8

David

On Sun, 30 Oct 2005 11:15:03 +0000, Philip Ronan
<in*****@invalid.invalid> wrote:

"Nick Kew" wrote:
If you have links to things like "////" and dumb robots, put the
paths in your robots.txt. Don't forget that robots.txt is only
advisory and is commonly ignored by evil and/or broken robots.

But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution. I realize not all
robots bother with the robots.txt protocol, but if even the legitimate
spiders can be misdirected then the whole point of having a robots.txt file
goes out the window.

A simple solution would be to add the robots meta tag to all pages you
don't want indexing as a backup for when someone links with //. Kind
of defeats the whole point of using a robots.txt file, but what else
can you do?

David
--
Free Search Engine Optimization Tutorial
http://www.seo-gold.com/tutorial/

Oct 30 '05 #9

David Ross

Anonymous, quoting Philip Ronan wrote:

Subject: Warning: robots.txt unreliable in Apache servers
From: Philip Ronan <in*****@invalid.invalid>
Newsgroups: alt.internet.search-engines
Message-ID: <BF89BF33.39FDF%in*****@invalid.invalid>
Date: Sat, 29 Oct 2005 23:07:46 GMT

Hi,

I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.

Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation" pages. They could put this rule in their
robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(

You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.

If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.

I thought that parsing and processing a robots.txt file is the
responsibility of the bot and not the Web server. All the Web
server has to do is deliver the robots.txt file to the bot.

If that is true, the problem lies within Google and not Apache.

--

David E. Ross
<URL:http://www.rossde.com/>

I use Mozilla as my Web browser because I want a browser that
complies with Web standards. See <URL:http://www.mozilla.org/>.

Oct 30 '05 #10

Guy Macon

David Ross wrote:

Philip Ronan wrote:

I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.

Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation" pages. They could put this rule in their
robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(

You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.

If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.

I thought that parsing and processing a robots.txt file is the
responsibility of the bot and not the Web server. All the Web
server has to do is deliver the robots.txt file to the bot.

If that is true, the problem lies within Google and not Apache.

I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?

--
Guy Macon <http://www.guymacon.com/>

Oct 30 '05 #11

Brian Wakem

Guy Macon <http://www.guymacon.com/> wrote:

I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?

Don't know, but it seems to be the case on unix/linux filesystems too,

If I 'cd //////usr////////////local////apache2' I end up
in /usr/local/apache2

The web servers are probably mimicking this behaviour.
--
Brian Wakem
Email: http://homepage.ntlworld.com/b.wakem/myemail.png

Oct 30 '05 #12

Jim Moe

Guy Macon wrote:

I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?

You are referring to which specs?
This behavior for following paths is from unix and is how all C
compilers handle paths. It is simply applied to URLs as well. There may
even be a requirement in the C specification about paths.

--
jmm (hyphen) list (at) sohnen-moe (dot) com
(Remove .AXSPAMGN for email)

Oct 30 '05 #13

Dave0x1

Guy Macon wrote:

I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?

Hint: Read the documentation offered at either of the first two URLs.

I don't understand why this is a big deal. The issue can be addressed
by numerous methods, including patching of the Apache web server source
code.

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?

Dave

Oct 30 '05 #14

Philip Ronan

"Dave0x1" wrote:

I don't understand why this is a big deal. The issue can be addressed
by numerous methods, including patching of the Apache web server source
code.
OK, so as long as the robots.txt documentation includes a note saying that
you have to patch your server software to get reliable results, then we'll
all be fine.
It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?

Which bit didn't I explain properly? I'm not going to post a link for you to
check, but here's the response I got from Google on the issue:

Thank you for your note. We apologize for our delayed response.
We understand you're concerned about the inclusion of
http://###.####.###//contact/ in our index.

It's important to note that we visited the live page in question
and found that it currently exists on the web as listed above.
Because this page falls outside your robots.txt file, you may
want to use meta tags to remove this page from our index. For
more information about using meta tags, please visit
http://www.google.com/remove.html

[remainder snipped]

I didn't publish the link to //contact/, someone else did. So that means the
robots.txt protocol is ineffective on (probably) most servers because it can
be circumvented without your knowledge by a third party.

Hope that's all clear now.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 30 '05 #15

Borek

On Sun, 30 Oct 2005 21:45:32 +0100, Dave0x1 <as*@example.com> wrote:

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?

/%3Fleft%3DpH-calculation%26right%3Dtoc&hl=pt-BR&lr=lang_pt&sa=G
/?left=BATE&amp%3Bright=phcalculation
/?left=BATE&amp;right=dissociation_constants
/?left=BATE&right=basic_acid_titr
/?left=BATE&right=basic_acid_titration_equilbria
/?left=BATE&right=basic_acid_titration_equilibri
/?left=BATE&right=basic_acid_titration_equilibria"> pH
/?left=BATE&right=basic_acid_titration_equilibria%2 2%3EpH
/?left=BATE&right=basic_acid_titration_equilibria/////////////////////////////////////////////////////
/?left=BATE&right=dissociation_constants]</td></tr><tr>
/?left=casc&amp/
/?left=casc&right=download
/?left=faq/
/?left=dave-is-great
/?left=BATE&right=basic_acid_titration_equilibria/
/index.php[left]BATE[right]overview[SiteID]simtel.net
/pHlecimg/3-f.png
/pHlecimg/3-g.png
/?left=pH-calculation
/?left=casc&right=concentration_and_solution_calcul ator
/?left=casc&right=density_tables
/files/CASCInstall.ziphttp:/www.chembuddy.com/files/CASCInstall.exe
/?left=bate&right=dissociation_constants
/?left=bate&right=download
/?left=bate&right=screenshots
/this_is_a_test_of_404_response
/?left=CASC&right=buy
/?left=CASC&right=concentration_and_solution_calcul ator://
/?left=CASC&right=density_tables
/?left=BATE&right=right=basic_acid_titration_equili bria

All of these generated 404 in last few weeks on my site.

No additional slashes inside of the url, although several times
they were added at the end.

& vs & and wrong capitalization (bate, casc instead of BATE, CASC)
are most prominent sources of errors. But it seems every error is possible
:)

Best,
Borek
--
http://www.chembuddy.com - chemical calculators for labs and education
BATE - program for pH calculations
CASC - Concentration and Solution Calculator
pH lectures - guide to hand pH calculation with examples

Oct 30 '05 #16

D. Stussy

On Sun, 30 Oct 2005, Philip Ronan wrote:

"Nick Kew" wrote:
If you have links to things like "////" and dumb robots, put the
paths in your robots.txt. Don't forget that robots.txt is only
advisory and is commonly ignored by evil and/or broken robots.

But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution. I realize not all
robots bother with the robots.txt protocol, but if even the legitimate
spiders can be misdirected then the whole point of having a robots.txt file
goes out the window.

No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED robots
into our web sites. Those robots which behave and respect the robots.txt file
will NEVER fall into these traps. Those that don't get their IP's auto-added
into a deny-access map file.

On my system, the deny-access file is also shared with the mail server to deny
mail from those abusive systems.

To the original poster:

Just because you haven't planned for the malicious to happen shows us that you
are closed minded. Your reliance ONLY on robots.txt shows this also. Open up
your thinking.

You should also probably use the "robots" meta-tag on each HTML page.

Have you considered using the rewrite engine to trap for "//" in the URI (the
part of the URL after the protocol and domain name is removed)?

Oct 30 '05 #17

Guy Macon

Dave0x1 wrote:

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results.

If there exists a way for someone else on the Internet to override
your spidering decisions as defined in robots.txt, there will be
those who use that ability to inconvenience, harass or hurt others.

Oct 31 '05 #18

Philip Ronan

"D. Stussy" wrote:

On Sun, 30 Oct 2005, Philip Ronan wrote:
if even the legitimate spiders can be misdirected then the whole point of
having a robots.txt file goes out the window.
No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED
robots into our web sites. Those robots which behave and respect the
robots.txt file will NEVER fall into these traps.

You seem to have misunderstood the problem. These robots CAN and DO access
pages prohibited by robots.txt files due to the way servers process
consecutive slashes in request URIs.
To the original poster:
Yes, that's me.
Just because you haven't planned for the malicious to happen shows us that you
are closed minded. Your reliance ONLY on robots.txt shows this also. Open up
your thinking.
You also seem to have misunderstood the whole point of this thread. I'm not
asking for help here. I'm just warning people about the unreliability of
robots.txt as a means of excluding your pages from search engines.
You should also probably use the "robots" meta-tag on each HTML page.
That was the first thing I did when I noticed there was a problem.
Have you considered using the rewrite engine to trap for "//" in the URI (the
part of the URL after the protocol and domain name is removed)?

You haven't been paying attention, have you?
<http://groups.google.com/group/alt.i...g/9a0f7baad24c
74dc?hl=en&>

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 31 '05 #19

Guy Macon

Philip Ronan wrote:

"D. Stussy" wrote:
On Sun, 30 Oct 2005, Philip Ronan wrote:
if even the legitimate spiders can be misdirected then the whole point of
having a robots.txt file goes out the window.

No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED
robots into our web sites. Those robots which behave and respect the
robots.txt file will NEVER fall into these traps.

You seem to have misunderstood the problem. These robots CAN and DO access
pages prohibited by robots.txt files due to the way servers process
consecutive slashes in request URIs.

Which means that the operators of the bad robots can put up a few
multiple-slash links so as to lure good robots into those honeypots
and traps, thus discouraging their use. The good news is that all
of the good robots that I know of obey metas as well as robots.txt,
so they can only do that to honypot owners who are under the same
the mistaken belief that Mr. Stussy expressed above - that "Those
robots which behave and respect the robots.txt file will never fall
into these traps."

I am still hoping that one of the .htaccess experts will come up
with a way to make all multiple-slash requests 301 redirect to
their single-slash versions.

Oct 31 '05 #20

D. Stussy

On Mon, 31 Oct 2005, Philip Ronan wrote:

"D. Stussy" wrote:
On Sun, 30 Oct 2005, Philip Ronan wrote:
if even the legitimate spiders can be misdirected then the whole point of
having a robots.txt file goes out the window.

No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED
robots into our web sites. Those robots which behave and respect the
robots.txt file will NEVER fall into these traps.

You seem to have misunderstood the problem. These robots CAN and DO access
pages prohibited by robots.txt files due to the way servers process
consecutive slashes in request URIs.

No - not in a PROPERLY set up system they won't. If one is trapping for a "//"
(any number of slashes greater than one), then the robots (or anyone/anything
else) will ever get there.

One can also state that if they do, they are not properly behaved bots.

To the original poster:

Yes, that's me.
Just because you haven't planned for the malicious to happen shows us that you
are closed minded. Your reliance ONLY on robots.txt shows this also. Open up
your thinking.

You also seem to have misunderstood the whole point of this thread. I'm not
asking for help here. I'm just warning people about the unreliability of
robots.txt as a means of excluding your pages from search engines.
You should also probably use the "robots" meta-tag on each HTML page.

That was the first thing I did when I noticed there was a problem.
Have you considered using the rewrite engine to trap for "//" in the URI (the
part of the URL after the protocol and domain name is removed)?

You haven't been paying attention, have you?
<http://groups.google.com/group/alt.i...g/9a0f7baad24c
74dc?hl=en&>

I'm NOT reading this message from that group. If all the messages in the
thread weren't also crossposted to the group I'm reading this from -
comp.infosystems.www.authoring.html, TFB. Deal with it.

Oct 31 '05 #21

D. Stussy

On Mon, 31 Oct 2005, Guy Macon wrote:

Philip Ronan wrote:

"D. Stussy" wrote:
On Sun, 30 Oct 2005, Philip Ronan wrote:

if even the legitimate spiders can be misdirected then the whole point of
having a robots.txt file goes out the window.

No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED
robots into our web sites. Those robots which behave and respect the
robots.txt file will NEVER fall into these traps.
You seem to have misunderstood the problem. These robots CAN and DO access
pages prohibited by robots.txt files due to the way servers process
consecutive slashes in request URIs.

Which means that the operators of the bad robots can put up a few
multiple-slash links so as to lure good robots into those honeypots
and traps, thus discouraging their use. The good news is that all
of the good robots that I know of obey metas as well as robots.txt,
so they can only do that to honypot owners who are under the same
the mistaken belief that Mr. Stussy expressed above - that "Those
robots which behave and respect the robots.txt file will never fall
into these traps."

That's not a mistaken belief. Technically, a double (or more) slash, other
than following a colon when separating a protocol from a domain name and not
counting the query string, is NOT a valid URL construct. Robots which accept
them are misbehaved.
I am still hoping that one of the .htaccess experts will come up
with a way to make all multiple-slash requests 301 redirect to
their single-slash versions.

Trivial. Do it yourself.

Oct 31 '05 #22

Philip Ronan

"D. Stussy" wrote:

On Mon, 31 Oct 2005, Philip Ronan wrote:

You seem to have misunderstood the problem. These robots CAN and DO access
pages prohibited by robots.txt files due to the way servers process
consecutive slashes in request URIs.
No - not in a PROPERLY set up system they won't.

If by "properly set up" you are referring to a system that has been fixed to
redirect or deny requests for URIs containing consecutive slashes, then
that's correct. In fact that's what I've been suggesting all along in this
thread.
If one is trapping for a "//" (any number of slashes greater than one),
then the robots (or anyone/anything else) will ever get there.

I don't understand what you mean. If you think the addition of a rule
"Disallow: //" will completely fix the problem then you're mistaken. I've
already explained why.

You haven't been paying attention, have you?
<http://groups.google.com/group/alt.i...g/9a0f7baad24c
74dc?hl=en&>

I'm NOT reading this message from that group. If all the messages in the
thread weren't also crossposted to the group I'm reading this from -
comp.infosystems.www.authoring.html, TFB. Deal with it.

Maybe there's something wrong with your newsreader then.
<http://groups.google.com/group/comp.....html/msg/9a0f
7baad24c74dc>

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 31 '05 #23

Tim

Philip Ronan:

the robots.txt protocol is ineffective on (probably) most servers because
it can be circumvented without your knowledge by a third party.

It always has been, anyway. For numerous reasons. Your multiple slash
example is just one of them. Some robots will ignore them altogether,
others will deliberately look at what you tell them to ignore.

Likewise with Google's advice:

Because this page falls outside your robots.txt file, you may want to
use meta tags to remove this page from our index.

In either case, such restrictions only help reduce the load on your server
from well meaning robots. If you want to truly restrict access, you need
to use some form of authentication.

There was moves to suggest the robots exclusion ought to let you specify
what you allow and disallow. For some cases it'd be easier to exclude
everything by default, only allowing what you want through. Though I
don't think that ever took off.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please destroy some files yourself.

Oct 31 '05 #24

Borek

On Mon, 31 Oct 2005 11:58:36 +0100, D. Stussy <at******@bde-arc.ampr.org>
wrote:

That's not a mistaken belief. Technically, a double (or more) slash,
other than following a colon when separating a protocol from a domain
name and not counting the query string, is NOT a valid URL construct.
Robots which accept them are misbehaved.

Quoting RFC1738 (BNF description of url):

; HTTP

httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

* means 0 or infinite repetitions, thus hsegment can be empty.
If hsegment can be empty, hpath may contain multiple not separated slashes.
So it seems you are wrong - multiple slashes in URLs are valid.

Best,
Borek
--
http://www.chembuddy.com - chemical calculators for labs and education
BATE - program for pH calculations
CASC - Concentration and Solution Calculator
pH lectures - guide to hand pH calculation with examples

Oct 31 '05 #25

Philip Ronan

"Tim" wrote:

Philip Ronan:
the robots.txt protocol is ineffective on (probably) most servers because
it can be circumvented without your knowledge by a third party.

It always has been, anyway. For numerous reasons. Your multiple slash
example is just one of them. Some robots will ignore them altogether,
others will deliberately look at what you tell them to ignore.

What you're saying is that it's pointless putting absolute faith in
robots.txt files because they are ignored by some robots. I'm not disputing
that. What I'm saying is that even genuine well-behaved robots like
Googlebot can be made to crawl content prohibited by robots.txt files.

So for example, if you're using a honeypot to block badly behaved robots
from your website automatically, then I can *remove your site from Google*
and probably other search engines simply by publishing a link to your
honeypot directory with an extra slash inserted somewhere. That's why this
issue is important.

I hope you understand now.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 31 '05 #26

Big Bill

On Mon, 31 Oct 2005 10:58:36 GMT, "D. Stussy"
<at******@bde-arc.ampr.org> wrote:

On Mon, 31 Oct 2005, Guy Macon wrote:
Philip Ronan wrote:
>
>"D. Stussy" wrote:
>
>> On Sun, 30 Oct 2005, Philip Ronan wrote:
>>
>>> if even the legitimate spiders can be misdirected then the whole point of
>>> having a robots.txt file goes out the window.
>>
>> No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED
>> robots into our web sites. Those robots which behave and respect the
>> robots.txt file will NEVER fall into these traps.
>
>You seem to have misunderstood the problem. These robots CAN and DO access
>pages prohibited by robots.txt files due to the way servers process
>consecutive slashes in request URIs.

Which means that the operators of the bad robots can put up a few
multiple-slash links so as to lure good robots into those honeypots
and traps, thus discouraging their use. The good news is that all
of the good robots that I know of obey metas as well as robots.txt,
so they can only do that to honypot owners who are under the same
the mistaken belief that Mr. Stussy expressed above - that "Those
robots which behave and respect the robots.txt file will never fall
into these traps."

That's not a mistaken belief. Technically, a double (or more) slash, other
than following a colon when separating a protocol from a domain name and not
counting the query string, is NOT a valid URL construct. Robots which accept
them are misbehaved.
I am still hoping that one of the .htaccess experts will come up
with a way to make all multiple-slash requests 301 redirect to
their single-slash versions.

Trivial. Do it yourself.

Umm, no, I think we'll hand the floor over to you at this point.

BB
--
www.kruse.co.uk/ se*@kruse.demon.co.uk
Elvis does my SEO

Oct 31 '05 #27

Borek

On Mon, 31 Oct 2005 12:48:47 +0100, Borek
<bo***@parts.bpp.to.com.remove.pl> wrote:

That's not a mistaken belief. Technically, a double (or more) slash,
other than following a colon when separating a protocol from a domain
name and not counting the query string, is NOT a valid URL construct.
Robots which accept them are misbehaved.
Quoting RFC1738 (BNF description of url):

; HTTP

httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

* means 0 or infinite repetitions, thus hsegment can be empty.

Small correction - 0 _to_ infinite repetitions, not OR infinite
repetitions. But it doesn't change final conclusion.
If hsegment can be empty, hpath may contain multiple not separated
slashes.
So it seems you are wrong - multiple slashes in URLs are valid.

Best,
Borek
--
http://www.chembuddy.com
http://www.chembuddy.com/?left=BATE&...ion_equilibria
http://www.chembuddy.com/?left=CASC&...ion_calculator

Oct 31 '05 #28

Robi

Philip Ronan wrote in message news:BF********************@invalid.invalid...

"D. Stussy" wrote:
On Mon, 31 Oct 2005, Philip Ronan wrote: [...]
You haven't been paying attention, have you?
<http://groups.google.com/group/alt.i...g/9a0f7baad24c
74dc?hl=en&>

I'm NOT reading this message from that group. If all the messages in the
thread weren't also crossposted to the group I'm reading this from -
comp.infosystems.www.authoring.html, TFB. Deal with it.

Maybe there's something wrong with your newsreader then.
<http://groups.google.com/group/comp.....html/msg/9a0f
7baad24c74dc>

I don't know what is worse than telling someone
"there's something wrong with newsreader"
and at the same time posting b0rken links.

Oct 31 '05 #29

Nick Kew

Philip Ronan wrote:

If by "properly set up" you are referring to a system that has been fixed to
redirect or deny requests for URIs containing consecutive slashes, then
that's correct. In fact that's what I've been suggesting all along in this
thread.

Feel free to set up your server like that. Apache provides a range of
mechanisms for doing so, which you can read all about at apache.org.
It only applies default rules (map to the filesystem) if you haven't
asked it to do otherwise.

--
Nick Kew

Oct 31 '05 #30

Borek

On Mon, 31 Oct 2005 15:58:43 +0100, Robi <me@privacy.net> wrote:

and at the same time posting b0rken links.

Is it a typo, or bad joke? ;)

Best,
Borek
--
http://www.chembuddy.com
http://www.chembuddy.com/?left=BATE&...ion_equilibria
http://www.chembuddy.com/?left=CASC&...ion_calculator

Oct 31 '05 #31

Robi

Borek wrote in message news:op.szim0cam584cds@borek...

On Mon, 31 Oct 2005 15:58:43 +0100, Robi wrote:
and at the same time posting b0rken links.

Is it a typo, or bad joke? ;)

http://www.bennetyee.org/http_webste...isindex=borken

nothing to do with your name, sorry ;-)

Oct 31 '05 #32

Philip Ronan

"Robi" wrote:

Philip Ronan wrote in message news:BF********************@invalid.invalid...

Maybe there's something wrong with your newsreader then.
<http://groups.google.com/group/comp.....html/msg/9a0f
7baad24c74dc>

I don't know what is worse than telling someone
"there's something wrong with newsreader"
and at the same time posting b0rken links.

.... using a crap newsreader and blaming everyone else when it doesn't work?

If your newsreader can't handle this link:
<http://groups.google.com/group/comp.....html/msg/9a0f
7baad24c74dc>

then try this one instead: <http://tinyurl.com/89bmv>

If you're not too busy then try this one too:
<http://rfc.net/rfc2396.html#sE.>

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 31 '05 #33

Guy Macon

Tim wrote:

Philip Ronan:
the robots.txt protocol is ineffective on (probably) most servers because
it can be circumvented without your knowledge by a third party.

It always has been, anyway. For numerous reasons. Your multiple slash
example is just one of them. Some robots will ignore them altogether,
others will deliberately look at what you tell them to ignore.

The robots.txt protocol has always been ineffective on bad
robots, but this is, as far as I know, the first example of
it being ineffective on good robots.

--
Guy Macon <http://www.guymacon.com>

Oct 31 '05 #34

Guy Macon

D. Stussy wrote:

Guy Macon wrote:
I am still hoping that one of the .htaccess experts will come up
with a way to make all multiple-slash requests 301 redirect to
their single-slash versions.

Trivial. Do it yourself.

What I described appears to not only be non-trivial, but also
appears to be impossible. Feel free to prove me wrong by posting
a counterexample that redirects all multiple-slash requests to
their single-slash versions. I don't think that you can do it,
but I am not an expert on .htaccess wizardry, so I may be wrong.

One would think that if such a trivial fix existed that someone
in the last 40+ posts would have posted it, thus solving the
problem...

--
Guy Macon <http://www.guymacon.com/>

Oct 31 '05 #35

Philip Ronan

"Guy Macon" wrote:

One would think that if such a trivial fix existed that someone
in the last 40+ posts would have posted it, thus solving the
problem...

Guy, if you've seen my solution at <http://tinyurl.com/89bmv> and you
haven't got access to PHP, you could try a recursive solution using
..htaccess by itself:

RewriteEngine On
RewriteCond %{REQUEST_URI} //+
RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L]

I haven't tested this, but -- in theory -- if the server detects a cluster
of forward slashes in a request URI, it will redirect the client to a URI
containing a single slash in its place. If a request contains more than one
cluster of forward slashes, then the client will be redirected more than
once, but it should eventually get to the right place.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 31 '05 #36

Alan J. Flavell

On Mon, 31 Oct 2005, Philip Ronan wrote:

RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L]
Actually, a RewriteMatch would suffice, it doesn't need the
full panoply of mod_rewrite...

Your regex doesn't do quite what you hope, due to the greedy nature of
the first "(.*)"

Incidentally, I recommend "pcretest" for this kind of fun.

$ pcretest
PCRE version 3.9 02-Jan-2002

re> "^(.*)//+(.*)$"
data> /one////two/three
0: /one////two/three
1: /one//
2: two/three

As you see, $1 captures a pair of slashes which you really wanteed
to be captured by your "//+" portion. As I say, I made the same
mistake at first.

I'd then got closer, with ^(.*?)/{2,}(.*)$ $1/$2

re> "^(.*?)/{2,}(.*)$"
data> /one////two/three
0: /one////two/three
1: /one
2: two/three

with the end result being /one/two/three , as desired.

I think your "//+" is pretty much synonymous with my "/{2,}";
the key difference is to make the first regex non-greedy.

If a request contains more than one cluster of forward slashes, then
the client will be redirected more than once, but it should
eventually get to the right place.

Indeed.

But aren't there also analogous abuse possibilities with things like
/././ and /.././ and so on?

Oct 31 '05 #37

Alan J. Flavell

On Mon, 31 Oct 2005, Alan J. Flavell wrote:

On Mon, 31 Oct 2005, Philip Ronan wrote:
RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L]

Actually, a RewriteMatch would suffice,

*RATS*: I meant of course "RedirectMatch". Sorry.

But I think the rest of what I posted is OK.

Oct 31 '05 #38

Philip Ronan

"Alan J. Flavell" wrote:

On Mon, 31 Oct 2005, Philip Ronan wrote:
RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L]
Actually, a [RedirectMatch] would suffice, it doesn't need the
full panoply of mod_rewrite...

Your regex doesn't do quite what you hope, due to the greedy nature of
the first "(.*)"

Ah, well spotted. :-)

In which case, this ought to do the trick:

# Eliminate forward slash clusters
RedirectMatch 301 ^(.*?)//+(.*)$ $1/$2
But aren't there also analogous abuse possibilities with things like
/././ and /.././ and so on?

Another good point. I thought my server was already redirecting those, but
apparently not -- it was the browser correcting my URLs for me.

Perhaps someone can debug these for me?

# Replace /./ with /
RedirectMatch 301 ^(.*?)/\./(.*)$ $1/$2

# Replace /../foo/bar with /foo/bar (at beginning of URI)
RedirectMatch 301 ^/\.\./(.*)$ /$1

# Replace /foo/../bar with /bar
RedirectMatch 301 ^(.*?)/[^/]+/\.\./(.*)$ $1\$2

Phil

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Nov 1 '05 #39

Tim

Tim:

It always has been, anyway. For numerous reasons. Your multiple slash
example is just one of them. Some robots will ignore them altogether,
others will deliberately look at what you tell them to ignore.

Guy Macon:
The robots.txt protocol has always been ineffective on bad robots, but
this is, as far as I know, the first example of it being ineffective on
good robots.

I'm not so sure that it's a fault with robots.text. After all,
strangeness notwithstanding ///example isn't the same as /example.
Personally, I think this is an issue you'd need to deal with within the
server (e.g. filter requests to disallow access to URIs with multiple
concurrent slashes in them, rather than work around such conditions).

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please destroy some files yourself.

Nov 1 '05 #40

Dave0x01

Borek wrote:

On Sun, 30 Oct 2005 21:45:32 +0100, Dave0x1 <as*@example.com> wrote:

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?

<snip>
All of these generated 404 in last few weeks on my site.

No additional slashes inside of the url, although several times
they were added at the end.

& vs & and wrong capitalization (bate, casc instead of BATE, CASC)
are most prominent sources of errors. But it seems every error is possible
:)

Sorry, I should've been more clear. I wanted to know whether anyone
could point to an actual URL (e.g., a search query) demonstrating that
URLs with multiple adjacent forward slashes are actually being indexed
by any of the major search engines. I haven't seen one.

However, I don't think that the original poster was concerned with
whether these multiple slashed URLs appear in the index as such, so it's
probably not terribly important.
Dave

Nov 2 '05 #41

Dave0x01

Guy Macon wrote:

Dave0x1 wrote:

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results.

If there exists a way for someone else on the Internet to override
your spidering decisions as defined in robots.txt, there will be
those who use that ability to inconvenience, harass or hurt others.

A robots.txt file doesn't make any decisions about which parts of a site
are indexed; it merely offers suggestions.

Dave

Nov 2 '05 #42

Dave0x01

Philip Ronan wrote:

"Dave0x1" wrote:

I don't understand why this is a big deal. The issue can be addressed
by numerous methods, including patching of the Apache web server source
code.

OK, so as long as the robots.txt documentation includes a note saying that
you have to patch your server software to get reliable results, then we'll
all be fine.

I wouldn't consider patching of the Apache source code either necessary
or desirable in this situation.

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?

Which bit didn't I explain properly? I'm not going to post a link for you to
check, but here's the response I got from Google on the issue:

Thank you for your note. We apologize for our delayed response.
We understand you're concerned about the inclusion of
http://###.####.###//contact/ in our index.

Does the URL in question appear in the index as
<http://###.####.###//contact/>, or as <http://###.####.###/contact/>?
My assumption is the latter.

Dave

Nov 2 '05 #43

Big Bill

On Wed, 02 Nov 2005 17:45:05 -0500, Dave0x01 <as*@example.com> wrote:

Guy Macon wrote:
Dave0x1 wrote:

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results.

If there exists a way for someone else on the Internet to override
your spidering decisions as defined in robots.txt, there will be
those who use that ability to inconvenience, harass or hurt others.

A robots.txt file doesn't make any decisions about which parts of a site
are indexed; it merely offers suggestions.

Dave

Which is a good way of putting it.

BB
--
www.kruse.co.uk/ se*@kruse.demon.co.uk
Elvis does my SEO

Nov 3 '05 #44

Guy Macon

Dave0x01 wrote:

Guy Macon wrote:
Dave0x1 wrote:
It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results.

If there exists a way for someone else on the Internet to override
your spidering decisions as defined in robots.txt, there will be
those who use that ability to inconvenience, harass or hurt others.

A robots.txt file doesn't make any decisions about which parts of a site
are indexed; it merely offers suggestions.

A robots.txt file most certainly does decide which parts of a site
are indexed - by good robots. It offers suggestions that every good
robot obeys. The effect we are discussing someone else on the Internet
to override your good-robot spidering decisions as defined in robots.txt.

Nov 3 '05 #45

Philip Ronan

"Dave0x01" wrote:

Philip Ronan wrote:
"Dave0x1" wrote:
I don't understand why this is a big deal. The issue can be addressed
by numerous methods, including patching of the Apache web server source
code.
OK, so as long as the robots.txt documentation includes a note saying that
you have to patch your server software to get reliable results, then we'll
all be fine.

I wouldn't consider patching of the Apache source code either necessary
or desirable in this situation.

I was being sarcastic. (You're American, right?)
Does the URL in question appear in the index as
<http://###.####.###//contact/>, or as <http://###.####.###/contact/>?
My assumption is the latter.

Then what the hell do you think this thread is all about??

For all you doubting Thomases out there:

Exhibit A: http://freespace.virgin.net/phil.ron...bad-google.png

Exhibit B: http://www.japanesetranslator.co.uk/robots.txt
(Last-Modified: Tue, 01 Mar 2005 08:45:29 GMT)

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Nov 3 '05 #46

Bruce Lewis

Philip Ronan <in*****@invalid.invalid> writes:

"Nick Kew" wrote:
[please don't crosspost without warning. Or with inadequate context]

My original post was copied over to ciwah, so now there are two threads with
the same subject. I'm trying to tie them together, mkay?
Philip Ronan wrote:

But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution.

Who said anything about that?

You did, in your earlier post:
If you have links to things like
"////" and dumb robots, put the paths in your robots.txt.

What's impractical about "Disallow //" ?

It's a partial solution. If you're trying to protect content at deeper
levels in the hierarchy, you will also need:

Disallow: /path//to/file
Disallow: /path/to//file
Disallow: /path//to//file
Disallow: /path///to/file
etc..

You do not want that kind of specificity in your robots.txt file.

Hostile robots will use robots.txt as a menu of "protected" pages to
crawl.

Here's what you want:

Disallow: /unlisted
Disallow: //

Then keep your unlisted contact info under /unlisted/contact/
or better, under /unlisted/86ghb3qx/

This is what I do for ourdoings.com family photo sites that aren't
intended for public view. Additionally I use the meta tags google
recommends. There's still the possibility of someone creating an
external link to the site, but having "unlisted" in the URL advises
people that although they can share it they shouldn't. If someone
creates such a link anyway, good search engines won't follow it.

Nov 3 '05 #47

Philip Ronan

"Bruce Lewis" wrote:

You do not want that kind of specificity in your robots.txt file.

Hostile robots will use robots.txt as a menu of "protected" pages to
crawl.

Here's what you want:

Disallow: /unlisted
Disallow: //

Yeah, that'll work. I wasn't actually *recommending* putting every
conceivable combination of slashes into the robots.txt file, I was just
trying to point out that "Disallow: //" on its own is inadequate.

As long as you're aware of the problem and doing something about it, then
that's fine.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Nov 3 '05 #48

D. Stussy

On Mon, 31 Oct 2005, Philip Ronan wrote:

"D. Stussy" wrote:
On Mon, 31 Oct 2005, Philip Ronan wrote:

You seem to have misunderstood the problem. These robots CAN and DO access
pages prohibited by robots.txt files due to the way servers process
consecutive slashes in request URIs.

No - not in a PROPERLY set up system they won't.

If by "properly set up" you are referring to a system that has been fixed to
redirect or deny requests for URIs containing consecutive slashes, then
that's correct. In fact that's what I've been suggesting all along in this
thread.
If one is trapping for a "//" (any number of slashes greater than one),
then the robots (or anyone/anything else) will ever get there.

I don't understand what you mean. If you think the addition of a rule
"Disallow: //" will completely fix the problem then you're mistaken. I've
already explained why.

Nowhere did I say that trapping "//" would be a robots.txt rule. It should be
a rewrite engine rule (at least for apache).

You haven't been paying attention, have you?
<http://groups.google.com/group/alt.i...g/9a0f7baad24c
74dc?hl=en&>

I'm NOT reading this message from that group. If all the messages in the
thread weren't also crossposted to the group I'm reading this from -
comp.infosystems.www.authoring.html, TFB. Deal with it.

Maybe there's something wrong with your newsreader then.
<http://groups.google.com/group/comp.....html/msg/9a0f
7baad24c74dc>

Like you expect me to read EVERY message. Also, remember that NNTP is a flood
technology - and not all servers get flooded at the same time. Nothing need be
wrong here for one to miss a message that hasn't yet arrived.

Nov 7 '05 #49

D. Stussy

On Mon, 31 Oct 2005, Borek wrote:

On Mon, 31 Oct 2005 11:58:36 +0100, D. Stussy <at******@bde-arc.ampr.org>
wrote:
That's not a mistaken belief. Technically, a double (or more) slash, other
than following a colon when separating a protocol from a domainname and not
counting the query string, is NOT a valid URL construct. Robots which accept
them are misbehaved.

Quoting RFC1738 (BNF description of url):

; HTTP

httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

* means 0 or infinite repetitions, thus hsegment can be empty.
If hsegment can be empty, hpath may contain multiple not separated slashes.
So it seems you are wrong - multiple slashes in URLs are valid.

However, this is usually further restricted by the filesystem naming
conventions and that's where it's not proper.

Nov 7 '05 #50