473,796 Members | 2,741 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Warning: robots.txt unreliable in Apache servers


Subject: Warning: robots.txt unreliable in Apache servers
From: Philip Ronan <in*****@invali d.invalid>
Newsgroups: alt.internet.se arch-engines
Message-ID: <BF89BF33.39FDF %in*****@invali d.invalid>
Date: Sat, 29 Oct 2005 23:07:46 GMT

Hi,

I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.

Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation " pages. They could put this rule in their
robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(

You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.

If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.

Phil

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 30 '05
56 3637


David Ross wrote:

Philip Ronan wrote:

I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.

Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation " pages. They could put this rule in their
robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(

You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.

If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.


I thought that parsing and processing a robots.txt file is the
responsibili ty of the bot and not the Web server. All the Web
server has to do is deliver the robots.txt file to the bot.

If that is true, the problem lies within Google and not Apache.


I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?

--
Guy Macon <http://www.guymacon.co m/>

Oct 30 '05 #11
Guy Macon <http://www.guymacon.co m/> wrote:
I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?

Don't know, but it seems to be the case on unix/linux filesystems too,

If I 'cd //////usr////////////local////apache2' I end up
in /usr/local/apache2

The web servers are probably mimicking this behaviour.
--
Brian Wakem
Email: http://homepage.ntlworld.com/b.wakem/myemail.png
Oct 30 '05 #12
Guy Macon wrote:

I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?

You are referring to which specs?
This behavior for following paths is from unix and is how all C
compilers handle paths. It is simply applied to URLs as well. There may
even be a requirement in the C specification about paths.

--
jmm (hyphen) list (at) sohnen-moe (dot) com
(Remove .AXSPAMGN for email)
Oct 30 '05 #13
Guy Macon wrote:

I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?


Hint: Read the documentation offered at either of the first two URLs.

I don't understand why this is a big deal. The issue can be addressed
by numerous methods, including patching of the Apache web server source
code.

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?

Dave

Oct 30 '05 #14
"Dave0x1" wrote:
I don't understand why this is a big deal. The issue can be addressed
by numerous methods, including patching of the Apache web server source
code.
OK, so as long as the robots.txt documentation includes a note saying that
you have to patch your server software to get reliable results, then we'll
all be fine.
It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?


Which bit didn't I explain properly? I'm not going to post a link for you to
check, but here's the response I got from Google on the issue:
Thank you for your note. We apologize for our delayed response.
We understand you're concerned about the inclusion of
http://###.####.###//contact/ in our index.

It's important to note that we visited the live page in question
and found that it currently exists on the web as listed above.
Because this page falls outside your robots.txt file, you may
want to use meta tags to remove this page from our index. For
more information about using meta tags, please visit
http://www.google.com/remove.html

[remainder snipped]


I didn't publish the link to //contact/, someone else did. So that means the
robots.txt protocol is ineffective on (probably) most servers because it can
be circumvented without your knowledge by a third party.

Hope that's all clear now.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/
Oct 30 '05 #15
On Sun, 30 Oct 2005 21:45:32 +0100, Dave0x1 <as*@example.co m> wrote:
It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?


/%3Fleft%3DpH-calculation%26r ight%3Dtoc&hl=p t-BR&lr=lang_pt&s a=G
/?left=BATE&amp% 3Bright=phcalcu lation
/?left=BATE&amp; amp;right=disso ciation_constan ts
/?left=BATE&righ t=basic_acid_ti tr
/?left=BATE&righ t=basic_acid_ti tration_equilbr ia
/?left=BATE&righ t=basic_acid_ti tration_equilib ri
/?left=BATE&righ t=basic_acid_ti tration_equilib ria">pH
/?left=BATE&righ t=basic_acid_ti tration_equilib ria%22%3EpH
/?left=BATE&righ t=basic_acid_ti tration_equilib ria/////////////////////////////////////////////////////
/?left=BATE&righ t=dissociation_ constants]</td></tr><tr>
/?left=casc&amp/
/?left=casc&amp; right=download
/?left=faq/
/?left=dave-is-great
/?left=BATE&righ t=basic_acid_ti tration_equilib ria/
/index.php[left]BATE[right]overview[SiteID]simtel.net
/pHlecimg/3-f.png
/pHlecimg/3-g.png
/?left=pH-calculation
/?left=casc&righ t=concentration _and_solution_c alculator
/?left=casc&righ t=density_table s
/files/CASCInstall.zip http:/www.chembuddy.c om/files/CASCInstall.exe
/?left=bate&righ t=dissociation_ constants
/?left=bate&righ t=download
/?left=bate&righ t=screenshots
/this_is_a_test_ of_404_response
/?left=CASC&amp; right=buy
/?left=CASC&righ t=concentration _and_solution_c alculator://
/?left=CASC&amp; right=density_t ables
/?left=BATE&righ t=right=basic_a cid_titration_e quilibria

All of these generated 404 in last few weeks on my site.

No additional slashes inside of the url, although several times
they were added at the end.

& vs &amp; and wrong capitalization (bate, casc instead of BATE, CASC)
are most prominent sources of errors. But it seems every error is possible
:)

Best,
Borek
--
http://www.chembuddy.com - chemical calculators for labs and education
BATE - program for pH calculations
CASC - Concentration and Solution Calculator
pH lectures - guide to hand pH calculation with examples
Oct 30 '05 #16
On Sun, 30 Oct 2005, Philip Ronan wrote:
"Nick Kew" wrote:
If you have links to things like "////" and dumb robots, put the
paths in your robots.txt. Don't forget that robots.txt is only
advisory and is commonly ignored by evil and/or broken robots.


But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution. I realize not all
robots bother with the robots.txt protocol, but if even the legitimate
spiders can be misdirected then the whole point of having a robots.txt file
goes out the window.


No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED robots
into our web sites. Those robots which behave and respect the robots.txt file
will NEVER fall into these traps. Those that don't get their IP's auto-added
into a deny-access map file.

On my system, the deny-access file is also shared with the mail server to deny
mail from those abusive systems.

To the original poster:

Just because you haven't planned for the malicious to happen shows us that you
are closed minded. Your reliance ONLY on robots.txt shows this also. Open up
your thinking.

You should also probably use the "robots" meta-tag on each HTML page.

Have you considered using the rewrite engine to trap for "//" in the URI (the
part of the URL after the protocol and domain name is removed)?
Oct 30 '05 #17


Dave0x1 wrote:
It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results.


If there exists a way for someone else on the Internet to override
your spidering decisions as defined in robots.txt, there will be
those who use that ability to inconvenience, harass or hurt others.

Oct 31 '05 #18
"D. Stussy" wrote:
On Sun, 30 Oct 2005, Philip Ronan wrote:
if even the legitimate spiders can be misdirected then the whole point of
having a robots.txt file goes out the window.
No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED
robots into our web sites. Those robots which behave and respect the
robots.txt file will NEVER fall into these traps.


You seem to have misunderstood the problem. These robots CAN and DO access
pages prohibited by robots.txt files due to the way servers process
consecutive slashes in request URIs.
To the original poster:
Yes, that's me.
Just because you haven't planned for the malicious to happen shows us that you
are closed minded. Your reliance ONLY on robots.txt shows this also. Open up
your thinking.
You also seem to have misunderstood the whole point of this thread. I'm not
asking for help here. I'm just warning people about the unreliability of
robots.txt as a means of excluding your pages from search engines.
You should also probably use the "robots" meta-tag on each HTML page.
That was the first thing I did when I noticed there was a problem.
Have you considered using the rewrite engine to trap for "//" in the URI (the
part of the URL after the protocol and domain name is removed)?


You haven't been paying attention, have you?
<http://groups.google.com/group/alt.i...g/9a0f7baad24c
74dc?hl=en&>

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Oct 31 '05 #19

Philip Ronan wrote:

"D. Stussy" wrote:
On Sun, 30 Oct 2005, Philip Ronan wrote:
if even the legitimate spiders can be misdirected then the whole point of
having a robots.txt file goes out the window.


No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED
robots into our web sites. Those robots which behave and respect the
robots.txt file will NEVER fall into these traps.


You seem to have misunderstood the problem. These robots CAN and DO access
pages prohibited by robots.txt files due to the way servers process
consecutive slashes in request URIs.


Which means that the operators of the bad robots can put up a few
multiple-slash links so as to lure good robots into those honeypots
and traps, thus discouraging their use. The good news is that all
of the good robots that I know of obey metas as well as robots.txt,
so they can only do that to honypot owners who are under the same
the mistaken belief that Mr. Stussy expressed above - that "Those
robots which behave and respect the robots.txt file will never fall
into these traps."

I am still hoping that one of the .htaccess experts will come up
with a way to make all multiple-slash requests 301 redirect to
their single-slash versions.
Oct 31 '05 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.