On Tue, 15 Mar 2005, Steve wrote:
I notice that search engines are now finding robots.txt files and
catalogue their contents. Is this wise I wonder? Is it a possible
security risk?
Possibly. It depends on how you write your robots.txt file.
It also depends on how secure you want to be. If you *really* want to
block access to a URL, then you need to use an access control of some
kind, rather than hoping that the existence of the URL will remain a
secret.
But let's suppose, for the moment, that you're trying for the weak
security approach of creating an unpublished URL, and telling only
your friends about it (i.e not linking to it from any of your public
pages).
If you use robots.txt to "Disallow" explicit paths whose existence was
supposed to be kept hidden like this, then obviously you have now
revealed the existence of those paths to anyone who cares to look.
But if you apply the disallow only in a wildcard fashion, without
revealing explicit URLs, then you haven't given much away. For
example if you "Disallow /private" , then everybody can guess that
you have a hierarchy called "/private", but they still have no idea
what the individual URLs in there are called. As long as you
configure your server to block directory listing or URL guessing (a la
mod_speling) then they'd have a hard time finding anything by chance.
If you "Disallow foo.html", then everybody can suspect that you have
file(s) called foo.html, but they don't know exactly where to look for
them.
But in all of these cases, keep in mind that just a single incautious
mention of one of these hidden URLs - in a published web page or in a
usenet posting or in an archived mailing list etc. - is all that it
takes. robots.txt addresses itself to properly-behaved robots, but it
in no way prevents rogues from doing whatever they please.