By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
439,944 Members | 1,819 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 439,944 IT Pros & Developers. It's quick & easy.

robotparser behavior on 403 (Forbidden) robot.txt files

P: n/a
I just discovered that the "robotparser" module interprets
a 403 ("Forbidden") status on a "robots.txt" file as meaning
"all access disallowed". That's unexpected behavior.

A major site ("http://www.aplus.net/robot.txt") has their
"robots.txt" file set up that way.

There's no real "robots.txt" standard, unfortunately.
So it's not definitively a bug.

John Nagle
SiteTruth
Jun 27 '08 #1
Share this Question
Share on Google+
1 Reply


P: n/a
I just discovered that the "robotparser" module interprets
a 403 ("Forbidden") status on a "robots.txt" file as meaning
"all access disallowed". That's unexpected behavior.
That's specified in the "norobots RFC":

http://www.robotstxt.org/norobots-rfc.txt

- On server response indicating access restrictions (HTTP Status
Code 401 or 403) a robot should regard access to the site
completely restricted.

So if a site returns 403, we should assume that it did so
deliberately, and doesn't want to be indexed.
A major site ("http://www.aplus.net/robot.txt") has their
"robots.txt" file set up that way.
You should try "http://www.aplus.net/robots.txt" instead,
which can be accessed just fine.

Regards,
Martin
Jun 27 '08 #2

This discussion thread is closed

Replies have been disabled for this discussion.