471,616 Members | 1,397 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,616 software developers and data experts.

Problem with Python's "robots.txt" file parser in module robotparser

Python's "robots.txt" file parser may be misinterpreting a
special case. Given a robots.txt file like this:

User-agent: *
Disallow: //
Disallow: /account/registration
Disallow: /account/mypro
Disallow: /account/myint
...

the python library "robotparser.RobotFileParser()" considers all pages of the
site to be disallowed. Apparently "Disallow: //" is being interpreted as
"Disallow: /". Even the home page of the site is locked out. This may be incorrect.

This is the robots.txt file for "http://ibm.com".
Some IBM operating systems recognize filenames starting with "//"
as a special case like a network root, so they may be trying to
handle some problem like that.

The spec for "robots.txt", at

http://www.robotstxt.org/wc/norobots.html

says "Disallow: The value of this field specifies a partial URL that is not to
be visited. This can be a full path, or a partial path; any URL that starts with
this value will not be retrieved." That suggests that "//" should only disallow
paths beginning with "//".

John Nagle
SiteTruth
Jul 11 '07 #1
5 2227
In article <0T*****************@newssvr19.news.prodigy.net> ,
John Nagle <na***@animats.comwrote:
Python's "robots.txt" file parser may be misinterpreting a
special case. Given a robots.txt file like this:

User-agent: *
Disallow: //
Disallow: /account/registration
Disallow: /account/mypro
Disallow: /account/myint
...

the python library "robotparser.RobotFileParser()" considers all pages of the
site to be disallowed. Apparently "Disallow: //" is being interpreted as
"Disallow: /". Even the home page of the site is locked out. This may be
incorrect.

This is the robots.txt file for "http://ibm.com".
Hi John,
Are you sure you're not confusing your sites? The robots.txt file at
www.ibm.com contains the double slashed path. The robots.txt file at
ibm.com is different and contains this which would explain why you
think all URLs are denied:
User-agent: *
Disallow: /

I don't see the bug to which you're referring:
>>import robotparser
r = robotparser.RobotFileParser()
r.set_url("http://www.ibm.com/robots.txt")
r.read()
r.can_fetch("WhateverBot", "http://www.ibm.com/foo.html")
1
>>r.can_fetch("WhateverBot", "http://www.ibm.com//foo.html")
0
>>>
I'll use this opportunity to shamelessly plug an alternate robots.txt
parser that I wrote to address some small bugs in the parser in the
standard library.
http://NikitaTheSpider.com/python/rerp/

Cheers

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
Jul 11 '07 #2
Nikita the Spider wrote:
>
Hi John,
Are you sure you're not confusing your sites? The robots.txt file at
www.ibm.com contains the double slashed path. The robots.txt file at
ibm.com is different and contains this which would explain why you
think all URLs are denied:
User-agent: *
Disallow: /
Ah, that's it. The problem is that "ibm.com" redirects to
"http://www.ibm.com", but but "ibm.com/robots.txt" does not
redirect. For comparison, try "microsoft.com/robots.txt",
which does redirect.

John Nagle
Jul 11 '07 #3
In article <IE******************@newssvr23.news.prodigy.net >,
John Nagle <na***@animats.comwrote:
Nikita the Spider wrote:

Hi John,
Are you sure you're not confusing your sites? The robots.txt file at
www.ibm.com contains the double slashed path. The robots.txt file at
ibm.com is different and contains this which would explain why you
think all URLs are denied:
User-agent: *
Disallow: /
Ah, that's it. The problem is that "ibm.com" redirects to
"http://www.ibm.com", but but "ibm.com/robots.txt" does not
redirect. For comparison, try "microsoft.com/robots.txt",
which does redirect.
Strange thing for them to do, isn't it? Especially with two such
different robots.txt files.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
Jul 12 '07 #4
Nikita the Spider wrote:
In article <IE******************@newssvr23.news.prodigy.net >,
John Nagle <na***@animats.comwrote:

>>Nikita the Spider wrote:

>>>Hi John,
Are you sure you're not confusing your sites? The robots.txt file at
www.ibm.com contains the double slashed path. The robots.txt file at
ibm.com is different and contains this which would explain why you
think all URLs are denied:
User-agent: *
Disallow: /

Ah, that's it. The problem is that "ibm.com" redirects to
"http://www.ibm.com", but but "ibm.com/robots.txt" does not
redirect. For comparison, try "microsoft.com/robots.txt",
which does redirect.


Strange thing for them to do, isn't it? Especially with two such
different robots.txt files.
I asked over at Webmaster World, and over there, they recommend against
using redirects on robots.txt files, because they questioned whether all of
the major search engines understand that. Does a redirect for
"foo.com/robots.txt" mean that the robots.txt file applies to the domain
being redirected from, or the domain being redirected to?

John Nagle
Jul 13 '07 #5
In article <yP*******************@newssvr21.news.prodigy.net> ,
John Nagle <na***@animats.comwrote:
I asked over at Webmaster World, and over there, they recommend against
using redirects on robots.txt files, because they questioned whether all of
the major search engines understand that. Does a redirect for
"foo.com/robots.txt" mean that the robots.txt file applies to the domain
being redirected from, or the domain being redirected to?
Good question. I'd guess the latter, but it's a little ambiguous. I
agree that redirecting a request for robots.txt is probably not a good
idea. Given that the robots.txt standard isn't as standard as it could
be, I think it's a good idea in general to apply the KISS principle when
dealing with things robots.txt-y.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
Jul 13 '07 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Metin Esat | last post: by
3 posts views Thread by Pitmairen | last post: by
1 post views Thread by John Layton | last post: by
3 posts views Thread by John Torville | last post: by
1 post views Thread by XIAOLAOHU | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.