468,786 Members | 1,764 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,786 developers. It's quick & easy.

Why doesn't Python's "robotparser" like Wikipedia's "robots.txt"file?

For some reason, Python's parser for "robots.txt" files
doesn't like Wikipedia's "robots.txt" file:
>>import robotparser
url = 'http://wikipedia.org/robots.txt'
chk = robotparser.RobotFileParser()
chk.set_url(url)
chk.read()
testurl = 'http://wikipedia.org'
chk.can_fetch('Mozilla', testurl)
False
>>>
The Wikipedia robots.txt file passes robots.txt validation,
and it doesn't disallow unknown browsers. But the Python
parser doesn't see it that way. No matter what user agent or URL is
specified; for that robots.txt file, the only answer is "False".
It's failing in Python 2.4 on Windows and 2.5 on Fedora Core.

I use "robotparser" on lots of other robots.txt files, and it
normally works. It even used to work on Wikipedia's older file.
But there's something in there now that robotparser doesn't like.
Any ideas?

John Nagle
Oct 2 '07 #1
2 2402
In message <HY****************@newssvr21.news.prodigy.net>, John Nagle
wrote:
For some reason, Python's parser for "robots.txt" files
doesn't like Wikipedia's "robots.txt" file:
>>import robotparser
>>url = 'http://wikipedia.org/robots.txt'
>>chk = robotparser.RobotFileParser()
>>chk.set_url(url)
>>chk.read()
>>testurl = 'http://wikipedia.org'
>>chk.can_fetch('Mozilla', testurl)
False
>>>
>>chk.errcode
403

Significant?

Oct 2 '07 #2
On 02/10/2007, John Nagle <na***@animats.comwrote:
>
But there's something in there now that robotparser doesn't like.
Any ideas?
Wikipedia denies _all_ access for the standard urllib user agent, and
when the robotparser gets a 401 or 403 response when trying to fetch
robots.txt, it is equivalent to "Disallow: *".

http://infix.se/2006/05/17/robotparser

It could also be worth mentioning that if you were planning on
crawling a lot of Wikipedia pages, you may be better off downloading
the whole thing instead: <http://download.wikimedia.org/>
(perhaps adding <http://code.google.com/p/wikimarkup/to convert the
wiki markup to HTML).
--
filip salomonsson
Oct 2 '07 #3

This discussion thread is closed

Replies have been disabled for this discussion.

By using this site, you agree to our Privacy Policy and Terms of Use.