471,321 Members | 1,769 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,321 software developers and data experts.

Why doesn't Python's "robotparser" like Wikipedia's "robots.txt"file?

For some reason, Python's parser for "robots.txt" files
doesn't like Wikipedia's "robots.txt" file:
>>import robotparser
url = 'http://wikipedia.org/robots.txt'
chk = robotparser.RobotFileParser()
chk.set_url(url)
chk.read()
testurl = 'http://wikipedia.org'
chk.can_fetch('Mozilla', testurl)
False
>>>
The Wikipedia robots.txt file passes robots.txt validation,
and it doesn't disallow unknown browsers. But the Python
parser doesn't see it that way. No matter what user agent or URL is
specified; for that robots.txt file, the only answer is "False".
It's failing in Python 2.4 on Windows and 2.5 on Fedora Core.

I use "robotparser" on lots of other robots.txt files, and it
normally works. It even used to work on Wikipedia's older file.
But there's something in there now that robotparser doesn't like.
Any ideas?

John Nagle
Oct 2 '07 #1
2 2483
In message <HY****************@newssvr21.news.prodigy.net>, John Nagle
wrote:
For some reason, Python's parser for "robots.txt" files
doesn't like Wikipedia's "robots.txt" file:
>>import robotparser
>>url = 'http://wikipedia.org/robots.txt'
>>chk = robotparser.RobotFileParser()
>>chk.set_url(url)
>>chk.read()
>>testurl = 'http://wikipedia.org'
>>chk.can_fetch('Mozilla', testurl)
False
>>>
>>chk.errcode
403

Significant?

Oct 2 '07 #2
On 02/10/2007, John Nagle <na***@animats.comwrote:
>
But there's something in there now that robotparser doesn't like.
Any ideas?
Wikipedia denies _all_ access for the standard urllib user agent, and
when the robotparser gets a 401 or 403 response when trying to fetch
robots.txt, it is equivalent to "Disallow: *".

http://infix.se/2006/05/17/robotparser

It could also be worth mentioning that if you were planning on
crawling a lot of Wikipedia pages, you may be better off downloading
the whole thing instead: <http://download.wikimedia.org/>
(perhaps adding <http://code.google.com/p/wikimarkup/to convert the
wiki markup to HTML).
--
filip salomonsson
Oct 2 '07 #3

This discussion thread is closed

Replies have been disabled for this discussion.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.