Why doesn't Python's "robotparser" like Wikipedia's "robots.txt"file?

For some reason, Python's parser for "robots.txt" files
doesn't like Wikipedia's "robots.txt" file:

>>import robotparser
url = 'http://wikipedia.org/robots.txt'
chk = robotparser.RobotFileParser()
chk.set_url(url)
chk.read()
testurl = 'http://wikipedia.org'
chk.can_fetch('Mozilla', testurl)

False

>>>

The Wikipedia robots.txt file passes robots.txt validation,
and it doesn't disallow unknown browsers. But the Python
parser doesn't see it that way. No matter what user agent or URL is
specified; for that robots.txt file, the only answer is "False".
It's failing in Python 2.4 on Windows and 2.5 on Fedora Core.

I use "robotparser" on lots of other robots.txt files, and it
normally works. It even used to work on Wikipedia's older file.
But there's something in there now that robotparser doesn't like.
Any ideas?

John Nagle

Oct 2 '07 #1

Subscribe Reply

2603

Lawrence D'Oliveiro

In message <HY****************@newssvr21.news.prodigy.net>, John Nagle
wrote:

For some reason, Python's parser for "robots.txt" files
doesn't like Wikipedia's "robots.txt" file:

>>import robotparser
>>url = 'http://wikipedia.org/robots.txt'
>>chk = robotparser.RobotFileParser()
>>chk.set_url(url)
>>chk.read()
>>testurl = 'http://wikipedia.org'
>>chk.can_fetch('Mozilla', testurl)

False

>>>

>>chk.errcode

403

Significant?

Oct 2 '07 #2

Filip Salomonsson

On 02/10/2007, John Nagle <na***@animats.comwrote:

>
But there's something in there now that robotparser doesn't like.
Any ideas?

Wikipedia denies _all_ access for the standard urllib user agent, and
when the robotparser gets a 401 or 403 response when trying to fetch
robots.txt, it is equivalent to "Disallow: *".

http://infix.se/2006/05/17/robotparser

It could also be worth mentioning that if you were planning on
crawling a lot of Wikipedia pages, you may be better off downloading
the whole thing instead: <http://download.wikimedia.org/>
(perhaps adding <http://code.google.com/p/wikimarkup/to convert the
wiki markup to HTML).
--
filip salomonsson

Oct 2 '07 #3

by: John Nagle | last post by:

This bug, " robotparser interactively prompts for username and password", has been open since 2003. It killed a big batch job of ours last night. Module "robotparser" naively uses "urlopen" to...

Python

Problem with Python's "robots.txt" file parser in module robotparser

by: John Nagle | last post by:

Python's "robots.txt" file parser may be misinterpreting a special case. Given a robots.txt file like this: User-agent: * Disallow: // Disallow: /account/registration Disallow: /account/mypro...

Python

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

Why doesn't Python's "robotparser" like Wikipedia's "robots.txt"file?

Similar topics