473,237 Members | 1,248 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,237 software developers and data experts.

Why doesn't Python's "robotparser" like Wikipedia's "robots.txt"file?

For some reason, Python's parser for "robots.txt" files
doesn't like Wikipedia's "robots.txt" file:
>>import robotparser
url = 'http://wikipedia.org/robots.txt'
chk = robotparser.RobotFileParser()
chk.set_url(url)
chk.read()
testurl = 'http://wikipedia.org'
chk.can_fetch('Mozilla', testurl)
False
>>>
The Wikipedia robots.txt file passes robots.txt validation,
and it doesn't disallow unknown browsers. But the Python
parser doesn't see it that way. No matter what user agent or URL is
specified; for that robots.txt file, the only answer is "False".
It's failing in Python 2.4 on Windows and 2.5 on Fedora Core.

I use "robotparser" on lots of other robots.txt files, and it
normally works. It even used to work on Wikipedia's older file.
But there's something in there now that robotparser doesn't like.
Any ideas?

John Nagle
Oct 2 '07 #1
2 2595
In message <HY****************@newssvr21.news.prodigy.net>, John Nagle
wrote:
For some reason, Python's parser for "robots.txt" files
doesn't like Wikipedia's "robots.txt" file:
>>import robotparser
>>url = 'http://wikipedia.org/robots.txt'
>>chk = robotparser.RobotFileParser()
>>chk.set_url(url)
>>chk.read()
>>testurl = 'http://wikipedia.org'
>>chk.can_fetch('Mozilla', testurl)
False
>>>
>>chk.errcode
403

Significant?

Oct 2 '07 #2
On 02/10/2007, John Nagle <na***@animats.comwrote:
>
But there's something in there now that robotparser doesn't like.
Any ideas?
Wikipedia denies _all_ access for the standard urllib user agent, and
when the robotparser gets a 401 or 403 response when trying to fetch
robots.txt, it is equivalent to "Disallow: *".

http://infix.se/2006/05/17/robotparser

It could also be worth mentioning that if you were planning on
crawling a lot of Wikipedia pages, you may be better off downloading
the whole thing instead: <http://download.wikimedia.org/>
(perhaps adding <http://code.google.com/p/wikimarkup/to convert the
wiki markup to HTML).
--
filip salomonsson
Oct 2 '07 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: John Nagle | last post by:
This bug, " robotparser interactively prompts for username and password", has been open since 2003. It killed a big batch job of ours last night. Module "robotparser" naively uses "urlopen" to...
5
by: John Nagle | last post by:
Python's "robots.txt" file parser may be misinterpreting a special case. Given a robots.txt file like this: User-agent: * Disallow: // Disallow: /account/registration Disallow: /account/mypro...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.